In this systematic literature review (SLR), we use a series of quantitative bibliometric analyses to (1) identify the main papers, journals, and authors of the publications that make use of statistical analysis (SA) and machine learning (ML) tools as well as technological elements of smart cities (TESC) and Geographic Information Systems to predict road traffic accidents (RTAs); (2) determine the extent to which the identified methods are used for the analysis of RTAs and current trends regarding their use; (3) establish the relationship between the set of variables analyzed and the frequency and severity of RTAs; and (4) identify gaps in method use to highlight potential areas for future research. A total of 3888 papers published between January 2000 and June 2021, distributed in four clusters—RTA + HA + SA (SA, n = 399); RTA + HA + ML (ML, n = 858); RTA + HA + SC (TESC, n = 2327); and RTA + HA + GIS (GIS, n = 304)—were analyzed. We identified Accident Analysis and Prevention as the most important journal, Fred Mannering as the main author, and The Statistical Analysis of Crash-Frequency Data: A Review and Assessment of Methodological Alternatives as the most cited publication. Although the negative binomial regression method was used for several years, we noticed that other regression models as well as methods based on deep learning, convolutional neural networks, transfer learning, 5G technology, Internet of Things, and intelligent transport systems have recently emerged as suitable alternatives for RTA analysis. By introducing a new approach based on computational algorithms and data visualization, this SLR fills a gap in the area of RTA analysis and provides a clear picture of the current scientific production in the field. This information is crucial for projecting further research on RTA analysis and developing computational and data visualization tools oriented to the automation of RTA predictions based on intelligent systems.

1. Introduction

Road traffic accidents (RTAs) are one of the leading causes of death in all age groups and the first cause in people aged 15–29 years. About 1.3 million people die annually on the world’s roads as a result of road traffic accidents [1]. According to a report published by the World Health Organization in December 2018 [2], progress has been insufficient in addressing the lack of safety on the world’s roads. The Americas account for 11% of global road deaths, and the death rate is 15.6 per 100,000 people annually. In the last decade, the trend in Africa and Southeast Asia is incremental, while in Europe, America, and Oceania is stable [2]. After Brazil, Colombia is the South American country with the highest number of road accident fatalities; 3629 people died in traffic accidents between January and July 2019 [3].

There is a current trend in several cities around the world to take advantage of technology to develop and monitor their daily activities, from which the term “intelligent city” derives. According to the European Union smart city model [4], a city is considered intelligent if it has at least one initiative that addresses one or more of the following characteristics: smart economy, smart people, smart mobility, smart environment, smart governance, and/or smart living. These characteristics, especially smart mobility, can provide data that facilitates the analysis of RTAs [5].

Since 2009, the interest of researchers in this area to comprehensively analyze, predict, and prevent RTAs has led to the use of various statistical models, including the logit, probit, Poisson, and Binomial Negative Regression models [6]. These statistical models have good theoretical interpretability, which allow a direct and clear understanding of the relationship between the frequency and/or severity of accidents and the variables analyzed. However, their main drawback is that they consider a linear relationship between risk factors and accident frequency, which may not be suitable in most cases [7]. Other assumptions of these models are very specific such independence between variables and normal distribution of the data, which are also difficult to satisfy in the real world [8].

Recently, models based on machine learning (ML), which include classification models, deep learning, artificial neural networks, random forest, and support vector machines (SVMs), have also been used for this purpose [9, 10]. ML-based models require no assumptions or prior knowledge, can automatically extract useful information from the dataset, and systematically deal with process outliers and missing values [8]. One of the main disadvantages of ML models is their “black box” approach nature, which limits their direct and clear interpretation of the results compared to statistical models [7].

At present, it is possible to analyze large amounts of publications related to RTAs using several bibliometric tools such as VOSviewer® [11]. Through text mining and mapping, VOSviewer allows to build and visualize bibliometric networks, which include individual journals, researchers, or publications, built on the basis of the number of citations each publication has [11]. Another approach is the use of R [12], one of the most powerful and flexible statistical software environments, jointly with the Bibliometrix [13] and causalizeR [14] packages, which facilitate a comprehensive bibliometric analysis using quantitative research. Via word processing algorithms, it is possible to extract causal links within a group of papers of interest based on simple grammar rules, which can subsequently be used to synthesize evidence in unstructured texts in a structured way [14]. These tools allow to perform analyzes based on the title, author, abstract, keywords, and references of the set of publications of interest.

In this SLR, we use quantitative bibliometric analysis to (1) identify the most cited research papers, journals, authors, and methods that contribute to the state of the art of RTA analysis based on statistical techniques, ML, technological elements of smart cities, and geographic information systems, such that they can be referred to new research related to RTA; (2) determine to what extent the methods identified for the analysis of RTAs and the variables included in the main publications are used to recognize current trends regarding their use; (3) establish the relationship between the set of variables analyzed and the methods applied and hence determine the variables with the highest incidence in the frequency and severity of RTAs so that it can be specified in what type of studies these critical variables have been analyzed; and (4) identify gaps, that is, methods and/or variables that have not been sufficiently scrutinized, with the aim of raising possible problems and areas of further research. In Section 2, we describe the methods and infometric tools used to analyze the selected literature, and in Section 3, the main results are presented. Finally, we discuss our findings and identify promising lines of research.

2. Materials and Methods

2.1. Searching for Publications

A search for publications on the Web of Science (WOS, URL: https://www.webofknowledge.com) was conducted. WOS is an online platform that contains databases of bibliographic information and resources for obtaining and analyzing such information, in order to study the performance of research, and whose purpose is to provide analysis tools that allow the assessment of its scientific quality. Through WOS, it is possible to have access to one or several databases simultaneously. The content of WOS is wide, and its high quality allows it to be a reference in the academic and scientific field [10].

After accessing WOS, we generated the search in the “Web of Science Core Collection” in the categories “Road Traffic Accident or Highway Accident and Statistical Analysis,” “Road Traffic Accident or Highway Accident and Machine Learning,” “Road Traffic Accident or Highway Accident and Smart Cities,” and “Road Traffic Accident or Highway Accident and Technological Elements of Smart Cities (TESC).” Each of these terms corresponds to the consulted categories or clusters in this document. For each cluster, a new search was made by cluster, which groups the publications of the corresponding consulted categories. The result of each search was exported to text files using the “Full Record and References Cited” option. These files were exported without format and BibTeX as required by VOSviewer® [15] and the Bibliometrix [13] package of R [16], respectively.

2.2. Definition of Clusters

A total of 3888 papers published between January 1, 2000 and June 30, 2021 related to road traffic accidents or highway accident (RTA + HA) were identified after using the string “road traffic accident” or “highway accident” or “road safety” or “road crash” or “crash injury.” After adding the terms “statistical analysis” or “Poisson regression” or “negative binomial regression” or “ordered probit regression” or “tobit regression” or “kernel density estimation” or “Bayesian networks,” publications were filtered and included as part of the RTA + HA + SA cluster (n = 399, 10,26%). Similarly, publications in the RTA + HA + ML cluster (n = 858, 22,06%) were included after adding the terms “machine learning” or “classification model” or “regression model” or “data mining” or “support vector machine” or “svm” or “neural networks” or “neural network” or “ann” or “deep learning” or “Big data” or “Decision Tree” or “random forest” or “supervised learning” or “regression analysis.” Publications in the RTA + HA + SC cluster (n = 2327, 59,85%) were filtered after adding the terms “smart city” or “smart cities” or “intelligent transportation systems” or “its” or “vehicular ad hoc networks” or “vanets” or “wireless sensor networks” or “wsn” or “internet of things” or “iot.” Finally, filtering the original search by adding the terms “gis” or “geographic information system” or “geographic information systems” led to the identify publications in the RTA + HA + GIS (n = 304, 7,81%) cluster.

2.3. Bibliometric Analysis

The bibliometric analysis was performed using VOSviewer® [11], and the Bibliometrix [13] and causalizeR [14] packages for R [16]. VOSviewer® are a tool that uses the results of WOS searches to build and visualize bibliometric networks based on number of citations, bibliographic linkage, cocitations, or authorship relationships [17]. This procedure allows you to generate visualizations, using networks, of the most important publications, journals, and authors in the categories described above. With the information extracted from VOSviewer®, a graph is constructed that illustrates the number of publications in the last 20 years in each cluster, in addition to the graphical representation of the top 20 publications and the top 10 authors, journals, and methods through the use of R statistical software packages.

The Bibliometrix package provides several functions that allow to perform advanced bibliometric analysis using bibliographic data obtained from WOS [13]. Such analyses include the identification of countries with higher scientific production, authors with more experience, and the construction of dendrograms for statistical analysis methods and variables used to analyze RTA data.

Bibliometric information was graphically represented using a Sankey diagram [18], which helps to explore the relationship between publications, the methods used, and the variables analyzed. We considered variables related to road and environment factors (i.e., “ramps,” “curve,” “pavement,” “visibility,” “weather,” “rain,” “snow,” and “traffic volume”), human factors (i.e., “gender,” “age,” “sex,” “alcohol,” “cell,” “phone,” and “seat belt”), characterization of the crash (i.e., “injury,” “fatal,” “severity,” “week,” “hour,” “peak,” and “year”), and vehicle’s factors (i.e., “motorcycle,” “truck,” “car,” and “pedestrian”) as defined by Mannering et al. [19]. In addition, we included the following statistical methods widely used in RTA analysis: “Poisson Models” [20], “Binomial Negative Models” [21], “Kernel Functions” [22], “Probit Models,” “Logit Models,” [23], “Hotspot” [24], “Spatial Model,” “Temporal Model” [25] “Regression Model,” “Classification Model,” “Neural Network” [26], “Clustering,” “Bayesian Methods” [27], “Likert Scale” [28], “Image” [29], “Lidar,” “GIS” [30], “Markov” [31], “GPS,” “IoT” [32], and “Unobserved Heterogeneity” [33]. Based on the aforementioned categories, we used the causalizeR [14] package for R with the keywords “accident” and “severity” with the purpose of generating a diagram that synthesizes the grammatical rules creating causal links around the keywords, that is, identifying the variables that affect the frequency and severity of accidents. To the best of the authors’ knowledge, this is the first study making use of methods implemented in the Bibliometrix and causalizeR packages to explore RTA-related publications.

3. Results

3.1. Scientific Production per Cluster

Figure 1 shows the number of publications per year for each cluster between January 2000 and June 2021. Our results suggest a similarity between the RTA + HA + SA and RTA + HA + GIS clusters. In particular, these clusters present a linear trend between 2000 and 2010, and a slightly incremental trend between 2010 and 2021. On the other hand, the number of publications in the RTA + HA + SC cluster increases exponentially from 2007, while those in the RTA + HA + ML cluster present the exponential trend from 2014. Interestingly, the peak in the number of publications occurred in 2020 for all clusters, except in RTA + HA + SC, which had it in 2018.

3.2. Bibliometric Analysis Based on Citations Ranking

Table 1 shows the top-20 most cited papers in each cluster, which are shown in Figure 2, where the journals in which they were published and the number of citations that each publication has is illustrated; Figure 2 also shows the main authors of each cluster, that is, the most cited. In the RTA + HA + SA cluster, the most cited papers are The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives (n = 885, 22,13%) [33], The statistical analysis of highway crash-injury severities: A review and assessment of methodological alternatives (n = 487, 12,17%) [36], and Unobserved heterogeneity and the statistical analysis of highway accident data (n = 468, 11,7%) [19]. Regarding the most cited authors, these are Fred Mannering (n = 2593, 35,6%), Dominique Lord (n = 1408, 19,32%), and Mohammed A. Quddus (n = 578, 7,93%). On the other hand, the most cited journals are Accident Analysis and Prevention with 89 documents and 3513 (56,87%) citations, Transportation Research Part A-Policy and Practice with 4 documents and 915 (14,81%) citations, and Analytic Methods in Accident Research with 13 documents and 833 (13,48%) citations.

In the RTA + HA + ML cluster, the most cited papers are Multilevel data and Bayesian analysis in traffic safety (n = 151, 7,22%) [100], A study of factors affecting highway accident rates using the random-parameters Tobit Model (n = 145, 6,93%) [37], and The red-light running behavior of electric bike riders and cyclists at urban intersections in China: An observational study (n = 128, 6,12%) [40]. As for the most cited authors, these are Helai Huang (n = 383, 17,26%), Mohammed Abdel (n = 340, 15,32%), and Mohammed A. Quddus (n = 298, 13,42%). Similarly, the most cited journals are Accident Analysis and Prevention with 91 documents and 2735 (56,90%) citations; Safety Science with 12 documents and 413 (8,59%) citations, and Journal of Safety Research with 16 documents and 356 (7,4%) citations.

In the RTA + HA + SC cluster, the most cited papers are A comprehensive survey on vehicular Ad Hoc network (n = 633, 13,61%) [34], Vehicular ad hoc networks (VANETS): status, results, and challenges (n = 475, 10,21%) [38], and Unobserved heterogeneity and the statistical analysis of highway accident data (n = 468, 10,06%) [41]. Similarly, the most cited authors are Sherali Zeadally (n = 921, 14,15%), Fred Mannering (n = 885, 13,59%), and Ali H. Al-Bayatti (n = 736, 11,31%). The most cited journals are Accident Analysis and Prevention with 114 documents and 3083 (33,51%) citations, Journal of Network and Computer Applications with 2 documents and 1045 (11,35%) citations, and IEEE Transactions on Vehicular Technology with 28 documents and 970 (10,54%) citations.

In the RTA + HA + GIS cluster, the most cited papers are Kernel density estimation and K-means clustering to profile road accident hotspots (n = 303, 20,47%) [35], Recognizing basic structures from mobile laser scanning data for road inventory studies (n = 205, 13,85%) [39], and The longitudinal influence of home and neighbourhood environments on children’s body mass index and physical activity over 5 years: the CLAN study (n = 105, 13,42%) [42]. As for the most cited authors, these are Tesa K. Anderson (n = 303, 18,38%), Sander Oude Alberink (n = 205, 14,43%), and Shi Pu(n = 205, 14,43%). In addition, the most cited journals are Accident Analysis and Prevention with 22 documents and 715 (41,93%) citations, ISPRS Journal of Photogrammetry and Remote Sensing with 3 documents and 275 (16,13%) citations, and Journal of Safety Research with 4 documents and 151 (8,85%).

Overall, Accident Analysis and Prevention is the most cited journal in the four clusters, the Journal of Safety Research and Safety Science appears in the top 10 journals among all clusters, and Analytic Methods in Accident Research appears in the top 10 of the clusters RTA + HA + SA, RTA + HA + ML, and RTA + HA + SC as shown in Figure 2.

3.3. Methods of RTA Based on the Abstracts

Figure 3 shows word clouds based on the methods, mentioned in the abstracts, used to analyze the RTAs. In the RTA + HA + SA cluster (Figure 3(a)), the negative binomial regression has been used consistently for several years. Currently, Poisson regression, Bayesian networks, and kernel density estimation are used to a similar extent. In addition, the issue of Unobserved Heterogeneity is important number of studies within the RTA + HA + SA cluster. On the other hand, variables associated with human factors such as driving behavior and driving performance have been identified. According to our analyses, methods such as deep learning, convolutional neural networks, transfer learning, and computer vision correspond to the current trends in the RTA + HA + ML cluster (Figure 3(b)). On the other hand, the RTA + HA + SC cluster can be identified that the routing protocols were used before 2015, and sensors networks were frequently used between 2016 and 2018 (Figure 3(c)). Starting in 2019, 5G technologies, transport systems smartphones (ITSs), Internet of Things (IoT), and VANETS have been more frequently used (see Figure 3(c)). Finally, in the RTA + HA + GIS cluster, spatial analysis and kernel density estimation correspond to the methods with the highest usability for analyzing RATs (Figure 3(d)).

Figure 4 shows the 10 most used analytical methods for RTA analysis for each cluster. In the RTA + HA + SA cluster, the spatial-temporal models are shown as the main method, representing 20.67% of the methods, followed by the negative binomial models (16.82%) and machine learning methods (15.86%) (Figure 4(a)). As for the RTA + HA + ML cluster, neural network models (26.07%), general ML models (21.55%), and data mining techniques (11.42%) are the most frequently used techniques (Figure 4(b)). Regarding methods used to analyze RTAs in the RTA + HA + SC cluster (Figure 4(c)), we found that vehicular networks (48.05%), intelligent transportation systems (20.92%), and Internet of Things (7%) are the most frequently used. Finally, in the RTA + HA + GIS cluster, spatial analysis (17.59%) is the method used more frequently, followed by kernel density estimation (13.88%) and hot spot analysis (13.88%) (Figure 4(d)).

Based on the information from the abstracts, three major topics were derived related to the strategies utilized for the analysis of RTAs (Figure 5). These topics, broadly speaking, include methods used to analyze RTAs, variables included to study RTAs, and indicators associated with road safety such as crash frequency, crash injury severity, fatalities, and mortality. In the RTA + HA + SA cluster, the first topic is defined by the identification and location of accidents using GIS and kernel density estimation methods; the second topic is mainly composed of variables associated with the environment, road geometry, alcohol use, age, severity (i.e., fatalities, mortality, injuries, etc.), and frequency, in addition to classic methods such as Poisson and negative binomial regression models; and the third topic includes methods such as multinomial logit models and logit models as well as unobserved heterogeneity, single-vehicle crashes, and SVM (Figure 5(a)). Similarly, in the RTA + HA + ML cluster, the first topic contains motor-vehicle-crashes analysis using regression model and Poisson regression and variables such as geometric design; the second topic includes frequency and severity analysis supported by SVM and logistic regression; and the third topic includes variables associated with frequency, severity, injuries, and prevalence, as well as other variables such as weather, design of the road, speed, age and behavior of the driver, and performance of preventive measures (Figure 5(b)). On the other hand, the first topic in the RTA + HA + SC includes, in general, RTA severity (i.e., traffic fatalities and mortality); the second topic includes technological elements associated with ITS; and the third topic includes a general list of technological elements used to collect road traffic data (Figure 5(c)). Finally, in the RTA + HA + GIS cluster, we identified walking, physical activity, and travel as the first topic; the second topic is defined by the identification and location of accidents using the kernel density estimation method; and the third topic includes classical regression and classification methods for RTA analysis (Figure 5(d)).

3.4. Relationship between Methods for RTAs and Variables Used Based on Papers’ Methodologies

We identified that in the RHA + HA + SA cluster, the papers with the highest number of relationships are Wang et al. [81] with 5 methods and 18 associated variables, Savolainen et al. [36] with 9 methods and 14 associated variables and Lord & Mannering [33] with 9 methods and 10 associated variables (Figure 6(a)). According to our analysis, the most used models are regression models (n = 12 publications), spatial models (n = 11), binomial negative regression (n = 10), classification models (n = 8), and Poisson regression and temporal models (n = 7). The most used variables are “age” (n = 20), “year” (n = 17), “car” (n = 14), “severity” (n = 13), and “rain” (n = 12).

In the RHA + HA + ML cluster, the papers with the highest number of relationships are Mannering [43] with 13 methods and 13 associated variables, Montella [85] with one method and 14 associated variables, and Zhang et al. [78] with 2 methods and 11 associated variables (Figure 6(b)). The most used models are regression model (n = 17), classification model (n = 8), Bayesian model in (n = 7), and neural network and binomial negative regression (n = 5). The most used variables are “age” (n = 20), “car” (n = 18), “injury” (n = 16), “year” (n = 13), and “rain” (n = 12).

On the other hand, the publications with the highest number of relationships in the RHA + HA + SC cluster are Mannering et al. [41] with 10 methods and 16 associated variables, Wang & Abdel-Aty [92] with 4 methods and 16 associated variables, and Fiore et al. [98] with 3 methods and 15 associated variables (Figure 6(c)). In this cluster, the most used models are VANETS (n = 11), GPS (n = 11), spatial model (n = 8) and GIS (n = 8), while the most used variables are “age” (n = 20), “car” (n = 19), “rain” (n = 14), “cell” (n = 14), “year” (n = 13), and “weather” (n = 10).

As for the RHA + HA + GIS cluster, the papers with the highest number of relationships are Zou et al. [59] with 14 methods and 18 associated variables, Anderson [35] with 7 methods and 12 associated variables, and Plug et al. [45] with 9 methods and 10 associated variables (Figure 6(d)). Regarding models for RTA analysis, the most frequently used models in this cluster are GIS (n = 15), spatial models (n = 15), image processing (n = 13), classification models (n = 12), regression models (n = 11), and clustering (n = 9). On the other hand, the variables mostly used are “age” (n = 20), “car” (n = 14), “rain” (n = 14), “year” (n = 13), and “pedestrian” (n = 10).

Overall, regression, classification, and spatial models are within the top-5 methods most frequently used to analyze RTAs in three of the four clusters analyzed. In addition, “age” is the variable most frequently analyzed with a total of 80 publications, while “car,” “year,” and “rain” correspond to the most analyzed in the publications included in the four clusters.

3.5. Variables Influencing the Frequency and Severity of RTAs Based on Papers’ Abstracts

Figure 7 depicts the variables influencing the frequency and severity of RATs. In the RHA + HA + SA cluster, “age,” “sex,” “van,” “model” (vehicle type), and “snowy” (presence of snow) can be identified as the main variables that increase the frequency of road accidents (Figure 7(a)). Conversely, “effective strategy” and “measure,” which refer to the implemented road safety measures, are mitigating factors of the frequency of RTAs, and “helmet” (the use of a helmet) is identified as a factor that mitigates the severity of road accidents (Figure 7(a)). In addition, the main variables increasing RTA frequency in the RHA + HA + ML cluster are “age,” “drink” (consumption of alcohol while driving), “mobile phone” (use of the mobile phone while driving), “speed” (excess speed), and “visibility,” while “effective strategy” and “policy,” which refer to the implemented road safety measures identified as mitigating factors (Figure 7(b)). Regarding severity, “helmet” (i.e., the use of a helmet) and “prevention strategy” are identified as factors that mitigate the RTA severity, while “age group” and “speed limit” (i.e., violation of the limits of speed) increase it (Figure 7(b)).

We found that “year,” “vehicle” (type of vehicle), “mobility,” “cannabis use” (use of cannabis while driving), and “angle” (crash angle) are the main variables increasing RTA frequency in the RHA + HA + SC cluster, while the mitigating factors of RTA frequency are “traffic safety” and “safety information” (Figure 7(c)). Interestingly, no factors affecting RTA severity were identified. On the other hand, we identified that “age,” “car” (type of vehicle), “peak,” and “noise” are the main variables increasing RTA frequency, while “strategy” and “policy” were identified as mitigating factors in the RHA + HA + GIS cluster (Figure 7(d)). Regarding RTA severity, “speed traffic” and “speed road” were identified as factors increasing it.

Overall, our analyses indicate that (1) variables such as “age” and those associated with the type of vehicle (i.e., “model,” “car,” “vehicle,” or “van”) are represented in three of the four clusters, and (2) variables that mitigate the frequency and severity of RTAs are related with the implementation of safety measures.

4. Discussion

The first goal of this systematic literature review (SLR) was to identify the most cited research papers, journals, authors, and methods that contribute to the state of the art of RTAs based on statistical analysis, machine learning (ML), technological elements of smart cities, and geographic information systems (GIS). As shown in Figure 1, an exponential growth can be observed from 2010 onwards, mainly in the RTA + HA + SC and RTA + HA + ML clusters. In the first cluster, this behavior may be due to the contribution of the Industry 4.0 revolution with technologies such as Internet of Things (IoT), intelligent transportation systems (ITSs), and wireless sensor networks (WSNs) [4] (Figure 4(c)). The exponential growth in the second cluster may be related to the appearance of processors capable of executing computational models with large databases [101]. The performance of the RTA + HA + SA and RTA + HA + GIS clusters is similar because the publications applying statistical models tend to rely on GIS to clearly and concisely show the results as previously discussed by [5]. Some publications using this strategy include Refs. [35, 39, 42], among others.

Zou & Vu [102] proposed four clusters to identify the main publications in road safety via scientometric analysis: Cluster 1: effects of driving psychology and behavior on road safety; Cluster 2: causation, frequency, and injury severity analysis of road crashes; Cluster 3: epidemiology, assessment, and prevention of road traffic injury; and Cluster 4: effects of driver risk factors on driver performance and road safety. Following this approach, in this SLR, we proposed the same number of clusters to analyze the frequency and severity of RTAs: Cluster 1: statistical analysis (RTA + HA + SA; n = 399, 10.26%), Cluster 2: machine learning (RTA + HA + ML; n = 858, 22.06%), Cluster 3: technological elements of smart cities (RTA + HA + SC; n = 2327, 59.85%), and Cluster 4: Geographic Information Systems (RTA + HA + GIS; n = 304, 7.81%). Figure 2 depicts the 20 most cited publications analyzing RTAs per cluster, which is in line with the proposal by Zou et al. [103] when analyzing the top 50 most cited publications in the journal Accident Analysis and Prevention. In this SLR, we identified that the most cited publications are the first three of the RHA + HA + SA cluster (i.e., Mannering et al. [104] with 885 citations, Savolainen et al. [36] with 487 citations and Mannering et al. [19] with 468 citations); and the first three of the RHA + HA + SC cluster (i.e., Al-Sultan et al. [34] with 633 citations, Zeadally et al. [38] with 475, Mannering et al. [41] with 468 citations, and Whaiduzzaman et a [44] with 412 citations (Figure 2). These publications have in common that they correspond to literature reviews, highlighting the importance of these type of publications for scientific production since they serve as a reference for exploring gaps in the literature and propose new research venues focused on expanding the frontier of knowledge. We strongly believe that these results constitute a valuable resource for future studies related to RTAs.

The most cited author is Fred Mannering, who presents four publications in the Journal Analytic Methods in Accident Research: Unobserved heterogeneity and the statistical analysis of highway accident data [41] with 523 citations and Temporal instability and the analysis of highway accident data [105] with 155 citations, in addition to Big data, traditional data and the tradeoffs between prediction and causality in highway-safety analysis (n = 26) [104]. This article is not among the top 20 publications in any of the clusters but is part of this SLR and Analytic methods in accident research: Methodological frontier and future directions (n = 722) [106]. This last paper was not included in any of the clusters generated by Web of Science. These publications correspond to literature reviews that address advantages and disadvantages of the methods and variables used to model RTAs, as well as the problems and difficulties when applying specific methods. In addition, these publications highlight the importance of taking into account unobserved heterogeneity when predicting RTAs.

Other similar SLRs show the 50 most cited papers in the journal Accident Analysis and Prevention [103] and identified research papers with the greatest influence in the field of RTAs. The authors hypothesized that the oldest papers accumulate the greatest amount of citations, emphasizing that research papers from the 1990s and the first decade of 2010 are the most important [103]. In the present SLR, we contemplate other journals within this line of research and found the most cited papers are from 2010 onwards.

VOSViewer® has previously been used to identify clusters in which causality and accident frequency via co-occurrence of keywords [59], and to illustrate the methods, technological tools, and variables correlated with the frequency and severity of RTAs [103]. Here, we were able to distinguish that generalized linear models, mainly negative binomial regression and Poisson regression, have been widely used between the years 2012 and 2020 (Figure 3(a)). However, methods such as spatial analysis and the analysis of unobserved heterogeneity are of particular importance today (Figure 5(d)). Unobserved heterogeneity is understood as the error made by not including in the analysis of RTAs variables that are correlated with observable variables. In many cases, these “unobserved” variables are unknown by the analyst [41]. Although unobserved heterogeneity is not a method, it is a problem that must be addressed to improve the quality of the road accident prediction models.

In Figure 3(b), the most recently implemented methods in the RHA + HA + ML cluster are artificial intelligence, deep learning, CNN, transfer learning, and computer vision. We believe that these findings may be explained by the development of GPUs [107] and next-generation processors. Similarly, in the RHA + HA + SC cluster, 5G, ITS-G5, V2X, deep learning technologies (Figure 3(c)) are emerging as suitable alternatives for collecting RTA data due to the development of technologies associated with the fourth industrial revolution.

In terms of variables considered when predicting frequency and severity of RTAs, we identified four main sources: variables related to human factors, variables related to climatological and road factors, variables defining vehicle-specific characteristics, and variables associated with accident characteristics (Figure 7). In Silva et al. (Silva, Andrade, & Ferreira), these four groups of variables were also identified. However, in the present SLR, we provide a key visualization that allows us to easily identify which variable was analyzed in each publication. Although this same group of variables was utilized by Mannering et al. [19] to textually describe the reason why each variable was part of the corresponding group and how it contributes to the occurrence of RTAs. We strongly believe that a graphical representation such as Figure 7 is useful for this type of analysis and facilitates a quick understanding of this important topic.

We identified that variables such as age or age group, sex, type of vehicle, and year contribute the most to the occurrence of RTAs and correspond to those frequently analyzed among the 80 papers highly cited papers for all cluster (Figures 2 and 6). In contrast, variables associated with human factors (i.e., the use of a helmet and driving under the influence of alcohol) and traffic congestion and visibility, which have been identified as one of the factors associated with the frequency and severity of RTAs (Figure 7), are also the less frequently analyzed variable in all clusters (Figure 6). The use of helmets as a mitigating factor in the severity of RTAs involving motorcyclists is observed in Savolainen et al. [36], while driving under the influence of alcohol is analyzed as a cause of RTAs in Mannering et al. [105]. On the other hand, the analysis of the use of a mobile phone and its incidence on the frequency and severity of RTAs is even rarer; it has been proven that manipulating the cell phone while driving causes serious RTAs [108]. Figure 6 summarizes the methods and variables most frequently used to analyze RTAs across all research papers identified in this SLR. In all clusters, the use of a mobile phone has not been analyzed as a cause of RTAs but as a tool for communication, that is, to create VANETS as evidenced in 14 out of the 20 research papers comprising the RTA + HA + SC cluster (Figure 6(a)). Another variable that reduces the severity of RTAs is the use of seat belts by drivers and passengers of automobiles [108]. Unfortunately, this variable is only analyzed in 3 of the 80 main studies (Table 1 and Figure 2).

Another important aspects of this SLR is the use of quantitative tools for analyzing research papers covering the analytical tools used to assess RTAs and the development of specific visualizations. In particular, Figure 2 shows the top 80 highly cited publications and Figure 6 connects them with the different analytical methods and variables for predicting the frequency and severity of RTAs. This visualization was developed following Silva et al. [109]. Of note, this connectivity plot allows to easily identify how research publication makes use of different methods and variables and, to what extent, these methods contribute to the analysis of RTAs. Motella [85] identified several contributing factors to the occurrence of RTAs. However, a graphical representation, such as Figure 7, allows us to (1) illustrate and determine potential causal relationships between different factors and the frequency and severity of RTAs, and (2) facilitates a more user-friendly presentation for the reader. This latter aspect makes easy to understand complex information, especially when large volumes of data are available.

In light of the results of this SLR, we suggest the following topics for future research related to the analysis of RTAs, which can be framed in four different areas: (1) data sources and organization; (2) statistical- and ML-based methods to be used; (3) variables to be analyzed; and (4) visualization tools for easily and intuitively communicate the results.(1)Database sources and organization. In this SLR, the Web of Science database was used as the publication database. For future literature reviews, other bibliographic databases (i.e., SCOPUS and Dimensions) and resources should be included and explored.(2)Data analytic methods. Considering the trends identified in this SLR, future studies would greatly benefit of comparing the feasibility and predictive power and different methods for RTA analysis such as spatiotemporal vs. ML-based methods (i.e., recurrent neural network model and/or an assembly method). In this regard, it would be crucial to ensure that ML-based methods are capable of addressing the unobserved heterogeneity.(3)Variables to be analyzed. In this SLR, human factors that cause RTA accidents (i.e., the use of a cell phone while driving and driving under the influence of alcohol) as well as other mitigating factors such as the use of a helmet (for motorcyclists) and seat belt (for drivers and car occupants) were identified. Future studies assessing RTAs should properly analyze, using different statistical and ML-based methods, the influence of these and other risk and mitigating factors on RTAs in order to establish a baseline comparison and provide more accurate results.(4)Data visualization tools. Data visualization makes possible to understand information more effectively than representations based on numbers or text, since the human mind is more easily able to process graphic images than tables with numerical or textual information [110]. Currently, tools for analyzing, presenting, and transforming bibliometric data from metadata (i.e., title, journal, authors, date of publication, keywords, abstract, references, and country of origin) are available. Some of these tools include, but are not limited to, biblioshiny ([https://bibliometrix.org/Biblioshiny.html]), VOSviewer®, CiteSpace© [111], CitNetExplorer [112], and the R package causalizeR [14]. These tools, correctly integrated, will allow the extraction of information from bibliometric data. Although here we have shown the usefulness of these tools to visualize this type of data, further lines of research include the possibility of expanding these tools or developing new data visualization approaches for analyzing RTA-related publications.

5. Conclusions

Literature reviews are useful to support other investigations in the production of new scientific knowledge. Here, we use bibliometric analyses to identify the main papers, journals, and authors significantly contributing to the scientific production in RTA analysis using statistical and ML-based methods as well as technological elements of smart cities. Interestingly, most cited publications identified in this study correspond to literature reviews. The position adopted in this work is that bibliometric analyses can be used to complement research in all areas of knowledge in which there are scientific publications in known databases. It would be ideal that research topics suggested herein are materialized in future studies.

In the present SLR related to the analysis of RTAs, we found that the growth of scientific production has been evidenced in the last decade. This result may be due, in part, to the fact that new generations of computers and data analytical methods have emerged or have been implemented to allow the analysis of large databases faster, more securely, and with high accuracy thanks to the adoption and development of technological elements of smart cities to collect data produced on the world’s roads in real time.

Unobserved heterogeneity is not a statistical analysis method but a problem that arises when applying other statistical methods to analyze RTAs. Thus, it is important to develop and/or use models/methods that are robust and have good performance for predicting RTAs under the potential presence of unobserved heterogeneity.

Bibliometric tools such as VOSviewer® and the implementation of several scientometric methods in the Bibliometrix and causalizeR packages of R constitute powerful and easy-to-use tools to analyze large volumes of publications. Cautiously using these tools in a complementary way gives researchers the possibility, in a short amount of time, of scrutinizing patterns and relationships that would be almost impossible to perform manually.

Human factors are one of the main causes of RTAs. However, in this SLR, we showed that, in the most cited publications, these factors have not been sufficiently analyzed. It is important to highlight that the use of helmets in the case of motorcycle users and the use of seat belts for car users are factors that mitigate the severity of road accidents, as well as the use of cell phones while driving and driving under the influence of alcohol are factors that increase the frequency of RTAs. Therefore, it is worthwhile for future research to include human factors as predictor variables in their models, since it is possible that they may be able to increase the performance of said models. Finally, data visualization tools allow the presentation of research results with high informative content in a coherent and clear manner, allowing the reader of these publications an easy understanding of said content. Hence, it is advisable to use them whenever possible.

Data Availability

The data retrieved from Web of Science and used to support the findings reported in this article are available from the corresponding author under reasonable request.


The sponsor of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the manuscript. CFMV is a doctoral student at the Universidad del Norte, Barranquilla, Colombia. Some of this work is to be presented in partial fulfilment of the requirements for the PhD degree.

Conflicts of Interest

None of the authors of this article has a financial or personal relationship with other people or organizations that could inappropriately influence or bias the content of the article.

Authors’ Contributions

JIV and GAGL conceived the study; JIV helped with the methodology; JIV and GAGL validated the study; CMFV did the formal analysis; CMFV investigated the study; CMFV, JIV, and GAGL helped with the resources; CMFV and JIV curated the data; CMFV prepared the original draft of the manuscript; CMFV, JIV, and GAGL reviewed and edited the manuscript; CMFV and JIV visualization; JIV and GAGL supervised the study; GAGL performed the project administration; and CMFV and GAGL helped with the funding acquisition. Jorge I. Velez and Guisselle A. Garcia-Llinas contributed equally to this work.


CMFV was supported by COLCIENCIAS, project “Design of a framework for Integral Road Traffic Accident Management in Smart Cities,” project # 785 of 2017, contract 84-091-894.