Road traffic carnages are global concerns and seemingly on the rise in Ghana. Several risk factors have been studied as associated with road traffic fatalities. However, inadequate road traffic fatality (RTF) data and inconsistent probability outcomes for RTF remain major challenges. The objective of this study was to illustrate and estimate probability models that can predict road traffic fatalities. We relied on 66,159 recorded casualties who were involved in road traffic accidents (RTAs) in Ghana from 2015 to 2019. Three generalized linear models, namely, logistic regression, probit regression, and linear probability model, were used for the analysis. We found that gender and age groups have significant effects in predicting the probability of road traffic fatality for all three models. Through a likelihood ratio test, however, it was determined that the logit regression model produced consistent probabilities of traffic fatalities which are very close to the actual probability values across the age groups and gender, compared to the other two models. Thus, we recommend intensified campaign for the use of seat belts in vehicles, targeted at the aged and male users of road transport, to reduce the possibility of death in any RTA.

1. Introduction

Reducing risks of road traffic fatalities (RTFs) remains the ultimate objective in many road safety regulations and studies. Deaths resulting from road traffic crashes have become an existential threat, as available statistics indicate that the world loses close to 1.35 million people through road accidents each year, of which majority are young people between the ages of 5 and 29 years [1]. The repercussions are severe, given the immense losses they bring to the victims’ families and their communities. Further, it is estimated that several countries make economic losses of about three to five percent of their gross domestic product (GDP) due to road traffic crashes [2]. Global leaders, through the Sustainable Development Goals (SDG) target 3.6, projected to halve road traffic deaths by the year 2020 [3]. This target was however missed, necessitating the UN General Assembly to pass a new resolution, dubbed, improving global road safety. The resoultion emphasized on achieving the previous target of reducing the global number of deaths and injuries from road traffic crashes by 50%, by the year 2030 [3]. To this end, examining the risks of road traffic fatalities by victim’s characteristics is crucial in order to fully comprehend the impact of various risk factors contributing to the fatalities so that appropriate safety interventions can be identified and implemented to reduce the number of deaths from these crashes. Road traffic accident fatalities, according to [4], include only deaths which occur within 30 days following a road accident.

Statistics available from [2] suggest significant differences in the rate of road traffic deaths per 100,000 people across different regions of the world. Among these regions (Africa, America, Eastern Mediterranean, Europe, South-East Asia, and Western Pacific), Africa was noted to have the highest rate of road traffic deaths (26.6/100,000 people) compared to Europe (9.3/100,000 people). Further evidence from OECD [5] mentioned South Africa to have had a road traffic mortality rate of 22.4/100,000 people in 2018. These figures present an unpleasant trend in terms of the progress made in fighting road traffic carnages in Africa. Particularly in sub-Saharan Africa, Aga et al. [6] explained that the situation of road traffic accidents is severe, and that the region has the highest road traffic death rate, with significant number of properties damaged through road traffic accidents.

The evidence in Ghana is frightening as Blankson and Lartey [7] confirmed that deaths resulting from road traffic accidents constitute 62 percent of all emergency cases reported at designated referral hospitals for accident victims in Ghana. This was corroborated by available statistics from WHO [8], which indicates that an average of 8 persons in Ghana out of every 100,000 population died from RTA annually over the past decade [7]. Many of these RTA related deaths, according to Konlan et al. [9], are caused by road traffic behavior of motorcyclists. Predictor variables such as age, alcohol influence, excessive speeding, bad roads, overloading, and disregard for road regulations have been studied to be significant risk factors of road fatalities in Ghana [1013]. However, these studies are devoid of vigorous probability models that attempt to predict road traffic human deaths in Ghana. For instance, Asare and Mensah [13] only applied the ordinal regression model to identify factors that contribute to accident severity in Ghana.

2. Literature Review

This section discusses contributory risk factors to RTF and methodological approaches used in estimating these risk factors. It further presents a brief empirical review of RTF studies across the globe. Extensive research has already provided many insights into risk factors that influence road traffic crashes. The literature discussion has focused on six areas: demographic factors, human factors, road factors, vehicle factors, circumstantial factors, and environmental factors. Demographic characteristics such as gender, age, education, employment sector, and income earned by drivers dominate recent studies [14, 15], but road traffic crashes resulting in deaths are not exclusive to only drivers of the vehicle. Other studies [16, 17] show that RTF cuts across different spectrum of road users with different demographic backgrounds.

Regev et al. [16] and Melchor et al. [18] noted that men are more likely to suffer death in a road accident because of their higher frequency in engaging in the transport and distribution business compared to women. In relation to age, Vaa [19] observed that the biological and psychological system of a person deteriorates faster as they age. This makes the aged (60 years and above) more likely to suffer death in a road accident. Vaa’s [19] observation is in contrast with Hesse et al.’s [20] study where between the periods of 2001 and 2010, they found that persons between the ages of 26 and 35 were the highest casualties in road traffic fatalities in Ghana.

On the part of human factors, Wu and Xu [21] and Abele et al. [22] highlighted that driver behaviors such as speeding, drunk-driving, fatigue, safety measures adopted, and risk-taking behaviors are the most influential causes of traffic-related causalities. Febres et al. [23] blamed young drivers for risky driving behaviors. Mazankova [24] maintained that these human behaviors are the main causes of about 70 percent of the road traffic fatalities. Xie et al. [25] and Zhang et al [26], found that road-related factors such as increased motorization, lane changing, and overtaking cars have negative effects on traffic safety. Lack of using appropriate safety accessories in vehicles, according to Febres et al. [23], contributes a higher probability of human deaths on motorways. The environmental factors including road types, nighttime travel, and weather conditions examined by Altwaijri et al. [27] were found to have amplified the risk of exposure to fatal road injuries causing deaths.

The literature presents a plateau of statistical models used in predicting human road traffic fatalities and injuries given some risk factors. In Farooq and Moslem’s [28] studies, analytic network process was used to conclude that driving without alcohol and obeying speed limits were significant factors compared to other factors causing road traffic injuries in Hungary. This conclusion is problematic as the study is limited in its sampling. Twenty drivers were used without regard to any probabilistic approaches; besides, analytic network process is only a decision analysis tool. The analytic network process functions like the artificial neural networks, as applied in similar previous studies such as Delen et al. [29] and Chimba and Sando [30]. A related study by Febres et al. [23] used Bayesian network to conclude that lack of using appropriate safety accessories, high speed violations, distractions, and errors have higher probability of predicting fatal injuries for drivers in Spain. Febres et al.’s [23] study used secondary data where 66,253 drivers were selected using systematic sampling compared to Farooq and Moslem’s [28] study that gathered data through a questionnaire survey.

The use of Bayesian statistics in predicting risk of RTF is increasing. Varied approaches such as Bayesian ordered probit and Bayesian hierarchical binomial logit are common in the literature [25, 31]. For instance, Hesse et al. [20] used Bayesian analysis to confirm that population and numbers of registered vehicles were the predominant factors influencing road traffic fatalities in Ghana. The weakness of the Bayesian approach is however exposed mainly in its subjective choice of priors.

Another statistical concept common in RTF literature is the logistic regression analysis. It is presented in three forms, namely, binomial (binary), multinomial, and ordinal logistic regression. Santos et al. [32] applied the concept of the binary logistic regression to identify the factors influencing work zone road crashes. Other studies that employed the binary logistic regression model in RTF analyses include Potoglou et al. [33]; Zeng et al. [34] and Eboli et al [35]. The major weakness of the binary logistic regression is that it can handle only two possible outcome values.

The multinomial logistic regression, however, improves the weakness found in the use of the binary logistic regression. In the multinomial logistic regression analysis, there is room for at least three possible outcome values, but these values are not ordered. Vilaça et al. [36] used the multinomial logistic regression model to identify statistically significant variables to predict the vulnerable road users risk injuries based on spatial and temporal assessment. Similar studies that used the multinomial logistic regression in RTF include the works of Abdulhafedh [37]; Useche et al. [38]; and Damsere-Derry et al. [39].

Unlike the multinomial logistic regression, the ordinal logistic regression has its outcome values ordered. The ordinal logistic regression has seen several usages in diverse fields of research [4043]. With regard to studies about road traffic fatalities, Kadilar [44] and Qian et al. [45] used the ordinal logistic regression to analyze the effect of drivers’ behavior, roadway, and vehicle characteristics on crash severity in traffic accidents in Turkey and China, respectively.

Another commonly used probabilistic model in RTF data analysis is the linear regression model. Previous studies from Ghana [10, 11] mainly applied multiple regression analysis to draw conclusion that the road traffic accidents were surging and at a faster rate in Ghana. However, Shankar et al. [46] critiqued the use of linear regression models as inappropriate for making probabilistic statements about the occurrences of vehicular accidents on the road because it lacks the ability to establish a precise relationship between the dependent variable and independent variables for smaller sample sizes. In such circumstances, Abdullah and Zamri [47] proposed the use of fuzzy linear regression models to analyze factors responsible for RTF.

In other related studies [4850], Poisson regression models were fitted for RTF data with the basic assumption that the data produced same mean and variance. However, Shaik and Hossain [51] faulted the use of the Poisson regression model, as the underlying assumptions are difficult to satisfy. They contend that in many instances, RTF data tends to have larger variance or over dispersed and may not produce same mean and variance. Consequently, it will be improper to use the Poisson regression model to analyze both under dispersed and over dispersed data sets as pertain to the RTF data.

With the knowledge gained in the reviewed literature, we proceed to present statistical methods suitable for predicting RTF in Ghana.

3. Methods

3.1. Model Formulation

In generalized linear model (GLM), it is assumed that a linear relationship exists between a variable called the dependent variable (outcome variable) and k independent variables, Thus, given and the basic model for GLM is of the formwhere are the partial GLM coefficients, and it is assumed that and . The link function, specifies the link between random and systematic components. It says how the expected value, of the response relates to the prediction equation of explanatory variables through a prediction equation having linear form

The simplest link function is This models the mean directly and is called the identity link. Suppose we want a GLM that models n Bernoulli trials, that is, the response variable Y is binary that takes value of 1 (for success) or 0 (for failure). The probability mass function of Y is given bywhere is the probability of success. Since it follows that is a function of . In subsequent sections, we present three different link functions that can be used to estimate

3.1.1. Logistic Regression Model

The link function for the logistic regression model is the logit or log-odds function, which is defined, according to McCullagh and Nelder [52], as

Thus, the logistic regression model can be written as

Solving for the probability in the logit model giveswhich giveswhere and Let be the n observed independent Bernoulli trials with parameters respectively. From equation (3), the likelihood function is given by

The maximum likelihood estimates of the components of the vector are the values of which maximize the likelihood function. They are also the values of which maximize

The first derivative of with respect to is thus

Setting each partial derivative in equation (8) to zero and replacing by , we obtain the maximum likelihood estimates of The methods of solution are iterative in nature and have been programmed into logistic regression software.

3.1.2. Probit Regression Model

The probit regression method uses the cumulative distribution function of the normal distribution to explain the function of the equation. In the probit model, the inverse standard normal distribution of the probability is modeled as a linear combination of the predictors [53, 54]. Thus, the link function for the model iswhere , and is the distribution function of the standard normal distribution. Thus, the probit model has the following expression:


Therefore, from equation (3),

Similar to equation (6), the likelihood function is given by

The maximum likelihood estimates of maximize the likelihood function. They are also the values of which maximize the log-likelihood function. Greene [55] showed that the estimator could be calculated through maximizing the following log-likelihood function :

3.1.3. Linear Probability Model

The link function for linear probability model is the identity, which is defined as Thus, the identity regression model can be written as

The likelihood function is given by

Log-likelihood function is

The first derivative of with respect to is and thus

Setting each partial derivative in equation (8) to zero and replacing by , we obtain the maximum likelihood estimates of

3.2. The Significance of Model Coefficient

The hypothesis for testing the significance of any individual regression coefficient such as is

If is rejected, we conclude that the regressor, contributes significantly to the model. The Wald chi-squared test statistic for testing against iswhere is the estimated standard error of When is true, has the chi-square distribution with 1 degree of freedom. We reject at significance level if

The comparison of observed to predicted values using the likelihood function is based on the following expression:

In particular, to assess the significance of an independent variable, we compare the value of D with and without the independent variable in the equation. The change in D due to the inclusion of the independent variable in the model is

If the hypothesis that is true, then G has the chi-square distribution with k degrees of freedom [56]. It can be shown that a equal-tailed confidence interval for is

3.3. Study Setting

The National Road Safety Authority (NRSA) in Ghana, which is responsible for ensuring road safety and compliance of road regulation, has recently revealed that a total of 2,924 persons died in 2021 through road crashes. In addition, 15,972 road crashes were recorded within the same period, resulting in 13,048 injuries. We conducted this study in retrospect to past road accidents that occurred from January 2015 to December 2019 in Ghana. Secondary data in the form of recorded road traffic fatalities were obtained from the registry of the NRSA of Ghana for this study. Sixty-six thousand one hundred and fifty-nine (66,159) recorded casualties were included in the current study.

3.4. Data Processing and Analysis

The secondary data procured from the NRSA were tested for completeness. This was done by first keying all values into Microsoft Excel for cleaning, editing, and interpolating for missing values. Thereafter, numerical methods were applied to estimate the proportion of road traffic fatalities across various age groups and gender. The estimation of the model parameters was done using the R statistical application. The respective R programs can be found in appendices A and B.

Table 1 shows the data frame for the logit, probit, and the identity regression models, which is made up of 66,159 casualties who were involved in road traffic accidents in Ghana from 2015 to 2019.

The dependent variable for our models was the number of human deaths per road traffic accident in Ghana, measured on a nominal scale. It is named casualty (y) and coded as fatality ≡ 1 and injury ≡ 0. A human death includes passengers, pedestrians, road users, and any other person whose death was due to road traffic crash. Two main independent variables were used to assess robustness of our model, namely, age groups (x1) and gender (x2). The age groups were coded according to the increasing order of the fatality index of the age groups. This was necessary because of the use of fatalities per 100 casualties index (F.I.) as a requirement for characterization and comparison of the extent and risk of road traffic fatality [57, 58]. Consequently, age group (x1) was coded as follows: 0–5 years ≡ 7, 6–15 years ≡ 5, 16–25 years ≡ 3, 26–35 years ≡ 1, 36–45 years ≡ 2, 46–55 years ≡ 4, 56–65 years ≡ 6, and over 65 years ≡ 8. Finally, gender (x2) was measured on a nominal scale and coded as follows: male ≡ 1 and female ≡ 2.

4. Results and Discussion

4.1. Descriptive Statistics

It is observed from Table 1 that the overall male fatalities out number their female counterparts by an approximate ratio of 4 : 1 (78% vs. 21%), for the period 2015–2019. Considering the fact that females constitute majority of Ghana’s population, make males’ fatalities highly over-represented in road traffic fatalities.. Similar conclusion was reached by Regev et al. [16] and Melchor et al. [18] who noted that men are more likely to suffer death in a road accident because of their higher patronage of motor transports for economic activities compared to women.

From Table 2, it can be observed that over the 5-year period, “over 65” is the age group with the highest national fatality rate. That is, about 32% of all road traffic casualties who were over 65 years lost their lives while 27% of casualties who were 5 years old or less died as a result of road traffic accidents. This finding is consistent with the works of Mcdoy et al. [59]; Vaa [19]; and Etehad et al. [60]. Vaa’s [19] findings reiterate that people aged 60 years and above may have their biological and psychological system deteriorating faster, and therefore, they are more likely to suffer death in a road accident. Similar explanation could be true for those 5 years and below since they may have a very weak or immature biological and psychological make-up, and therefore, they are more prone to death in case of a road accident. Table 2 shows the fatality indices of each of the eight age groups, computed based on road traffic accident data in Ghana from 2015 to 2019.

4.2. Estimated Models

Three models were fitted to the data described in Table 1, namely, logistic regression model, probit regression model, and linear probability model. Under the logistic regression model, the independent variables were introduced into the model step-wisely to measure their individual contribution to the model. In models A, B, and C, respectively, we mounted the following models:where , is the age group of the casualty, and is the gender of the casualty and interaction of the variables, . The objective is to predict the probability of road traffic fatality as a function of age group and gender as stated in model B and then compare it with model A which has only age group as predictor. The analysis using model B assumes that there is no interaction between age group and gender. Therefore, to test for interaction, we compare the full model (model C) with model B. The coefficient estimates for the three models and the corresponding standard errors together with their values are given in Table 3.

The results reveal that and are significant for all three logistic models in Table 3, with values approximating to 0.0000. The observed value of the deviance D (54847) < for model A, which suggests a good fit of the logit regression model. Similar conclusions can be drawn for models B and C.

The likelihood ratio test for the comparison analysis between model A and model B gave a test statistic of 70.2064 with a very small value (less than 0.00001). We, therefore, reject the hypothesis that there is no significant difference between the log-likelihood for Model A and that of Model B” and therefore conclude that the gender has significant effect in predicting the probability of road traffic fatality. Thus, we would choose to use model B. This result supports the recent findings of Islam and Mannering [61] and Useche et al. [38]. Islam and Mannering’s [61] study confirmed significant differences within gender behavior in predicting the likelihood of a road traffic fatality. Similar conclusion was reached by Useche et al. [38] whose study supports the influence of gender in predicting risky road accidents resulting in deaths.

From model C, the interaction effect with value of 0.983 is not significant in predicting the probability of road traffic fatality. Furthermore, a likelihood-ratio test for comparing the model with interaction to model B, without interaction, gave a test statistic of 0.00047 with a value of 0.9827. Since the value > 0.05, we conclude that model C, with interaction, is not significantly more accurate than model B, without interaction. Thus, we would choose to use model B, instead.

In the estimation of the probit regression model, the objective was to predict the probability of road traffic fatality as a function of age group (x1) and gender (x2). We then compare it with logit model B to determine if there is any significant difference, via likelihood-ratio test. The probit model was formulated as

Appendix C presents an R function used in estimating the model parameters as given in Table 3. Results from Table 3 show that the model coefficients are all significant in predicting the probability of road traffic fatality across various age groups in Ghana. Moreover, a residual deviance of 54786 suggests a good fit of the probit model. A comparative analysis of the probit model D with logit model B via the likelihood ratio test gave a test statistic value of 9.7934 with a value of 0.001751, showing that model B is preferred. Finally, using the linear probability model, we mounted the identity regression model:

We fitted the identity regression model using the R code as specified in Appendix D. The coefficient estimates for model E and the corresponding standard errors together with the estimates of model B and model D are given in Table 3. Like the logit and probit regression models, the linear probability model is significant in predicting road traffic fatality in Ghana.

Comparing probit model E with that of logit model B, via the likelihood ratio test, the value of the test statistic is computed as 53.69624 with a value which is less than 0.001, indicating that the logistic regression model is preferred over that of the linear probability model. Table 4 shows the sample proportions of road traffic fatalities and the fitted values for the logistic, probit, and linear probability regression models across age groups and gender.

The fitted values for the logistic regression model, shown in Table 4, are similar to those for the linear probability and probit regression models. When the values are rounded up to two decimal places, the probit and logistic regression models provide the same estimates.

5. Conclusion and Recommendation

In this study, we have formulated and estimated three generalized linear models that have shown strong statistical significance in predicting road traffic fatalities given some demographic factors in Ghana. First, the estimated logit regression model proved to be robust when fitted with the predictors, gender, and age groupings. However, the interaction of gender and age groupings did not show any significant effect in predicting the probability of road traffic fatality. Secondly, when compared to a probit regression model via a likelihood ratio test, the logit model gave better estimates. Finally, the logit model was preferred over a fitted linear probability model with the same predictor variables when a likelihood ratio test was conducted to compare the two models. These findings illustrate that a logistic regression model produced consistent probabilities of traffic fatalities which are very close to the actual probability values across age groups and gender. The finding implies that gender and age groups have significant effect in predicting the probability of road traffic fatality. What this means is that the probability of male fatality was high compared to female in a RTA. Also, the aged (over 65 years) were found to be more likely to suffer death in a road accident compared to other categories of age.

Thus, it is recommended to road traffic regulators to consider the possibility of death based on victim’s gender and age group when formulating road safety interventions and regulations to reduce the growing deaths on the road. Road safety education campaigns through the media should be targeted to the male users of transport and the aged, particularly on the use of seat belts in vehicles, to reduce the possibility of death in any RTA.


A. R Codes Used for the Estimation of the General Linear Models

rtf.logit.Y_X1 < −glm(Y ∼ X1, family = binomial(link = “logit”), data = rtf)rtf.logit.Y_X1_X2 < −glm(Y ∼ X1 + X2, family = binomial(link = “logit”)rtf.logit.Y_X1_X2_X1X2 < −glm(Y ∼ X1 + X2 + X1X2, family = binomial(link = “logit”), data = rtf)

B. R Codes Used for the Estimation of the Model Parameters

A = logLik(rtf.logit.Y_X1)B = logLik(rtf.logit.Y_X1_X2)teststat < - −2(as.numeric(A) − as.numeric(B))p.val < − pchisq(teststat, df = 1, lower.tail = FALSE)p.val[1] 5.341424e − 17

C. R Functions Used in Estimating the Model Parameters Found in Table 3

rtf.probit.Y_X1_X2 < -glm(Y ∼ X1 + X2, family = binomial(link = “probit”), data = rtf)

D. R Codes Used in Fitting the Identity Gression Model Parameters

rtf.identity.Y_X1_X2 < -glm(Y ∼ X1 + X2, family = binomial(link = “identity”), data = rtf)

Data Availability

The data used to support the findings of this study are included within the article. However, upon a reasonable request from the corresponding author, the data used and/or analysed during the current study could be made available.

Conflicts of Interest

The authors declare that there are no conflicts of interest.