Abstract

This paper presents an investigation of the relationship between driver risk and factors indicating vehicle’s speed and driver’s acceleration behavior. The main objective is to examine whether GPS data and derivative indicator can be used to identify risky drivers by means of factor analysis. In doing so, a real road driving experiment is conducted to collect data. Fifty drivers are asked to drive along a route which includes both rural highways and urban roads. The trajectories are recorded by GPS devices to calculate speed and derive acceleration measures. Driver’s behavior is also recorded by cameras and analyzed by another group of volunteers to determine whether the driver is risky or not. The drivers are then classified into five groups with different levels of risk based on the scores obtained through factor analysis. The results are verified by the volunteer's categorization and further evaluated by symbolic aggregate approximation. A binary logistic regression model is established ultimately for predicting high-risk drivers. The potential applications of this study include developing quantitative measures to identify risky drivers, especially for auto-insurance companies with usage-based insurance (UBI) applications, bus companies, and transport enterprises.

1. Introduction

With an ever-increasing number of cars on the roads, traffic safety has become a progressively more important research issue worldwide [1]. According to statistical data of road accidents in China [2], nearly 95% of road accidents are attributed to human factors and driver maneuvers. Among these factors, inappropriate speed control or speeding behavior is one of the most risky factors contributing to road accidents [3]. Furthermore, factors like inappropriate speed control or speeding account for 3.62% of accidents, with death and injury rates of 5.59% and 3.34%, respectively [4]. Research has shown that there is a significant correlation between speed-related risk indices and historical accident data [5]. To this end, measures should be taken to monitor driver performance, particularly typical risky driving behaviors such as hard braking or abrupt accelerationn, and speed-limit violation.

Usage-based insurance (UBI) or insurance telematics appears to have become increasingly popular in the insurance industry, especially among innovative auto-insurance companies worldwide [6]. Various UBI applications for personalized auto-insurance, including smartphone-based products and other on-board devices, are emerging to evaluate driving performance and determine the premiums or benefits according to the driving data collected during trips. Generally, the data used for UBI analysis include running period (time of day), location of vehicle, speed condition (e.g., average speed and speeding percentage), usage of gas pedal or brake pedal (e.g., the frequency of abrupt acceleration or hard braking), steering behavior, and total trip length [7, 8]. Furthermore, UBI schemes also provide an insight into other possible uses for the information gathered during the identification of risky driving behaviors. With the help of UBI schemes, drivers could receive performance feedback and understand factors that may stimulate or motivate them to behave aggressively. Some UBI schemes that have been implemented around the world are summarized in the work by Husnjak et al. [8] and Tselentis et al. [9].

The main contribution of this paper is to explore the relationship between driver risk and factors indicating vehicle’s speed and driver’s acceleration behavior from a real road driving study. The purpose is to examine whether GPS data and derivative indicator can be used to identify risky drivers through factor analysis. Categorization by volunteers and symbolic representation are both used to verify and evaluate the results of the case study. A binary logistic regression model is developed for predicting high-risk drivers. Moreover, for the purpose of identifying individual differences and comparing driving behavior among drivers, nearly the same trip length and driving period are selected for each participant.

The remainder of this paper is organized as follows. Section 2 discusses some previous work on the identification of risky driving behaviors through various monitoring systems or devices. Section 3 presents the variables extracted from GPS data and determines the threshold for abrupt acceleration and hard braking in the test. Section 4 describes the details of the real road driving experiment and introduces the rules for extracting risky events. Section 5 compares the risk scores obtained through factor analysis with the scores evaluated by volunteers and develops a binary logistic regression model for predicting high-risk drivers. In Section 6, symbolic aggregate approximation is used to evaluate the case study. Finally, Section 7 compares the methods and results described here with previous research and offers conclusions and some suggestions for future work.

2. Literature Review

Some previous research has been done on the identification of risky driving behaviors in normal driving situations, and some systems have been proposed that utilize smartphones, on-board data recorders, multiple sensors such as the Controller Area Network (CAN) bus, camera systems, 3-axis accelerometers, on-board diagnostics (OBD) readers, and GPS (Global Positioning System) receivers. Wang et al. [10] proposed a system with multi-sensors in a bid to detect risky driving events on the basis of vehicle dynamic parameters and driver physiological data. Imkamon et al. [11] developed a fuzzy logic system to infer the risk level of drivers with information collected from an OBD reader, accelerometer, and camera during a real road driving test. This method performed well when compared to passenger responses questionnaires, which were collected during the trips. A self-diagnosis system equipped with multi-sensors was proposed by Takeda et al. [12] to detect hazardous situations and potential risky events in terms of the rule-based and thresholding method. The number of risky driving behaviors was reduced by half following the use of this system.

The high cost and complexity of multi-sensor-based systems mean that it is not practical to deploy within a vehicle for daily users, and an efficient approach utilizing available sensors in smartphones or data recorders (including a GPS receiver or 3-axis accelerometer) is required. Unlike some costly methods, smartphones with some applications installed could also provide reliable information for identifying some typical risky driving behaviors, such as the percentage in excess of the legal speed limit, or the frequency of abrupt acceleration and hard braking [13]. Researchers Johnson and Trivedi [14] were able to determine driving style with smartphones by using the dynamic time warping algorithm. Zeeman and Booysen [15] used GPS data (speed and location) and acceleration information collected from minibus taxis to detect reckless driving and presented a reckless driving detection system to visualize the tracked vehicle and the driving behavior. Similarly, Fazeen et al. [16] utilized GPS data and acceleration information derived from smartphones to detect risky driving behaviors in different road conditions. The data used in this approach included vehicle position, maximum speed, maximum acceleration, and minimum deceleration. Results show that it is feasible to detect aberrant driving behaviors and identify risky situations automatically with smartphones, even in different road conditions.

Furthermore, Eboli et al. [17] investigated the relationship between drivers’ characteristics and driving behaviors by analyzing speed information derived from smartphones with embedded GPS chips through a real road driving test, and classified driving behaviors in terms of the travelling speed adopted by drivers. Also, Eboli et al. [18] utilized the information of longitudinal/lateral acceleration and speed derived from smartphones equipped with accelerometer and GPS receiver to identify the risk level of a driver, and made an overall judgment on the driving style based on objective measure (the percentage of external points at borderline of a safety domain) and subjective judgment (a questionnaire provided by the driver at the end of each survey). Considering that both speed and acceleration information can be derived from GPS data, devices with only a GPS receiver can be used for risky behavior identification. Therefore, this research will discuss the application of GPS-measured data for identifying some typical risky situations during a trip and determine the risk level of each driver with the help of video verification.

3. Variables and Data Analysis

In this test, GPS data and some derivative indicators are used to identify the risky events and evaluate the risk level of drivers. Therefore, a simple method that can detect typical risky situations efficiently is proposed here, and a data analysis is conducted in the following sections.

3.1. Average Speed

The instantaneous speed [19] is calculated between two consecutive GPS coordinates with the sampling frequency of 1 HZ, which is shown in Eq. (1). The average speed is described in Eq. (2).

where or is in the unit of km/h, d is in the unit of meters, and t is recorded in seconds.

3.2. Average Acceleration and Acceleration Standard Deviation

Acceleration is the representation of the driver’s speed control operating the gas pedal or brake pedal and is closely associated with characteristic driving behavior. If a driver applies the gas pedal or brake pedal heavily and frequently, they are more likely to be considered an aggressive driver [20]. Furthermore, although the second derivative of the vehicle’s speed, or ‘jerk’, has been recognized as a risk indicator [21], it is not included in this paper. The average and standard deviation of the positive acceleration and negative acceleration (deceleration) are calculated based on the following equations:

where (i =1, 2,···, n) is the derivative of the vehicle’s speed in the unit of m/s2 and is the time interval of sampling (set to 1 s). In addition, is classified into positive sequence and negative sequence (), with the absolute value above 0.01 m/s2 .

3.3. Event Rate and Duration of Acceleration or Deceleration Exceeding a Certain Value

Generally, the more abruptly acceleration or hard braking occurs, the more risky the driver will be [22]. The available threshold reported in the literature for abrupt acceleration or hard braking is varied [23]. The threshold of abrupt acceleration or hard braking is acknowledged to be ±0.3 g by Mortimer et al. [24] and Anderson & Baldock [25], while it is set at ±0.5 g by Fazeen et al. [16]. Deceleration (in absolute value) exceeding 0.49 g [26] or greater than 0.4 g in the euroFOT project [27] is regarded as hard braking, while deceleration above 0.15 g is considered a moderate risk level [28]. Bergasa et al. [29] used the threshold of 0.4 g in both longitudinal and lateral directions to identify risky driving behavior and utilized the frequency of risky events of a different level to determine the risk score of drivers. In contrast, the threshold for abrupt acceleration or hard braking used in some studies is 0.1 g [30], 0.0875 g [31], and 0.15 g [32]. In addition, according to the findings in Zeeman and Booysen [33], Eboli et al. [34], and Eboli et al. [35], speed or road condition should be considered for characterizing drivers’ behavior and the acceleration threshold is found to be reduced with an increase in speed. Also, the threshold range for positive acceleration on city roads in Brussels, as defined in Vlieger et al. [36], is 0.85~1.10 m/s2 for aggressive driving. Based on this, in this case, the threshold of 0.12 g, 0.1 g, and 0.0875 g is used for abrupt acceleration or hard braking (in absolute value) in three different kinds of routes: urban road, city ring road, and expressway, respectively.

Correspondingly, the average duration of the acceleration or deceleration threshold violated is considered along with the event rate in this paper. The event of abrupt acceleration (or deceleration) exceeding a certain value is defined as follows: only if the acceleration is above (or below) the identified threshold in different kinds of routes and lasts for a continuous period of time T (T ≥ 2 s), it will be counted as one abrupt acceleration (or hard braking incidence). The cumulative number is denoted as (or ). Likewise, the average duration of the acceleration or deceleration exceeding a certain value in different kinds of routes is described as or . Furthermore, the event rate is calculated as average number of acceleration or deceleration exceeding a certain value in different kinds of routes per 10 km traveled:

where and represent the cumulative time of acceleration and deceleration exceeding a certain value in different kinds of routes.

3.4. Maximum Speed, Acceleration, and Deceleration

The maximum speed , acceleration , and deceleration (in absolute value) are selected from durations lasting for three seconds at least to avoid a sudden extreme value.

The values of GPS data and derivative indicator mentioned above are provided in Table 1.

4. Methodology

4.1. Participants and Routes

A total of 50 drivers (three females among them) of an age group from 27 to 50 years were recruited to participate in the experiment. The average driving experience of the participants was 14.0 years with a standard deviation of 8.3 years. All participants were asked to drive as they usually do in everyday driving, and the test session lasted for 90–110 minutes for each participant. Additionally, the dataset used here was collected using an instrumented vehicle in real traffic between Huzhou and Changxing and between Yanta and Lantian in China, with similar road conditions at both test sites. The participants mainly drove three kinds of routes at both test sites: urban road, city ring road, and expressway, with the speed limit ranging from 40 to 50 km/h, 70 to 90 km/h, and 100 to 110 km/h, respectively. The test routes between Yanta and Lantian are provided as an example in Figure 1 using map-matching, where urban road, city ring road, and expressway are marked in green, yellow, and red, respectively. In addition, traffic jam and congested roads rarely exist in the expressway session, but several traffic lights and crossroads are present on the city ring road.

4.2. Apparatus

Tests were conducted to evaluate the accuracy of the speed and acceleration using GPS devices as outlined by Zito et al. [37]. The position and speed of a vehicle can be obtained by modern GPS devices with a much higher accuracy however. In this test, the speed of the subject vehicle was extracted using a Differential Global Positioning System (DGPS) with an accuracy of 0.1 km/h (resolution of 0.01 km/h), and the data used were recorded with a sampling rate of 1 HZ. Also, this test provides an opportunity to explore the validity of GPS measurements. Along with GPS, acceleration data from inertia sensor are also available. Figure 2 shows GPS and inertia measurements from a single trip by time. Visually, these two data streams seem to correlate well (). In addition, six cameras installed on the instrumented vehicle (Figure 3) were used to record the front view, rear view, two side views, the driver’s face, and the pedal area. The videos recorded by these cameras were mainly used to pick out typical risky events (the relevant extracting rules will be introduced in Section 4.3. Note that the purpose of this research is to detect characteristic risky events and identify risky drivers with the data extracted from GPS, and in addition to GPS data and derivative indicator, inertia measurements (for the comparison of accuracy), and radar data (for extracting risky situations), together with video verification, were also used as an assistance to GPS identification.

4.3. Risky Event Extraction

According to the findings of a risky driving report [38] on the 100-Car Naturalistic Driving Study (lasting for 12 to 13 months with 109 drivers participating), the number of crashes and near crashes was 69 and 761, respectively. In comparison, crash and near crash events rarely exist in the dataset of this paper. Also in many cases, crashes or near crashes may not actually happen due to the prompt reaction of traffic participants from both sides to avoid them (e.g., pedestrian stopping first before entering the roadway or another vehicle taking evasive actions). Therefore, it is also of great importance to detect and record those potential risky situations or events to improve the overall traffic safety, as frequent risky maneuvers or events may lead to crashes or near crashes at any time.

For the purpose of extracting potential risky situations, one data reductionist watched videos from the front-view camera and manually extracted all risky events from the 50 participants according to basic rules. Risky events include: if a vehicle cuts in suddenly with a longitudinal gap lower than 10 meters, the subject vehicle’s speed is simultaneously greater than 20 km/h, following too closely to the lead vehicle frequently in urban areas, running on the line, no evasive action or reacting too late, other vehicles bursting into view at poor-visibility intersections, driving through an intersection without traffic lights, failing to notice someone at the pedestrian crossings, failing to give way to vehicles with priorities, zigzag driving, and abrupt turning or hard braking [21, 3840] ].

After extracting the risky event video segments from 50 drivers, ten volunteers (five professors, five PhD students) mainly engaged in traffic safety research were recruited to evaluate the severity level of those risky events. They were asked to complete a questionnaire after viewing the video segments extracted from 50 drivers, in a process that took nearly four hours (two sessions were carried out on consecutive days) for each volunteer to complete the evaluation. The risk level ranged from 1 to 3, with 1 indicating the least risky driving behavior and 3 indicating a most risky driving behavior in the test. Note that although some near crashes and one minor crash occurred in this test, the authors believe that most of the risky events in level 3 corresponded to the severity of incidents described in the risky driving report [38]. Following this, the output from the questionnaires was compared with the risk scores calculated through factor analysis. The factor analysis was conducted with SPSS 19.0 software.

5. Results

5.1. Factor Extraction and Analysis

Before conducting a factor analysis, it is necessary to check whether this method is appropriate for the dataset used here. Note that the measure of Kaiser–Meyer–Olkin (KMO) could verify the sampling adequacy for the factor analysis. In this analysis, KMO = .774 (Table 2), which is greater than .70 and well above the acceptable threshold of .5 [41]. In addition, the results of the sphericity test (Table 2) illustrate that the significance level is close to zero, showing that it is suitable to conduct a factor analysis. The Pearson correlation coefficients among all pairs of variables are provided in Table 3. One can see that these variables are not completely independent of each other and some variables are even strongly correlated with other features. Therefore, it is appropriate to reduce the dimension of variables through factor analysis. The 12 original variables were thus reduced into a subset of principal components (PCs) for better interpretation by using their combinations.

Communality indicates the proportion of common variance within each variable that could be explained by the extracted factors. According to the factor analysis results, the communalities of the 12 variables are all greater than .7, so the factors extracted through factor analysis could reflect most of the information about the original variables, and Kaiser’s criterion, the SPSS software default for extracting factors, is suitable [41].

In addition, considering the eigenvalues represent the amount of variation that could be explained by a factor, in theory a larger eigenvalue should be retained and Kaiser [42] recommends retaining factors with eigenvalues greater than 1. Kaiser’s criterion is based on the idea that an eigenvalue of 1 represents a substantial amount of variation, and the same criterion is used in this analysis. Figure 4 shows the eigenvalues in the descending order through a scree plot, and the contribution ratio of several principal components (or factors) together with a cumulative curve is shown in Figure 5. Considering the eigenvalues should be greater than 1 and the cumulative contribution was above 80 percent within three factors, the three factors were retained. Furthermore, each contribution ratio of the first three factors before and after rotation is depicted in Table 4.

One can observe that although the same cumulative contribution is achieved for the principal component analysis and factor analysis, the contribution ratio of each component became more balanced after rotation. Note that rotation components (RCs) obtained by orthogonal rotation were more explanatory than PCs and could explain most of the original variables from different perspectives, which is shown in Table 5. A factor loading matrix obtained from the principal component analysis and factor analysis is also included in Table 5. According to the factor loading matrix, the variables that clustered on the same factor suggest that factor RC1 mainly represents acceleration information, factor RC2 represents deceleration information, and factor RC3 denotes speed-related information.

5.2. Comparison of Driving Risk Scores and Outputs from Questionnaires

In this section, the factor scores are calculated using the formula shown in Eq. (12).

where n is the number of initial variables, which is 12 in this test. The number of principal factors retained is m, which is 3 in the study. In addition, stands for the value of GPS data or derivative indicator (listed in Table 1) through standardization and represents the component score coefficient matrix calculated by SPSS software automatically, which is useful in understanding how the factor score has been computed. The factor score was recorded in three new columns of data (one for each rotation factor RC) labelled F1, F2, and F3, respectively. The risk score was calculated (Table 6) based on the following equation:

where and represent the contribution ratio of variance after rotation with the value of .4658, .2184, and .1359, respectively (Table 4). The higher the sum score is, the more risky the driver will be. Figure 6 shows the distribution of the risk score obtained through factor analysis. Nonparametric estimation reveals that the risk score obeys the law of normal distribution (Table 7). One can see from Table 7 that the value (.595) is greater than 0.05. Therefore, the null hypothesis that the risk score conforms to the normal distribution is accepted.

In addition, as mentioned in Section 4.3, ten volunteers (five professors and five PhD students) mainly engaged in traffic safety research were recruited to evaluate the severity level of risky events from 50 participants by viewing video segments. A risky event level was assigned to each event with a score ranging from 1 to 3 accordingly. The scores of risky events evaluated by volunteers are shown in Table 8. Given that Kendall’s coefficient of concordance (also known as Kendall’s W) is a measure of agreement among the judges, when running the Kendall’s W test, the value of Kendall’s W was 0.615 and the value came out to be significant (.000), which means that there is a good agreement among the scores evaluated by volunteers [43]. In order to get a much lower variance, for each risky event, an average score was taken from the ten volunteers after removing the maximum and minimum scores and then rounded to the nearest whole number. The number of risky events with a different level (score) and the sum score are shown in Table 9.

To compare whether the score acquired through factor analysis and evaluated by volunteers is highly correlated, the risk score obtained through factor analysis (Figure 7) is normalized and evaluated by volunteers in a range from 1 to 5 (the bigger the value is, the more risky a driver will be), respectively, without changing the original distribution of both scores. The correlation between the volunteers’ score and the score obtained through factor analysis is illustrated in Figure 8 with a Pearson’s correlation coefficient of 0.52. This result indicates that the risk score obtained through factor analysis has a relatively high correlation to the score evaluated by volunteers. Furthermore, when the score evaluated by volunteers is taken as a benchmark, it can be observed that the score obtained through factor analysis is relatively consistent with the volunteer’ score at the high-risk level (≥4), rather than the low-risk level (≤3). Considering the aim and focus of similar research is to identify high-risk drivers rather than low-risk ones, it is feasible to determine risky drivers by using the approach described here.

5.3. Driving Risk Model Establishment

After the risk level of each driver was identified through factor analysis and then verified by the volunteer’s categorization, a binary logistic regression model was developed to evaluate the probability of being a high-risk driver based on the RC factors.

Research has shown that there is a significant correlation between speed-related risk indices and historical accident data [5]. To this end, the accident record of each driver during last three years was considered as the dependent variable. The model setup is as follows. Define

The observed Yi has two categories: driver with and without a record of car accident during last 3 years, which is assumed to follow a Bernoulli distribution.

Let pi be the probability of being a high-risk driver for driver i. This probability is associated with a set of continuous covariates by a logit link function,

where is the vector of regression parameters and is the matrix of RC factors (predictors) for driver i. A driver will be predicted as a high-risk one if pi is above a predefined threshold p0 (0.5 by default).

The null hypothesis of the model is that the RC factors have no effect on model outputs, and model outputs are listed in Table 10. One can see from Table 10 that all three RC factors can improve the model significantly when using a significance level of 0.1 and variables ‘RC_1’ and ‘RC_2’ are significant predictors when using 0.05 as a significance level. Note that the best effect size to use in the context of logistic regression is the odds ratio. One can also see from Table 10, that the odds ratios of three RC factors are all greater than 1, which indicates that as the predictor increases, the odds ratio of the outcome occurring increases. Finally, the predictive model is developed as follows:

where RC_1 represents acceleration information, RC_2 represents deceleration information, and RC_3 denotes speed-related information.

6. Experimental Evaluation

In this section, the method of Symbolic Aggregate Approximation (SAX) proposed by Lin et al. [44] is used to evaluate and validate the results of the case study. As a time-series analysis method, SAX is commonly used to reduce dimensions by converting the continuous values to Piecewise Aggregate Approximation (PAA) representations. Furthermore, SAX also supports data visualization and allows distance measurement for discrete characters that lower the bound rather than the distance measurement on the original time series [45].

To this end, speed and acceleration data used in this paper were analyzed again with the SAX algorithm and the major notation concerning this algorithm is summarized in Table 11. Following the steps of the SAX algorithm, the time series L=,…, must be transformed into a string of length w (w < n, typically w << n). Time-series data with the length of n were first normalized with a mean of zero and a standard variance of one before being converted into symbolic representation, denoted as . Next, the standardized time series was further transformed into a w-dimensional space by a vector =,…, on the basis of PAA method. The kth element of was calculated by the following equation:

In other words, to transform an original time series L into w dimensions, the data are converted to w-equal sized sections under the Gaussian curve yielding breakpoints. The PAA representation is then mapped into symbols (a discrete string) with reference to the breakpoint lookup table (see Table 12). Further analysis could be conducted on the symbolic representation instead of the original series.

In this approach, both speed and acceleration data with the length of 1000 s were taken as an example to briefly illustrate the SAX algorithm. After normalization, the time series of speed and acceleration were converted from [0, 70] km/h to [-2, 3] and from [−2, 1]  m/s2 to [−6, 4], respectively. The normalized speed and acceleration data were then transformed into 10 and 20 segments based on the PAA method. Finally, these segments were replaced with four letters (a, b, c, and d), and the original speed and acceleration data were reduced to symbols with the representation of aaccbacddd and ccbbcbbcbcbbcbccbbcc (Figure 9).

As such, the speed and acceleration data from 50 drivers were sorted out and encoded as a whole with a sequence from 1 to 50. The length n of each original time series is nearly 1.5 h to 2 h. Given that the dataset from 50 drivers is a high number when using the representation of the original time series as depicted in Figure 9, the distribution of the speed and acceleration data from the drivers was displayed instead (Figure 10). In addition, a discrete string with the alphabet size of nine was selected for each distribution with the SAX algorithm.

To the authors’ knowledge, a driver is more prone to be considered as a high-risk driver when applying the gas pedal or brake pedal heavily and frequently, especially at a higher running speed. This study evaluates whether a driver is risky or not in this regard, though it does not mean the scenarios in which a driver is operating in lower speed conditions (i.e., urban roads) are considered safe, however.

According to the symbolic representation of the speed and acceleration data from 50 drivers depicted in Figure 10, a vehicle’s speed greater than 88 km/h, with the representation of h or i, together with acceleration or deceleration (in absolute value) exceeding 0.5 m/s2 with representation of a or i, was taken into consideration; that is, the symbolic representation of ha, hi, ia, or ii was considered as a risky driving event. Analysis was then conducted for each driver in the same approach, and the number of risky events created by the 50 drivers was calculated. The results show that a total of 2,585 risky driving events were generated from the driving data of 50 drivers, with an average number of 51.7 for each participant. Note that the number of risky driving events from drivers (no. 3, 5, 9, 12, 19, 27, and 33) accounted for 46.1% (namely 1191), and these seven drivers were from the high- or medium-risk group. The number of risky driving events from drivers (no. 4, 6, 11, 13, 14, 17, and 24) only accounted for 1.2% (namely 31), and these seven drivers were all from the low-risk group. The results obtained on the basis of the SAX algorithm can also verify the case study by using the number of risky driving events described here, which provides a technical means to identify risky drivers from a different perspective.

7. Discussion and Conclusion

The methods and results generated in this study are compared with previous research in this section. In work by Wu et al. [46], GPS data of the connected vehicles were collected and factor analysis and cluster analysis were conducted to aggregate commercial vehicle drivers into different groups with four speed-related clusters. However, except for two drivers, the risk level of drivers in each cluster was not same, and no prediction model for risky drivers was presented or validation research conducted. The SAX algorithm is commonly applied in motif discovery and anomaly detection, such as in areas including medical scenarios and power consumption [47], but it is rarely used in the field of driving safety analysis. Additionally, according to the findings in Zeeman and Booysen [33], Eboli et al. [34], and Eboli et al. [35], speed or road condition should be considered for characterizing drivers’ behavior and the acceleration threshold is found to be reduced with an increase in speed. The threshold of rapid acceleration or hard braking is set at 0.12 g/0.1 g/0.0875 g here for abrupt acceleration or hard braking (in absolute value) in three different kinds of routes, which is in line with the threshold described in Vlieger et al. [36], Boonmee and Tangamchit [31], and Paefgen et al. [30]. However, the threshold is acknowledged to be 0.3 g by Mortimer et al. [24] and Anderson & Baldock [25], 0.4 g in the euroFOT project [27], and 0.5 g by Fazeen et al. [16]. This could be attributed to the limited dataset (only 1.5 h to 2 h driving data) for each participant used in this test when compared with other larger datasets (i.e., the euroFOT project).

This paper predominantly explored the relationship between driver risk and factors indicating vehicle’s speed and driver’s acceleration behavior and examined whether a driver is risky or not by using GPS data and derivative indicator collected through a real road driving experiment. Fifty drivers were asked to drive a route that included both rural highways and urban roads, and their behaviors were recorded by cameras. All potential risky situations from the 50 drivers were extracted by one data reductionist using some basic rules, and the footage was then analyzed by another group of volunteers to evaluate the risk level of each event and determine whether the driver was risky or not through viewing the video segments. The risk score obtained through factor analysis and evaluated by volunteers was compared afterwards, and SAX algorithm was used to validate the case study through normalization, discretization, and symbolization. The results demonstrated that three factors closely associated with speed and acceleration measures were extracted by reducing the dimensions, and the drivers were classified into five groups with different levels of risk based on the risk score obtained through factor analysis. In addition, the risk score obtained through factor analysis had a relatively high correlation () with the score evaluated by volunteers. Drivers (no. 3, 5, 9, 12, 19, 27, and 33) from the high- and medium-risk groups accounted for 46.1% of the total risky driving events, while the number of risky driving events for drivers (no. 4, 6, 11, 13, 14, 17, and 24) from the low-risk group only accounted for 1.2%. This kind of research could provide a technical means to identify risky drivers by using signals easily obtained through conventional sensors, especially for auto-insurance companies with UBI applications, bus companies, and transport enterprises.

It is acknowledged that the low dataset of 50 drivers with only 1.5 h to 2 h driving data for each participant created limitations in this study. Furthermore, although the same threshold of rapid acceleration or hard braking was used in research by Vlieger et al. [36], Boonmee and Tangamchit [31] and Paefgen et al. [30], the threshold value used here was still relatively lower. Note that threshold also depends on the dataset size to some extent, so an even lower threshold was used here. However, it was still possible to identify risky drivers as nearly the same trip length and running time was selected for each participant. Almost all participants recruited in this test are male drivers, and this means that the findings may only represent this specific driver population.

To expand on this study in the future, exploration into how the identification of risky driving is affected by the driver’s experience, gender, or type of instrumented vehicle (e.g., car vs bus) requires investigation. Examination into the variability within individual driver data using a larger dataset is also necessary. Furthermore, identifying individual differences in more specific areas such as car-following and lane-changing situations will improve the development of the advanced driver assistance system in relation to driving styles.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they no conflicts of interest.

Acknowledgments

This work was supported by National Key R&D Program of China [2018YFB1600501], Changjiang scholars and an innovative project team development plan [IRT_17R95], the National Natural Science Foundation of China [51775053], and the Natural Science Foundation of Shaanxi Province of China [2018JM5158].