#### Abstract

Analyzing driving style is useful for developing intelligent vehicles. Previous studies usually consider the statistical features (e.g., the means and standard deviations of brake pressure) of the measured driving data or manually define the number of patterns divided by behavior semantics to characterize driving styles. In this paper, we propose a driving style analysis to describe the personalized driving styles from time-series driving data without specifying the levels in advance but by estimating them from the data. First, range, range rate, and acceleration are selected as three feature variables to describe car-following scenarios. Then, the car-following data are normalized to reduce the scale influence of different variables on the segmentation results. The hidden Markov model (HMM) and the finite mixture of the hidden Markov model (MHMM) are adopted to extract behavior semantics. Compared with the HMM, the MHMM can identify the heterogeneity of data and then provide more reasonable primitive driving patterns. Based on the results, this study uses the *K*-means clustering to label all the driving patterns semantically and identifies a total of 75 different driving patterns. We use the normalized frequency distributions to describe personalized driving behavior characteristics, and similarity evaluations of driving styles are applied using the Kolmogorov–Smirnov test. The proposed approach in this paper is useful for exploring the characteristics of driving habits.

#### 1. Introduction

The human factor plays an important role in affecting traffic safety, and about 80% of traffic accidents are mainly caused by human factors [1]. The human factors include psychology, physiology, and driving styles [2]. Driving styles are habitual manipulation behaviors of drivers [3], which also influence the traffic flow. In addition, driving styles are closely related to fuel economy [4, 5], ride comfort [6], and the risk of the vehicle [7, 8]. Some researchers focus on distinguishing and classifying driving styles. Most studies generally divided drivers into several categories based on the statistical metrics of vehicle operation data. For example, Murphey et al. [9] applied an online style classification system to analyze the rate of change of acceleration and deceleration and then classified the drivers into three categories: calm, normal, and aggressive. Meanwhile, Xu et al. [10] selected the steering wheel position to describe the variability of driving styles and built a driving style analysis model by improving neural networks. Traditional studies often simply classify all driver styles into several classes, which may neglect the potential heterogeneous characteristics of driving behaviors. Therefore, it is important to conduct a detailed analysis of personalized driving style. Personalized driving style analysis can obtain the characteristics of drivers’ driving habits and help drivers clearly understand the risks of their driving behaviors so that they can be targeted to improve driving behaviors. Meanwhile, personalized driving style analysis is also a core research component of the personification development of intelligent driving technology.

To obtain a personalized driving model, substantial studies focus on statistical indicators of driving data (e.g., the means and standard deviations of brake pressure and throttle position). Shi et al. [11] used the naturalistic driving data to train the parameters of radial basis function and obtained a personalized driver model by using the locally designed neural network and the real-world vehicle test data. Wang et al. [12] proposed a driving style recognition method based on conditional kernel density function and Euclidean distance, which divided driving styles into seven classes from normal to aggressive, and the driving style of a driver at each moment can be inferred according to probability values. These studies adopt statistical indicators of driving data to describe driving styles, which can easily capture the static driving habits but cannot obtain the dynamic driving decision-making process.

Grouping driving behaviors into simple primitive driving patterns can improve the precision of personalized driving style analysis. Nilsson et al. [13] presented a low-complexity lane-change maneuver algorithm to calculate and determine if, when, and how to perform lane-change maneuvers. Many studies define driving behavior semantics based on predefined templates. For example, Taylor et al. [14] used a dynamic time warping model to estimate the parameters of the car-following model and then described the heterogeneity of driving styles. Meanwhile, some studies defined the driving behavior semantics by the changes in indicators (e.g., standard deviation) or derived signals (e.g., hidden states of hidden Markov models). Agamennoni et al. [15] proposed a method for automatically finding the boundaries between driving sequences, inferred the parameters of a multivariate linear driving model based on the maximum likelihood, and segmented driving behavior sequences using sticky hierarchical Dirichlet process HMMs. We introduce a specific mixture of HMM to analyze time-series data. It avoids the traditional behavioral semantic segmentation method and provides a better characterization of levels for the analysis of these data while adapting to the population. In addition, the use of several HMMs makes it take into account dependency over time and thus improves the traditional method based on cutoff points [16].

The rest of this paper is organized as follows: the second section introduces the hidden Markov model and the finite mixture of the hidden Markov model. In Section 3, the data description and preliminary data analysis are introduced. Section 4 illustrates the results of behavioral semantic segmentation and personalized driving style analysis. Finally, Section 5 provides conclusions and future work for this research.

#### 2. Methodology

##### 2.1. HMM Model

The hidden Markov model (HMM) is a dual stochastic process, and the state process is hidden. In general, the observation sequence is used to determine the unobservable state sequence [17].

As shown in Figure 1, node is the hidden state of the observation and node is the observation. is the state transition matrix of the hidden states; is the probability distribution between the state sequence and the observation sequence. The HMM model usually satisfies the following three requirements. First, the number of states in the HMM model is limited. Second, the hidden states evolve dynamically according to the state transition matrix. Third, the observation sequence only depends on the state sequence.

The initial state probability vector, the state transition probability matrix, and the observation probability matrix can be described as follows:where is the initial state probability vector, is the probability of state which changes to state ; is the observation probability at the time ; is the number of hidden states; is the number of all possible values of the observed variable.

An HMM model is generally defined by the notation . The observation probability matrix is the most important since it directly contributes to the observed states [18]. For this paper, the most important parts are computing the observation probability matrix and extracting the valid semantics, which can help segment drivers’ behavior semantics in the car-following scene.

##### 2.2. MHMM Model

For the driving behavior data analyzed in this study, the observation sequence is successive and contains a large amount of data. To address this issue, the finite mixture of hidden Markov model (MHMM) allows the heterogeneity of the population to be taken into account. And the probability of making an error in the partition estimation exponentially decreases with time, when the model parameters are known. The MHMM model is a combination of the hidden Markov model and the finite mixture model. Compared with other clustering methods, the finite mixture model (FMM) has the advantage that hidden groups in the data can be analyzed [19]. To split the data with different characteristics, the FMM model clusters the data into multiple homogeneous subdatasets. Meanwhile, each subdataset has its characteristic parameters and distribution type, which is especially suitable for continuous data analyzed in this paper. A FMM model with subgroups is defined in the following equation:where is the independent variable in a vector form; is the dependent variable; is the conditional density; is the mixing ratio which depends on the number of groups; are all parameters in the vector form. If is a member of the exponential families, as shown in Equation (2), we can combine the FMM model with the HMM Model to eventually obtain a MHMM model. The maximum likelihood estimation approach is adopted in this study to estimate the parameters of the MHMM model.

#### 3. Data Description

##### 3.1. Data Source

All car-following data in this study are from the Safety Pilot Model Deployment (SPMD) database. In 2012–2014, SPMD recorded driving behavior data from thousands of equipment vehicles in the USA. Equipment on the vehicle includes a real-time data collection system and electronic eye, and the data were all collected at a frequency of 10 Hz. This database can continuously observe driving behaviors over some time. The observation equipment is hidden from drivers, which can avoid disturbing their behaviors [20].

##### 3.2. Data Extraction and Preprocess

On the one hand, a single variable does not characterize the driver’s personalized driving style well. On the other hand, the number of variables should not be too large, which will lead to a complicated model with redundant information [21]. The variables selected from car-following events are demonstrated in [22], including the subject vehicle acceleration ; the relative range between subject and lead vehicles , , and the relative range rate , .

For the frame dropping in the data, if the missing duration with the same lead vehicle is less than 1 second, we use linear interpolation to impute the missing data. The conditions for extracting the car-following events are as follows: (1) the same lane; (2) the relative range <120 m (the car-following events ended when the relative range less than 5 m); (3) the subject vehicle speed >18 km/h; (4) no overtaking; (5) the duration >50 s [23].

##### 3.3. Variable Segmentation and Threshold Selection

In the process of training and testing a model, to avoid the value scale impact of the three feature variables on the behavioral semantic segmentation results [24], we normalized the feature variables in terms of drivers so that the mean was 0 and the standard deviation was 1. Then, we used the HMM model and the MHMM model presented in this section to segment drivers’ behavior semantics in the car-following scenario, thus analyzing the personalized car-following behavior characteristics further.

According to the physiological and psychological perception thresholds of drivers, all characteristic variables are divided into different levels to facilitate the semantic interpretation of the primitive driving patterns. In this way, the same driving behavior semantics extracted from different drivers can characterize the same driving patterns.

Based on the quantile characteristics of feature variables and the driver’s comfortable thresholds, the relative distances are classified into three classes [25]: , and the relative speeds and accelerations are classified into five classes [26] and , respectively. More specifically, in order to clearly make a semantic explanation for primitive driving patterns, we classify each variable into different levels based on drivers’ physical and psychological perception thresholds corresponding to their statistical feature. The variable segmentation is shown in Table 1.

#### 4. Modeling Result Analysis

##### 4.1. Segmentation Results and Comparisons

The behavioral semantic segmentation results are obtained based on the hidden state sequences of HMM and MHMM. Considering the length limitation of the paper, we only show the segmentation results from the representative trajectory of driver #1, as shown in Figure 2. The horizontal coordinate shows the time in 0.1 s, the background color blocks are the extracted driving behavior sequence units, and each color represents a different type of driving behavior semantics. Then, we make a further discussion and analysis on the segmentation results of different models as follows:

**(a)**

**(b)**

Figure 2 shows an example of the segmentation results using these two methods. As seen in Figure 2, it can be found that HMM can divide the driving data into different behavioral semantic units according to the data distribution characteristics, but this model is too sensitive to data fluctuation, which leads to many behavioral semantic units with a duration of less than 1.0 s, for example, at time 5.9 s, 26.2 s, 26.5 s, 45.2 s, 49.2 s, and 53.9 s with durations of 0.1 s, 0.3 s, 0.5 s, 0.5 s, 0.5 s, 0.4 s, and 0.1 s, respectively. These do not correspond to the real driving conditions. Drivers hardly adjust the driving maneuvers in such a short time (e.g., accelerating for 0.2 s and then taking the brake) [27].

The MHMM model can also divide the driving data into different behavioral semantic patterns. The behavioral semantic segmentation results can accurately extract effective behavioral semantic sequences based on the data distribution characteristics. At the same time, the MHMM model can effectively avoid the effect of random noise in data, resulting in very short behavioral semantic segments (i.e., durations are less than 1.0 s). It shows that the MHMM model not only divides time-series driving data into segments but also keeps the behavioral semantic pattern within a reasonable duration. Therefore, in the next section, the personalized driving styles are interpreted based on the MHMM results.

##### 4.2. Labeling Behavior Semantics

In this paper, MHMM is used to segment the driving behavior semantics in terms of car-following events. Different car-following events demonstrate distinct conditions. Thus, it is expected that different driving behavior semantics characterized by the same states may be extracted from multiple drivers. Therefore, we refer to behavior semantics defined in Table 1 and label them, which are based on the driving behavior semantic segmentation results of MHMM and the mean value of feature variables in each behavior semantic segment. To describe the semantic features concisely and capture the dynamic change regularity of behavior semantics, we use the following abbreviations to label the semantic category labels: , , , where , , and denote the semantic labels of relative range, relative range rate, and acceleration, respectively, and the meaning and value range of each label are the same as shown in Table 1.

Based on the results of MHMM, some car-following driving patterns are easily labeled with behavior semantics, while some others are not. To clearly label the states, we find common characteristics in the car-following driving patterns and then cluster the data of each segment into a single point, which can represent this car-following driving pattern. We choose the *K*-means clustering method to cluster the driving data, which is widely used [28–30]. As one of the methods to analyze and extract feature parameters from a large amount of raw data, clustering methods aim to classify the data objectively and stably. In this case, objective means that we can obtain the same results in each set of driving data by using the same methods. Stable means that the classification process remains constant across various drivers.

Based on the behavioral semantic segmentation results of the car-following event in Figure 2, Table 2 shows central values of the K-means clustering results for driver #1 and standard deviations of three variables for each segment. According to the clustering results, the labels of each semantic segment can be labeled. The results in Table 2 further demonstrate the advantages of MHMM; the reasons are as follows: (1) the standard deviation of each car-following segment is relatively small; (2) feature variables of the same behavioral semantic category take values close to each other; (3) there are significant differences in the values of feature variables taken by different behavioral semantic categories.

##### 4.3. Normalized Frequency Distribution of Driving Style

To analyze personalized driving styles, we interpret the driving behavior semantics as a driver unit. In this way, we can not only establish the mapping relation between the car-following driving states and driving patterns but also obtain the results of driver’s behavior preference and the transform rules of driving patterns by aggregate analysis.

In this paper, we describe personalized driving behavior characteristics by using the normalized frequency distribution of driving behavior semantics, which is different from traditional statistical indicators such as the mean and standard deviation of feature variables, and normalized frequency distribution can help us intuitively analyze driving styles. Based on the labels of relative range, the semantic segments are divided into three datasets and then compute the normalized probability of each distance pattern. And the normalized probability of each distance pattern is computed by the following equation:where is the relative distance labels ; is the combination of 5 relative range rate labels and 5 acceleration labels, a total of 25 types; is the number of points in one label, where the semantic label is and the relative distance label is ; is the normalized probability of driving behavior semantics, where the semantic label is and the relative distance label is . The sum of the normalized probability of driving behavior semantic is equal to 1 in each driver.

Table 3 shows the normalized frequency distribution of driving behavior semantics for driver #1. The meaning of semantic labels is provided in Table 1. As shown in Table 3, we can find that the trend of relative range changes has a clear impact on driving patterns. When the relative range rapidly increases, drivers prefer to accelerate, and the frequency of acceleration is higher than deceleration; when the relative range rapidly decreases, drivers prefer to decelerate, and the frequency of deceleration is higher than acceleration.

For example, when following a lead vehicle at a normal distance, driver #1 prefers to take a gentle acceleration operation and falls behind (ND-FB-GA), as shown in Table 3. The same is for other situations (long distance and close distance), which indicates that the dynamic changes of relative distance have a direct effect on the driver.

##### 4.4. Similarity Evaluation of Driving Style

Similarity evaluation of driving style plays an important role in personalized driving style analysis. To reduce the difficulty of the personalized customization on the driving assistance system, we need to compute the similarity of different driving styles reasonably and allow the drivers with similar driving styles to share one class of the driving assistance pattern. However, the randomness and dynamic of driving styles are challenging for similarity evaluation. Furthermore, driving behavior is usually a highly nonlinear process; thus, it is not easy to directly evaluate the similarity between two drivers with continuous observation sequences. The common approaches can calculate statistical indicators directly [31], but they cannot evaluate the similarity of driving styles in terms of randomness and dynamic. Therefore, we use the Kolmogorov–Smirnov (KS) test [32, 33] to compute the similarity of two drivers in driving styles and illustrate the differences between drivers. The steps of the KS test are as follows: *Step 1*. Assume that and are independent identically distributed samples from the population distribution function, and , respectively, and all the samples are independent. *Step 2*. The null and alternative hypotheses are defined as follows: *Step 3*. Construct test statistics: The empirical distribution functions of and are and , respectively. Since the empirical distribution function of the sample is a good estimate of the population distribution, should be small or tend to 0 for each value of when the null hypothesis is well established as the test statistic is computed by We can compare the difference between and , and the rejection region is . *Step 4*. Set the significance level and determine the rejection region. The distribution of T can be divided into three cases: small samples with unequal capacities, small samples with equal capacities, and large samples. In the first two cases of small samples, the critical value of the test can be obtained by Table 1. In the case of large samples, according to the sample sizes m and n and the significance level α, the quantile can be obtained by where is the coefficient related to , normally is 0.05 or 0.01, the corresponding is 1.32 or 1.63, and the rejection region of test is . *Step 5*. Calculate the value of the statistic . On the one hand, if , we can indicate that the sample data and come from different population distributions, and a significant difference can be found between and . On the other hand, if , we have no idea to reject the null hypothesis, which means that the two population distributions are the same.

A heat map presents the KS test between two normalized distributions of driving patterns among all drivers, as shown in Figure 3. The red color represents a huge difference between the driving styles of the two drivers, and purple means that the two drivers are quite similar to each other. By the way, the values of the statistic are equal to 0, which are drawn with dark purple color.

From Figure 3, on the one hand, we know that the occurrence probability of behavior semantics for driver #21 is significantly different from many other drivers, especially different from some drivers (i.e., drivers #3, #10, #23, #28, #29, #30, etc). On the other hand, driver #13 is similar to others, especially similar to drivers #4, #6, #7, #8, #10, etc. Last but not least, drivers #15 ∼ #23 are significantly similar to each other, as they have blocks of similar color. In summary, the KS test can compare the degree of similarity or dissimilarity of driving styles for two drivers.

#### 5. Conclusions

In this paper, we select the MHMM model to label behavior semantics. The normalized frequency distribution of behavior semantics is adopted to analyze driving style. Feature variables are divided into different behavioral semantic patterns, and the similarity of different driving styles is also compared through this model. The main findings can be summarized as follows:(1)Compared with the conventional HMM model, the MHMM model can better recognize the driving patterns. MHMM can identify the heterogeneity of data and divide the car-following data into different behavioral semantics. The MHMM model can also provide reasonable behavioral semantic segmentation results.(2)We can adopt the normalized frequency distribution of driving behavior semantics to analyze personalized driving styles, instead of statistical indicators, which allows us to intuitively analyze driving styles. The normalized frequency distribution of driving behavior semantics is consistent with the results of previous studies and requires less computational time.(3)The KS test can quantitatively evaluate the similarity and dissimilarity of driving styles for different drivers. This method can be useful for identifying personalized driving styles.

For future work, we consider expanding the sample size to study the impact of different factors (e.g., gender, age, and vehicle type) on driving styles. In addition, we should consider more complicated traffic scenarios to further analyze the driving styles.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This research was funded by the National Natural Science Foundation of China (Grant no. 71971160), the Shanghai Science and Technology Committee (Grant no. 19210745700), and the Fundamental Research Funds for the Central Universities (Grant no. 2120210009).