Abstract

To overcome the drawbacks of the maximum speed limit information of expressways (i.e., long update cycle and great complexity of information recognition), in this work, an Electronic Toll Collection (ETC) gantry data-based method for dynamically identifying the maximum speed limit information of expressways is proposed. Firstly, the characteristics of the ETC gantry data are analyzed, and then data are cleaned and reconstructed, after which an algorithm is proposed for constructing a vehicle travel speed data set. Secondly, the speed feature vector model of the road section is established by taking the relationship among the speed distribution feature, time domain feature, and the maximum speed limit of the road section into consideration. Then, a data supplement algorithm is constructed to solve the problem of the imbalance of data samples. Finally, the combined GC-XGBoost classification algorithm is used to train and learn the potential speed limit features, and it is verified through the Fujian Provincial Expressway ETC data and the speed limit information provided by the Fujian Traffic Police. The result shows that the accuracy of the method in the recognition of the maximum limited speed information of the expressway is 97.5%. Compared with the traditional limited speed information recognition and extraction methods, the proposed approach can identify the maximum limited speed information of each section of the expressway more efficiently. It can also accurately identify the dynamic change of the maximum limited speed information, which is able to provide data support for intelligent expressway management systems and map providers.

1. Introduction

In recent years, China’s expressway ETC system technology has been developed rapidly. More and more vehicles have installed ETC equipment. These vehicles interact with ETC gantries during driving, resulting in massive ETC data. At present, the cumulative users of ETC have exceeded 220 million, and the utilization rate of vehicle owners is 78% [1]. Moreover, the ETC gantry can also interact with the Manual Toll Collection (MTC) system users. Therefore, the ETC gantry system almost collects the traffic information of all vehicles on the expressway, reflecting the overall traffic situation of the expressway, which can provide strong support for the informatization construction, vehicle infrastructure cooperation, and automatic driving [2] of smart expressway. Obtaining the maximum speed limit information of each section of the expressway is an important part of intelligent management of expressways [3]; it can provide drivers with expressway speed limit information [4,5] to avoid traffic accidents caused by speeding and provide reliable perception and driving speed decision-making for autonomous vehicles. However, the maximum speed limit information is dynamic and changeable. The relevant management departments will adjust the speed limit information of the road section according to road traffic flow, road maintenance conditions, and the number of traffic accidents [68]. At present, the method of collecting speed limit identification information is mainly manually collected, then the data is uploaded to the system for updating within a certain period. However, this method has two disadvantages: first, it requires professionals to travel to the expressway and collect speed limit information, which costs immense manpower and material resources. Second, it has a long update cycle, and the driver cannot obtain the latest speed limit information, which leads to safety hazards while driving, and the traffic efficiency of the road is correspondingly reduced. Therefore, the study of how to automatically collect the speed limit information and dynamically identify the maximum speed limit information on the road in real-time has research significance.

Traffic flow prediction and travel time prediction are research hotspots in the field of transportation. Most of their research methods and speed limit recognition are supervised learning based on machine learning algorithms. The difference is that speed limit recognition is a classification problem, and traffic flow prediction and travel time prediction are regression problems. The recognition of road maximum speed limit information mainly relies on image recognition technology [912] and floating car trajectory data mining technology. The image recognition technology obtains the speed limit information of each road by recognizing the speed limit information of the traffic signs on the road. Machine learning is widely used in a variety of research fields [13]. Support Vector Machine (SVM) [14], Extreme Learning Machine (ELM) [15], and multitask convolutional neural network (MTCNN) [16] are used to train and learn speed limit signs features to realize the recognition of maximum road speed limit. Although these methods are relatively suitable in terms of recognition effect, they require surveyors to collect pictures of speed limit signs on the road, which consumes a lot of resources. In addition, the collection period is long and cannot achieve real-time and dynamic recognition maximum speed limit information. In terms of floating car trajectory data mining, the floating car is equipped with a global positioning system, which records the time, location, and other information of the vehicle, and the floating car trajectory data mining can obtain the driving speed feature of all floating car on the road [17]. Machine learning algorithm [18] is able to learn the maximum speed limit feature in the vehicle speed information of the road to realize the recognition of the maximum speed limit information. However, the floating car accounts for a small proportion of all cars that cannot fully reflect the speed of the vehicles on the expressway. Therefore, the maximum speed limit recognition based on floating car data still has certain defects.

In view of the high cost of speed limit sign recognition and the shortcomings of trajectory data recognition, this study proposes a method using real-time traffic data collected by an ETC gantry system to identify the maximum speed limit of expressways dynamically, which solves the problems of the high cost of manual information collection and incomplete vehicle data. First, the road section speed set construction algorithm and section driving speed abnormal filtering algorithm are designed to ensure the integrity and reliability of the sample data. Then, the speed feature vector model of the speed limit feature is constructed to mine the speed limit feature of the vehicle speed in different aspects. Finally, taking the road maximum speed limit information of 534 sections of expressways in Fujian Province as the sample set. Then, the multivoting ensemble algorithm is used to perform supervised classification training and cross-validation on the road speed feature. The test results show that this method can well identify the maximum speed limit information and recognize the dynamic changes of the maximum speed limit information on the road.

The contributions of this paper can be summarized as follows. First, an algorithm is proposed for constructing speed sets of road section, which can solve the problem that the speed of road section cannot be calculated due to the lack of transaction records of ETC gantries and obtain the speeds of vehicles on each road section accurately and completely. Second, this proposal extracts the feature of the road section speed from different aspects to construct the road section speed feature vector model and mine the potential correlation features between the speed of the vehicles on the expressway and the road speed limit information. Third, a dynamic recognition method of the maximum speed limit of expressways is proposed to identify the maximum speed limit of the expressway, the validity of the method is verified by the real maximum speed limit information, and the scientificity is verified by comparing a large number of prediction algorithms.

This paper is organized as follows. Section 1 introduces the research methods of road speed limit recognition. Section 2 defines the related concepts in this work. Section 3 describes each part of the dynamic method of expressway maximum speed limit. Section 4 shows the experimental results and analysis. Section 5 draws the conclusion and future work.

2. Relevant Definitions

Definition 1. Each ETC gantry of the expressway is collectively called , and two adjacent on the road constituting an expressway section, which is referred to as , , and , are shown in Figure 1, where is the start point of the road section, is the end point of the road section, and is the actual distance of the road section.

Definition 2. Expressway network, formed by all the expressway sections within this proposal, referred to as .

Definition 3. A set of ETC gantries by which a vehicle passed while driving on the expressway, forming a sequence of nodes in chronological order called trajectory , , , , . is the trajectory point, including node and time property , is the label of the ith node passed by the vehicle, and is the information interaction time when the vehicle passes through node . is the start point of the trajectory, and is the end point of the trajectory.

Definition 4. The average speed of a vehicle passing through a certain road section is called road section speed. The calculation method is shown in the following equation:where is the actual length of the road section, is the time when vehicles pass the start point of the road section, and is the time when vehicles pass the end point of the road section.

Definition 5. The dispersion of the speed of the road section describes the measures of dispersion of the average speed of vehicles passing through the road section. The section speed of vehicles on the expressway within a certain period of time constitutes the speed set of the section. Sort the value of speed: the speed at 85th percentile is , and the speed at 15th percentile is . The speed dispersion index can be expressed asThe larger the value range is, the higher dispersions of the speed information are.

Definition 6. The speed limit includes the minimum speed limit and the maximum speed limit. The speed limit value is generally an integer multiple of 10. In this paper, we only discuss the maximum speed limit.

3. Methods

3.1. ETC Data Preprocessing
3.1.1. ETC Data Cleaning

The ETC gantry system can generate a large amount of transaction data in a short period. Due to system error, information exchange interruption, and severe weather conditions, these factors can lead to abnormal data which can affect the results. In order to reduce interference, the data needs to be preprocessed, mainly including the following aspects.

Data Redundancy: Duplication between Multiple Data. The transaction information of each vehicle passing through the ETC gantry should be unique. However, due to problems in data acquisition, transmission, storage process, and other intermediate links, it can cause the repeated data uploading and duplication, resulting in data redundancy. Therefore, these data need to be cleaned.

Data Error. The data record does not conform to the normal driving rules, including two ETC gantries that control different driving directions recorded by the same vehicle at the same time, and different passing records of the same vehicle are recorded at the same time. These data need to be filtered or deleted.

3.1.2. Vehicle Speed Recognition Algorithm in Road Section

In order to calculate the speed distribution of the road section, it is necessary to obtain the transaction data of all vehicles of each gantry. However, gantry transaction data may be missing. Therefore, all traffic data and road network data need to be checked and supplemented to ensure the integrity of the gantry transaction data. After the transaction data of the ETC gantry system is initially cleaned, the trajectory of each vehicle is constructed in chronological order according to the transaction data of each gantry. Traverse each adjacent ETC gantry in the one by one. Check whether the road section formed by the two gantries belong to the expressway road network . If the road section belongs to the expressway road network G, the speed of the vehicle passing through the section is directly generated. and the speed are expressed as follows:where represents the number of all vehicles within certain time period of the road section and represents the average speed of each vehicle on the road section within certain time period.

If does not belong to the expressway road network , it means that the section data of the middle gantries are missing. And path searching algorithm based on, needs to be performed to fill the missing gantry transaction data. As shown in Figure 2, if the road section formed by and cannot be queried in the road network G, use and as the basic node. The feasible path , , , can be obtained through path search. and are supplementary nodes, and the average speed between and is taken as speed for, , .

To ensure the reliability of the average speed , the minimum speed is set for high-speed driving to 30 km/h and the maximum speed for high-speed driving to 160 km/h [19]. If the average speed value is not in the range , where is the average speed of all road sections between and , it will be deleted as abnormal data. The specific process of the section speed data construction algorithm is shown in Algorithm 1.

 Input: trajectory data of a car D, expressway road network data G
 Output: speed data of the road section
(1)fuction Sections(D)//The vehicle trajectory data is divided into the data of each section of the vehicle
(2)D = {D1, D2, …, DE}, Di = {Ni, Ti}
(3)for i = 0 to E-1 do
(4) Nodei, Nodei+1Di.N, Di+1.N//Extracting the node information of two adjacent data points
(5) Timei, Timei+1Di.T, Di+1.T//Extracting the time information of two adjacent data points
(6) delta = Timei+1-Timei//Calculating the time difference between two adjacent data points
(7)Ri.Q←(Nodei, Nodei+1)//Reconstitute the front and back node information of the vehicle passing section
(8)Ri.T←(Timei, Timei+1)//The front and back time information of the vehicle passing section
(9)Ri←(roadi.Q, roadi.T, delta)
(10)  Sec←{ R0, R1 ,…, RE-1}//Encapsulating into Sec
(11) end for
(12)return Sec
(13)end fuction
(14)Sec = { R1, R2, …, Rm}←Sections (D)
(15)for each Rj in Sec(j = 0,1,2, …, m) do//Extracting road information from the data, which Rj = (Qj, Tj, deltaj)
(16) if Qj in G then
(17)  DistancekGk.Distance//Getting road section distance from expressway network, which k = Qj
(18)  t = Secj.delta//Extracting the time required for vehicles to pass through the road section
(19)   = Distancek/t//Speed of vehicle passing through road section
(20)  Rj.V//Adding speed attribute
(21) if Qj not in G then//The road information cannot be found in the expressway network, and there is uncollected node information between two nodes of the road section
(22) {N1, N2, …, NZ}←shortest_path(G, Nj)//Searching the shortest path between two nodes, getting the path node data set, which Qj=(N1,NZ)
(23)A = {A1, A2,…,AZ-1}←{N1,N2,…, NZ}//Converting path node data set to road section data set
(24) path = { }
(25) for Ai in A then
(26) path = {path1, path2, …, pathZ-1}←Gk.Distance//Getting road section distance from expressway network, and add to path, which k = Ai
(27) end for
(28) A.Distance←Sum(path)
(28)VA←A.Distance/Rj.delta
(29) if VA ≥ Vmin and VA ≤ Vmax then
(30)  for Ai in A then
(31)  ti/VA//Calculating time difference
(32)  t1, t2Rj.T//Extracting the time passing through the two nodes separately
(33)  if i = 1 then
(34)   Ai.tqt1//The time when the vehicle enters the entrance Ai
(35)   Ai.tht1+ti//The time when the vehicle leaves the entrance Ai
(36)   Ai.deltati//Time difference
(37)  else
(38)   Ai.tqt1+ti-1//The time when the vehicle enters the entrance Ai
(39)   Ai.tht1+ti//The time when the vehicle leaves the entrance Ai
(40)   Ai.delta←ti-ti-1//Time of passing through the road section
(41)   Ai.T←(Ai.tq, Ai.th)
(42)   Ai.VVA
(43)  end for
(44)  A←{Q,T,delta,V}//Getting the corrected section information, including road section node, time and road section speed attributes
(45)Rj←A//A replaces the original Rj, and to generate a new Rj
(46)  end if
(47) end if
(48)end for
(49)speed_data←{R0, R1, …, Rc}//Generating speed data of road section
3.1.3. Outlier Information Detection Algorithm for Road Section

To better analyze the road section speed distribution feature of each section, a noise data cleaning model is constructed to detect and eliminate outliers in the data. The basic idea of the model is to use the upper and lower limits of the speed boxplot to detect abnormal points and determine the threshold interval for filtering abnormal speed data. Under the condition of collecting a large amount of expressway ETC transaction data, according to the central limit theorem, the road section speed data set should be a normal distribution. And the upper and lower limits of the speed boxplot that meet the 3σ interval range of the normal distribution can better prove the rationality of realizing outlier detection and filtering through boxplot analysis. As shown in Figure 3, there are 6 element points in the boxplot, among which is 1/4 divide point; is the median; is the 3/4 divide point; and , which is the distance between and . There are also upper limit and lower limit. Here, represents the speed value greater than 25% of the traffic flow, represents the speed value greater than 50% of the traffic flow, and represents the speed value greater than 75% of the traffic flow. Thus, the upper and lower limits of the noise data cleaning threshold model can be obtained, expressed as follows:

Then, the threshold range of velocity filtering is obtained as follows:

Among which, the speed data of the road section within the range of is retained, and the outlier data is deleted.

3.2. Feature Vector Model of Expressway Speed

Vehicles driving on the expressway have different speeds at different times or on different road sections. Through the statistical analysis of the feature of the traffic speed of the road section, the potential connection between the speed of the vehicle and the road speed limit information can be obtained, after which the road section speed feature vector model is constructed. The feature vector is mainly divided into three categories such that the first is the frequency-speed percentile feature, the second is road section speed evaluation feature, and the third is road section speed time domain feature.

3.2.1. Road Section Frequency-Speed Percentile Feature

Road section frequency-speed percentile feature reflects the distribution of the section speed at different times, including the speed values of the 50th percentile, upper and lower 25th percentile, and the upper and lower 15th percentile of the speed set of the road section, and then converts it into multidimensional feature vector . It can be expressed as follows:where are, respectively, the 15th, 25th, 50th, 75th, 85th, and 95th percentile of the total section speed distribution, which can describe the overall distribution of the speed in road section.

3.2.2. Road Section Speed Evaluation Feature

Road section speed feature are described by the relevant evaluation indexes in frequency domain, including average speed, speed standard deviation, and speed dispersion, which can transform into multidimensional feature vectors . It is expressed as follows:where is the majority number of section speed, representing the general level of vehicle speed statistical law; and are the overall average interval speed of the road section and standard deviation , respectively; and attributes the speed dispersion indices, which reflects the changing range and dispersion range of speed data.

3.2.3. Road Section Speed Time Domain Feature

Road section speed time domain feature reflects the speed evolution regularity of the traffic flow on different road sections under different limited speed conditions. If the section speed data was analyzed by day without considering the feature of different periods, it was easily affected by road congestion and other factors in individual periods, and it cannot reflect the speed evolution feature of the road. Therefore, it is necessary to fully integrate the speed feature information of roads in different periods. The whole day is divided into 24 time periods, denoted as 0, 1, ..., 23, respectively. Then, mining and counting the speed information of each road section in each period is carried out to find the speed change law of each road section. As shown in Figure 3, the multidimensional velocity time domain feature vector is constructed. It is expressed as follows:where is the average road section speed of each period in the data sample; that is, the average road section speed of 24 time periods in the whole day, in order from large to small, takes the first n values. Here, we take the first 6 values to avoid the disturbance caused by the relatively low road section speed caused by traffic congestion or road maintenance in some periods.

3.3. Sample Imbalance Processing

The road speed limit classification values constructed in this paper conform to the 80 km/h, 100 km/h, 110 km/h, and 120 km/h specified in the “Road Speed Limit Sign Design Specification” (JTG/T 3381-02-2020) and the “Expressway Engineering Technical Standard” (JTG B01-2003). Because most of the data we collect is 100 km/h, this means the data size of 100 km/h is far more than the other three types of sample data, 80 km/h, 110 km/h, and 120 km/h. This creates an imbalance among sample categories. Therefore, to tackle the problem of unbalanced data samples, there are two processing methods, including oversampling and undersampling [20]. Oversampling is to copy the minority samples multiple times to expand the data volume of the minority samples. This oversampling method will duplicate the preexisting sample data, which will lead to a certain degree of overfitting during the model training process. Undersampling is to randomly remove part of the data from the majority samples or select a part of the sample in this category according to a certain proportion as the sample data. This method will cause the model to only learn a part of the rules of the sample data; thus, it cannot effectively reflect the complete pattern of the sample in this category. In order to alleviate these problems, an improved random oversampling method SOMTE [21] is utilized, which analyzes the minority samples, by using their similarity in feature space to add the simulated new samples to the data set. The number of minority samples in the original data set is expanded, and the dispersion between categories is reduced; therefore, the imbalance problem is solved. The process of the SOMTE can be divided into the following steps:Step 1. Select the speed feature vector set of minority sample categories with speed limit values of 80, 110, and 120 km/hStep 2. For each category of sample set, Euclidean distance is used as the metric in the feature space, and then the distance between each sample in the sample set is iteratively calculated to determine the k-nearest neighbor sample pointsStep 3. Perform random linear interpolation on the connection line between sample points and the selected s neighboring sample points to generate new samplesStep 4. Repeat Step 2 and Step 3 until the various categories of the expressway speed feature vector data set reach a balance

3.4. Maximum Speed Limit Recognition Classification Model

The acquisition of speed limit information on expressways is an important factor that affects the driving safety. Different road sections correspond to different speed limit information, and the differences of speed limit information directly affect the state of the vehicles, which makes the relevant data show a certain pattern. Using strong learning machine to perform in-depth learning and training on related data can achieve high-precision recognition results. XGBoost is a method of integrated learning based on a boosting algorithm [22]. Its learning machine usually takes the decision tree model and learns the true value and the residuals of the current prediction values of all trees through the continuous iterative generation of new trees. Then, the results of all trees are accumulated as the final result to obtain a better classification accuracy [2325]. By using the XGBoost algorithm as a classifier for identifying the maximum speed limit information on expressways, the maximum speed limit information can be determined accurately.

A sample data set is constructed by extracting 16-dimensional speed feature vectors from the expressway section data with the known speed limit information. Suppose the data set is . is the feature vector of the ith sample, also known as the input value, that is, the constructed 16-dimensional expressway speed feature vector. is the output value of the ith sample, that is, the road speed limit classification labeled value corresponding to . Assuming that the XGBoost integrated learning model integrates a total of regression trees, the prediction result of the XGBoost algorithm can be expressed as in the following equation:where is the number of trees, corresponds to the kth regression tree with structure and leaf weight , is an integrated classifier composed of all regression trees, and corresponds to the predicted score of the kth regression tree on the sample .

The objective function of XGBoost consists of a loss function and a regular term, expressed as follows:where is the error function and is the regularization term. The regular term can be expressed as follows:where represents the penalty coefficient of the model, and the value range is [0,1]. represents the number of leaves of the kth tree; is the regular term coefficient.

The XGBoost algorithm adopts an additive step-by-step integration strategy in the training process. First, optimize the first tree, and then optimize the second tree until the th tree is optimized, and the loss function is continuously reduced during the optimization process. By adding an incremental function in the iterative process to optimize the objective function, the prediction accuracy can be improved, and the calculation method can be expressed as in the following equation:where is a constant term and represents the predicted value in the (t − 1)th iteration on the ith sample. Then, carry out the expansion of the second-order Taylor equation and discard the constant term in order to reduce the running time of the model, expressed as follows:where represents the sample set of leaf and and are the first derivative and the second derivative of the loss function, respectively.

The objective function is converted into a quadratic function about to find the minimum value, and then the optimal prediction score of each leaf node and the optimal value of the objective function are obtained as follows:where , .

After that, the optimization of XGBoost parameters mainly include the following 4 steps:Step 1. Choose a higher learning rate, set a reasonable initial value of the booster parameters, and use cross-validation in each iteration to get the ideal number of decision treesStep 2. According to Step 1, the learning rate and the number of decision trees are determined, and the cross-validation method and grid search method are used to optimize the parameters of each boosting machineStep 3. The method is the same as Step 2; based on the given data, adjust the regularization parameters to reduce overfittingStep 4. Appropriately reduce the learning rate to determine the final ideal parameter combination of the model

3.5. Maximum Speed Limit Recognition Model

The problem of identifying the maximum speed limit information on expressways is a classification problem. The framework of identification model is shown in Figure 4. Dynamic identification of highway speed limit information is realized based on the following steps. First, the data cleaning is adopted on ETC gantries transaction data, removing duplicated data and error data. Taking vehicle speed recognition, the algorithm is used to find the missing records in the ETC gantries transaction data and to accurately reduce of gantry distribution on expressways. The speed of the road section can be obtained by calculating the speed of the vehicle between the gantries. However, there are some very large or small outliers in the speed of the road section so that boxplot is utilized to remove speed outliers. Next, the speed of each driving section is analyzed, and the models of frequency-speed percentile feature, interval speed evaluation feature, and interval speed time domain feature are constructed. Since the velocity distributions of various types in the data are quite different, the oversampling algorithm is used to expand the minority samples to obtain the balanced data. Finally, data are divided into training data and test data. The training data are inputted into XGBoost algorithm for training and learning; the training process is shown in process 1 in Figure 4. At the same time, the grid search and cross-validation are used to find the optimal parameters of each boosting machine in XGBoost; the optimization process is shown in process 2 in Figure 4.

4. Experiments and Results

4.1. Introduction of Experimental Data

ETC gantry system is one of the main components of the Expressway ETC System, which is used for real-time vehicle driving information supervision and record, vehicle path identification, toll data fitting, and other functions [14]. The experimental data mainly includes three categories. One is the ETC transaction data collected by the ETC gantry on various sections of the expressway in Fujian Province for 9 days from September 3 to September 11, 2020; it contains 50 expressways including Fuyin Expressway, Xiazhang Expressway, and Longchang Expressway, which contains 534 sections, about 100 million pieces of data. The average distance between each section is 8.9 km, 85% of the section distance are less than 16 km, and the maximum distance is 30 km; its distribution is shown in Figure 5. These data are sourced from Fujian Provincial Expressway Information Technology Co., Ltd. The main attributes of the data are shown in Table 1. The second category is the road speed limit information data, including the name of the road section and the maximum speed limit value of the road section, which is derived from the online announcement of the Fujian traffic police. It is used for model learning, training, and testing; the third category is the distance of each section of the expressway from the Amap, including the node pair of the gantry of each section and the actual road section distance.

4.2. Experimental Results and Analysis
4.2.1. ETC Data Preprocessing

Matching the initially cleaned ETC data with the road network topology data, the road section speed of each vehicle is calculated, and then the expressway road section speed data set is constructed. Table 2 shows the main characteristics of the data. Due to the influences of some random factors, there may be a certain amount of outlier data; these outlier values of each road section can be detected through the noise data filtering model. After the noise data is eliminated, the road section velocity data after preprocessing is obtained. As shown in Figure 6, the road section speed data of the road section from September 3, 2020, to September 11, 2020, is used. Among them, the abscissa denotes the date of each day, and the ordinate represents the magnitude of the road section speed. In addition, each box represents the overall distribution of the road section speed of the road section on that day, and the black origin represents the part need to be deleted. The original speed data of the road section are around 1.229 million, the abnormal data are about 1.19 million, accounting for 9.68%, and the preprocessed section speed data is approximately 11.1 million.

4.2.2. Road Section Velocity Feature Vector

After obtaining the preprocessed speed data set of the road section, the road section speed feature vector model is constructed based on the statistical analysis of the expressway road section speed feature by day. Thus, the expressway road section data set contains 3 types, including 16-dimensional feature vector, and its sample classification mark value is obtained. The attributes shown in Tables 35 are the feature vectors, and output of the model after the speed data feature is extracted. Among them, is a road section; for example, represents the road section between ETC gantry 340507 to ETC gantry 351C03. represents the date when the traffic condition occurred, and represent that each section is between 15% and 95% of driving speed, where represents the mode, average, standard deviation, and dispersion of vehicle speed, represent the first 6 values after sorting the average road speed in 24 time periods of the day, and represents the maximum speed limit value.

4.2.3. Balance Analysis of Sample Data

There are 5,081 samples in road section speed feature vector data set, among which the number of samples with 80 km/h, 100 km/h, 110 km/h, and 120 km/h speed limits accounts for 5.31%, 87.24%, 9.39%, and 2.83%, respectively, which are seriously unbalanced among different categories and have adverse effects on the efficiency of model identification. Therefore, the SMOTE is used to oversample the sample data with speed limits of 80, 100, and 120 km/h, which makes it possible to achieve relative balance among all kinds of samples. In the experiment, the new data obtained by the SMOTE algorithm is used as the input of the algorithm model. The sample data consists of training sample data and testing sample data.

4.2.4. The Result of the Model’s Performance

The parameter setting of XGBoost algorithm is an important factor that affects the performance of the model. In order to improve the accuracy of the model, a set of sensitivity experiments is conducted to optimize the performance of the model. First, four boosting machine parameters are identified that have a significant impact on the model, including n_estimators, learn_rate, max_depth, and min_child_weight. Second, a combination of grid search and cross-validation (GK) are used to obtain the optimal parameters, in which for cross-validation. Follow the method of Section 3.4 for parameter optimization. The search range, step length, and postexperiment parameter optimizations for each parameter are shown in Table 6.

The model can be established through the above processing, using test data to verify the effectiveness of the model, and the results of the confusion matrix are shown in Table 7. In 3295 test samples, 3212 were identified correctly, with an accuracy rate of 97.5%. The recognition accuracy of 80 km/h data is 100%. This is because the data with a speed limit of 80 km/h is quite different from other categories and can be better distinguished. However, the gap between the category data with100 km/h and110 km/h is very small, and it is easy to cause mistakes in identification. Among them, there are 824 sample data with a speed limit of 100 km/h, 759 correctly identified, and 47 with a speed limit of 110 km/h, which makes the accuracy rate decrease to some extent. For the same reason, the accuracy rate of the 110 km/h limit is also lower position compared with the other three categories.

4.2.5. Comparison and Analysis

(1) Impact Analysis of Data Equalization. In order to verify the influence of oversampling model on SMOTE algorithm, the original data set and the data set processed by SMOTE algorithm are used for training and learning. The other steps of the model are consistent, and two model classifiers are obtained. The comparison of classification results is shown in Table 8. The first category is the model result corresponding to the data set processed by the SMOTE algorithm, and the second category is the model result corresponding to the original data set. The following can be seen from Table 8:(1)After the SMOTE algorithm oversampled the data, the accuracy, recall rate, and F1-score of all categories were greatly improved.(2)The data with the speed limit value of 100 km/has the most samples. Without data expansion in the oversampling process, the evaluation indexes of this class are still improved, indicating that the SMOTE algorithm can not only greatly improve the recognition accuracy of minority speed limit information, but also effectively improve the recognition accuracy of majority speed limit information.(3)The SMOTE algorithm improves the prediction accuracy of data with a speed limit of 110 km/h and 120 km/h, and the recall rate and F1-score are also greatly improved. It has little effect on the prediction accuracy of class data with a speed limit of 80 km/h but has a great influence on the recall rate and F1-score.

(2) Comparison and Analysis of Feature Vector Model. By only adjusting input features, the other steps remain the same; the effectiveness of different types of features in expressway section speed feature vector model can be verified. Seven sets of experiments are set up to verify the influence of a single-feature and multiple-feature combinations on the model. Model indicates that only frequency-velocity percentile feature is considered. Model only considers the road section velocity evaluation feature. Model only considers time domain feature of road section velocity. Model indicates that frequency-velocity percentile feature and road section velocity evaluation feature are considered. Model takes into account the frequency-velocity percentile feature and road section velocity time domain feature. Model takes into account the road section velocity evaluation feature and road section velocity time domain feature. Model takes into account the frequency-velocity percentile feature, road section velocity evaluation feature, and road section velocity time domain feature. All the features are taken into account, and the experimental results are compared. The experimental results are shown in Figure 7, where A1–A7 represent models , , , , , , and , respectively. The following can be seen:(1)When only a single feature is added, a better model prediction effect can be obtained by adding frequency-velocity percentile feature, followed by interval velocity evaluation feature model and interval velocity time domain feature model.(2)When two features are added, the prediction effect is improved compared to a single feature. When all the features are added, the prediction effect is the best.(3)The contribution of each feature in the speed feature vector model of the expressway section to the prediction model is arranged from large to small, which is the road section speed-frequency percentile feature, road section speed time domain feature, and road section speed evaluation feature; the contribution of the feature vector in each feature is shown in Figure 8.

(3) Comparison of Classification Models. To further illustrate the advantages of the model, we compare the performance of GBDT, KNN, SVM, AdaBoost, and Logistic Regression (LR) with our method. The experimental results are shown in Table 9. From the comparison of six different classification methods in Table 7, SVM, AdaBoost, and LR classifiers perform poorly in terms of the accuracy, recall rate, and F1-score. GC-XGBoost, GBDT, and KNN can get an ideal result on the expressway maximum speed limit information recognition, and the recognition accuracy is high. In particular, GC-XGBoost outperforms GBDT and KNN in terms of the quality of results, with the highest accuracy rate of 97.5%.

5. Conclusion

This paper proposes a method of identifying expressway speed limit information based on ETC data mining analysis. First, the abnormal data of ETC gantry is processed, and a road section speed data set construction algorithm is proposed. The speed data of the road section is constructed, and the outlier samples in each road section are eliminated by the boxplot analysis to ensure the accuracy of the ETC data expression. Then, the SMOTE algorithm is used to oversample the samples of the minority speed limit categories to achieve the balance between the various types of road section speed limit information. Finally, the oversampled training samples are input into the proposed GC-XGBoost (grid search + cross-validation + XGBoost) algorithm for training and learning; then it is compared and analyzed with multiple similar algorithms. The experimental results show the following:(1)The contribution of each feature in the speed feature vector model of expressway section to the prediction model is arranged from large to small, followed by the speed-frequency percentage feature, time domain feature, and speed evaluation feature. Three categories of features have an improvement effect on the prediction model, and the frequency-speed percentile feature has the best improvement effect.(2)In the test sample data, the speed limits of 80 km/h, 100 km/h, 110 km/h, and 120 km/h classification data recognition accuracy are 100%, 92.1%, 97.9%, and 99.9%; the overall accuracy is 97.5%. The gap between the category data with 100 km/h and 110 km/h is very small, so the recognition accuracy is relatively low.(3)The speed limit recognition accuracy of GC-XGBoost is 97.5%, precision is 0.98, recall is 0.97, and F1-score is 0.97. The experimental results are significantly better than those of the other five algorithms, which can accurately identify the maximum speed limit information of expressway.

This paper considers the speed feature of hybrid vehicles, which is suitable for the identification of the maximum speed limit information of expressway. However, this work still has some limitations:(1)The speed limit recognition of 100 km/h and 110 km/h is less effective. More speed limit features can be considered to explore the differences between the two to improve their speed limit recognition effect.(2)In this study, we do not consider the speed limit values of different lanes on the same road. In the future, they can be considered to analyze the speed limit information on different lanes of the same road through vehicle classification and road lane number and construct a more complete expressway speed limit information recognition model.

Data Availability

The data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (41971340), the Special Funds for the Central Government to Guide Local Scientific and Technological Development (2020L3014), the 2020 Fujian Province “the Belt and Road” Technology Innovation Platform (2020D002), and the Provincial Candidates for the Hundred, Thousand and Ten Thousand Talent of Fujian (GY-Z19113).