This study developed and verified a travel speed prediction model based on the travel speed and work zone statistics collected from the advanced traffic management system (ATMS) real-time data in Daegu, South Korea. A clustered K-nearest neighbors (CKNN) algorithm was used to predict travel speed, resulting in a 6.9% average mean absolute percentage error (MAPE) using the data from 1,815 work zones. Furthermore, road network impact due to road work was calculated by comparing the travel speed prediction results obtained from the historical speed data. The predicted travel speed data in a work zone generated from this study is expected to allow drivers to select optimized paths and use them for traffic management strategies to operate in a work zone efficiently.

1. Introduction

A downside involved in road works is the reduction in the capacity of a road, leading to traffic congestion and inconveniencing drivers. In Daegu Metropolitan City, South Korea, an average of 20 road works are carried out per day with limited information provided in advance, such as construction schedules. Other essential details, including expected traffic congestion and estimated travel time based on road works, are not disclosed. Therefore, drivers navigating or near the work zone may experience significantly longer travel times than expected due to the restricted notices. Predicting the network impact of road constructions can provide drivers with opportunities to choose detours [1, 2] and allow road managers to use it as data for establishing traffic operation strategies in case of traffic congestion. It is necessary to have a system that predicts the impact of work on traffic flow to reduce congestion caused by frequent road works while providing information to drivers or road managers ahead of time. In addition, an algorithm for predicting the speed of neighboring road links after road work and a method for understanding the effect of road construction on the network should be developed.

Daegu Metropolitan City operates the advanced traffic management system (ATMS) to provide traffic condition information on urban roads. However, the data generated by the system does not reflect real-time traffic information, so inconsistencies in the actual road conditions arise. It is necessary to provide the system’s users with estimated travel speed or time to change their travel plans appropriately based on the provided traffic information. Meanwhile, numerous studies have been conducted to predict traffic conditions in urban networks [3], but studies on predicting traffic conditions, particularly in urban construction sections, are relatively insufficient [4]. This study aims to develop an algorithm to predict traffic speed after road work in an urban area and present a method for determining whether the road network is affected. The travel speed prediction model for the work area was developed through the clustered K-nearest neighbors (CKNN) algorithm using traffic statistics collected from Daegu ATMS and work data gathered from the urban traffic information system (UTIS). Furthermore, road network impact due to road construction was calculated by comparing the driving speed prediction results obtained from the historical speed data.

This study is composed of five sections. Section 2 reviews previous studies related to this study, while Section 3 describes data used for this study and its preprocessing. Moreover, Section 4 describes how to design and verify a CKNN model, and Section 5 discusses the research result and the follow-up study.

1.1. Literature Review

Predicting travel speed or travel time has been an active research topic for decades, and as a result, various predictive models have been developed [5]. In early studies, parametric methods such as autoregressive moving average (ARMA), Kalman filter [6], autoregressive integrated moving average (ARIMA) [7, 8], and seasonal autoregressive integrated moving average (SARIMA) [9] were utilized [10, 11]. However, parametric methods were difficult to implement in real-time traffic systems due to some problems such as model calibration, validation, and computational challenges [5]. In addition, they have been proven to encounter poor performance compared to nonparametric methods in unstable traffic conditions and complex road settings [12, 13]. Neural network (NN), K-nearest neighbors (KNN) [14], Bayesian network (BN) [15], and support vector machine (SVM) [10, 16] are the representatives of nonparametric algorithms [17]. Such approaches were advantageous as they are free of assumptions regarding the underlying model formulation and the uncertainty in estimating the model parameters [18]. Recently, studies using deep learning techniques have been conducted to improve the prediction accuracy of traffic conditions [17, 19]. These include long short-term memory (LSTM) [20, 21], deep belief network (DBN) [22], stacked autoencoder (SAE) [23, 24], and convolutional neural network (CNN) [25], which were widely used and had achieved good results in predicting traffic conditions [26]. Nevertheless, a large amount of traffic data was required to utilize nonparametric methods and deep learning strategies, which increased algorithm execution times, making it difficult to present prediction results in real-time [27, 28].

This study considered two factors when selecting the travel speed prediction algorithm. First, the travel speed prediction algorithm must be implemented in the traffic information and management systems. Second, the algorithm should be easily understandable through a traffic manager’s level of knowledge and experience. The parametric process was unsuitable for this study because of its complications in implementing it in the system based on these two points. Conversely, a nonparametric method was easier to apply and provided superior prediction performance than a parametric method in unstable traffic conditions. In particular, due to their excellent prediction results, neural networks and KNN algorithms have been used in related studies for a long time. However, it was difficult for analysts and managers to understand neural network models since they use numerous neurons, complex structures, and nonlinear functions [2931]. Therefore, a KNN model implemented in a real-time traffic system that traffic managers can easily understand would be more appropriate.

KNN algorithms could perform relatively accurate predictions as the data increases, but computation time becomes longer. Liu et al. [32] recognized this problem and used a clustering method to improve it. Clustering is a procedure for grouping data with similar characteristics, and when used in the KNN algorithm, it shortens prediction times and maintains good performance [33]. Hence, a clustering method was applied to the KNN algorithm to compress prediction times and improve the accuracy.

Most previous studies on travel speed prediction in work zones were conducted on highways [3438]. Prior studies focused on work zones of urban arterial roads, but these were limited to specific links or routes [4, 39]. Based on the cited studies emphasizing the efficacy and potential of the KNN algorithm, this study aims to develop an algorithm that predicts traffic speeds based on changing traffic conditions caused by work zones on urban roads. It also aims to present a method to understand its effect on the road network. This study is not limited to a specific link or route but conducts a travel speed prediction targeting all arterial roads in Daegu Metropolitan City. In addition, this study presents a difference in that few studies suggest a method for judging the effect on networks due to construction on urban arterial roads.

2. Data Description and Preparation

2.1. Standard Node Link Data

Standard node link is Korea’s standard transportation network database with a unified identification (ID) system. Among them, link data includes various road information (link ID, number of lanes, road name, speed limit, etc.), as shown in Table 1. In Daegu, the two major systems that collect, process, and provide traffic information are UTIS and ATMS, which efficiently use standard node link-based link IDs to match data between systems. The calculation time of the travel speed prediction algorithm in a work zone was shortened, and the accuracy of the prediction result was improved by clustering 1,672 links according to their attributes’ similarities.

2.2. ATMS Data

Daegu Metropolitan City provides real-time traffic information to road users by building ATMS, a part of intelligent transportation systems (ITS). ATMS collects individual vehicle travel information such as vehicle IDs and detection time through dedicated short-range communication (DSRC) when a vehicle equipped with an onboard unit (OBU) passes through roadside equipment (RSE) installation points. The collected traffic speed data was generated using the vehicle detection times. Meanwhile, the distance between the roadside devices was then processed into traffic speed data in units of five minutes for each road link. As shown in Table 2, ATMS data fields include standard node link-based ID (STD_LINK_ID), aggregated time (GENERATEDDATE), and speed (SPEED) calculated by aggregating data collected over five minutes. In this study, the traffic speed of the work zone was predicted using ATMS data collected for a total of eight months, from November 2018 to June 2019.

2.3. UTIS Data

UTIS contains event information such as traffic accidents, road construction, events, and weather conditions that happened on the road. Table 3 shows the UTIS data, including various information such as event ID, link ID based on standard node link, event start and end date, event information, and location of occurrence. After collecting the UTIS data from November 2018 to June 2019, the results were used to extract (1,815 cases) ATMS data at the time of road work through link ID matching.

2.4. Data Preprocessing

This study applied travel speed data on arterial roads (1,672 links) collected through ATMS and road works statistics (1,815 cases) for eight months (November 2018 to June 2019). For the same links, ATMS speed data was classified into days with or without road works, and some links showed that road work was performed twice or more within eight months. Since the day of the week was one of the many factors that affect the travel speed, details were constructed by classifying the days from Sunday to Saturday so that the characteristics of the day can be reflected in the travel speed prediction model for the work zones. The network statistics with road works in progress were extracted and used as training data, and the statistics on the networks under the normal condition without road works were utilized to analyze the network impact caused by these construction or maintenance activities.

Because the traffic data generates random noise of measured values from its stochastic characteristics, it is required to remove the noise in historical speed data through smoothing [40]. Thus, moving average, which is considered a smoothing method, was implemented using five-minute time intervals (travel speed for ten minutes before and after including the speed of the -th time) as in equation (1). The historical speed data was transformed into a smoother form of the time series data with outliers removed, as shown in Figure 1.

In equation (1), is the speed in time t at which the smoothing operation was performed, and is the historical speed data.

3. Methodology

3.1. Cluster Analysis

Cluster analysis refers to grouping data having a similar pattern [33]. In this study, cluster analysis was performed to improve the accuracy of prediction results and the computation time required for prediction. Travel speed, which is affected by various factors such as road environment (e.g., speed limit, number of lanes, etc.), can be predicted more accurately because the noise from inconsistent data can be removed when clustered by links with similar road environments [33]. Moreover, it is possible to improve the prediction speed of the KNN algorithm by grouping data, which deteriorates as the number of samples increases [32, 41].

As a partitional clustering method, the k-means clustering algorithm was applied because the concept is relatively simple, making it easier for traffic managers to understand. The calculation time is short, making it effortless to use in a real-time information system [42]. The number of lanes and speed limit were used as input variables for k-means clustering. As seen in Table 1, the link contains numerous pieces of information, but the input variables used for k-means clustering analysis are limited. For example, since the road grade or road type means the hierarchy of roads (expressway/general road, highway/urban road/rural road, etc.) rather than link information, it is challenging to use them to cluster similar links. Conversely, the number of lanes and speed limit affect the capacity of the construction section network [43, 44] since network capacity is related to the traffic speed [45]. Therefore, similar links were classified as k-means clustering input variables using the number of lanes and speed limit.

Next, the optimal number of clusters (k) was determined. Various methods for determining k include elbow method, gap statistic, silhouette coefficient, and canopy [42]. The elbow method is utilized in this study, which is the most frequently used k determination method. It is used to select k as the point at which cluster variability (within-cluster sum of squares) becomes smooth with an increase in the number of clusters [46]. For that reason, it was appropriate when the value of k is 3, which is the inflection point of the graph, as illustrated in Figure 2. The three clusters classified through this can be characterized as follows: Cluster 1 is a road with a speed limit of 50 km/h or slower and three lanes or less. Meanwhile, Cluster 2 is a road with a speed limit of 60 km/h or higher and four lanes or more, and Cluster 3 is a road with a speed limit of 60 km/h or higher and three lanes or less. The cluster analysis result was used to find the travel speed pattern most similar to the past when predicting the travel speed of the work zone through the KNN algorithm.

3.2. CKNN Algorithm

The KNN algorithm is a nonparametric method used for classification or regression. It predicts situations by referring to the K training data, which is most similar to the input data [33]. The measure of similarity usually uses the Euclidean distance, which is preferred in predicting a short-term traffic condition because its basic model and calculation time are short, with data matching based on simple similarity. In particular, its prediction is excellent for complex nonlinear problems and can reflect traffic conditions with incidents or traffic jams.

Moreover, the KNN algorithm used in this study predicted the speed up to the forecast duration by referring to the training data of K numbers. This was most similar to the travel speed pattern data during the lag duration before the road work’s starting time (t). The detailed analysis procedure of the KNN algorithm is presented in Figure 3. When designing the KNN algorithm, 1,453 out of 1,815 units of data were used as training data for predictive model design, and 362 units of data were utilized as test data for finding the K and algorithm verification.

Data with the same day of the week and link cluster number as the input data is filtered from historical travel speed statistics to find the most similar travel speed pattern to the past. This study referred to the CKNN algorithm because the cluster results were used to find similar travel speed patterns [33]. Then, equation (2) was used to calculate the Euclidean distance between travel speeds of lag duration, and K training data was selected in order of the smallest Euclidean distance.

Here, is Euclidean distance, is speed data at the road work start time t, is real-time speed data, is historical speed data, and is lag duration.

Finally, the travel speed for each period (in five-minute intervals) since the start of road work of the most similar K training data was reflected in equation (3) to predict travel speed after the current road work commenced. The method for pattern matching using Euclidean distance is shown in Figure 4.

In equation (3), is the predicted travel speed, is the historical speed data, is -th nearest neighbor speed, K is the number of nearest neighbors, and is the prediction horizon (every five minutes).

3.3. Selection of Optimal K and Appropriate Lag Duration (in CKNN Algorithm)

When designing the CKNN algorithm, it is imperative to determine the lag duration required for Euclidean distance calculation and the optimal K and forecast duration. The lag duration and K can be selected using mean absolute percentage error (MAPE), mean absolute error (MAE), and root mean square error (RMSE), which are methods for evaluating the predictive power of a model. In this study, predictions were performed by varying the lag duration and K value using test data, and the results were compared to select the most suitable lag duration and K for the model. The lag duration and the optimal K were chosen based on the prediction accuracy of the CKNN algorithm using equations (4)–(6) or the three error criteria.

Here, is actual speed data, is the predicted travel speed, is -th work zone link speed, is the number of data on the road works, and is the number of time intervals for forecast duration.

3.4. Determination of the Impact of Road Work on the Network and Travel Speed Degradation

Road work in an urban area may or may not reduce the network traffic speed depending on the scale or type of work. Thus, a probability distribution model was applied to determine whether the actual road work causes the decrease in the travel speed under road work. The impact of road work on the network was determined by comparing the speed under normal network conditions with speed predicted through the CKNN algorithm by checking if the confidence level at 95% is met. Assuming that speeds under normal conditions were the standard normal distribution when the predicted value satisfied the 95% confidence level of the average speed under normal conditions, the road work had no impact on the network.

In equation (7), is the predicted travel speed in time , is the average speed in time , and is the standard deviation in normal speed in time .

If the z-score calculated through equation (7) is higher than , or the critical value of the 95% confidence interval of the average travel speed at normal conditions, the road work affected the network. When calculating the predicted network degradation caused by road work, equation (8) must be applied to determine the speed degradation against the average speed.

3.5. Case Study

In this section, the forecast accuracy of the prediction algorithm was evaluated by selecting the optimal K and lag duration when predicting the travel speed of the work zone. The predicted speed during road work and the normal speed without road work were determined, including the work zone’s impact on network and travel speed degradation.

First, the CKNN algorithm and 362 test data were used to select the optimal K and lag duration. Based on the three error criteria presented above (equations (4)–(6)), the accuracy of the CKNN algorithm was analyzed according to the increase in K and lag duration for each prediction time (one hour, two hours, three hours), as illustrated in Figure 5.

Second, the prediction accuracy based on the three error criteria was most appropriate when the lag duration was 20 minutes. Thus, the speed pattern for the previous 20 minutes should be used when designing the CKNN algorithm. Forecast duration was less accurate when forecasting for a long time, so one hour was identified as the most suitable. Lastly, to find the optimal K, a predictive power evaluation was performed according to the change of the K value when the lag duration was 20 minutes and the forecast duration was 1 hour (Figure 6).

Table 4 shows the values of MAPE, MAE, and RMSE when K has values from 1 to 10. Figure 6 shows that when the K value is ten or more, the value of each criterion continues to increase upward; hence it is unable to find the minimum value. As a result, MAPE obtained the minimum value when K was 5, MAE reached the minimum value when K was 2, and RMSE acquired the minimum value when K was 2 and 5. Since the difference between the MAE values when K was 2 and 5 was detected as small and with scale-dependent errors, the K value was then identified as 5, which minimized MAPE.

The test details were used as the input data to verify the model for predicting travel speed in a work zone and the travel speed for an hour after the road work was predicted. The result is presented in Figure 7.

As a result of predicting the test set using the CKNN algorithm, the average MAPE, 6.9%, exhibited excellent predictive power, as indicated in Table 5. In some cases, the MAPE for the predicted value exceeded 15%, but most of them were predictable within 10%. Thus, the model accuracy was considered high [47].

Table 6 specifies the results of analyzing the network impact due to the road work carried out at 10:25 am on Friday, February 15, 2019. The network was classified as Cluster 1. Using the CKNN algorithm, the travel speed one hour after road work was predicted and compared with the network under normal conditions.

Consequently, there was no difference between the speed predicted by CKNN and the normal speed at the beginning of the road work. It was found that the network effect occurred from about 25 minutes after the start of the work, and the speed decreased by about 11%–17% compared to normal conditions, suggesting that road or traffic managers need to establish a strategy to reduce congestion about 30 minutes after starting. Predicting the speed and judging the network impact can also forecast congestion intensity by time, enabling more active and preemptive traffic and congestion management.

4. Conclusion

It is crucial to prepare an appropriate traffic management strategy for the expected congestion level by predicting the travel speed after road work to prevent congestion caused by road works. This study developed a model that predicts the travel speed of the work zone using the CKNN algorithm. Furthermore, a method to grasp how much the traffic speed decreases due to road work was compared with the normal speed pattern.

Most proposed methodologies for short-term speed prediction presented by several existing studies were methods for predicting speed in normal road conditions. Since roads in the work zone were entirely or partially blocked, a speed pattern differed from normal road conditions. Applying the proposed methodology through a case study can accurately predict the speed from the start of road construction up to an hour later. Furthermore, it was likewise feasible to provide useful information for preemptive traffic congestion management by detecting the timing of link speed degradation caused by capacity reduction due to road work.

However, this study had limitations that need improvement through future studies. First, the established model for predicting travel speeds in a work zone filtered the data using the day of the week and link clusters classified according to road characteristics. Still, it is necessary to use work type as a filter. For example, road works that block or occupy roads largely affect traffic conditions. However, work conducted on drains or sidewalks will only slightly influence traffic conditions. Therefore, better results can be achieved if the travel speed of the work zone can be predicted by considering the work type.

Second, a prediction model was developed using eight-month data for major arterial roads installed with traffic information collection devices. Although the amount of data was not small, it was still insufficient for securing details similar to the input data.

Third and last, this study used only the CKNN algorithm for speed prediction. However, evaluating the appropriateness of the methodology proposed in this study compared to results predicted by other clustering methods such as support vector machines, random forests, and neural networks is required.

Data Availability

The data used to support the findings of this study are not publicly made available according to the data security policy of Daegu Metropolitan City.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the 2021 Research Fund of the University of Seoul.