#### Abstract

Macroscopic Fundamental Diagram (MFD) reveals the relationship between network accumulation and flow at the macroscopic level. The network traffic flow state analysis is a fundamental problem for the MFD-based applications. Theoretical and experimental investigations have provided insights into the dynamics and characters of traffic flow states. Although many empirical studies had been conducted in the field of MFD, few studies were dedicated to investigate the network traffic flow states with field data. This study aims to develop a data-driven method based on time series analysis of MFD state points to characterize critical transition state (CTS) of network traffic flow using field data. The proposed method was tested in a real network of Kunshan City, China. The test results showed that the CTS points can be well captured by the proposed method. The identified CTS points distinguished the traffic states between free-flow state and optimal accumulation state, and the optimal accumulation state was characterized. The day-to-day pattern of CTS points was investigated by the Gaussian Mixture Model-based clustering model. An extended application of real-time identification of CTS points was also discussed. The proposed method is helpful to understand the temporal evolution process of network traffic flow and provides potentials for developing more reliable network traffic flow management strategies, such as optimizing traffic signal plans and developing strategies for congestion tooling.

#### 1. Introduction

Macroscopic Fundamental Diagram (MFD) reveals the relationship between network accumulation and flow at the macroscopic level, and can be readily obtained with traffic flow data from existing detectors (e.g., loop detectors). The core property of MFD is that the network flow is maximized when network traffic stays at an optimal accumulation state. This property can be used to develop effective applications such as traffic signal gating control [1–3], congestion pricing [4], route guidance [5] and traffic safety analysis [6]. To the successes of these applications, one of the important bases is to understand the characteristic of network traffic flow states in MFD, including the state transitions in the temporal evolution process.

To characterize the network traffic flow states with MFD, analytical methods were proposed in theoretical studies. Daganzo and Gayah [7] illustrated that bifurcations occurred when the network accumulation came up to a certain value. In light of their work, a method based on the phase portraits of the dynamic Equations was developed to determine the boundaries of the stable and unstable regions of states in MFD [8]. Gayah and Daganzo [9] found that the clockwise hysteresis phenomenon during the congestion recovery period was caused by the instability of network traffic flow. Daganzo and Geroliminis [10] developed an analytical approximation method to derive the upper bound of the average flow conditional on average density. The derived network flow was interpreted as the theoretical capacity of network. Following their work, Geroliminis and Boyac [11] investigated the effects of different link lengths and signal timing on network capacity. Although these studies have revealed important properties of macroscopic network traffic flow states, the theoretical methods were not ready to apply in practice due to limitations such as the overestimation of network capacity [12, 13].

More practically, empirical studies using field data gained more attentions. In literature, most studies focused on the estimation of MFD with different sources of data, including loop detector data [14, 15], probe vehicle data [15–17] and connected vehicle data [18]. However, few studies dedicated to develop methods for the analysis of network traffic flow states, such as the network traffic flow state identification and state evolution pattern recognition. Saberi and Mahmassani [19] conducted a comparison analysis to investigate the capacity drop phenomena using loop detector data from three different freeway networks. The optimal accumulation states were typically identified by observation [20, 21]. A recent study by Ampountolas and Kouvelas [22] proposed a Kalman Filter based method to identify the maximum network flow according to the changes of slopes between consecutive points in MFD.

In contrast to the limited analysis of network traffic flow states, the analysis of link traffic flow states was extensively discussed using field data in the past decades. The equilibrium relationship among the traffic flow, density, and speed on links of urban roads and freeways was obtained by fitting the scatter state points in link traffic flow fundamental diagram [23–25]. Further, clustering methods were applied to the scatter state points to identify the traffic flow states. The methods include nested clustering [26], fuzzy c-means clustering [27, 28], online agglomerative clustering [29], K-means [30], multivariate clustering [31], and DBSCAN clustering [32]. These studies intended to answer the questions that how many traffic states could be described, and what the pattern of traffic flow is within each state. However, the temporal evolution patterns of traffic states are not specifically addressed in the clustering methods. Thus, in another perspective, analysis methods were developed to identify the transition state on the time series of the traffic flow states. The key procedure of these studies is to identify the breakpoints in a time continuous traffic flow state series. Banks [33] measured the significant changes of the slopes between consecutive flow-occupancy state points and identified the transition states between a free-flow state and a synchronized flow state. Vlahogianni et al. [34] used the Recurrence Plot (RP) to identify the point where a shift or abrupt change in the traffic volume series taken place. Kamarianakis et al. [35] used a smooth-transition regression to identify an asymmetric transition behaviour between free-flow and congestion state. Tang et al. [36] and Yan et al. [37] reconstructed the time series of traffic flow, occupancy, and speed as a network graph and identified different traffic flow states within the time series based on a complex network method. Based on the analysis of the time series of traffic flow states, Blandin et al. [38] proposed a modified LWR model to describe the temporal transition characteristic of traffic flow states. Despite the similarity of the problems between the link and network traffic flow state analysis, the characteristic of traffic flow states on network could be very different from that on a single link.

The problem addressed in this study is to identify and characterize the critical transition state (CTS) point in MFD using field data. The CTS point is defined as the point where significant changes occurs in a MFD time series, meaning that the network traffic flow state before and after a CTS point is significantly different. First, the similarity of two subseries before and after a point in a MFD time series was measured. Second, the CTS points were identified based on the similarity measurement. Third, the day-to-day patterns were investigated based on the identified CTS points. The proposed method was designed to analyse network traffic flow states based on historical data. In addition, an extended application of real-time identification of CTS points was also discussed in the end. The proposed method is helpful to better understand the evolution process of network traffic flow and offers opportunities to develop more reliable and responsive network traffic flow management strategies. For example, with the identification of CTS point, it is helpful to determine when to use the peak control strategy to improve traffic flow conditions in real time. Besides, it also can be used to determine to the best period to implement congestion tooling policy before the urban traffic flow becomes too congested.

#### 2. Methodology

##### 2.1. Framework

The framework of the methodology involves three parts, as illustrated in Figure 1. The first part is to measure the similarity between two subseries around each point in a MFD time series. The two subseries are extracted by a sliding window structure, and the Dynamic Time Wrapping (DTW) distance [39] is used as the similarity measurement. The second part is to find the most significant DTW distance through a peak searching process to identify the CTS point. The Locally Weighted Regression (LOESS) [40] is initially used to smooth the DTW distance series extracted in the first part. Then, the peaks of the smoothed DTW distance series are determined based on a root-finding method [41]. The third part is to analyse the day-to-day patterns of the CTS points identified from MFD time series of different days. The patterns are recognized by a Gaussian Mixture Model (GMM) clustering method [42].

##### 2.2. Measuring Pattern Similarity with Sliding Windows

Let be a state point in a MFD time series for the *th *interval in a day, and is a two-dimensional vector defined as the network occupancy and network flow . The network occupancy and network flow are the average occupancy over each link and average volume weighted by link length in a network [14].

The two subseries before and after are defined in two sliding windows with the length , as and , respectively. The similarity measurement between the two subseries is notated as and attributed to . and are normalized to ensure the two subseries are comparable. The Dynamic Time Wrapping (DTW) distance is used to calculate . The reason that DTW distance is adopted as the similarity measurement of the two subseries rather than Euclidean distance is that the DTW distance is able to compare the trends (e.g., increasing or decreasing trends) between two series. Let and be the *th* and th element of and , respectively.

The schematic diagram for calculation of the DTW distance is illustrated in Figure 2. The DTW distance is calculated in two steps. First, calculate the local cost matrix. Each item of the matrix with index represents a local cost between and . Second, calculate the accumulated cost matrix or global cost matrix, as shown in (1). Third, find the shortest path with the global cost matrix shown as the path in Figure 2. The global cost is the DTW distance between the two series. The DTW distance can be measured for two series with different lengths. In this study, the lengths of two series are the same as . Thus, the cost matrix is . The ) is calculated as the Euclidean distance between and . To effectively find the shortest path, dynamic programming algorithm is used.where is the global cost from item ) to item . By iteratively solving (1), the shortest path can be obtained.

##### 2.3. Peak Searching for Critical Transition State Point

Based on the DTW distance, the CTS points could be identified as the points with significantly large DTW distances. Intuitively, a threshold-based method could be used. Thus, all points with DTW distances larger than a predetermined threshold would be regarded as the CTS points. However, the selection of the thresholds could be difficulty and arbitrary because different networks may have distinct thresholds. Basically, the identification of the CTS points is an extreme value analysis problem over a series. Therefore, a more sophisticated algorithm was developed in this study. The algorithm was described as follows.

*Step 1. *Smooth the distance series with Locally Weighted Regression (LOESS) [40]. The smoothing process is initially conducted in order to filter the occasionally and locally fluctuated points in the DTW distance series. Let be the smoothed value of the* jth* item in the DTW distance series which were divided into equally spaced intervals.

*Step 2. *Determine the ranges that contain the CTS points. The identification of CTS is to find the extreme point (EP) of DTW series. To determine the range of each extreme point (EP), the approximate extreme point (AEP) is captured first. The AEP locates at a small scale that an extreme point was contained. Notate and as the first and second order difference of . The monotonicity of is determined by comparing the sign of which were expressed in (2) and (3).

If , the monotonicity of changed compared with , and there is an AEP around interval. Based on the determined AEPs, the ranges that contain EPs are determined by the centres of neighbouring AEPs.

*Step 3. *Identify the CTS points using Brend’s method. The identification for CTS (EP) point can be viewed as rooting-finding of the first derivative of within each range. Brend’s method [41] is used to find the extreme point with local maximum distance within each range. The method is a root-finding algorithm combining the bisection method, the secant method, and inverse quadratic interpolation which has the advantage of fast-converging. Corresponding to , the critical state point is achieved.

##### 2.4. Day-to-Day Pattern Recognition of the CTS Points

Due to the stochastic nature of urban traffic flow, it is assumed that the day-to-day distribution of the CTS points is subject to a Gaussian Mixture Distribution. In this study, a Gaussian Mixture Model (GMM) is used to cluster the CTS points extracted from different days. Compared to other clustering methods such as the K-means, one appealing feature of the GMM clustering method is that the probability of a state point belonging to a cluster is available. The GMM clustering method is described as follows.

In GMM, the data is assumed to be generated from a mixture of a finite number of Gaussian distributions. Each Gaussian distribution is regarded as a component of the GMM, and the components are linearly combined by mixing weights. In fact, the GMM clustering is equivalent to estimate the unknown mixing weights and parameters of each Gaussian distribution. In this study, in addition to the network flow and occupancy, the temporal feature (e.g., time index during a day) of the CTS points is also included to the clustering process. Let represents the set of the identified CTS points from multiple days. which represents for CTS point is a vector formed by network flow-occupancy and the time of day for the th element in . is the Probability Density Function (PDF) of* kth* component. The GMM model is shown in where presents the* kth* component in the GMM, is the total number of the components, , and are the mean and covariance of the* kth* component, and is the mixing weight of the* kth* component.

As described in (5), the GMM-based clustering method can be viewed as a process to find the optimal parameters and to maximize the likelihood function of GMM.

and were iteratively estimated using the Expectation-Maximization (EM) algorithm [42]. The E-step which is expressed in (6) aims at getting the posterior probability () of belonging to component with initialled and . The M-step is to get the new estimate of and through maximizing with posterior probability achieved in E-step. The estimation of parameters is shown in (7)-(9).where is the total number of all critical transition state points. is a latent variable that determines the component from which the observation originates. The EM Algorithm is shown in Algorithm 1, is the initial value for , and is the th literation. is the difference between two neighbouring iterations, and is a small real number, and is the optimal value. It is worth mentioning that the BIC value is used to avoid overfitting and determine the optimal model parameters and the number of clusters [43].

#### 3. Implementation

##### 3.1. Data Collection

A real network with 13 signalized intersections and 26 links in Kunshan City, China, was selected as the test site, as shown in Figure 3. All intersections operated under pretimed signal control with four different signal plans in a day. The four arterials in the network had significant traffic volume during the day time. Oversaturated and spillover conditions were frequently observed during morning and afternoon peaks. Microwave-based detectors were installed on links except the ones shown as the dash link in Figure 3. All detectors were located approximately at the middle of the links and collected traffic counts and occupancy in every 5 minutes. Two months of data collected from August 1, 2015, to September 30, 2015, were used in this study. In addition, for validation purpose, GPS probe vehicle data from 1526 taxi cabs during the same period were also collected to calculate the network average speed. The space-mean speed of the network was simply calculated as the ratio between the total distance travelled and the total time spent by the taxi cabs in the test site for every 5 minutes.

##### 3.2. Identification of the CTS Points

Before calculating DTW distance for the two subseries in the sliding windows, it is important to select a reasonable window length since the two subseries are expected to reflect the traffic evolution patterns. However, due to the different OD patterns and signal operations in different networks, the window length is more likely network-dependent. In general, the window length should be shorter than the peak duration. Besides, the window length should not be too short to filter random fluctuations in data series and not too long to reflect traffic dynamics. Basically, determining a desirable length is a trade-off between reliability and accuracy.

In our case, the morning peak is from 6:00 to 9:00, and the afternoon peak is from 16:30 to 19:30. Therefore, the window length should be shorter than 3 hours. Based on that, the impacts of different window lengths on the measurement of the DTW distance were investigated. An index of Average Change Rate (ACR) was used to describe the fluctuation in the DTW distance series. The ACR is calculated as where is the normalized DTW distance associated with the* ith* state point in a MFD time series and is the number of state points.

Figure 4(a) shows the measurement of DTW distance calculated from the MFD time series on August 10, 2015, with window length ranging from 15 minutes to 3 hours. It is observed that the curves of the DTW distance become smoother with the increasing of window length. The curves with window lengths of 1 hour and 1.5 hours are observed as the relatively balanced cases where the trends are clear and informative short-term changes are retained. In addition, the ACR was calculated using data in the two months, as shown in Figure 4(b). Each point in Figure 4(b) represents the ACR of the DTW distance series of a day with a particular window length. When the window length is lower than 1 hour, the ACR decrease significantly from about 0.11 to 0.01. A turning point can be clearly spotted at the window length of 1 hour. Based on the investigation, the window length was selected as 1 hour for our network.

**(a)**

**(b)**

With the measurement of DTW distance, a peak searching method was applied. Figure 5 illustrates the identified peaks on the DTW distance curve of August 11, 2015. The red and blue marks represent the local maximum and minimum extreme points extracted by the proposed method. The local maximum extreme points are regarded as the CTS points in this study. As shown in Figure 5, the CTS points (see the points located at the high peaks of the curves) appear during the morning and afternoon peaks. Although the most critical transition state points are successfully identified, suspicious CTS points are also found at the local peaks on the curves. Some suspicious points (shown in the blue circle in Figure 5) are obviously invalid since the corresponding DTW distances are so small. These points could be readily excluded by a lower bound threshold. In this study, any state points with DTW distance less than 15 were considered as invalid points.

##### 3.3. Validation of the CTS Points

To validate the identified CTS points, the locations of the CTS points at the MFD time series, the correlation between the CTS points, and network space-mean speed were investigated. For ease of illustration, the morning and afternoon peak were separately considered. The identified CTS points on August 10, Monday, 2015, and August 12, Wednesday, 2015, are illustrated in Figure 6. Figures 6(a), 6(d), and 6(g) show the CTS points during the morning peaks of the three days, and Figures 6(b), 6(e), and 6(h) show the CTS points during the afternoon peaks of the three days. Figures 6(c), 6(f), and 6(i) illustrate the correlation between the CTS points and network space-mean speed for the three days. The red markers represent the identified CTS points. The arrows indicate the time sequence within the MFD time series.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

**(i)**

For the locations at the MFD time series, the identified CTS points approximately locate at the boundaries of the two distinguished traffic states; see Figures 6(a), 6(d), 6(g), 6(b), 6(e), and 6(h). The state points on the left side of the CTS points show a positive correlation between network flow and occupancy, while the correlation of state points on the right side is vague and the slopes between the consecutive state points are much more fluctuated than the left side. A flat shape of the right side state points can be observed, and the network flow fluctuates within a small range from 75 to 90 veh/5min. Based on this, it is reasonable to infer that the left and right side of the identified CTS points are corresponding to the free-flow and optimal accumulation state, respectively. In other words, the identified CTS points well captured the shifts of network traffic states between the free-flow and optimal accumulation state.

By incorporating the network speed, in all these three days, the CTS points are identified at the times when the network congestion just appeared and nearly recovered; see Figures 6(c), 6(f), and 6(i). Moreover, we can find that only the CTS points during the congestion formation stage are identified in the morning peaks of the three days. The reason may be that the traffic flow decreases slowly during the congestion recovery stage at the end of the morning peaks, and the traffic state after the morning peaks does not experience a significant change. According to the paired CTS points in afternoon peaks, the optimal accumulation state can be approximately estimated as (0.12, 73veh/5min) by averaging over the three days.

##### 3.4. Day-to-Day Pattern Analysis

The empirical distribution of the identified CTS points based on two-month weekday data is visualized in as shown Figure 7. A clear bimodality can be observed in the distribution. In fact, the bimodality is caused by the different distributions of the CTS points in the morning and afternoon peaks.

Considering the most congested afternoon peaks in our case, day-to-day patterns of the CTS points from the afternoon peaks in 38 weekdays were investigated with the GMM clustering. Two distinct clusters are found, as shown in Figure 8. The cluster centre for Cluster 1 (in blue) is (17:18, 0.13, 72.8 (veh/5min)), and for Cluster 2 (in red) is (18:10, 0.10, 72.2(veh/5min)). 51.6% and 48.4% of the total identified CTS points are in Cluster 1 and Cluster 2, respectively. Figures 8(a) and 8(b) illustrate the distribution of flow-occupancy and time in the two clusters, respectively. According to the time distribution, Cluster 1 could be the critical transition state indicating a transition from the free-flow state to optimal accumulation state, while Cluster 2 is the critical transition state from the optimal accumulation state to free-flow state. In terms of the flow-occupancy distribution, it is found that the CTS points in Cluster 1 are much more scattered than those in Cluster 2. This indicates that the transition from the optimal accumulation state to free-flow state is much more predictable, and the transition from the free-flow to optimal accumulation state has higher uncertainty.

**(a)**

**(b)**

##### 3.5. Extended Application of Real-Time CTS Point Identification

In addition to the analysis of the CTS points using historical data, an extended application of real-time CTS point identification is also discussed. Based on the proposed method, the MFD state points can be labelled as three classes: the CTS points of Cluster 1, the CTS points of Cluster 2, and non-CTS points. Thus, a probabilistic discriminant analysis [43] was conducted using the labelled dataset. The discriminant analysis, also known as supervised classification, assumes that the observations in each class are generated by a distribution specific to that class. Let be the probability distribution of the* kth* class which is Gaussian distribution in this study, be the proportion of observations of class k in the population, and be a real-time observation of MFD state points. The posterior probability that belongs to the* kth* class is expressed in

In this study, the labelled MFD state points in the afternoon peaks of 45 days are used as the training set, and another 15 days as the test set. Each MFD state point was identified as an observation of the class with the highest probability that belongs to. According to the test results, the error of the discriminant method is about 8.1%.

The discrimination results for three typical weekdays are illustrated in Figure 9. The black, red, and blue dots in Figure 9 represent the non-CTS points, the CTS points of Cluster 1, and the CTS point of Cluster 2, respectively. The probabilistic discriminant analysis performed well in identifying the CTS points. All the identified CTS points are located at the boundaries between the free-flow and optimal accumulation state, as shown in Figures 9(a), 9(c), and 9(e), and illustrate a strong correlation to the changes of network speed during the formation and recovery period of the network congestion.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

#### 4. Conclusions

This study proposed a data-driven method to identify and characterize the critical transition state in MFD using field data. The identification of the critical transition state is based on the analysis of the MFD time series. The DTW distance is used to measure the similarity of subseries before and after each state point in a MFD time series, and a peak searching method is developed to identify the critical transition state points based on the DTW distance series. The GMM clustering is used to characterize the day-to-day pattern of the identified critical transition state points. The probabilistic discriminant analysis is performed to identify the CTS points.

The identification and validation of CTS points indicated that the proposed method well identified the network transition states. By analysis, the identified CTS points well captured the shifts of network traffic states between the free-flow and optimal accumulation state. The optimal accumulation state can be approximately estimated as (0.12, 73veh/5min) by averaging over the three days. A clear bimodality can be observed by exploring the probability density of identified CTS points, and the bimodality is caused by different distributions of the CTS points in the morning and afternoon peaks. Moreover, the CTS points were classified as two categories. One cluster represents the transition from free-flow state to optimal accumulation state, and the other represents the transition from optimal accumulation state to free-flow state. Besides, the probabilistic discriminant analysis performed well in identifying the CTS points. All the identified CTS points were located at the boundaries between the free-flow and optimal accumulation state. The proposed method is helpful to better understand the evolution process of network traffic flow and offers opportunities to develop more reliable and responsive network traffic flow management strategies.

In future, the impact factors of critical transition states in actual network will be investigated through comparative analysis under certain conditions. Besides, it is encouraged to test the proposed method with different sources of data (e.g., probe vehicle or vehicle license plate reidentification data). Also, more field validation works are needed to further interpret the critical transition states considering real traffic behaviours.

#### Data Availability

Partial of the detector data used to support the findings of this study are included within the supplementary information file.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

The work was funded by the National Natural Science Foundation of China (no. 71871055) and the Key Research Program of Jiangsu Province Science and Technology Department (no. BE2017027).

#### Supplementary Materials

The supplementary material file is partial of the detector data used to derive the MFD for test site. The data are valid, and they were formed by STATIONID (the ID number of specific link), CALENDAR_KEY (the date of traffic flow data for specific link), FIVEMINX (the time of traffic flow data for specific date), OCCUPANCY (the occupancy/5min for specific link), and VOLUME (the flow/5min for specific link) from August 1, 2015, to September 30, 2015.* (Supplementary Materials)*