Simultaneous Generation of Optimum Pavement Clusters and Associated Performance Models

Khadka, Mukesh; Paz, Alexander; Arteaga, Cristian; Hale, David K.

doi:https://doi.org/10.1155/2018/2159865

Mathematical Problems in Engineering

On this page

Abstract Introduction Results Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2018 | Article ID 2159865 | https://doi.org/10.1155/2018/2159865

Simultaneous Generation of Optimum Pavement Clusters and Associated Performance Models

Mukesh Khadka,¹Alexander Paz,²Cristian Arteaga,¹and David K. Hale³

Academic Editor: A. M. Bastos Pereira

Received07 Jun 2018

Accepted26 Nov 2018

Published12 Dec 2018

Abstract

With regard to developing pavement performance models (PPMs), the existing state-of-the-art proposes Clusterwise Linear Regression (CLR) to determine the pavement clusters and associated PPMs simultaneously. However, the approach does not determine optimal clustering to minimize error; that is, the number of clusters and explanatory variables are prespecified to determine the corresponding coefficients of the PPMs. In addition, existing formulations do no address issues associated with overfitting as there is no limit to include parameters in the model. In order to address this limitation, this paper proposes a mathematical program within the CLR approach to determine simultaneously an optimal number of clusters, assignment of segments into clusters, and regression coefficients for all prespecified explanatory variables required to minimize the estimation error. The Bayesian Information Criteria is proposed to limit the number of optimal clusters. A simulated annealing coupled with ordinary least squares was used to solve the mathematical program.

1. Introduction

Typically, pavement performance models (PPMs) are developed using a two-step approach. First, pavement segments with similar characteristics are grouped into clusters using a few critical factors, such as pavement type, age, and traffic volume. Then, PPMs for each of the clusters are developed using statistical techniques. The objective of clustering is to group the pavement segments that perform similarly over time. However, in practice, the performances of pavement segments within a cluster differ significantly because clusters are formed using only a few critical factors [1, 2]. A major challenge is to select characteristics that define clusters and the corresponding segments associated with them [3]. If inappropriate characteristics are used, clusters may include homogeneous segments with different performance behavior or heterogeneous segments with similar performance behavior [4]. The prediction accuracy of PPMs can be improved by subdividing the pavement segments into more uniform clusters. However, this subdivision is not always possible due to limited information [1].

Figures 1(a) and 1(b) provide an example of heterogeneous performance behavior for two segments, each grouped within the same cluster (the Prioritization Category), using the two-step approach and actual data from the Pavement Management System (PMS) of the Nevada Department of Transportation (NDOT). Segments “SR445 (SB), MP: 40-39” and “SR445 (NB), MP: 36-37” were assigned to one cluster, Prioritization Category 4. Segments “SR156 (WB), MP: 3-2” and “SR892 (SB), MP: 25-24” were assigned into Prioritization Category 5.

(a)

(b)

(c)

(d)

In contrast, Figure 1(c) illustrates that segments “SR445 (SB), MP: 40-39” and “SR156 (WB), MP: 3-2” had homogeneous performance behavior. Similarly, segments “SR892 (SB), MP: 25-24” and “SR445 (NB), MP: 36-37” showed a consistent behavior (see Figure 1(d)). This suggests that factors other than the Prioritization Category are critical in causing the differences in performance behavior. Influencing factors could include subgrade type, traffic loading characteristics, or any hidden effects.

To address the limitations of the two-step approach, Spath [5] proposed using Clusterwise Linear Regression (CLR) to determine clusters and associated regression models simultaneously. CLR assigns pavement segments that have similar regression effects on the pavement performance such that the overall sum of squared errors is minimal. Hence, CLR minimizes the overall prediction error by simultaneously determining the explanatory variables’ coefficients and assigning each pavement segment into an appropriate cluster.

In the context of pavement management, CLR first was used by Luo and Chou [1] to model the deterioration of pavement conditions. First, the pavement network was clustered by using a few critical pavement characteristics. The subdivision was performed at the data-point level; that is, data points collected over various years for a pavement segment could be assigned to multiple clusters. Hence, there was a chance of pavement segments being associated with multiple performance models. An additional step was proposed to predict performance using the results from multiple models. Later, Luo and Yin [2] expanded their research, using CLR to formulate the development of pavement distresses. Both studies [1, 2] used pavement age as the only explanatory variable.

To address some of the limitations present in the studies completed by Luo and Chou [1] and Luo and Yin [2], Zhang and Durango-Cohen [6] developed CLR models using multiple explanatory variables. Data used in the study were collected during the AASHO Road Test [7], conducted in the late 1950s in Ottawa, Illinois. The data were collected from a single site in a relatively controlled environment; clearly, the site characteristics were not representative of all other locations. For example, the data did not represent the varieties of soil and climatic conditions. In addition, construction techniques and materials have changed substantially since this test was conducted. In this study, the number of clusters was determined manually using an “elbow” criterion. Experiments were run only for a number of clusters equal to 1, 5, 10, 15, 20, and 25. One possible reason for not examining all the possible number of clusters might have been large amount of required computational time given that an exchange algorithm was utilized with 100 instances with various initial assignments.

The previous CLR methods used to estimate PPMs do not test the explanatory power of variables in both cluster and regression analyses. That is, all the explanatory variables used in regression models are assumed to be significant. The effects of any insignificant explanatory variables on the dependent variable are also accounted for during assignment of pavement segments into clusters and estimation of the regression coefficients. The presence of any insignificant explanatory variables distorts the underlying regression effects of other significant explanatory variables. This may lead to an incorrect assignment of pavement segments into clusters and estimation of PPMs. Khadka and Paz [8] developed PPMs using CLR while testing the significance of the explanatory variables and obtained superior results relative to the case when significance was assumed.

Another key limitation of the existing CLR is the need to prespecify the number of clusters. In order to avoid prespecifying the number of required clusters, this study proposes a mathematical program to simultaneously determine an optimal number of clusters, the assignment of segments into clusters, and regression coefficients for all explanatory variables. This mathematical program is flexible enough to handle multiple explanatory variables, multiple observations per pavement segments, and user-defined constraints on cluster characteristics.

In previous studies using CLR for pavement management [1, 2, 6], the objective function was to minimize the sum of squared errors of prediction (SSE). It is intuitive that SSE decreases monotonically as the number of clusters increases. That is, for a given dataset, the optimum number of clusters always is the total number of data points [9]. An optimum number of clusters are equal to the total number of pavement segments, and each pavement segment is the sole member of its own cluster. Such clustering structure is unlikely to provide statistical reliable models. In addition, SSE always decreases when a new explanatory variable is added to the model [10]. Usually, this leads to an overfitting problem [11]. Therefore, SSE is not the best objective function to use for searching an optimal number of clusters.

Even though SSE decreases as the number of clusters increases, the rate of improvement diminishes significantly after a relative optimal [12] number of clusters known as the elbow point. Increasing the number of clusters beyond the elbow point provides a very small reduction in SSE. An SSE versus the number of cluster curve might not exhibit an elbow point distinctly in all cases. Hence, it could be very challenging to choose the right number of clusters.

To address these limitations, this study extended the existing CLR framework by simultaneously (i) incorporating the Bayesian Information Criteria (BIC) [13] as the objective function, (ii) finding the optimal number of clusters, and (iii) finding the maximum number of clusters as required for model estimation. The BIC penalizes more for the inclusion of additional parameters relative to the Akaike Information Criteria (AIC) [14]. The BIC selects simpler models than the AIC when the sample size is greater than eight [15]. Hence, the optimal number of clusters found using BIC provides a balance between model complexity and goodness of fit [16]. On the other hand, several studies showed that the number of parameters in a model selected using AIC was overestimated [14, 17–19]. The search process for the best model specification using BIC has the property of consistency, which asymptotically selects this model with a probability of “1” [20–23]. Galimberti et al. [24] used BIC to propose a unified framework for model-based clustering, linear regression, and multiple cluster structure detection.

In this study, the data limitations in the existing literature were addressed using actual field data collected across a variety of environmental, traffic, design, construction, and maintenance conditions. Pavement data collected for the past 12 years over the entire State of Nevada were used. This data included significant variations across a large range of characteristics, e.g., pavement segments exposed to either extreme desert heat or to very low winter temperatures in the mountains.

2. Methodology

2.1. Problem Formulation

This section includes the definition of a pavement sample, notation and terms, proposed mathematical program, and a procedure to find upper bound of the range of the feasible number of clusters.

2.1.1. Definition of a Pavement Sample

The condition of a pavement segment improves when intervention occurs by applying an M&R treatment. Such intervention alters the physical characteristics of the pavement. Hence, the performance of a pavement before and after the intervention differs, even though all other contributing factors remain constant. In this circumstance, the same pavement segment before and after intervention should be treated as two different samples. Given that the physical location of a pavement segment is the same, a different identifier is required to distinguish the set of consecutive observations before and after the intervention. In this study, the term pavement sample is used as an identifier to uniquely represent the set of consecutive observations that accounts for historical interventions made on a pavement segment. Figure 2 provides a simplified depiction of how consecutive observations of a pavement segment are divided into two pavement samples.

In this study, a pavement sample, instead of pavement segment, was considered as a single entity during cluster analysis. Hence, if a pavement segment consists of two or more pavement samples, these samples could be assigned to different clusters.

2.1.2. Notation and Terms

The following notation and terms are used in describing the proposed model: = Subscript for a pavement sample in the network = Number of pavement samples in the network, indexed = Subscript for an observation period for a pavement sample = Number of observation periods for a pavement sample , indexed = Total number of observations = = Subscript for an explanatory variable = Number of explanatory variables, indexed = Subscript for a cluster = Maximum number of clusters = Number of optimum clusters = Measurement of an explanatory variable for a sample at observation period that is assigned to a cluster = Measurement of dependent variable for a sample at observation period that is assigned to a cluster = Minimum number of observations required in a cluster = Set of pavement samples that are assigned to cluster = Cluster membership of a pavement sample to a cluster , and . = Intercept for regression model in cluster = Slope coefficient for explanatory variable ) in cluster

2.1.3. Mathematical Program

A major aspect of model development is to identify variables that explain an actual physical process. Pavement performance models (PPMs) should only include explanatory variables that affect pavement deterioration. In this study, explanatory variables from the existing literature were selected. For example, pavement age, traffic and environmental conditions, and structural and material properties of pavement were considered as factors affecting pavement performance [25–28].

PSI was chosen as the dependent variable, . PSI serves as a unified standard and is widely accepted for evaluating pavement performance, especially in terms of ride quality [29–31]. In addition, PSI reflects the human rider’s response and is understood by highway users and legislators [32]. The adopted functional form for the regression model is expressed as

This study proposes a mixed-integer, nonlinear mathematical program to determine an optimal number of clusters, assignment of segments into clusters, and regression coefficients for all prespecified explanatory variables. The problem was defined by the optimum number of clusters, ; the number of predefined explanatory variables, ; the number of pavement samples to be clustered, ; and the number of observation periods, , associated with each pavement sample. The formulation partitions pavement samples into an optimum number of clusters, with a PPM model fit to each cluster.

The objective function involved minimization of the BIC across clusters as expressed by

Given that BIC is an increasing function of SSE and free parameters (the number of clusters and regression coefficients) to be estimated, a clustering with the lowest BIC provides the optimal solution. Minimizing BIC reduces unexplained variation in the dependent variable [13]. The total SSE was calculated usingEach cluster was associated with a linear regression model with predefined explanatory variables. Deviations of predicted statistics from actual data were calculated separately for each cluster and summed to obtain the total SSE.

Decision variables to be determined were the coefficients for all prespecified explanatory variables, and ; the optimum number of clusters, ; and the cluster membership, .

The following constraints were imposed to describe the proposed problem:

Equations (4) and (5) ensure that each pavement sample was assigned to exactly one cluster. The indicator equals 1 if and only if a pavement sample belongs to cluster . Otherwise, it takes a value of zero.

Equation (6) describes a minimum-size constraint, which is imposed to ensure sufficient observations in each cluster for a statistically reliable estimation of coefficients. The total number of observations in a cluster is required to be no less than the minimum number of observations, .

Equations (7) and (8) are imposed to determine the maximum number of potential clusters. If a pavement sample has more than observations, regression over these observations could generate statistically reliable estimates of the coefficients. Hence, a cluster could be formed with only one pavement sample that has more than observations. If all pavement samples have more than observations, each pavement sample could form a cluster. In this case, the maximum number of clusters would be the total number of pavement samples, . However, in reality, it is possible that none of the pavement samples would have more than observations. In this case, samples would be grouped to form a cluster at the sample level, but not at the observation level; that is, observations of a sample must not be assigned to more than one cluster. Equation (8) determines the maximum number of potential clusters in both cases. Function F in this constraint represents the following algorithm to determine . A flowchart for this algorithm is provided in Figure 3.

2.1.4. Proposed Algorithm to Determine

Step 1. If the total number of observations, , is less than the minimum number of observations required to form a cluster, , then set = zero and go to Step 6. Otherwise, create a matrix, , of size ( x 2) with the following elements, where is maximum number of observations of a pavement segment(s) in the dataset:(a)The first column of includes all integer values from 1 to .(b)The second column includes the number of segments associated with the number of observations.(c)If no segments have a particular number of observations in the dataset, then set the second column of the matrix to zero.

Step 2. If any segment has a number of observations greater than or equal to ), then (a)calculate (b)update with 0 for Otherwise, go to Step 3 to find the maximum number of clusters that could be formed.

Step 3. If the matrix M has all zeros in its second column , then return and go to Step 6. Otherwise,(a)update M by removing all rows that have the number of segments equal to zero ()(b)initialize two indices as = = number of rows in M(c)make a copy of M and let it be represented by (d)if the remaining total number of observations is less than , then, , and go to Step 6. Otherwise, initialize with the value of and

Step 4. Repeat the following steps until .

Step 4.1. If (), then . Otherwise, go to Step 4.3.

Step 4.2. If ( = 0), then(i)(ii)set = number of rows of M(iii)set (iv)set (v)go to Step 4.6

Otherwise, go to Step 4.3.

Step 4.3. If , then(i)(ii)(iii)

Otherwise, go to Step 4.6.

Step 4.4. If ( = 0), then(i)(ii)set = number of rows of M(iii)set (iv)set

Otherwise, go to Step 4.6.

Step 4.5. If (, then(i)update both indices and with the number of rows of (ii)set (iii)go to Step 4.6

Otherwise,(i)set (ii)set (iii)go to Step 4.6

Step 4.6. Update with and with .

Step 5. Set , and go to Step 3.

Step 6. Return the current value of and stop.

2.2. Solution to the Mathematical Program

Simulated annealing (SA) coupled with an ordinary least square (OLS) algorithm was implemented to solve the above mathematical program. SA was used to cluster the dataset estimate, . For each accepted neighborhood cluster, OLS was utilized to estimate the regression coefficients, and . The fitting linear models (lm) function, available in the statistical software, R, was used to estimate these coefficients [33]. DeSarbo et al. [34] successfully implemented such an algorithm to solve the CLR problem.

The algorithm utilized to solve the clusterwise multiple linear regression is described as follows, and illustrated in Figure 4.

Step 1. Initialization:
Step 1.1. Set , and .
Step 1.2. Set values of initial temperature , final minimum temperature (), cooling rate , and the maximum number of neighbors to be generated at each temperature level. Set the iterator .

Step 2. Calculate maximum number of potential clusters, , utilizing function F as described above, as part of Constraint 8.

Step 3. Initial estimation of regression coefficients:
Step 3.1. For a given number of clusters, , randomly assign cluster memberships to all pavement samples.
Step 3.2. Count the number of observations of all pavement samples assigned to each cluster. If all clusters have at least observations, then go to Step 4; otherwise, reassign the cluster memberships until all clusters have at least observations. Let be the valid initial clusters.
Step 3.3. Estimate and for all clusters using OLS.

Step 4. Evaluate objective function, using (2).

Step 5. Generate a set of neighborhood clusters near to the previous one, using the following steps.
Step 5.1. Randomly select a prespecified number of pavement samples to change their memberships.
Step 5.2. For each of the samples selected, assign a new membership by generating a random number . If the new membership is same as the previous one, regenerate a random number until it is different. Repeat this process until the memberships of all the selected pavement samples are different from those that were previously assigned.
Step 5.3. Count the total number of observations of all pavement samples assigned to each cluster.
Step 5.4. If all clusters have at least observations, go to Step 6; otherwise, repeat Steps 5.1., 5.2., and 5.3. until all clusters have at least observations. Let be a new set of valid neighborhood clusters.

Step 6. Search for a solution:
Step 6.1. For , estimate new and for all clusters using OLS.
Step 6.2. Evaluate using (1).
Step 6.3. Calculate .
Step 6.4. Check the following two conditions:(a)If , accept the current set of clusters, , and corresponding and . Go to Step 7; otherwise, go to Step (b).(b)Generate a random number . Calculate the acceptance probability, , where is a Boltzmann’s constant. If , accept the current set of clusters, , and corresponding and . Go to Step 7; otherwise, return to Step 5.

Step 7. Counter and temperature update:
Step 7.1. Repeat Steps 5 and 6 for times.
Step 7.2. If , stop the algorithm. Otherwise, reduce the temperature by multiplying the current temperature by the prespecified cooling rate, , set , and go to Step 5.

Step 8. Stopping criteria:
Step 8.1. Update with the smallest between the one obtained in Step 7 and the current . Set equal to .
Step 8.2. Repeat Steps 3 to 7 for times.

This algorithm seeks solutions using a probabilistic approach. The algorithm starts with a high temperature, , and a high probability of accepting a worse solution, . This enables occasional “uphill” moves, which help escape from the local minima. The algorithm builds up a rough view of the search space by moving with large step lengths. As drops, decreases to behave more closely as a greedy algorithm, with small step lengths slowly focusing on the most promising solution space. Theoretical studies have shown that, with infinitely slow cooling, the algorithm converges to a global minimum [35].

3. Experiment and Results

3.1. Data Resources

Data used in this study were extracted from the PMS database of NDOT [36]. The data consisted of various classes, location data, segment data, contract data, environmental data, traffic data, and pavement condition data, collected for flexible pavement in the entire State of Nevada.

A detailed data analysis was performed to check for inconsistent and missing information in the dataset, and some were found. Some of the missing data were synthesized based on associated information available in the dataset. In preparing the PMS data, the following filters were applied:(i)Only one-mile segments were selected for consistency.(ii)Only pavement segments with the most recent maintenance contracts awarded in 2001 or later were used in the study.(iii)PSI of a pavement should deteriorate over time if no M&R treatment occurred. If PSI of a segment in any year increased by 0.1 or more points from the previous year without any M&R treatment, all observations for that year were excluded from the analysis. However, if an increase in PSI in any year was less than 0.1 from the previous year, it was assumed to be a random error during the process of pavement evaluation or data processing. Therefore, those observations were included in the analysis.(iv)If the PSI of any year decreased by one or more points from the previous year, all observations for that year were excluded from the analysis.(v)In practice, the PSI range is between 4.5 and 1.5. Therefore, if a pavement segment had a PSI beyond these limits in any year, it was considered an outlier, and all observations for that year were excluded.(vi)Only PSI values used were within the interval of the mean, minus three standard deviations to the mean plus three standard deviations.(vii)Pavement samples that did not consist of data regarding conditions for at least two consecutive years were excluded.(viii)Data analysis showed that an improvement in PSI was seen one or two years after the contract award date. Hence, the age of the pavement sample was set to 0 when the actual improvement occurred rather than when the contract was awarded.

After data preparation was completed, 4,138 flexible pavement samples with 17,642 observations were available. For CLR modelling, 14,637 observations, collected from 2001 to 2010, were used; the remaining 3,005 observations, collected in 2011 and 2012 (about 17% of the total number), were used as test dataset to check the accuracy of the CLR models. Tables 1 and 2 provide the descriptive statistics of continuous and categorical variables used in this paper, respectively.

3.2. Parameters of the Algorithm

Performance of the SA algorithm generally depends on the values of the optimization parameters utilized for a given problem. To ensure proper initialization and search for optimal solutions, selection of the most appropriate parameter values is critical [37, 38]. A body of literature exists regarding various methodologies for finding the most appropriate values for annealing parameters in SA [37, 39–43].

If an SA algorithm is allowed to run for a sufficiently long time by setting a high initial temperature with a slow cooling rate, the algorithm performs well, as shown by Anily and Federgruen [44]. In such a cooling scheme, the selection of the most appropriate parameter values may not be critical. However, computation time cannot always be ignored. Hence, the algorithm has to find a good solution in a reasonable amount of time [39].

Effective values to be assigned to the optimization parameters depend on the type and complexity of the problem. These values may not be obvious to determine, but rather might be determined by trial and error for a given problem [40]. In this study, values assigned to the optimization parameters were determined using experience gained from previous research [45–51] that involved SA and other comparable algorithms. Table 3 lists the parameter values used in this study.

The minimum number of observations required in a cluster, , is one of the parameters to be defined by the analyst for each dataset and application as it is required for any other statistical analysis regarding the minimum sample size. That is, the proposed methodology and contributions are not restricted or affected by the sample size. The analyst must have available sufficient data for each cluster to be able to obtain reliable estimates. A too small will result in statistically unreliable models and a potentially time-consuming search process as the maximum number of feasible clusters; will be very large. In contrast, a too large will result in insufficient number of clusters required to provide the optimum goodness of fit. Hence, sensitivity analysis is required to achieve balance between reliable estimates and .

3.3. Results and Discussion

Given the constraints for feasible partitions defined in the problem formulation and the minimum number of observations required in a cluster, , the proposed algorithm determined 16 as the maximum number of potential clusters. The algorithm searched for the optimum number of clusters from 2 to 16. Seven-cluster CLR models provided the optimum solution with the lowest BIC. The estimated regression coefficients for the CLR models are presented in Table 4.

Figure 5(a) shows the smallest BIC for each of the clusters ( to 16) considered in this experiment. Figure 5(b) shows the trajectory of the objective function, BIC, when the CLR models were used. The initial value of BIC was 8,502. After 1,360 iterations, the BIC decreased to 3,008. This change was equivalent to an improvement of 65%.

(a)

(b)

(c)

(d)

It was observed that not all coefficients had associated p values less than 0.05. In this study, the significance level was considered to be 5%. As expected, coefficients differed in magnitude and sign across the clusters, which indicated that the deterioration patterns of pavement samples varied among the clusters. However, seven explanatory variables had the same sign across all clusters.

Different clusters had different numbers of significant explanatory variables. For example, Cluster 2 had 10 insignificant explanatory variables. In addition, among all seven clusters, five variables, age, adt, rut_depth, category=4, and category=5, were significant. However, four variables, trucks, elevation, precip, and category=2, were not significant in four different clusters.

Trucks play a key role in pavement deterioration because they transfer heavy loads to the pavement [52]. Hence, it is expected that variable “trucks” be significant. There could be various reasons precluding the statistical significance of this variable. Two of them include (i) multicollinearity effects, which are not addressed in the existing CLR literature and motivates the expansion of the framework as recommend in this paper under future research and (ii) different lanes mistakenly used to collect pavement performance and truck traffic data. The data used in this study does not include the lane used to collect the information.

The performance of the proposed CLR approach was compared with that of the existing CLR for pavement management. That is, in this section models estimated using the proposed approach were compared with those estimated using the existing CLR [6] described in the Introduction. Experiments using the existing CLR approach were run for all feasible clusters ( to 16). Figure 5(c) shows the smallest SEE for each of these clusters. As expected, SSE decreased with an increasing number of clusters, but at a very small rate after . In this case, Figure 5(c) does not exhibit a clear elbow point. Hence, an optimum number of clusters needed to be decided by visual inspection while considering the trade-off between goodness of fit and model complexity (i.e., of the number of models and the explanatory variables). This inherent subjectivity when choosing an optimum number of clusters is a major drawback for the existing state-of-the-art CLR approach.

After careful assessment, 11-cluster CLR models were selected as the optimum solution. Figure 5(d) shows the trajectory of SSE, when 11-cluster CLR models were used, and Table 5 provides the corresponding regression coefficients. Similar to the results obtained from the proposed CLR approach, the coefficients differed in terms of magnitude and sign. In addition, some coefficients had p values larger than 0.05.

The BIC for these models are provided in Tables 4 and 5. To compare the goodness of fit, overall BIC values were calculated. The overall BIC for the seven-cluster models obtained from the proposed approach was 3,008, whereas the BIC for 11-cluster models, obtained from the existing approach, was 3,171. This difference was the result of similar or better explanatory power provided by the proposed approach with seven clusters versus 11. That is, the more clusters, the more coefficients for explanatory variables needed to be estimated for a similar goodness of fit; thus, the BIC increased.

It is observed that the individual BIC for the 11-cluster are slightly smaller than those for the seven-cluster models. Similar to SSE, the more the clusters are used, the smaller the individual model BIC is. However, this does not necessarily translate into a smaller overall BIC.

3.4. Model Assessment and Validation

3.4.1. Potential Overfitting

Brusco et al. [53] noted that clusterwise regression models have great potential for overfitting. Often, a variation in the response variable is governed by clustering. Hence, they recommend investigating the potential presence of overfitting in CLR models. This study adopted a procedure proposed by Brusco et al. [53] and the test dataset to diagnose overfitting. For the optimum seven-cluster models, the total sum of squares (TSS) was 2,419, the between-clusters sum of squares (BCSS) was 30, the within-clusters sum of squares (WCSS) was 2,389, the sum of squares due to regression (SSR) was 1,456, and the SSE was 933. The BCSS was around 1% of TSS and SSR was 62% of WCSS. These results indicated that there was no overfitting, as most of the variation in PSI was explained by within-cluster regressions. SSE accounted for 38% of TSS, which suggests that the models still have a relatively high rate of errors. A nonlinear functional form should be investigated to reduce the existing errors.

3.4.2. Model Accuracy

The accuracy of the models obtained from both approaches was assessed by calculating the overall root-mean-square error (RMSE), as follows:where, = the observed PSI = the predicted PSI = the number of predictions

Both models were applied to the test dataset. Memberships of the pavement samples were assigned by mapping the sample IDs and memberships determined by the CLR models. Associated regression models and observed data were used to estimate the PSIs. Predicted PSIs then were compared with the observed PSIs, as shown in Figure 6. Results indicate that both CLR models overestimated the PSI. A possible reason might be the existence of multicollinearity among explanatory variables.

(a)

(b)

The RMSE for 2011 and 2012 were calculated for models obtained using both approaches. The RMSE were, respectively, 0.429 and 0.439 for the 11-cluster models estimated using the existing state-of-the-art and the seven-cluster models estimated using the proposed CLR approach. The prediction accuracy of models estimated by the existing state-of-the-art was slightly higher than that of the model estimated by the proposed approach. However, this difference in accuracy is very small. The additional four clusters required 100 parameters corresponding to twenty-five explanatory variables including the intercept. Hence, seven-cluster models estimated using the proposed approach were more parsimonious and preferred over the 11-cluster models.

4. Conclusions

This study proposed and implemented a clusterwise multiple linear regression to develop pavement performance models. A mixed-integer nonlinear mathematical program was formulated to explain the problem. The CLR approach simultaneously divided pavement samples into an optimum number of clusters, and estimated a PPM for each cluster.

In the experiments, various environmental factors were considered as potential explanatory variables, including elevation, annual precipitation, average minimum and maximum temperatures, the number of wet days, and freeze and thaw cycles. The proposed approach enabled consideration of other types of variables, such as economic and social factors. Formulation of the mathematical program developed in this study supports a number of explanatory variables, multiple observations per pavement segment, and user-defined constraints on cluster characteristics.

A simulated annealing coupled with OLS was used to solve the mathematical program. For the data used in the experiments, the algorithm found that 7-cluster models provided the optimum solution. Results obtained from the proposed CLR models were compared with results obtained from the state-of-the-art approach. This comparison showed that the proposed CLR approach performed better than the state-of-the-art approach in predicting the PSI of pavement samples.

The analysis showed that overfitting was not an issue for the resulting clusters and regression models. As expected, the use of the BIC as an objective function to determine the best model specification provided a more parsimonious structure compared with that obtained using SSE. This was a consequence of the consistency property of the BIC [10, 20–23].

5. Future Work

This study did not address all limitations of the state-of-the-art CLR approach that were discussed in the Introduction of this paper. The following are potential extensions.

One limitation in this study is related to the error that likely is caused by including insignificant explanatory variables during clustering and regression analyses. The proposed formulation needs to be extended to include only significant variables. Hence, the CLR models would not be restricted only to prespecified explanatory variables. Instead, the models could include cluster-specific significant explanatory variables. In addition, the multicollinearity among explanatory variables should be investigated to exclude highly correlated variables during the model estimation process.

The proposed mathematical formulation was limited to a linear functional form for PPMs. Luo and Chau [1] implemented a CLR approach using an exponential form. However, their study used only pavement age as an explanatory variable. An interesting aspect worthy of investigation would be to explore utilizing the proposed CLR approach by allowing nonlinear relationships between pavement performance measures and multiple explanatory variables.

The proposed SA with the OLS algorithm was designed to search for a global minimum. However, a large amount of computational time was required. Hence, another avenue for future research would be to develop faster and more efficient combinatorial algorithms that could guarantee global optimality.

Data Availability

The Nevada Department of Transportation (NDOT) provided the data that was used for analysis. These data is confidential and should be requested to NDOT.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

Many thanks are due to the Nevada Department of Transportation for providing the data that was used for analysis and to our Technical Writer at UNLV’s Howard R. Hughes College of Engineering, Julie Longo, for her help reviewing this manuscript. Special thanks are due to Mr. Antoine Bretecher and Fabien Sortais for their help with the implementation of the solution algorithm.

References

Z. Luo and E. Y. J. Chou, “Pavement condition prediction using clusterwise regression,” Transportation Research Record, no. 1974, pp. 70–77, 2006.
View at: Google Scholar
Z. Luo and H. Yin, “Probabilistic analysis of pavement distress ratings with the clusterwise regression method,” Transportation Research Record, no. 2084, pp. 38–46, 2008.
View at: Google Scholar
M. Steinbach, L. Ertoz, and V. Kumar, “The challenges of clustering high dimensional data,” in New directions in statistical physics, pp. 273–309, Springer, Berlin, 2004.
View at: Google Scholar | MathSciNet
H. Pulugurta, Development of Pavement Condition Forecasting Models [PhD Dissertation], University of Toledo, 2007.
H. Späth, “Algorithm 39 Clusterwise linear regression,” Computing, vol. 22, no. 4, pp. 367–373, 1979.
View at: Publisher Site | Google Scholar | MathSciNet
W. Zhang and P. L. Durango-Cohen, “Explaining Heterogeneity in Pavement Deterioration: Clusterwise Linear Regression Model,” Journal of Infrastructure Systems, vol. 20, no. 2, p. 04014005, 2014.
View at: Publisher Site | Google Scholar
“Highway Research Board, The AASHO road test, Special Rep. No. 61A-E,” Tech. Rep., National Academy of Science, National Research Council, Washington, 1962.
View at: Google Scholar
M. Khadka and A. Paz, “Comprehensive Clusterwise Linear Regression for Pavement Management Systems,” Journal of Transportation Engineering, Part B: Pavements, vol. 143, no. 4, 2017.
View at: Publisher Site | Google Scholar
T. M. Kodinariya and P. R. Makwana, “Review on determining number of cluster in kmeans clustering,” International Journal of Advanced Research in Computer Science and Management Studies, vol. 1, no. 6, pp. 90–95, 2013.
View at: Google Scholar
J. M. Wooldridge, Introductory econometrics: A modern approach, Mason, OH: Thomson/South-Western, 2006.
J. Wu, Model-based clustering and model selection for binned data [PhD Dissertation], SUPELEC, 2014.
R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society B: Statistical Methodology, vol. 63, no. 2, pp. 411–423, 2001.
View at: Publisher Site | Google Scholar | MathSciNet
G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
View at: Publisher Site | Google Scholar | MathSciNet
J. B. Kadane and N. A. Lazar, “Methods and criteria for model selection,” Journal of the American Statistical Association, vol. 99, no. 465, pp. 279–290, 2004.
View at: Publisher Site | Google Scholar | MathSciNet
M. Forster and E. Sober, “Why likelihood?” in The nature of scientific evidence, pp. 153–190, Univ. Chicago Press, Chicago, IL, 2004.
View at: Google Scholar | MathSciNet
C. Fraley and A. E. Raftery, “How many clusters? Which clustering methods? Answers via model-based cluster analysis,” Computer Journal, vol. 41, pp. 578–588, 1998.
View at: Google Scholar
J. Geweke and R. Meese, “Estimating regression models of finite but unknown order,” International Economic Review, vol. 22, no. 1, pp. 55–70, 1981.
View at: Publisher Site | Google Scholar | MathSciNet
R. W. Katz, “On some criteria for estimating the order of a Markov chain,” Technometrics. A Journal of Statistics for the Physical, Chemical and Engineering Sciences, vol. 23, no. 3, pp. 243–249, 1981.
View at: Publisher Site | Google Scholar | MathSciNet
A. B. Koehler and E. S. Murphree, “A comparison of the Akaike and Schwarz criteria for selecting model order,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 37, no. 2, pp. 187–195, 1988.
View at: Publisher Site | Google Scholar | MathSciNet
C. R. Rao and Y. H. Wu, “A strongly consistent procedure for model selection in a regression problem,” Biometrika, vol. 76, no. 2, pp. 369–374, 1989.
View at: Publisher Site | Google Scholar | MathSciNet
Y. Yang, “Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation,” Biometrika, vol. 92, no. 4, pp. 937–950, 2005.
View at: Publisher Site | Google Scholar | MathSciNet
A. Maydeu-Olivares and C. García-Forero, “Goodness-of-fit testing,” International Encyclopedia of Education, pp. 190–196, 2010.
View at: Google Scholar
S. I. Vrieze, “Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC),” Psychological Methods, vol. 17, no. 2, pp. 228–243, 2012.
View at: Publisher Site | Google Scholar
G. Galimberti, A. Manisi, and G. Soffritti, A unified framework for model-based clustering, linear regression and multiple cluster structure detection, 2015, https://arxiv.org/abs/1510.03245v1.
A. J. Hand, P. E. Sebaaly, and J. A. Epps, “Development of performance models based on department of transportation pavement management system data,” Transportation Research Record, no. 1684, pp. 215–222, 1999.
View at: Google Scholar
F. Hong and J. A. Prozzi, “Estimation of pavement performance deterioration using Bayesian approach,” Journal of Infrastructure Systems, vol. 12, no. 2, pp. 77–86, 2006.
View at: Publisher Site | Google Scholar
S. W. Haider, K. Chatti, N. Buch, R. W. Lyles, A. S. Pulipaka, and D. Gilliland, “Effect of design and site factors on the long-term performance of flexible pavements,” Journal of Performance of Constructed Facilities, vol. 21, no. 4, pp. 283–292, 2007.
View at: Publisher Site | Google Scholar
H. Khraibani, T. Lorino, P. Lepert, and J.-M. Marion, “nonlinear mixed-effects model for the evaluation and prediction of pavement deterioration,” Journal of Transportation Engineering, vol. 138, no. 2, pp. 149–156, 2011.
View at: Publisher Site | Google Scholar
S. N. Shoukry, D. R. Martinelli, and J. A. Reigle, “Universal pavement distress evaluator based on fuzzy sets,” Transportation Research Record, no. 1592, pp. 180–186, 1997.
View at: Google Scholar
S. Terzi, “Modeling the pavement present serviceability index of flexible highway pavements using data mining,” Journal of Applied Sciences, vol. 6, no. 1, pp. 193–197, 2006.
View at: Publisher Site | Google Scholar
N. Attoh-Okine and S. Mensah, “MEMS Application in Pavement Condition Monitoring-Challenges,” in Proceedings of the 19th International Symposium on Automation and Robotics in Construction, Washington, DC, USA, September 2002.
View at: Publisher Site | Google Scholar
W. R. Hudson, R. Haas, and E. Perrone, “Measures of Pavement Performance must consider the Road User,” in Proceedings of the 9th International Conference on Managing Pavement Assets, Alexandria, VA, 2015.
View at: Google Scholar
R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2014.
W. S. DeSarbo, R. L. Oliver, and A. Rangaswamy, “A simulated annealing methodology for clusterwise linear regression,” Psychometrika, vol. 54, no. 4, pp. 707–736, 1989.
View at: Publisher Site | Google Scholar
P. Román-Román, D. Romero, M. A. Rubio, and F. Torres-Ruiz, “Estimating the parameters of a Gompertz-type diffusion process by means of simulated annealing,” Applied Mathematics and Computation, vol. 218, no. 9, pp. 5121–5131, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
Nevada Department of Transportation, Pavement Management System Overview, Material Division, Nevada Department of Transportation, USA, 2011.
M.-W. Park and Y.-D. Kim, “A systematic procedure for setting parameters in simulated annealing algorithms,” Computers & Operations Research, vol. 25, no. 3, pp. 207–217, 1998.
View at: Publisher Site | Google Scholar
S. Babajanzade Roshan, M. Behboodi Jooibari, R. Teimouri, G. Asgharzadeh-Ahmadi, M. Falahati-Naghibi, and H. Sohrabpoor, “Optimization of friction stir welding process of AA7075 aluminum alloy to achieve desirable mechanical properties using ANFIS models and simulated annealing algorithm,” The International Journal of Advanced Manufacturing Technology, vol. 69, no. 5-8, pp. 1803–1818, 2013.
View at: Publisher Site | Google Scholar
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.
View at: Publisher Site | Google Scholar | MathSciNet
N. E. Collins, R. W. Eglese, and B. L. Golden, “Simulated annealing---an annotated bibliography,” American Journal of Mathematical and Management Sciences, vol. 8, no. 3-4, pp. 209–307, 1988.
View at: Publisher Site | Google Scholar | MathSciNet
J. Rose, W. Klebsch, and J. wolf, “Temperature Measurement and Equilibrium Dynamics of Simulated Annealing Placements,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 9, no. 3, pp. 253–259, 1990.
View at: Publisher Site | Google Scholar
S. Z. Selim and K. Al-Sultan, “A simulated annealing algorithm for the clustering problem,” Pattern Recognition, vol. 24, no. 10, pp. 1003–1008, 1991.
View at: Publisher Site | Google Scholar | MathSciNet
J. Q. Guo and L. Zheng, “A modified simulated annealing algorithm for estimating solute transport parameters in streams from tracer experiment data,” Environmental Modeling and Software, vol. 20, no. 6, pp. 811–815, 2005.
View at: Publisher Site | Google Scholar
S. Anily and A. Federgruen, “Simulated annealing methods with general acceptance probabilities,” Journal of Applied Probability, vol. 24, no. 3, pp. 657–667, 1987.
View at: Publisher Site | Google Scholar | MathSciNet
C. Cobos, C. Daza, C. Martínez et al., “Calibration of microscopic traffic flow simulation models using a memetic algorithm with solis and wets local search chaining (MA-SW-Chains),” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 10022, pp. 365–375, 2016.
View at: Google Scholar
C. Cobos, C. Erazo, J. Luna et al., “Multi-objective memetic algorithm based on NSGA-II and simulated annealing for calibrating CORSIM micro-simulation models of vehicular traffic flow,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 9868, pp. 468–476, 2016.
View at: Google Scholar
M. Khadka and A. Paz, “Limitations of Existing Pavement Performance Models and a Potential Solution,” in Proceedings of the World Conference on Pavement and Asset Management, Baveno, Italy, 2017.
View at: Google Scholar
M. Khadka and A. Paz, “Estimation of optimal pavement performance models for highways,” in Proceedings of the The 10th International Conference on the Bearing Capacity of Roads, Railways and Airfields (BCRRA 2017), pp. 1489–1494, Athens, Greece.
View at: Publisher Site | Google Scholar
A. Paz, V. Molano, and J. Sanchez-Medina, “Holistic Calibration of Microscopic Traffic Flow Models: Methodology and Real World Application Studies,” in Engineering and Applied Sciences Optimization, vol. 38 of Computational Methods in Applied Sciences, pp. 33–52, Springer International Publishing, Cham, 2015.
View at: Publisher Site | Google Scholar
A. Paz, V. Molano, E. Martinez, C. Gaviria, and C. Arteaga, “Calibration of traffic flow models using a memetic algorithm,” Transportation Research Part C: Emerging Technologies, vol. 55, pp. 432–443, 2015.
View at: Publisher Site | Google Scholar
M. Khadka, Generalized CLUsterwise Regression for Simultaneous Estimation of Optimal Pavement CLUsters and PERformance Models, ProQuest LLC, Ann Arbor, MI, 2017.
View at: MathSciNet
H. K. Salama, K. Chatti, and R. W. Lyles, “Effect of heavy multiple axle trucks on flexible pavement damage using in-service pavement performance data,” Journal of Transportation Engineering, vol. 132, no. 10, pp. 763–770, 2006.
View at: Publisher Site | Google Scholar
M. J. Brusco, J. D. Cradit, D. Steinley, and G. L. Fox, “Cautionary remarks on the use of clusterwise regression,” Multivariate Behavioral Research, vol. 43, no. 1, pp. 29–49, 2008.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2018 Mukesh Khadka et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

678

Downloads

718

Citations