Abstract

A nonparametric simulation model (-nearest neighbor resampling, KNNR) for water quality analysis involving geographic information is suggested to overcome the drawbacks of parametric models. Geographic information is, however, not appropriately handled in the KNNR nonparametric model. In the current study, we introduce a novel statistical notion, called a “depth function,” in the classical KNNR model to appropriately manipulate geographic information in simulating stormwater quality. An application is presented for a case study of the total suspended solids throughout the entire United States. The stormwater total suspended solids concentration data indicated that the proposed model significantly improves the simulation performance compared with the existing KNNR model.

1. Introduction

Human activities in urban areas create a large number of pollutants. These pollutants are carried by stormwater into inland water bodies such as streams, rivers, and lakes, endangering the local ecosystems. During the last few decades, governments and communities have developed strategies to reduce urban stormwater pollution. To meet these objectives, a number of approaches for water quality analysis and modeling such as the environmental probability plot, the box-whisker plot, and the - model method have been developed; see [15]. For instance, in 1983, the United States Environmental Protection Agency (US EPA) established the national pollutant discharge elimination system (NPDES), imposing water quality requirements for urban storm sewer systems to secure the environment around water bodies. However, stormwater quality data are inherently difficult to collect and analyze due to their uncertain nature in both the time and space domains; see [6, 7]. Furthermore, the modeling of stormwater quality generally involves the difficult task of organizing and processing large amounts of spatially referenced data; see [2, 8]. It is thus essential for modelers and decision-makers to take into consideration the uncertainty in the data.

Monte Carlo simulation (MCS) has frequently been employed in the literature to determine the uncertainty in stormwater pollutant concentrations; see [5, 911]. The general MCS procedure is to fit a probability density function (pdf), for example, the log-normal distribution, to the observed data and then generate “samples” from the fitted pdf model. This traditional approach has a number of drawbacks, such as the limited number of feasible pdfs for stormwater quality data, the large effects of outliers, the limited choice of distributions (other than the normal distribution) for more than one variable, and the bias induced by the normality assumption, especially for the multivariate case; see [4]. Furthermore, the limited stormwater quality records hinder the application of the traditional MCS approach, especially in stormwater management.

Towler et al. [4] adapted the -nearest neighbor resampling (KNNR) method (see [1214]) to simulate influent concentration scenarios using the information collection rule (ICR) database of the US EPA. The KNNR method applied by Towler et al. [4] for wastewater quality simulation takes into account extensive spatial data to overcome the common limits of temporal concentration datasets. In this method, geographical information (GI) is included as variables along with the concentration variable. The results of Towler et al. [4] indicated that the KNNR model is a good alternative to the traditional parametric MCS approach for regulatory, treatment, and risk assessments regarding concentrations.

However, the way GI is handled as a general variable in Towler et al. [4] might lead to the underperformance of the KNNR model. The primary reason for this is that, as the modeling dimension increases from the insertion of the GI, the model becomes more intricate and subtle. Consequently, the model loses its focus on the water quality concentration variable. Second, the variability of the GI is quite different from the variability of the concentration variables because of the differences in their characteristics. The GI changes only spatially, whereas the pollutant concentration variable varies both temporally and spatially. Third, adding other GI variables such as the altitude and the area of the watershed is not advised in KNNR due to dimensionality problems.

To alleviate this problem, we propose a novel resampling approach based on depth functions (see [15, 16]), which adapts a different GI from the target stormwater quality variable in the KNNR simulation model. The proposed algorithm involves the combination of KNNR and a depth function and is denoted as “depth-neighbor resampling” (DNR). It is tested with a stormwater quality dataset in this study.

The study is organized as follows. The background of the KNNR model of Towler et al. [4] and depth functions are presented. The overall model procedure is described in the following methodology section. In the application section, the DNR model and the existing KNNR model are applied to the stormwater quality model, and their results are compared. Finally, the conclusions and final remarks are presented.

2. Methodology

The KNNR is a simple analogous model for fitting the conditional distribution and then generating simulations from it in a data-driven manner. The KNNR model for water quality simulation was proposed by Towler et al. [4]. The generation of an ensemble of the variable of interest (e.g., pollutant concentration for a given month), conditioned on a feature vector of explanatory variables, is based on the evaluation of a conditional distribution, that is, . Then, the distance between the current feature vector and the feature vectors constructed from observed data is measured, and one of the -nearest neighbors is selected. Finally, the corresponding pollutant concentration value of the selected neighbor is assigned as the simulation value. Towler et al. [4] employed three explanatory variables (i.e., ): pollutant concentration, latitude, and longitude. Note that the GI is included as explanatory variables through the latitude and longitude.

However, some drawbacks are expected, as discussed in the introduction section, when handling the GI in the form of explanatory variables. A special solution to handle the GI is proposed in the present study by employing the statistical notion of a “depth function.” Starting from the half-space depth proposed by Tukey [15], a number of depth functions have been formulated in the literature; see [1722]. Although depth functions have been widely used in statistics and econometrics (see [20, 23, 24]), they have just recently begun to be used in the environmental and hydrological fields; see [16, 25].

A depth function is a statistical notion for providing an outward ordering of points in a multivariate framework. The key properties of the depth function are (a) affine invariance, (b) maximality at the center, (c) monotonicity relative to the deepest point, and (d) vanishing at infinity; see [19].

For a given cumulative distribution function on , the corresponding depth function is any bounded and nonnegative function, denoted as , which provides an -based center-outward ordering of a point z in to satisfy the properties mentioned above. A number of depth functions can be described, among which the following two are commonly used.(a)The half-space depth (see [15]) is defined with respect to a probability, Pr, associated with a distribution on as (b)The Mahalanobis depth (MHD) (see [26]) is defined on the basis of the Mahalanobis distance between two points, and , with respect to a positive definite matrix . The Mahalanobis depth is given by where and are, respectively, any location and covariance measures corresponding to .

Employing the above depth functions, the basic idea of the proposed model in the current study is to convert the GI (location and longitude) into a real-valued weight vector. This process involves reducing the dimension of the GI variables. In doing so, the information is appropriately handled through the distance measurement in the KNNR model without any increase in the dimension. In other words, the elements of the GI (latitude and longitude) are mapped using a depth function. The MHD (2) is employed in the current study because it is easy to evaluate and flexible to adapt to the current situation and converted to the weight vector for the distance measurement of the KNNR model.

The overall procedure of the depth-based KNNR model denoted as “depth-neighbor resampling” (DNR) as a surrogate of for the stormwater quality simulation is as follows.(1)The user-specified feature vector is defined as , which is known in advance for the current time . Note that the element of the feature vector is now only with the stormwater quality variable (i.e., ) unlike the one in Towler et al. [4] as (i.e., ), where LAT and LON represent the latitude and longitude at the location where is measured. In the current proposed model, the GI (latitude and longitude) is taken into account through the depth function shown below in step .(2)The observed feature vectors of all pollutant concentration available sources are constructed as where represents the pollutant concentration, which is the variable of interest, and denotes all the available records of the database employed. The ICR database of the USEPA for was employed by Towler et al. [4] (see the introduction section). The employed database () for the current study is described in the next section.(3)The depth function is evaluated by in (2), where , at location, and is the covariance matrix of . Note that is the GI where the measurement was taken and are the GI for the measurements in .(4)The weights for are computed where is a known weight function. Some examples of weight functions are presented below.(5)The distances between the user-specified feature vector and the observed feature vectors are computed as (6)The distances are arranged in ascending order, and the first values are selected. Next, one of these values is randomly assigned with the selection probability given by , where . This probability gives a higher chance of selection to the nearest neighbor and a lower chance to the farthest neighbors. Suppose the corresponding index of the selected value is assigned as . The number of nearest neighbors is estimated using the heuristic approach (i.e., ) with its theoretical justification; see [12, 27].(7)The simulation of the stormwater quality variable can be performed by selecting the stormwater quality variable of the corresponding index as []; that is, .(8)The steps above (all steps 1–7 except step 2) are repeated to generate the desired number of data points ().

To obtain in step 4 and (4), a weight function that is positive and monotonically increasing is employed. In the current study, two common weight functions are tested: the one proposed by Lin and Chen [22] and the simple linear weight function (see [16]). The weight function of Lin and Chen [22] is given by with coefficients , and , where defines the support of the weight function and represents the slope of decay to zero. If , then it is equivalent to the weight function of Zuo et al. [28]. The simple linear weight function mentioned in Chebana and Ouarda [16] is expressed as where and are the coefficients with .

A trial-and-error approach was applied in the current study for the parameter estimation of the weight functions. The above two weight functions and are illustrated in Figure 1 using strategically selected coefficients to show their roles. Figure 1 reveals that the weight function decays to zero quickly with high values of (solid line with triangle). Extremely high values of (solid line) lead this function to jump from zero to one. An extremely high value of was used in Chebana and Ouarda [16] and other studies. Furthermore, it is evident that and are the lower and upper limits in (see the dotted line with circles in Figure 1) beyond which the weights 0 and 1, respectively, are assigned.

3. Application

The stormwater quality datasets were obtained from the international stormwater best management practice (BMP) database (http://www.bmpdatabase.org/), which has been assembled since 1996 by the American society of civil engineers and the USEPA, as shown in Figure 2. The database was established to foster a better understanding of factors influencing urban stormwater quality; see [3, 5]. The stormwater quality data as the event mean concentration of total suspended solids (TSS) was retrieved with (see (4)), as shown in Table 1, because it contains a relatively large number of records (approximately 800). The latitude and longitude were also retrieved as the GI.

As mentioned in the introduction, the KNNR is a simple model based on the conditional distribution . In Towler et al. [4], the annual average of wastewater influent concentration, longitude, and latitude were used for the conditional variables to simulate the corresponding monthly wastewater influent concentration variables. The stormwater quality data used in the present study, however, is rarely available for a continuous full year. Therefore, we use the immediate preceding monthly TSS value as the conditional variable instead of the annual average as in Towler et al. [4]. In other words, the conditional variable to simulate the stormwater quality for a certain month , denoted as , is the stormwater quality of the preceding month (i.e., ). Subsequently, in (3) consists of all the stormwater TSS data in month . In the current study, the TSS value of the current month is simulated from the proposed DNR model by treating the GI with the depth function.

Among other coefficient sets, with [, , and ] and with [ and ] performed comparably well in the application of the current study. Therefore, the results using these coefficient sets for each weight function are presented. The models and parameter sets used in this application are:(1)the KNNR model with the depth function ( in (2)) and the weight function in (5) with [, , and ]: DNRLC;(2)the KNNR model with the depth function () and the weight function in (6) with [ and ]: DNRCO.

Each TSS value is simulated 500 times with the three models above. Their performances are evaluated through the mean absolute log error (MALE) and the root mean square log error (RMSLE), given, respectively, as: where is the number of simulated times (here, is used) and is the data generated as a surrogate of the observed data at the time. Note that (i) a low value of MALE or RMSLE indicates a better reproduction of the characteristics of the observed data; (ii) RMSLE is more sensitive to outliers than MALE due to the squared formulation; and (iii) the statistics in (7) and (8) are the log-scale version of the mean absolute error (MAE) and the root mean square error (RMSE), respectively, because stormwater quality data are commonly analyzed with log-scaling; see [1, 4].

In Figure 3, parts of the generated sequences (20 stations among , as shown in Table 1 and Figure 2) are illustrated using boxplots along with the corresponding observed TSS data. The typical KNNR model often generates values that are much lower or higher than those corresponding to the observed data. The generated TSS quantities from the proposed KNNR models with depth function (DNRCO and DNRLC) show better agreement with the observed TSS values than the ones from the typical KNNR model. The remaining generated values (data not shown) have a behavior similar to that described above.

To check the agreement between the observed and generated data, using all the TSS data from 118 stations, as shown in Figure 4, the MALE and RMSLE in (7) and (8) are employed. These statistics are presented in Figures 5-6 and Table 2. The MALE statistics of the TSS-generated data at each month are illustrated in Figure 5. The DNRCO shows consistently better performance than KNNR during the summer and fall months. The DNRLC presents the best performance compared with the other models, except in July when DNRCO is slightly better. The overall superiority of DNRLC with the average MALE metric over all the simulated values is briefly presented in the second row of Table 1 (the smallest MALE value among the three different models is illustrated).

The overall simulation performance of the three models is illustrated through RMSLE, as shown in Figure 6. It shows that the RMSLEs of KNNR and DNRCO are similar, whereas the RMSLE of DNRLC is significantly lower than the former two. The average over all the simulated values of the RMSLE (summarized in the third row of Table 1) supports the results of Figure 6, indicating that the DNRLC model has the lowest RMSLE, whereas KNNR and DNRCO lead to similar performances. This implies that the DNRLC is superior to the other models.

An extensive sensitivity analysis of the coefficients for the weight function in (5) was performed with different parameter sets of and while fixing , as favored in Chebana and Ouarda [16]. This analysis is performed to further investigate the role of the coefficients of the weight function . The analysis of the coefficients of ( and ) is omitted because their role is obvious. The average RMSLE of each parameter set ( and ) from 500 simulations is shown in Figure 7. Note that is attributed to the shape of the weight function and is the location parameter beyond which the weight is assigned as the value 1.0 (see Figure 1). Figure 7 illustrates that the RMSLE increases as increases for all values of the parameter and the RMSLE decreases as increases for high values of , whereas the decrease of RMSLE is minimal for low values of . The results indicate that has a high impact on the performance of the model, whereas has little impact. Therefore, it is concluded that a low value of the parameter and a high value of the parameter are preferred for the currently applied TSS data.

4. Conclusion

The uncertainty in stormwater quality data has hindered the accurate analysis and physical modeling of water quality. Fitting a parametric pdf model to stormwater quality data is sometimes unfavorable primarily due to the lack of data. As an alternative, the nonparametric KNNR model was recently applied for simulating the ensemble of pollutant concentration data. In the current study, we adapted a statistical approach involving “depth functions” to improve the simulation performance of the KNNR for stormwater quality data. Different weight functions are incorporated with the depth function. The results illustrate that the integration of the depth function in the KNNR model, with an appropriate choice of the weight function and its coefficients, leads to an improved performance in the simulation of stormwater quality data. The weight function is superior to in the simulation performance.

Automatic models for the identification of the optimal weight function and the estimation of its coefficients still remain to be developed. One feasible option to find the optimal weight function and its coefficients is to build an objective function to optimize (such as minimizing MALE and RMSLE) and then using an optimization technique such as a genetic algorithm (see [29]) or an adaptive metropolis algorithm (see [30]) to solve the optimization problem.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean government (MEST) (no. 2013-0362).