Research Article  Open Access
Taesam Lee, Taha B. M. J. Ouarda, Fateh Chebana, Daeryong Park, "Evaluation of a DepthBased Multivariate Nearest Neighbor Resampling Method with Stormwater Quality Data", Mathematical Problems in Engineering, vol. 2014, Article ID 404198, 7 pages, 2014. https://doi.org/10.1155/2014/404198
Evaluation of a DepthBased Multivariate Nearest Neighbor Resampling Method with Stormwater Quality Data
Abstract
A nonparametric simulation model (nearest neighbor resampling, KNNR) for water quality analysis involving geographic information is suggested to overcome the drawbacks of parametric models. Geographic information is, however, not appropriately handled in the KNNR nonparametric model. In the current study, we introduce a novel statistical notion, called a “depth function,” in the classical KNNR model to appropriately manipulate geographic information in simulating stormwater quality. An application is presented for a case study of the total suspended solids throughout the entire United States. The stormwater total suspended solids concentration data indicated that the proposed model significantly improves the simulation performance compared with the existing KNNR model.
1. Introduction
Human activities in urban areas create a large number of pollutants. These pollutants are carried by stormwater into inland water bodies such as streams, rivers, and lakes, endangering the local ecosystems. During the last few decades, governments and communities have developed strategies to reduce urban stormwater pollution. To meet these objectives, a number of approaches for water quality analysis and modeling such as the environmental probability plot, the boxwhisker plot, and the  model method have been developed; see [1–5]. For instance, in 1983, the United States Environmental Protection Agency (US EPA) established the national pollutant discharge elimination system (NPDES), imposing water quality requirements for urban storm sewer systems to secure the environment around water bodies. However, stormwater quality data are inherently difficult to collect and analyze due to their uncertain nature in both the time and space domains; see [6, 7]. Furthermore, the modeling of stormwater quality generally involves the difficult task of organizing and processing large amounts of spatially referenced data; see [2, 8]. It is thus essential for modelers and decisionmakers to take into consideration the uncertainty in the data.
Monte Carlo simulation (MCS) has frequently been employed in the literature to determine the uncertainty in stormwater pollutant concentrations; see [5, 9–11]. The general MCS procedure is to fit a probability density function (pdf), for example, the lognormal distribution, to the observed data and then generate “samples” from the fitted pdf model. This traditional approach has a number of drawbacks, such as the limited number of feasible pdfs for stormwater quality data, the large effects of outliers, the limited choice of distributions (other than the normal distribution) for more than one variable, and the bias induced by the normality assumption, especially for the multivariate case; see [4]. Furthermore, the limited stormwater quality records hinder the application of the traditional MCS approach, especially in stormwater management.
Towler et al. [4] adapted the nearest neighbor resampling (KNNR) method (see [12–14]) to simulate influent concentration scenarios using the information collection rule (ICR) database of the US EPA. The KNNR method applied by Towler et al. [4] for wastewater quality simulation takes into account extensive spatial data to overcome the common limits of temporal concentration datasets. In this method, geographical information (GI) is included as variables along with the concentration variable. The results of Towler et al. [4] indicated that the KNNR model is a good alternative to the traditional parametric MCS approach for regulatory, treatment, and risk assessments regarding concentrations.
However, the way GI is handled as a general variable in Towler et al. [4] might lead to the underperformance of the KNNR model. The primary reason for this is that, as the modeling dimension increases from the insertion of the GI, the model becomes more intricate and subtle. Consequently, the model loses its focus on the water quality concentration variable. Second, the variability of the GI is quite different from the variability of the concentration variables because of the differences in their characteristics. The GI changes only spatially, whereas the pollutant concentration variable varies both temporally and spatially. Third, adding other GI variables such as the altitude and the area of the watershed is not advised in KNNR due to dimensionality problems.
To alleviate this problem, we propose a novel resampling approach based on depth functions (see [15, 16]), which adapts a different GI from the target stormwater quality variable in the KNNR simulation model. The proposed algorithm involves the combination of KNNR and a depth function and is denoted as “depthneighbor resampling” (DNR). It is tested with a stormwater quality dataset in this study.
The study is organized as follows. The background of the KNNR model of Towler et al. [4] and depth functions are presented. The overall model procedure is described in the following methodology section. In the application section, the DNR model and the existing KNNR model are applied to the stormwater quality model, and their results are compared. Finally, the conclusions and final remarks are presented.
2. Methodology
The KNNR is a simple analogous model for fitting the conditional distribution and then generating simulations from it in a datadriven manner. The KNNR model for water quality simulation was proposed by Towler et al. [4]. The generation of an ensemble of the variable of interest (e.g., pollutant concentration for a given month), conditioned on a feature vector of explanatory variables, is based on the evaluation of a conditional distribution, that is, . Then, the distance between the current feature vector and the feature vectors constructed from observed data is measured, and one of the nearest neighbors is selected. Finally, the corresponding pollutant concentration value of the selected neighbor is assigned as the simulation value. Towler et al. [4] employed three explanatory variables (i.e., ): pollutant concentration, latitude, and longitude. Note that the GI is included as explanatory variables through the latitude and longitude.
However, some drawbacks are expected, as discussed in the introduction section, when handling the GI in the form of explanatory variables. A special solution to handle the GI is proposed in the present study by employing the statistical notion of a “depth function.” Starting from the halfspace depth proposed by Tukey [15], a number of depth functions have been formulated in the literature; see [17–22]. Although depth functions have been widely used in statistics and econometrics (see [20, 23, 24]), they have just recently begun to be used in the environmental and hydrological fields; see [16, 25].
A depth function is a statistical notion for providing an outward ordering of points in a multivariate framework. The key properties of the depth function are (a) affine invariance, (b) maximality at the center, (c) monotonicity relative to the deepest point, and (d) vanishing at infinity; see [19].
For a given cumulative distribution function on , the corresponding depth function is any bounded and nonnegative function, denoted as , which provides an based centeroutward ordering of a point z in to satisfy the properties mentioned above. A number of depth functions can be described, among which the following two are commonly used.(a)The halfspace depth (see [15]) is defined with respect to a probability, Pr, associated with a distribution on as (b)The Mahalanobis depth (MHD) (see [26]) is defined on the basis of the Mahalanobis distance between two points, and , with respect to a positive definite matrix . The Mahalanobis depth is given by where and are, respectively, any location and covariance measures corresponding to .
Employing the above depth functions, the basic idea of the proposed model in the current study is to convert the GI (location and longitude) into a realvalued weight vector. This process involves reducing the dimension of the GI variables. In doing so, the information is appropriately handled through the distance measurement in the KNNR model without any increase in the dimension. In other words, the elements of the GI (latitude and longitude) are mapped using a depth function. The MHD (2) is employed in the current study because it is easy to evaluate and flexible to adapt to the current situation and converted to the weight vector for the distance measurement of the KNNR model.
The overall procedure of the depthbased KNNR model denoted as “depthneighbor resampling” (DNR) as a surrogate of for the stormwater quality simulation is as follows.(1)The userspecified feature vector is defined as , which is known in advance for the current time . Note that the element of the feature vector is now only with the stormwater quality variable (i.e., ) unlike the one in Towler et al. [4] as (i.e., ), where LAT and LON represent the latitude and longitude at the location where is measured. In the current proposed model, the GI (latitude and longitude) is taken into account through the depth function shown below in step .(2)The observed feature vectors of all pollutant concentration available sources are constructed as where represents the pollutant concentration, which is the variable of interest, and denotes all the available records of the database employed. The ICR database of the USEPA for was employed by Towler et al. [4] (see the introduction section). The employed database () for the current study is described in the next section.(3)The depth function is evaluated by in (2), where , at location, and is the covariance matrix of . Note that is the GI where the measurement was taken and are the GI for the measurements in .(4)The weights for are computed where is a known weight function. Some examples of weight functions are presented below.(5)The distances between the userspecified feature vector and the observed feature vectors are computed as (6)The distances are arranged in ascending order, and the first values are selected. Next, one of these values is randomly assigned with the selection probability given by , where . This probability gives a higher chance of selection to the nearest neighbor and a lower chance to the farthest neighbors. Suppose the corresponding index of the selected value is assigned as . The number of nearest neighbors is estimated using the heuristic approach (i.e., ) with its theoretical justification; see [12, 27].(7)The simulation of the stormwater quality variable can be performed by selecting the stormwater quality variable of the corresponding index as []; that is, .(8)The steps above (all steps 1–7 except step 2) are repeated to generate the desired number of data points ().
To obtain in step 4 and (4), a weight function that is positive and monotonically increasing is employed. In the current study, two common weight functions are tested: the one proposed by Lin and Chen [22] and the simple linear weight function (see [16]). The weight function of Lin and Chen [22] is given by with coefficients , and , where defines the support of the weight function and represents the slope of decay to zero. If , then it is equivalent to the weight function of Zuo et al. [28]. The simple linear weight function mentioned in Chebana and Ouarda [16] is expressed as where and are the coefficients with .
A trialanderror approach was applied in the current study for the parameter estimation of the weight functions. The above two weight functions and are illustrated in Figure 1 using strategically selected coefficients to show their roles. Figure 1 reveals that the weight function decays to zero quickly with high values of (solid line with triangle). Extremely high values of (solid line) lead this function to jump from zero to one. An extremely high value of was used in Chebana and Ouarda [16] and other studies. Furthermore, it is evident that and are the lower and upper limits in (see the dotted line with circles in Figure 1) beyond which the weights 0 and 1, respectively, are assigned.
3. Application
The stormwater quality datasets were obtained from the international stormwater best management practice (BMP) database (http://www.bmpdatabase.org/), which has been assembled since 1996 by the American society of civil engineers and the USEPA, as shown in Figure 2. The database was established to foster a better understanding of factors influencing urban stormwater quality; see [3, 5]. The stormwater quality data as the event mean concentration of total suspended solids (TSS) was retrieved with (see (4)), as shown in Table 1, because it contains a relatively large number of records (approximately 800). The latitude and longitude were also retrieved as the GI.

As mentioned in the introduction, the KNNR is a simple model based on the conditional distribution . In Towler et al. [4], the annual average of wastewater influent concentration, longitude, and latitude were used for the conditional variables to simulate the corresponding monthly wastewater influent concentration variables. The stormwater quality data used in the present study, however, is rarely available for a continuous full year. Therefore, we use the immediate preceding monthly TSS value as the conditional variable instead of the annual average as in Towler et al. [4]. In other words, the conditional variable to simulate the stormwater quality for a certain month , denoted as , is the stormwater quality of the preceding month (i.e., ). Subsequently, in (3) consists of all the stormwater TSS data in month . In the current study, the TSS value of the current month is simulated from the proposed DNR model by treating the GI with the depth function.
Among other coefficient sets, with [, , and ] and with [ and ] performed comparably well in the application of the current study. Therefore, the results using these coefficient sets for each weight function are presented. The models and parameter sets used in this application are:(1)the KNNR model with the depth function ( in (2)) and the weight function in (5) with [, , and ]: DNR_{LC};(2)the KNNR model with the depth function () and the weight function in (6) with [ and ]: DNR_{CO}.
Each TSS value is simulated 500 times with the three models above. Their performances are evaluated through the mean absolute log error (MALE) and the root mean square log error (RMSLE), given, respectively, as: where is the number of simulated times (here, is used) and is the data generated as a surrogate of the observed data at the time. Note that (i) a low value of MALE or RMSLE indicates a better reproduction of the characteristics of the observed data; (ii) RMSLE is more sensitive to outliers than MALE due to the squared formulation; and (iii) the statistics in (7) and (8) are the logscale version of the mean absolute error (MAE) and the root mean square error (RMSE), respectively, because stormwater quality data are commonly analyzed with logscaling; see [1, 4].
In Figure 3, parts of the generated sequences (20 stations among , as shown in Table 1 and Figure 2) are illustrated using boxplots along with the corresponding observed TSS data. The typical KNNR model often generates values that are much lower or higher than those corresponding to the observed data. The generated TSS quantities from the proposed KNNR models with depth function (DNR_{CO} and DNR_{LC}) show better agreement with the observed TSS values than the ones from the typical KNNR model. The remaining generated values (data not shown) have a behavior similar to that described above.
(a)
(b)
(c)
To check the agreement between the observed and generated data, using all the TSS data from 118 stations, as shown in Figure 4, the MALE and RMSLE in (7) and (8) are employed. These statistics are presented in Figures 56 and Table 2. The MALE statistics of the TSSgenerated data at each month are illustrated in Figure 5. The DNR_{CO} shows consistently better performance than KNNR during the summer and fall months. The DNR_{LC} presents the best performance compared with the other models, except in July when DNR_{CO} is slightly better. The overall superiority of DNR_{LC} with the average MALE metric over all the simulated values is briefly presented in the second row of Table 1 (the smallest MALE value among the three different models is illustrated).

The overall simulation performance of the three models is illustrated through RMSLE, as shown in Figure 6. It shows that the RMSLEs of KNNR and DNR_{CO} are similar, whereas the RMSLE of DNR_{LC} is significantly lower than the former two. The average over all the simulated values of the RMSLE (summarized in the third row of Table 1) supports the results of Figure 6, indicating that the DNR_{LC} model has the lowest RMSLE, whereas KNNR and DNR_{CO} lead to similar performances. This implies that the DNR_{LC} is superior to the other models.
An extensive sensitivity analysis of the coefficients for the weight function in (5) was performed with different parameter sets of and while fixing , as favored in Chebana and Ouarda [16]. This analysis is performed to further investigate the role of the coefficients of the weight function . The analysis of the coefficients of ( and ) is omitted because their role is obvious. The average RMSLE of each parameter set ( and ) from 500 simulations is shown in Figure 7. Note that is attributed to the shape of the weight function and is the location parameter beyond which the weight is assigned as the value 1.0 (see Figure 1). Figure 7 illustrates that the RMSLE increases as increases for all values of the parameter and the RMSLE decreases as increases for high values of , whereas the decrease of RMSLE is minimal for low values of . The results indicate that has a high impact on the performance of the model, whereas has little impact. Therefore, it is concluded that a low value of the parameter and a high value of the parameter are preferred for the currently applied TSS data.
4. Conclusion
The uncertainty in stormwater quality data has hindered the accurate analysis and physical modeling of water quality. Fitting a parametric pdf model to stormwater quality data is sometimes unfavorable primarily due to the lack of data. As an alternative, the nonparametric KNNR model was recently applied for simulating the ensemble of pollutant concentration data. In the current study, we adapted a statistical approach involving “depth functions” to improve the simulation performance of the KNNR for stormwater quality data. Different weight functions are incorporated with the depth function. The results illustrate that the integration of the depth function in the KNNR model, with an appropriate choice of the weight function and its coefficients, leads to an improved performance in the simulation of stormwater quality data. The weight function is superior to in the simulation performance.
Automatic models for the identification of the optimal weight function and the estimation of its coefficients still remain to be developed. One feasible option to find the optimal weight function and its coefficients is to build an objective function to optimize (such as minimizing MALE and RMSLE) and then using an optimization technique such as a genetic algorithm (see [29]) or an adaptive metropolis algorithm (see [30]) to solve the optimization problem.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean government (MEST) (no. 20130362).
References
 US EPA, Results of the Nationwide Urban Runoff Program, Water Planning Division, Washington, DC, USA, 1983.
 K. M. Wong, E. W. Strecker, and M. K. Stenstrom, “Gis to estimate stormwater pollutant mass loadings,” Journal of Environmental Engineering, vol. 123, no. 8, pp. 737–745, 1997. View at: Google Scholar
 E. W. Strecker, M. M. Quigley, B. R. Urbonas, J. E. Jones, and J. K. Clary, “Determining urban storm water BMP effectiveness,” Journal of Water Resources Planning and Management, vol. 127, no. 3, pp. 144–149, 2001. View at: Publisher Site  Google Scholar
 E. Towler, B. Rajagopalan, C. Seidel, and R. S. Summers, “Simulating ensembles of source water quality using a Knearest neighbor resampling approach,” Environmental Science & Technology, vol. 43, no. 5, pp. 1407–1411, 2009. View at: Publisher Site  Google Scholar
 D. Park, J. C. Loftis, and L. A. Roesner, “Performance modeling of storm water best management practices with uncertainty analysis,” Journal of Hydrologic Engineering, vol. 16, no. 4, pp. 332–344, 2011. View at: Publisher Site  Google Scholar
 R. A. Corbit, Standard Handbook of Environmental Engineering, McGrawHill, New York, NY, 1989.
 J. W. Delleur, “The evolution of urban hydrology: past, present, and future,” Journal of Hydraulic EngineeringAsce, vol. 129, no. 8, pp. 563–573, 2003. View at: Google Scholar
 J. Vaze and F. H. S. Chiew, “Comparative evaluation of urban storm water quality models,” Water Resources Research, vol. 39, no. 10, pp. SWC51–SWC510, 2003. View at: Google Scholar
 J. J. Warwick and J. S. Wilson, “Estimating uncertainty of stormwater runoff computations,” Journal of Water Resources Planning and Management, vol. 116, no. 2, pp. 187–204, 1990. View at: Google Scholar
 R. J. Charbeneau and M. E. Barrett, “Evaluation of methods for estimating stormwater pollutant loads,” Water Environment Research, vol. 70, no. 7, pp. 1295–1302, 1998. View at: Publisher Site  Google Scholar
 G. Freni, G. Mannina, and G. Viviani, “Uncertainty in urban stormwater quality modelling: the effect of acceptability threshold in the GLUE methodology,” Water Research, vol. 42, no. 89, pp. 2061–2072, 2008. View at: Publisher Site  Google Scholar
 U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resources Research, vol. 32, no. 3, pp. 679–693, 1996. View at: Publisher Site  Google Scholar
 T. S. Lee, , Stochastic simulation of hydrologic data based on nonparametric approaches [Ph.D. dissertation], Colorado State University, Fort Collins, Colo, USA, 2008.
 T. Lee, J. D. Salas, and J. Prairie, “An enhanced nonparametric streamflow disaggregation model with genetic algorithm,” Water Resources Research, vol. 46, no. 8, Article ID W08545, 2010. View at: Publisher Site  Google Scholar
 J. W. Tukey, “Mathematics and the picturing of data,” in Proceedings of the International Congress of Mathematicians, vol. 2, pp. 523–531, Vancouver, Canada, 1975. View at: Google Scholar
 F. Chebana and T. B. M. J. Ouarda, “Depth and homogeneity in regional flood frequency analysis,” Water Resources Research, vol. 44, no. 11, Article ID W11422, 2008. View at: Publisher Site  Google Scholar
 R. Y. Liu, “On a notion of data depth based on random simplices,” Annals of Statistics, vol. 18, pp. 405–414, 1990. View at: Google Scholar
 R. Y. Liu and K. Singh, “A quality index based on data depth and multivariate rank tests,” Journal of the American Statistical Association, vol. 88, no. 421, pp. 252–260, 1993. View at: Google Scholar
 Y. Zuo and R. Serfling, “General notions of statistical depth function,” Annals of Statistics, vol. 28, no. 2, pp. 461–482, 2000. View at: Google Scholar
 I. Mizera and C. H. Müller, “Locationscale depth,” Journal of the American Statistical Association, vol. 99, no. 468, pp. 949–966, 2004. View at: Publisher Site  Google Scholar
 Y. Zuo and H. Cui, “Depth weighted scatter estimators,” Annals of Statistics, vol. 33, no. 1, pp. 381–413, 2005. View at: Publisher Site  Google Scholar
 L. Lin and M. Chen, “Robust estimating equation based on statistical depth,” Statistical Papers, vol. 47, no. 2, pp. 263–278, 2006. View at: Publisher Site  Google Scholar
 A. Caplin and B. Nalebuff, “Aggregation and social choice: a mean voter theorem,” Econometrica, vol. 59, no. 1, pp. 1–23, 1991. View at: Google Scholar
 A. K. Ghosh and P. Chaudhuri, “On maximum depth and related classifiers,” Scandinavian Journal of Statistics, vol. 32, no. 2, pp. 327–350, 2005. View at: Publisher Site  Google Scholar
 F. Chebana and T. B. M. J. Ouarda, “Multivariate extreme value identification using depth functions,” Environmetrics, vol. 22, no. 3, pp. 441–455, 2011. View at: Publisher Site  Google Scholar
 P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings National Institute of Science of India, vol. 2, no. 1, pp. 49–55, 1936. View at: Google Scholar
 K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, Calif, USA, 1990.
 Y. Zuo, H. Cui, and X. He, “On the StahelDonoho estimator and depthweighted means of multivariate data,” Annals of Statistics, vol. 32, no. 1, pp. 167–188, 2004. View at: Publisher Site  Google Scholar
 D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, AddisonWesley, Reading, Mass, USA, 1989.
 H. Haario, E. Saksman, and J. Tamminen, “An adaptive Metropolis algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, 2001. View at: Google Scholar
Copyright
Copyright © 2014 Taesam Lee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.