Abstract

Precipitation is the main factor that drives hydrologic modeling; therefore, missing precipitation data can cause malfunctions in hydrologic modeling. Although interpolation of missing precipitation data is recognized as an important research topic, only a few methods follow a regression approach. In this study, daily precipitation data were interpolated using five different kernel functions, namely, Epanechnikov, Quartic, Triweight, Tricube, and Cosine, to estimate missing precipitation data. This study also presents an assessment that compares estimation of missing precipitation data through th nearest neighborhood (NN) regression to the five different kernel estimations and their performance in simulating streamflow using the Soil Water Assessment Tool (SWAT) hydrologic model. The results show that the kernel approaches provide higher quality interpolation of precipitation data compared with the NN regression approach, in terms of both statistical data assessment and hydrologic modeling performance.

1. Introduction

Precipitation data are key factors in hydrologic modeling for estimating rainfall-runoff mechanism [1]. Malfunctions in running hydrologic modeling can occur due to noncontinuous time series precipitation inputs. In light of this important issue, estimation of missing precipitation data is a challenging task for hydrologic modeling. Many hydrologic modeling require interpolation of missing precipitation data [2], meteorological data series completion [3], or imputation of meteorological data [4]. To estimate missing precipitation, researchers should consider spatiotemporal variations in precipitation (rainfall and snowfall) values and the related physical processes. However, accounting for spatial-temporal variation and physical processes can be difficult if there is a lack of equipment for measuring precipitation. Thus, statistical approaches have emerged as widely used methods for filling in missing precipitation data [5].

Many studies have investigated supplanting missing streamflow data with several statistical approaches [5], but there are limited studies on the interpolation of incomplete precipitation and temperature data [610]. Recently, the investigation of artificial neural networks (ANNs: [11]), a more advanced statistical approach, to estimate missing precipitation data, has been proposed [12]. ANNs can learn from training data to reconstruct a nonlinear relationship and obtain values for missing data. Pisoni et al. [13] investigated the interpolation of missing data for sea surface temperature (SST) satellite images using the ANN method; they found that the results from the ANN approach show better accuracy than the results from an interpolation system, as suggested by Seze and Desbois (1987). Nevertheless, ANNs are still under dispute because their neuron systems cannot provide clear relationships between data [14].

The American Society of Civil Engineers (ASCE) Task Committee [15] discussed that although the performance of ANNs for estimating missing precipitation data has already been verified, an alternate solution should be suggested for cases in which the available data are insufficient due to the reliance of ANNs on high data quality and quantity. Additionally, ANNs have other limitations, such as a lack of physical concepts and relations, based on the experience and preferences of those using, studying, and training the networks [1517]. Since ANNs are regarded as black-box model [18], it is difficult to use this method for realizing more linear relationships, even though ANNs can achieve convergence for almost any problem [17]. Thus, for real mechanisms in hydrologic models, in which linear relationships exist between series of weather inputs, the solution is less explicit [19].

Generally, a regression or a distance weighted method is most commonly used for estimating missing precipitation for hydrologic modeling [20]. Daly et al. [21] also propose a variety of regression models to incorporate spatial variation in weather data. However, Creutin et al. [22] found that even though simple linear regression of interpolation approaches show satisfactory serial correlation of daily or monthly streamflow; precipitation patterns do not show proper correlation when simple linear regression or interpolation approaches are used. Furthermore, if a regression method is used for estimating missing precipitation to make refined precipitation time series, a small data sample would not follow the normal distribution based on basic theory of linear regression.

Another approach for estimating missing precipitation data to use neighboring data is based on distance weight. Xia et al. [23] used the closest station to reconstruct missing precipitation data through geometrical distance weight; Willmott et al. [24] used arithmetic data averaging from neighboring data to filling missing precipitation; and Teegavarapu and Chandramouli [25] used an inverse distance weight method from neighboring data to estimate missing precipitation data. Smith [26], Simanton and Osborn [27], and Salas [28] suggest that traditional weighting and data-driven methods, namely, distance based weighting methods, are interpolated for estimating missing precipitation data. Distance weight approaches for estimating missing precipitation data are combined with linear regression and median distribution of regression [29, 30]. Young [31] and Filippini et al. [32] suggested spatially interpolating the correlation to define weight in terms of each station.

Estimation of missing precipitation data is possible when data are available for the same location. Linacre (1992) investigated the interpolation of missing precipitation data by using the mean value of a data series at the same location and Lowry [33] suggested simple interpolation between available data series. Acock and Pachepsky [34] used data from several days before and after missing precipitation data points for estimating the incomplete precipitation data. -nearest neighborhood (nn) regression is a basic method for estimating missing precipitation data that considers vicinity. However, the method has some weaknesses when the data have outliers or a nonlinear trend exists around the missing data. While nn regression has a fundamental assumption to follow a normal distribution which is statistically unsound, the kernel method uses a mean value, which can overcome nn regression’s weakness through the kernel weighting method. By using neighbor data in a kernel function, even though the data show a nonlinear trend, it can overcome nn regression weakness.

The objective of this study was to reconstruct daily precipitation data by using five different kernel functions (Epanechnikov, Quartic, Triweight, Tricube, and Cosine) to estimate missing precipitation data. This study also presents an assessment that compares estimation of missing precipitation data through nn regression to the five different kernel estimations and their performance in simulating streamflow using the Soil Water Assessment Tool (SWAT) hydrologic model. The remainder of this paper is organized as follows. Section 2 provides a description of the study area and the hydrologic model. In Section 3, the methodology of the five different kernel methods is presented. Section 4 presents the results of the interpolation of the missing daily precipitation data and the hydrologic model simulation. Finally, conclusions are in Section 5.

2. Study Area and Hydrologic Model

The Imha (Figure 1) watershed was selected as the test bed for this study. The Imha watershed is a tributary of the Nakdong River basin and is located in the upper side of the Nakdong River basin in South Korea. It is characterized by a mountainous area; approximately 79.8% of the total area of 1,361 km2 is mountainous. The slope in the Imha watershed is 40% to 60%, that is, 655 km2 as 33% of total watershed area. The elevation of the Imha watershed ranges from 80 to 1215 m. The average annual precipitation, minimum temperature, maximum temperature, humidity, and wind speed for the Imha watershed are 1,050 mm, 7°C, 18.8°C, 65%, and 1.6 m/s, respectively (Water Management Information System (WAMIS), http://www.wamis.go.kr/). Since the climate conditions in this area are defined by warm temperatures, there is no precipitation in the form of snow; all precipitation consists of rainfall. For this evaluation of interpolation of precipitation data and hydrologic model performance, precipitation and streamflow gauges were selected as shown in Figure 1 and precipitation and streamflow data were sourced from the Water Management Information System (http://www.wamis.go.kr/).

This study selected the SWAT model for analysis. SWAT has a GIS extension, ArcSWAT, which allows the use of various GIS based datasets to model the geomorphology of a given basin. The SWAT model was developed through research by the USDA (United States Department of Agriculture), Agricultural Research Service (ARS). Major data inputs for SWAT include temperature (maximum and minimum), daily precipitation, solar radiation, relative humidity, wind speed, and geospatial data representing soil types, land cover, and elevation. A watershed is divided into smaller subbasins, which must be broken up into smaller units known as hydrologic response units (HRU). Each of these HRUs is characterized by uniform land use and soil type. SWAT can be used to accurately predict hydrologic patterns for extended periods of time [35]. Canopy interception is implicit in the curve number (CN) method and is explicit for the Green-Ampt method. Infiltration is most accurately accounted for using the CN method in SWAT. An alternative method may be used to account for infiltration is the Green-Ampt method. However, the Green-Ampt method has not been shown to increase accuracy over the CN method, thus the CN method was used in this study.

3. Methodology

This study used the five kernel functions, Epanechnikov, Quartic, Triweight, Tricube, and Cosine, as a weight to predict missing values. Tricube method has large weight around target point. Even though Tricube weight is similar to Triweight, the decreasing acceleration of weight as far away from target point is less than Triweight. Next higher weight around target point is Quartic, which speed in decreasing weight is similar to Triweight. Both Epanechnikov and Cosine have small effect on neighboring values. A brief description of the five kernel functions and their application for reconstructing the missing values is presented in the following and specific kernel functions are described in Appendix A.

3.1. Epanechnikov

The Epanechnikov kernel is the most often used kernel function. The Epanechnikov kernel assigns zero weight to observations that are a distance of four, six, and eight away from the reference point. These values correspond to the choice of the interval width. This is often called the choice of smoothing parameter or band width selection. The main character of the Epanechnikov kernel is that even though the distance is far away from target value, namely, the missing value in this research, its estimation is smooth. A brief description is given by the following: where is the kernel function and is surrounding the nearest value as an independent in data.

3.2. Quartic

The second kernel function used in this research was the Quartic kernel which has more weight sensitivity based on distance from the missing value. Since the applied weight is largely different between near and far data points, it is more influenced by surrounding data. It consists of a fourth-order equation which has more sensitivity in terms of distance than second-order equation. It is described by the following:

3.3. Triweight

The third kernel function used in this research was the Triweight kernel which consists of a sixth-order equation. It has the most sensitivity in terms of distance because a sixth-order equation estimates the missing value based on the difference in distance with a weighted function as shown by the following:

3.4. Tricube

The fourth kernel function used in this research was the Tricube kernel, which uses absolute values. Since it uses absolute values, it presents a smoother pattern for nearest values than the Triweight kernel. However, as the values move further away from the nearest values, it shows a steep trend. The Tricube kernel has the most sensitivity in terms of weighted distance due to the fact that it consists of a ninth-order equation, as shown in the following:

3.5. Cosine

The fifth kernel function used in this research was the Cosine kernel function. It is a widely applied kernel function in various fields because it has a constant curvature. Its shape is similar to the Epanechnikov kernel, even though it uses a cosine function as shown in the following:

3.6. Calculation of the Missing Value

After using a kernel function to calculate the weight of the missing data, estimation of the missing data is performed using the following:where is the missing value, is the number of the nearest neighborhood, and is the th nearest values which correspond to (positive means the right side and negative means the left side). The kernel function should have bilateral symmetry based on a value of zero. If using, for example, the four nearest neighborhoods for estimating the missing value, the neighborhood values used will be two from right side and another two from left side. The specific equation for this example is shown in the following and example calculation is described in Appendix B:

3.7. Statistic Tests

A normality test is required to evaluate for infilling the methods for filling in interpolation data. The Shapiro-Wilk [36] normality test was used with nineteen samples to determine whether the average difference is normally distributed or not. The test statistic is as shown in the following:where is the th order statistic, namely, the th smallest value in the sample, is the mean of , and is a constant given by ordered data. The null hypothesis of the Shapiro-Wilk normality test is that sample is normally distributed, and if significance probability is less than 5%, the null hypothesis will be denied, meaning the sample does not satisfy normal distribution. Since the significance probability for the entire group (Table 1) is below 5%, the null hypothesis is denied. This study should, therefore, use a nonparametric test for normality analysis.

The Friedman test [37], which is a kind of -sample test that can provide the difference between paired values, was selected as a nonparametric test. This method evaluates a small sample for differences by ranking a sequence list. The null hypothesis of the Friedman test is that there is no average difference in each group and if the significance probability is less than 5%, the null hypothesis will be denied, thus conducting that in each group exists an average difference. A brief description of Friedman test is in the following:where and are the sum of the squared treatment and sum of the squared error, respectively.

The null hypothesis in this instance was denied because the significance probability was less than 5% for each and this study concluded that each interpolation method has an average difference, which is why each method is considered independent, even though this study used five different kernel methods. For example, the average rank for four reference points for nn-regression, Tricube, Quartic, Cosine, Triweight, and Epanechnikov varies from a large average to a small average rank (Table 2). For six reference points, the nn-regression, Tricube, Triweight, Quartic, Cosine, and Epanechnikov were ranked as shown in Table 2. In another example, eight reference points used nn-regression, Triweight, Quartic, Cosine, and Epanechnikov average rank (Table 2). As shown in Table 2, the nn-regression has the largest average rank and Epanechnikov has the smallest rank average for all of the reference point cases. This result proves the dissimilarity of these methods.

To determine which methods are dissimilar to the others, this study performed the Wilcoxon signed rank test [38]. The basic feature of the Wilcoxon signed rank test is that data samples that come from the same population are paired and it is detailed in the following:where is the sample size, is th value of the second data point, is th value of the first data point, and is the rank of . If the value is less than 5%, it means there is different mechanism used on the sample data or method. Table 3 shows that the value for nn-regression is less than 5% for all cases. Accordingly, this signifies that nn-regression is completely dissimilar to the other methods. Although the five different kernel methods for data interpolation exhibit similarity or dissimilarity to each other depending on the number of reference points, all of the kernel methods can be distinguished from nn-regression using the Wilcoxon signed rank test.

4. Results

Since Epanechnikov has the smallest average rank, which signifies a small difference between the observation value and the interpolated value for all reference points in Table 2, interpolation data obtained from the Epanechnikov method has the best result among the studied methods. Figure 2 shows that filling in data from nn-regression has a large difference at both four and six reference points. Interpolation data from the kernel methods are close to zero for both the average and median values at four reference points, meaning that the interpolation data are similar to the observation data. On the other hand, more than 75% of the interpolation data from nn-regression exhibits a difference than zero. When the interpolation data are evaluated at six reference points in Figure 2, the median value from the nn-regression is shown to be far away from zero. At eight reference points, nn-regression is close to zero for both average and median values; however, it is difficult to conclude that this is an ideal method because outlying maximum values will affect the average and median value.

This study on precipitation data interpolation also evaluated the simulation of the interpolated data using the SWAT hydrologic model. In SWAT hydrologic modeling, the surface runoff is estimated by considering excess precipitation with abstractions and infiltration factor through Soil Conservation Service Curve Number (SCS-CN) method. Green-Ampt (GA) infiltration method is another method to calculate the surface runoff in SWAT. A study shows that both methods give reasonable results, and there is no significant advantage observed in using one over the other. However, the GA method appears to have more limitations in modeling seasonal variability than the SCS-CN method does. Hence, the SCS-CN method is used for infiltration factor in this study. An SCS curve number based simulation needs time step updated information as soil water content changes. Excess rainfall equation in SCS-CN method was generated based on historical relationship between the curve number and the hydrologic mechanism for over 20 years. Throughout the surface runoff calculation, infiltration should be updated over time according to the soil type. Other abstractions such as evapotranspiration and soil and snow evaporation are calculated by Penman-Monteith method and meteorological statistics. Finally, the kinematic storage model is used to compute groundwater storage and seepage. Flow resulting in SWAT modeling is routed HRUs to watershed outlet. Figure 3 shows the calibration of the model simulation as the initial step and the specific parameters are described in Table 4. After the calibration of the SWAT model, the six different interpolated precipitation datasets, with three different reference ranges for each (a total of twenty-four interpolated precipitations data points), were used to assess the performance of interpolated precipitation data for hydrologic model simulation. Streamflow simulations were done for three years from 2008 to 2010. To evaluate the model performance considering the use of different interpolated precipitation datasets, this study used (Nash-Sutcliffe coefficient), -square (coefficient of determination), and RMSE (root mean square error). Table 5 and Figure 4 show that the simulation results from nn-regression exhibit low SWAT simulation performance for streamflow estimations, with 0.54 , 0.74 -square, and 23.78 m3/s RMSE as an average. All of the kernel functions, on the other hand, exhibit good performance for hydrologic simulations with interpolated precipitation data (Table 5 and Figure 4), the average of , -square, and RMSE (1) for Epanechnikov is 0.83, 0.86, and 14.03 m3/s; (2) for Quartic is 0.84, 0.88, and 13.03 m3/s; (3) for Triweight is 0.93, 0.93, and 9.30 m3/s; (4) for Tricube is 0.94, 0.95, and 8.13 m3/s; and (5) for Cosine is 0.93, 0.94, and 9.00 m3/s, respectively.

5. Conclusions

Five different kernel functions were applied to the Imha watershed to evaluate the performance of each weighted method for estimating missing precipitation data and the use of interpolated data for hydrologic simulations was assessed. The following conclusions can be drawn from this research.(1)To estimate missing precipitation data points, exploratory procedures should consider the spatiotemporal variations of precipitation. Due to difficulty on accounting for these variations, statistical methods for estimating missing precipitation data are commonly used.(2)Although ANNs are an advanced approach for estimating missing data, mechanisms are unclear because the neuron system is ultimately a black-box model. Thus, regression methods are widely used for estimating missing data, even though there are limitations in that regression methods cannot follow normal distribution when the sample is small.(3)When using kernel functions as a weighted method, estimated missing data would satisfy normal distribution which is more statistically sound. Also, kernel methods can overcome weakness in nn-regression if the data have outliers and/or a nonlinear trend around the missing data points in terms of mean value.(4)This study assessed the five kernel functions, Epanechnikov, Quartic, Triweight, Tricube, and Cosine, as a weight for predicting missing values. In comparison with the nn-regression method, this study demonstrates that the kernel approaches provide higher quality interpolated precipitation data than the nn-regression approach. In addition, the kernel function results better conform to statistical standards.(5)Furthermore, higher quality of interpolated precipitation data results in better performance for hydrologic simulations, as exemplified in this study. All of the statistical analyses of the streamflow simulations showed that the simulations using the interpolated precipitation data from the kernel functions provide better results than using nn-regression.(6)Use of kernel distribution is a more effective method than regression when the precipitation data have an upward or downward trend. However, if the precipitation data have a nonlinear trend, it is difficult to effectively reconstruct the missing values. For further research, a time series analysis or a random walk model using a stochastic process are possible methods by which to estimate missing data where there is a nonlinear trend.

Appendices

A. Kenel Functions

Kernel density estimation is an unsupervised learning procedure, which historically precedes kernel regression. It also leads naturally to a simple family of procedures for nonparametric classification.

A.1. Kernel Density Estimation

Suppose we have a random sample draw from a probability density and we wish to estimate at a point . For simplicity we assume for now that (real value). Arguing as before, a natural local estimate has the formwhere means number of which converges to and is a small metric neighborhood around of width . This estimate is bumpy, and the smooth Parzen estimate is preferred,because it counts observations close to with weights that decrease with distance from . In this case a popular choice for is the Gaussian kernel . Letting denote the Gaussian density with mean zero and standard-deviation , then (A.2) has the formthe convolution of the sample empirical distribution with . The distribution puts mass at each of the observed and is jumpy; in we have smoothed by adding independent Gaussian noise to each observation .

The Parzen density estimate is the equivalent of the local average, and improvements have been proposed along the lines of local regression (on the log scale for densities). We will not pursue these here. In the natural generalization of the Gaussian density estimate amounts to using the Gaussian product kernel in (A.3),

A.2. Kernel Density Classification

One can use nonparametric density estimates for classification in a straight-forward fashion using Bayes’ theorem. Suppose for a class problem we fit nonparametric density estimates , separately in each of the classes, and we also have estimates of the class priors (usually the sample proportions). ThenIn this region the data are sparse for both classes, and since the Gaussian kernel density estimates use matric kernels, the density estimates are low and of poor quality (high variance) in these regions. The local logistic regression method uses the tricube kernel with -NN bandwidth; this effectively widens the kernel in this region and makes use of the local linear assumption to smooth out the estimate (on the logit scale).

If classification is the ultimate goal, then learning the separate class densities well may be unnecessary and can in fact be misleading. In learning the separate densities form data, one might decide to settle for a rougher, high-variance fit to capture these features, which are irrelevant for the purposes of estimating the posterior probabilities. In fact, if classification is the ultimate goal, then we need only to estimate the posterior well near the decision boundary (for two classes, this is the set ).

B. Procedures of Missing Precipitation

This step shows example calculation for kernel functions for weighted mean. It is an example question about the weight of each situation. If the kernel functions are all symmetric, same values are used for weight based on day distance. Following Table 6 1st, 2nd, 3rd, and 4th day distance and weighted values are shown. For example, if we want to estimate missing precipitation for 2010-02-12 (actual value is 6), see following procedures (3 steps) with 4-NN Epanechnikov kernel (Table 7).

Step 1. Select the date for target interpolation data.

Step 2. Decide th nearest days precipitation and each kernel weight.

Step 3. Calculate the weight average to estimate missing.

The rest of the kernel methods for estimating missing precipitation are described in Table 8.

C. Sample Calculations with Real Value

This section shows how to calculate missing precipitation with kernel mean weighed function by using certain number. This sample selected daily data from 2008 to 2010 with 0.02 possibilities to bivariate by random. After selected data, setting data location is operated. Zhang et al. [39] addressed that kernel based nonparametric multiple imputation has better performance than general linear regression when the sample data is small or limited.

Table 9 shows procedure of kernel weight in each function. We used data Feb. 10, 2012 from Feb. 14, 2014 to estimate Feb. 12, 2012 missing data. Epanechnikov kernel showed that longest data has highest estimation as 0.417; however, Triweight kernel showed that longest data has lowest estimation as 0.188. Highest weight in nearest value is Tricube kernel and lowest weight is Epanechnikov kernel. Generally, Tricube, that is, high weight, shows the overestimation for missing precipitation.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.