Stochastic Temporal Data Upscaling Using the Generalized k-Nearest Neighbor Algorithm
Three methods of temporal data upscaling, which may collectively be called the generalized k-nearest neighbor (GkNN) method, are considered. The accuracy of the GkNN simulation of month by month yield is considered (where the term yield denotes the dependent variable). The notion of an eventually well-distributed time series is introduced and on the basis of this assumption some properties of the average annual yield and its variance for a GkNN simulation are computed. The total yield over a planning period is determined and a general framework for considering the GkNN algorithm based on the notion of stochastically dependent time series is described and it is shown that for a sufficiently large training set the GkNN simulation has the same statistical properties as the training data. An example of the application of the methodology is given in the problem of simulating yield of a rainwater tank given monthly climatic data.
The -nearest neighbor method has its origins in the work of Mack , Yakowitz and Karlsson , and others, e.g., [3, 4]. In this work an estimate for given an independently and identically distributed (i.i.d.) sequence of random vectors with and (where denotes the set of real numbers) on the basis of is obtained by taking the average of over the set , where is the set of indices of vectors which form the nearest neighbors of , in which .
In later work by Lall and Sharma  and Rajagapolan and Lall  a related method, also called the -nearest neighbor method, was used for simulating hydrological stochastic time series . In this method the next value in the simulated time series is chosen randomly according to a probability distribution over the set of indices of the nearest neighbors of in .
In the present paper we derive some general results about the -nearest neighbor algorithm and related methods which we group together as a general class of methods which we call the generalized -nearest neighbor method (GkNN method). We do not make the assumption that the time series are i.i.d. , null-recurrent Markov , or Harris recurrent Markov chains . We introduce the natural notion of a time series being eventually well distributed from which, if satisfied, some properties of the GkNN algorithm can be deduced.
The generalized nearest neighbor (GkNN) algorithm is described in Section 2. Section 3 investigates the problem of predicting the month by month yield (where we use the term “yield" to denote the value of the dependent variable ) while Section 4 considers the computation of the average annual yield. Section 5 computes the variance of the average annual yield while Section 6 considers the behavior of the total yield. Section 7 describes a general framework for viewing the GkNN algorithm and conditions under which this framework is applicable in practice. The eighth section of this paper presents the particular example of the problem of simulating rainwater tank yield. The paper concludes in Section 9.
2. The Generalized k Nearest Neighbor (GkNN) Method
In the GkNN method we are given a time series of predictor vectors which may be obtained from, for example, a stochastic simulation of climatic data. Here denotes the space of predictor vectors. We are also given training data .
We want to assign yields for in a meaningful way. We are given a metric . We are also given a probability distribution on . In the GkNN method the yield time series for t = 1,..., T is computed as follows.
For each ,(1)Compute the metric values for and sort from lowest to highest. Let be the resulting permutation of .(2)Randomly choose according to the distribution . Denote it by i_selected.(3)Return .
3. Prediction of the Month by Month Yield by GkNN Simulation
We want to determine by either theoretical calculation or computational experiment how well the GkNN method predicts yields, or at least to find some sense in which it can be said that the GkNN method is predicting yields accurately. Suppose that we have a training set . Let be a given climatic time series and associated (unknown) yields. The GkNN method is a stochastic method for generating a yield time series. Suppose that we run it times resulting in a yield time series for run r, where .
We will first work out how well the GkNN predicted yield approximates the actual yield for any given month. A measure of the error of the predicted yield compared to the actual yield for month and run is the square of the deviation, i.e. . The expected error for the GkNN computation of the yield for month is
We will show that this expected error exists and is positive. Let denote the expected value of the GkNN prediction of the yield for month . We will show that exists. Let denote the index i_selected chosen in Step (2) of the GkNN algorithm for month and run . By definitionThus exists. NowThereforeNow the variance Var is given by
The expected error is the sum of two nonnegative terms. The first term can only be zero if all the points in the neighborhood have associated yields equal to and this is seldom the case. The greater the distribution of yields in the neighborhood the greater the first term will be and hence, the greater will be. Thus the expected error is positive and the error in the prediction of the yield during month for any given run is likely to be positive.
A measure of the total error of the GkNN prediction of yield over the total simulation period for run isand its expected value isWe may writewhereandWe haveNow define byand let, for . Thenwhere is defined by may be called the base error map. We will show that E is bounded over the predictor vector space as follows:where .
4. Prediction of the Annual Average Yield by GkNN Simulation
Thus the GkNN method does not make accurate detailed month by month predictions of the yield. We would like to determine some way in which the GkNN method gives useful information about the system behavior. We will show that under certain assumptions the GkNN method gives an accurate prediction of the average annual yield and the accuracy of the prediction increases as the total time period of the simulation increases.
Given a permutation let . Let denote the set of all permutations of . Suppose that the simulation is carried out over years, so . The average annual yield for run isTherefore the average of the average annual yield over runs is given byas . Therefore the expected value of the predicted average annual yield is given byIf is a topological space and is a time series in then we will say that is eventually well distributed if(Borel denotes the sigma algebra generated by the set of open sets in .) This is a natural property for a time series to have. If is eventually well distributed define its distribution to be the mapping defined byIt is straightforward to show that is finitely additive and .
If the climatic time series is eventually well distributed with distribution then the average annual yield converges to a limit as the number of years in the simulation increases given by
5. Variance of the Average Annual Yield Predicted by GkNN Simulation
We will now compute the variance of the average annual yield and show that it tends to zero as the number of years in the simulation increases. We haveWe may computeNow for ,as (assuming that the index selection at Step of the GkNN algorithm at time is independent of its selection at time ). ThereforeAlso we computeIt follows thatThereforewhereand so the variance of the predicted annual average annual yield as computed by the GkNN method tends to zero as the total number of years in the simulation increases. If the time series is eventually well distributed with distribution then
6. Prediction of the Total Yield by GkNN Simulation
Thus the computation of average annual yield using GkNN seems to be well behaved. However it is perhaps of greater interest to consider the total yield at any month starting from the beginning of the simulation period. The total yield over a simulation period of years is given bywhere is the average annual yield. Therefore the variance of the total yield is given byIf the time series is eventually well distributed with distribution then Var(, whereas . This limit will be positive for practical applications. Thus, in this case, the variance of becomes unbounded as .
7. A General Framework for GkNN
Let be a time series which may be a realization of some stochastic process and let be a topological space. A stochastic process will be said to be stochastically dependent on if there exists a continuous kernel such thatThe condition that is a continuous kernel means that for all the mapping taking to is a probability measure and for all the mapping taking to is continuous. Equation (35) means that if for are a collection of runs (replicates) of the stochastic process thenConsider the GkNN process defined by training data . In this case the space is the space . We will show that the process is stochastically dependent on the time series . In fact we have where for , denotes the Dirac measure concentrated on defined byIt follows that the GkNN process is stochastically dependent on with kernel defined byNow suppose that is a time series and is a stochastic process which is stochastically dependent on with kernel where is defined by a continuous functional kernel , i.e.,Let be a realization (replicate) of and let be the kernel associated with the GkNN process with training set and probabilitiesfor which as but as . An example of a sequence satisfying this is . is given byTherefore for an interval Now let be defined byLet be defined by for . Then is a uniformly distributed sequence of random numbers and . Thusas , assuming that is small for all for large enough (this will follow, if is eventually well distributed with positive distribution, given that is a uniformly distributed sequence).
Thus the GkNN kernel equals the kernel of the dependent process in the sense defined above as long as the training set for the GkNN process is large enough.
8. Example of Temporal Upscaling of (Rainwater) Tank Data
We would like to estimate the month by month yield of a rainwater tank (RWT) given monthly climatic data. This is not straightforward because a monthly time step is too coarse for the RWT simulation model. To obtain reasonably accurate results a daily time step must be used for the RWT simulation [13, 14].
The monthly climatic data arises from the water supply headworks (WSH) model  and is usually stochastically generated with a very large time span (e.g., 1,000,000 years). The problem of temporal scaling up would not arise if the climatic data for the WSH model had a daily time step (and also if the RWT simulation algorithm could be executed sufficiently fast).
Temporal downscaling has been used extensively in studying the short term effects of long-term climate models such as models of climate change [16–19]. However in the present paper we are considering the problem of upscaling relatively short records of daily data to generate long term records of monthly data.
In each of these methods the RWT month by month yield associated with a WSH climatic time series is estimated using a comparatively short (e.g., 140 years) historical record of daily climatic data. In each case the RWT simulation model or, more generally, the Allotment Water Balance model described in  is run on this daily historical record for various RWT parameter settings. In order to do this it is necessary to have a demand model which is either a simulation or, as is unlikely, a historical record. The demand simulation will take into account the climatic variables, in particular, the temperature.
The upscaling methods can be described in terms of the following general format. Each of the upscaling methods aggregates the daily RWT yields and climatic variables obtained from running the RWT simulation on the historical record into monthly time steps. They then generate a list of records of the formwhere is the number of months in the historical record. The month label is a number in determined from the month corresponding to the record. For the method described in , and climatic_variable_1 = average_temperature, climatic_variable_2 = number_of_rainfall_days, climatic_variable_3 = rainfall_depth.
For Kuczera’s bootstrap method and the kNN method as currently implemented and climatic_variable_1 = rainfall_depth.
Now for all three upscaling methods we are given a sequence of monthly records coming from the WSH model where
For each we want to select a RWT yield to associate with . The NN method does this by finding the record in which is closest to as measured by the metric (a variant of the Manhattan metric) given bywhere is the record length (e.g. 2) and are weights which were chosen to be 1 in . The NN method is deterministic.
The kNN method is a stochastic method in which the following steps are carried out.(1)Evaluate the distance from each record to using the following metric (a variant of the Euclidean metric): where is the standard deviation of .(2)Sort the metric values(3)Choose the top (closest) values ,..., (4)Assign a probability to each of the selected values proportional to for (5)Randomly select an index according to the assigned probabilities and return the as the RWT yield corresponding to
The bootstrap method is a stochastic method in which a scatter plot of is created. The domain of the plot is divided up into bands of 50 samples per band. Then, given a WSH climatic record the corresponding RWT yield is obtained by finding the band containing , randomly choosing a sample in that band and then returning its RWT yield value.
The bootstrap method of Kuczera can be modified by taking the set of samples associated with any given rainfall value to be the set of samples whose rainfall values are the 50 closest values to the given rainfall value rather than using predefined bands of 50 rainfall values. It can be argued that the modified bootstrap method is superior to the bootstrap method because the closest values are the most appropriate values to use and, for example, if the given rainfall value falls near the boundary of one of the predefined bands then the predicted yield using the bootstrap method will be biased towards the values near the centre of the band.
The modified bootstrap method, the Coombes method, and the kNN method are all examples of the GkNN method. For the modified bootstrap method the predictor vectors have one component, the rainfall. For the Coombes method the predictor vectors have three components, the average temperature, the number of rainfall days, and the rainfall depth. For the kNN method the predictor vectors have two components, the month label (an integer in ) and the rainfall depth. The training data is obtained by running the RWT simulation model using a daily time step over a relatively short period of time (e.g., 100 years) and then upscaling to a monthly time step by aggregation. The GkNN metric may be the modified Manhattan metric of the Coombes method or the modified Euclidean metric of the kNN method.
For the bootstrap method the probability distribution on the set of nearest neighbors is given byFor the kNN method the distribution is given bywhere
A generalization of three methods of temporal data upscaling, which we have called the generalized k-nearest neighbor (GkNN) method, has been considered. The accuracy of the GkNN simulation of month by month yield has been considered. The notion of an eventually well distributed time series is introduced and on the basis of this assumption some properties of the average annual yield and its variance for a GkNN simulation are computed. The behavior of the total yield over a planning period has been described. A general framework for considering the GkNN algorithm based on the notion of stochastically dependent time series has been described and it is shown that for a sufficiently large training set the GkNN simulation has the same statistical properties as the training data. An example of the application of the methodology has been given in the problem of simulating the yield of a rainwater tank given monthly climatic data.
The work of the paper is a theoretical study. The author did not implement any code or generate any data relating to the work. Therefore no data were used to support this study.
Conflicts of Interest
The author declare that they have no conflicts of interest.
The work described in this paper was partially funded by the Commonwealth Scientific and Industrial Research Organisation (CSIRO, Australia). Also the author would like to thank Fareed Mirza, Shiroma Maheepala, and Yong Song for very helpful discussions.
S. Yakowitz and M. Karlsson, “Nearest neighbour methods for time series, with application to rainfall/runoff prediction,” in Stochastic Hydrology, J. B. Macneill and G. J. Umphrey, Eds., pp. 149–160, Reidel Publishing Company, 1987.View at: Google Scholar
P. R. Halmos, Measure Theory, Springer, New York, NY, USA, 1974.View at: MathSciNet
J. Mashford, S. Maheepala, L. Neumann, and E. Coultas, “Computation of the expected value and variance of the average annual yield for a stochastic simulation of rainwater tank clusters,” in Proceedings of the 2011 International Conference on Modeling, Simulation and Visualization Methods, pp. 303–309, Las Vegas, Nev, USA, 2011.View at: Google Scholar
G. Kuczera, “Urban water supply drought security: a comparative analysis of complimentary centralised and decentralised storage systems,” in Proceedings of the Water Down Under 2008, pp. 1532–1543, 2008.View at: Google Scholar