Abstract

Weather generators are tools used to downscale monthly to seasonal climate forecasts, from numerical climate models to daily values for use as inputs for crop and other environmental models. One main limitation of most of weather generators is that they do not incorporate neither the spatial/temporal correlations between/within sites nor the cross-correlations between variables, characteristics specially important when aggregating, for example, simulated crop yields, freeze events, or heat waves in a watershed or region. Three models were developed to generate realization of daily maximum and minimum temperatures for multiple sites. The first model incorporates only spatial correlation, whereas temporal correlation using a 1-day lag and cross-correlation between variables were added to model one, respectively, by the other two models. Vectors of correlated random numbers were rescaled to temperature values by multiplying each element with the standard deviation and adding the mean of the corresponding weather station. An extension of Crout's algorithm was developed to enable the factorization of nonpositive definite matrices. Monthly spatial correlations of generated daily maximum and minimum temperatures between all pairs of weather stations closely matched their observed counterparts. Performance was analyzed by comparing the root mean squared error, temporal semivariograms, correlation/cross-correlation matrices, multiannual monthly means, and standard deviations.

1. Introduction

Because of the geospatially homogeneous nature of daily temperatures, few publications have investigated the generation of daily temperatures while reproducing the observed spatial variability [1, 2]. Most of these methods [3, 4], including those presented in the current paper, are based on Wilks' parametric model [5]. Another common approach found in the literature is generally based on resampling methods [68]. The applicability of Wilks' methods is potentially limited if the correlation matrix cannot be factorized, such as the case when a matrix is nonpositive definite. The probability of facing nonpositive definite matrix increases exponentially for large number of locations or the use of more complex matrices, which includes lag time and cross-correlations. Thus, the methodological requirements to factorize matrices cannot be overlooked.

When regional analysis is required, it is important to generate geospatially and temporally correlated maximum and minimum temperatures while taking into account the covariance between them and between other variables, such as rainfall [9, 10]. For instance, a water balance for a given region must be based on multisite estimations of daily evapotranspiration based on temperatures (among other variables) following the observed spatial structure and its relationship with rainfall occurrence [11]. Failing to correctly generate daily differences between maximum and minimum temperatures will cause over- or underestimation of water availability in a region [12]. Inaccuracies can increase even more, if multisite estimations of daily evapotranspiration use temperature-based methods to estimate solar radiation [13, 14].

Over the last decade, there has been increasing use of weather generators to stochastically downscale the increasingly skilled monthly forecasts/projections from global and regional circulation models (G/RCMs) to forecast seasonal climate and to project climate change [15, 16]. Together with other available methods, such as the delta change method [17] and bootstrapping [18], weather generators are necessary to apply because the daily outputs from G/RCMs are mean values within the geographical domain of a grid cell, and the direct use of these daily outputs in environmental and agricultural models has shown large bias and errors in the final outcomes [19, 20].

The objectives of this study are as follows: (i) to develop a stochastic daily maximum and minimum temperature generator capable of reproducing both the daily spatial correlation between weather stations and the monthly statistics of each individual weather station, accounting for temporal correlation from a previous day and/or cross-correlation between maximum and minimum temperatures, characteristics that current weather generators are not capable to reproduce and (ii) to extend Crout's algorithm to transform a non-positive definite correlation matrix into a positive definite matrix, thus allowing its factorization.

Several questions have arisen while considering how to incorporate spatial correlations in generating maximum and minimum temperatures. For example, what is the best way to factorize a nonpositive definite correlation matrix? Are observed monthly spatial correlation patterns accurately reproduced by generating daily spatially correlated maximum and minimum temperatures? Is it possible to replicate the daily spatial covariance between maximum temperatures and minimum temperatures? The models developed in this paper generate daily maximum and minimum temperatures for weather station locations rather than continuous temperatures over space. Weather data generated by a single-site weather generator or bootstrapping methods should not be spatially interpolated because of the assumption of independence among stations; however, data generated by the algorithms developed in this paper can be spatially interpolated. For the same reason, outcomes produced by crop, hydrological, and environmental models using weather data generated without taking into account its spatial structure should not be spatially aggregated.

2. Methods

The present approach is based on the assumption of spatial-temporal covariance stationarity, which implies that the autocovariance functions of the data series, the spatial correlations between the data series, and the cross-correlation between variables of the data series do not change during the time period considered. From a temporal point of view, this allows one to characterize a time series of a given variable, like temperatures, as samples from a probability distribution at each weather station. It also allows one to characterize spatial relationships between all pairs of weather stations in a selected area with a correlation and cross-correlation matrices. This assumption allows one to perturb the main statistics of the geospatial weather generator for climate variability and climate change applications. The assumption of spatial-temporal covariance stationarity applies when the global and regional atmospheric circulation patterns across the study area remain unchanged within the range of the interannual variability. In some cases, extreme events like ENSO for instance, can significantly change these spatial and temporal patterns [21, 22]. Under these circumstances, it is recommended to split the dataset accordingly and to apply the assumption to each of the new categories (i.e., El Niño, La Niña, and neutral years). Applying the present approach to future climate projections must use the same rationale. Because future climate projections are generated using global and regional circulation models, these analyses are possible to perform.

Daily historical records of maximum and minimum temperatures and rainfall are needed for the model described in this paper because it uses statistics of observed weather data within and between locations for autocalibration. Generation of daily rainfall events and amounts used in the present paper follows the method described by Baigorria and Jones [23].

2.1. Models

Three models for generating spatially and temporally correlated maximum and minimum temperatures were developed (Table 1). The covariance between maximum temperature and rainfall occurrences was taken into account in all three models [24]. To account for this covariance, two Gaussian distributions were fitted for maximum temperatures for each location: (a) using maximum temperatures during rainy days only and (b) using maximum temperatures during nonrainy days.

The first and second models generate daily maximum and minimum temperatures independently, whereas the third model generates both variables at the same time based on the spatial correlation of the same variable and the cross-correlation between variables. The first and third models generate daily temperatures without taking into account the previous day’s generated values; the second model does take into account previously generated values.

The rationale of developing, evaluating, and comparing three models instead of a unified model is to emphasize the advantages and disadvantages of spatial, temporal, and cross-correlations approaches alone. This facilitates the understanding of each step of each model, as well as the uncertainties and errors of each approach alone. This makes it possible for the readers to implement their own methods through the combination of the proposed models according to their own necessities.

The proposed models to generate maximum and minimum temperatures, together with previously published methods for generating rainfall [23], were implemented in a user-friendly software named Geo Spatial and Temporal weather generator (GiST-wg). The software can be downloaded at http://www.agenviromodeling.com/.

2.1.1. First Model: Model with Spatial Correlation

The first model generates maximum temperatures independently of minimum temperatures (Figure 1). This model only takes into account the spatial correlations between weather stations; it does not take into account temperatures generated on the previous day (Table 1).

Using historical data from all weather stations in the selected area, Pearson’s correlation coefficients (), which are the elements of the correlation matrix, were calculated using the following equation (see Table 2 for variable definitions):

According to the results of the Filliben correlation test for Gaussian distribution (a computationally simple variant of the Shapiro-Wilk test [25]) (results not shown), the daily variability of maximum and minimum temperatures tended to follow a Gaussian distribution in our study regions. (Null hypotheses that the observed data were drawn from a fitted hypothetical Gaussian distribution were not rejected at levels ranging from 0.5% to 10% depending on month and weather station.) This is a requirement of the method [26]. If the data is not normally distributed, a previous transformation is recommended.

Taking into account the pairwise correlations between locations, a vector of independent standard Gaussian random numbers () is transformed into a vector (). The elements of are correlated random numbers () that follow the observed correlations between stations [27]. To obtain , a vector , with elements and with length equal to the number of locations, is matrix-multiplied by the Toeplitz-Cholesky factorization matrix (; [2831]), which is calculated, according to the case, based on the correlation matrix of maximum () or minimum () temperatures (An alternative to the Toeplitz-Cholesky factorization matrix is the eigendecomposition matrix () [32, 33]. According to our experience, both methods perform equally well.) (see Section 2.2).

Afterward, the elements in are rescaled according to the Gaussian distributed percentages of the population being included (e.g., from −1 to 1 for 68% or −1.96 to 1.96 for 95%). Equation (3) is then used to generate correlated maximum temperatures () and minimum temperatures () using previously generated vector of correlated random numbers from a Gaussian distribution () at each location (the variables are defined in Table 2):

After the first set of maximum or minimum temperatures is generated for the first day of the time series, a new is then produced for each new generated day. Because is calculated for each month, only needs to be calculated once per month, and it is the same for all the generated years.

2.1.2. Second Model: Model with Spatial and Temporal Correlation

As with the previous model, the second model generates maximum temperatures independently of the minimum temperatures. The main difference is that this model takes into account the spatial correlations between weather stations during the day being generated () and the previous day () (Table 1; Figure 2).

Taking into account for both days: (a) the pairwise correlations between locations and (b) the pairwise 1-day lag autocorrelations between locations, a vector of independent standard Gaussian random numbers () with length equal to twice the number of locations is transformed into a vector () whose elements are correlated random numbers () that follow the observed correlations and the autocorrelations between stations. To obtain , a vector of random numbers () is matrix-multiplied by the Toeplitz-Cholesky factorization matrix () calculated based on the correlation matrix of maximum () or minimum () temperature. In this model, however, the correlation matrix is composed of two correlation matrices: the 0-day lag () and the 1-day lag (), and two autocorrelation matrices: and , as shown in (4), where and

Elements in are then rescaled according to the Gaussian distributed percentages of the population being included. Using (3), elements of this vector are used to finally obtain for all locations for two consecutive days. By producing two days at a time, the 1-day lag autocorrelation is considered in the process. However, if we proceed in the same manner to produce days 3 and 4 of the time series, an inconsistency between days 2 and 3 will be created. Then, only one day at a time can be added to the time series. Because the temperature (maximum or minimum) has already been generated for day 2, it is impossible to numerically solve the set of equations because half of the correlated random numbers resulting from the multiplication are already known from the initial calculation (days 1 and 2). To solve this set of equations, an iterative process is performed. Here, a given number of are generated, forming an ensemble. Half of the vector's elements of each are used to generate day 2 (), and the other half are used to generate day 3 (). Those elements corresponding to day 2 are then compared to their counterparts previously calculated when generating days 1 and 2. For the comparison, the root mean square error (RMSE) is calculated. After the RMSEs are calculated for every ensemble member, the one that produces the lower RMSE is selected. This is then rescaled and used in (2) to generate the temperatures of day 3 for all locations. The same procedure is repeated to generate the complete required time series.

2.1.3. Third Model: Model with Spatial and Cross-Correlation

The third model generates maximum and minimum temperatures simultaneously, taking into account the existing cross-correlations between these variables (Figure 3). This model takes into account the spatial correlations between weather stations, but not temperatures generated the previous day (Table 1).

Using historical data from all weather stations in the selected area, Pearson’s correlation coefficients are calculated using (1). The cross-correlations between maximum and minimum temperatures (), which are the elements of the cross-correlation matrix, are calculated using:

As in the second model, two sets of temperatures are calculated simultaneously. In this model, these sets correspond to the maximum and minimum temperatures of the same day. The correlation matrix in this case is composed of two correlation matrices and the cross-correlation matrix (6):

Similar to the second model, a vector of independent standard Gaussian random numbers (), with length equal to twice the number of locations, is transformed into a vector () whose elements are correlated random numbers () that follow the observed correlations between stations and cross-correlations between maximum and minimum temperatures within the same station and between stations. To obtain , a vector of random numbers () is matrix-multiplied by the Toeplitz-Cholesky factorization matrix () calculated based on the newly composed correlation matrix produced by (6).

The elements in are then rescaled according to the Gaussian distributed percentages of the population being included. Using (3), the elements of this vector are used to finally obtain and for all locations.

2.2. Factorization of the Correlation Matrix

To factorize a correlation matrix using Toeplitz-Cholesky factorization or eigendecomposition, the matrix must be a positive-definite matrix. Thus, the resulting upper triangular matrix after applying an LU-decomposition must strictly contain only positive diagonal entries. When this is the case, the factorization is a straightforward process.

When the size of the correlation matrix increases (due to a large number of weather stations or the use of more complex matrices) and/or the correlation values within the matrix are homogeneous (which is the case for temperatures), the correlation matrix may become nonpositive definite. To solve this problem, Rebonato and Jäckel [34] proposed two iterative methods: (a) the hyperspheric decomposition and (b) the spectral decomposition. Both methods are based on redefining the correlation matrix prior to the factorization to guarantee a positive-definite matrix. These time-consuming methods are adequate for relatively small matrices with heterogeneous correlation values within the matrix. For applications with a large number of variables, these methods are not guaranteed to produce a positive-definite matrix.

The solution proposed and tested in this paper is to apply an LU decomposition to the correlation matrix using an extension of Crout's algorithm [35, 36] (see Figure 9). This algorithm produces a total of equations for the unknown values of the - and -triangular matrices (the diagonal being represented twice). Because the number of unknowns is greater than the number of equations, the standard procedure is to set the of the unknowns arbitrarily to one and then to try to solve for the others. For this application, the procedure is repeated iteratively by slightly increasing the arbitrary values (steps of 0.01 in our case) until all the diagonal values of the -triangular matrix are positive (see the Appendix). After reaching this condition, the - and -triangular matrices are multiplied to produce the new correlation matrix. This new matrix guarantees the positive-definite matrix necessary for the factorization.

2.3. Case Study
2.3.1. Data

The 10 selected weather stations (Table 3) used in this study are located in the state of North Carolina between 36°35′N, 84°19′W and 33°50′N, 75°28′W. The daily observed data of maximum and minimum temperatures and rainfall from these weather stations were obtained from the NOAA/NWS/National Climate Data Center (http://nndc.noaa.gov/?home.shtml/). The distance between weather stations in North Carolina ranges from approximately 34 to 164 km. The period from 1974 to 2004 was selected to avoid the effects of temporal climatic shifts detected over time [9, 37].

2.3.2. Evaluation

As stated by Ferraris et al. [38], referring to the evaluation of downscaling models, the stochastic geospatial weather generator models presented in the current paper must be able to reproduce the statistics on which they have been optimized (observed spatial-, temporal- and/or cross-correlations; monthly mean and standard deviations), as well as being able to reproduce other statistics of the data not used in model development (e.g., RMSE, temporal semivariograms). To satisfy both requirements and to compare the three previously described models, one thousand year-long replications of daily maximum and minimum temperatures, along with rainfall events and amounts, were generated for the 10 selected weather stations. Correlation coefficients and root mean square errors (RMSEs), between the observed and the generated daily maximum temperature, were compared among the three models. The calculations were performed using the monthly correlations between all pairs of locations. The same comparison was performed for daily minimum temperatures. Correlation coefficients and root mean square errors (RMSEs) calculated based on the monthly correlations between all pairs of locations calculated from the observed and generated daily maximum (minimum) temperatures were compared among the three models. RMSEs were used to represent differences between spatial correlations of the historical records and spatial correlations of the generated temperatures. Cross-correlations between daily values of maximum and minimum temperatures for each month were calculated from the generated data and compared with their observed counterparts. Observed and generated multiannual monthly means of daily maximum and daily minimum temperatures were compared. For every month, a comparison between observed and generated interannual variance of daily maximum and minimum temperatures was performed by comparing standard deviations. The correlation coefficients and RMSEs were calculated from the mean and the standard deviations of the three models. Temporal semivariograms calculated from generated daily maximum and minimum temperatures of the three models were compared with the corresponding semivariograms calculated from the historical record.

3. Results and Discussion

3.1. Generation of Maximum and Minimum Temperatures

Figure 4 shows the comparisons between observed and generated correlations of daily maximum and minimum temperatures among all pairs of locations for each month. The three models reproduced the monthly observed correlations of maximum and minimum temperatures with correlation coefficients ranging from 0.876 to 0.989. For maximum temperatures, RMSE values (as a percentage) relative to the mean observed correlation among all locations were 2%, 10%, and 4% for the three models, respectively; for the minimum temperatures, relative RMSEs were 3%, 11%, and 7%, respectively.

As expected, only the third model reproduced the monthly observed cross-correlations between daily maximum and minimum temperatures. The correlation coefficient (relative RMSE) for the third model was 0.920 (10%), whereas the correlation coefficients (relative RMSEs) for the first and second models were 0.765 (57%) and 0.738 (67%), respectively (Figure 5).

Figure 6 shows the temporal structure of the observed and generated daily maximum and minimum temperatures from the weather station at Asheboro. Only one location is shown because all temporal semivariograms followed the same trend. The weather station was randomly selected. There were differences in the observed sill and range parameters of the temporal semivariograms. As expected, the second model adequately represented the sill parameter; however, the range parameter was 2 days shorter than the observed range. This behavior could be because only a 1-day lag was included in the correlation matrix; future research should explore the extention of the input matrix of the second model to 2-day or even 3-day lag. The third model overestimated the sill parameter but adequately reproduced the range parameter (better in the summer than in the winter). A hybrid model between the second and third models could potentially better represent both the sill and the range parameters of the temporal semivariogram. This should be the focus of future research. The first model produced a range of 2 days shorter than the second model, and it overestimated the sill parameter as did the third model.

Comparisons between observed and generated multiannual monthly means of daily maximum temperatures and daily minimum temperatures are shown in Figures 7(a), 7(b), 7(c), 8(a), 8(b), and 8(c), respectively. For both variables and for the three models, correlations were 1.000. Relative RMSEs for maximum temperature were 1%, 0.7%, and 0.2% for models from one to three, respectively. For minimum temperature, relative RMSEs were 1%, 2%, and 0.3%, respectively. The interannual variability of maximum temperatures (Figures 7(d), 7(e), and 7(f)) and minimum temperatures (Figures 8(d), 8(e), and 8(f)) were well reproduced by the three models. For maximum temperatures, correlation coefficients for the three models were 0.896, 0.900, and 0.965, respectively; for minimum temperatures, they were 0.901, 0.837, and 0.996, respectively. For maximum and minimum temperatures, relative RMSEs were 13% and 13% for the first model, 16% and 20% for the second model, and 27% and 14% for the third model, respectively. These comparisons were made between the observed and generated interannual variance of daily maximum and minimum temperatures, and they are directly related to the overdispersion problem (i.e., the underestimation of the interannual variability) reported in other weather generators [39]. All three models presented in this paper were able to reproduce the interannual variability of the observed temperatures; however all models, especially the third model, slightly underestimated the standard deviation.

3.2. Factorization of Nonpositive Definite Correlation Matrices

For this paper, only 10 weather stations were used to demonstrate the performance of the described models; however, the factorization method was tested with 65 weather stations (results not published). In the third model, the size of the correlation matrix reached elements. All twelve monthly correlation matrices were nonpositive definite. To factorize these matrices, the hyperspherical decomposition and the spectral decomposition methods [34] were applied, but without success. The extended Crout's algorithm needed less than 10 iterations to solve the set of equations. In the current study, the new n-unknown values assigned to solved the Crout's algorithms were less than or equal to 1.10.

4. Conclusions

The three proposed stochastic models for generating spatially and temporally correlated daily maximum and minimum temperature data reproduced the main statistics of the observed historical record of each individual weather station as well as the spatial correlations between pairs of weather stations for each month, the temporal correlations within sites, and/or the cross-correlations between variables. The proposed models did not show the overdispersion problem on the interannual variability of maximum and minimum temperatures. Among the presented models, the final selection will depend on the specific user’s necessities regarding the relative importance of taking into account spatial, temporal, and/or cross correlations.

The method for transforming a non-positive definite matrix into a positive definite matrix to satisfy the matrix factorization requirements produced satisfactory results.

The methods proposed in this paper can be used in the same manner in any other region (larger area, different density of the station network, complex topography, etc.). After some simple transformation to fit other non-Gaussian distributions, the methods proposed can be applied to different time step duration (e.g., hourly), other variables [10], or being applied outside the atmospheric sciences to datasets showing spatial-, temporal- and/or cross-correlated structure. Also, the use of semivariograms based on the spatial correlation could help if one wants to apply the model on a region where the length of the dataset is too short to permit a proper calculation of the correlation matrices. To follow the World Meteorological Organization’s guidelines, a minimum of 30 years of daily historical records are necessary to calibrate the statistical models; however, to consider the effects of climate change, specific analyses must be performed to determine the best historical record length [9].

The idea of mixing the second (spatial and temporal correlations) and the third (spatial and cross-correlations) models will be carried out in future researches. The expectations of this hybrid model is to surpass its predecessor, generating data that is spatial-, temporal- and cross-correlated, closely matching the observed historical record; however, it is also expected that the necessary computational resources will increase.

Appendix

Extension of Crout's algorithm

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research project was supported in part through grants from: the Agriculture and Good Research Initiative Competitive Grant no. 2011-00828 from the USDA National Institute of Food and Agriculture; the National Atmospheric and Space Administration—Earth Science Division Applied Sciences Program (NNH08ZDA001N-DECISIONS and NNH09ZDA001N-GULF), and it was developed under the auspices of AgMIP: The Agricultural Model Intercomparison and Improvement Project, and the CropClimate Initiative. The views expressed in this paper are those of the author and do not necessarily reflect the views of NASA, USDA, AgMIP, or any of its subagencies. The author wishes to thank two anonymous reviewers for their observations.