#### Abstract

In the current context of climate change discussions, predictions of future scenarios of weather and climate are crucial for the generation of information of interest to the global community. Due to the atmosphere being a chaotic system, errors in predictions of future scenarios are systematically observed. Therefore, numerous techniques have been tested in order to generate more reliable predictions, and two techniques have excelled in science: dynamic downscaling, through regional models, and ensemble prediction, combining different outputs of climate models through the arithmetic average, in other words, a postprocessing of the output data species. Thus, this paper proposes a method of postprocessing outputs of regional climate models. This method consists in using the statistical tool multiple linear regression by principal components for combining different simulations obtained by dynamic downscaling with the regional climate model (RegCM4). Tests for the Amazon and Northeast region of Brazil (South America) showed that the method provided a more realistic prediction in terms of average daily rainfall for the analyzed period prescribed, after comparing with the prediction made by set through the arithmetic averages of the simulations. This method photographed the extreme events (outlier) that the prediction by averaging failed. Data from the Tropical Rainfall Measuring Mission (TRMM) were used to evaluate the method.

#### 1. Introduction

General circulation models (GCM) have been used for climate prediction over Brazil [1, 2]. Although these models represent the influence of synoptic scale weather systems and aspects of the general circulation that have limitations in representing mesoscale processes, such as squall lines, meteorological systems formed by complex topography, watershed, and others [3]. Thus, due to the large size and complexity of terrain and biomes covering the Brazilian territory, the GCM are limited to representation of regional aspects of the climate on Brazil.

A solution to this problem is use of downscaling technique through regional climate models (MCR). Over South America several studies [1, 4–7] showed the RCM skill relatively to GCM. The effectiveness and suitability of this technique are due to the possibility of using more appropriate physical parameterizations for mesoscale due the increasing of the spatial resolution. These characteristics are important because in regions such as Brazil, forcing mesoscale regulates the spatial and temporal distribution of atmospheric variables, reducing errors in GCMs that are performed with low spatial resolution [8].

An important regional model is the regional climate model (RegCM), which was originally developed at the National Center for Atmospheric Research (NCAR) during the 80s decade [9, 10]. Due to the contribution of many researchers to the RegCM there are six versions: RegCM1, RegCM2, RegCM2.5, RegCM3, RegCM4, and RegCM4.1. It is widely used because it is public and open source code, moreover, it has a good computational performance.

Despite progress achieved in modeling regional in recent years, there are still many aspects to be explored, evaluated, and improved to a substantial improvement of climate representation through the RCM compared to the representations through the GCMs.

Systematic errors that regional models exhibit, including the RegCM in different regions, especially over tropical region, are due to a lack of fit in the physical parameterizations, especially in parameterizations of convective cumulus and precipitation in grid scale [11, 12].

The most used technique to overcome the lack of adjustment in the parameterization and reduce forecasting errors is called ensemble prediction, which consists of combination of the multiple simulations, performed with different initial conditions or parameterization, for the same period and region. Studies have shown that this method produces more consistent results with observation.

For South America (SA), consequently to Brazil, the situation is no different, there is a need of studies that aim to enhance regional and technical treatment models to output models. However, several studies have focused on simulation with the standard model configuration [3, 8, 15–17], and for the ensemble prediction technique using the arithmetic mean. Therefore, this paper investigates the possibility of improvement of regional model simulations RegCM4 through proper adjustment of physical parameterizations and using appropriate statistical methods to combine multiple simulations. In this sense, we use the technique of multiple linear regression using principal components in order to combine different simulations with the RegCM4. To test the method we apply this technique of combination in the period from February to June 1998. This year was an atypical year, El Nino.

The work is organized as follows. Section 2 will show a brief presentation of the physical parameters of the regional climate model RegCM4 that most influence the simulated rainfall, together with the input data of this model and the data that are used to verify the method proposed in this work. In Section 3, the method multiple linear regression using principal components to combine different simulations will be presented. In Section 4, the results are present. Finally, in Section 5, the conclusions and discussions are drawn.

#### 2. Model Used and Numerical Experiments Performed

##### 2.1. Regional Climate Model (RegCM4)

The RegCM4 [18] is the fifth version of RegCM, originally developed by the National Center for Atmospheric Research NCAR [19] and based on mesoscale model (MM5). This is a model of limited area discretized into grid points (Arakawa B). In the vertical system sigma coordinates are used. The primitive equations, which correspond to the core of the dynamic model, are for a compressible hydrostatic fluid [20].

The physical processes are represented in the model by a series of parameterizations. The radiative transfer scheme is the same used in the global model Community Climate Model version 3 (CCM3). This scheme calculates the interaction of gases (, , O_{3}, , , and CFC) and aerosols with radiation in the infrared and ultraviolet. For the soil-vegetation-atmosphere interaction, RegCM4 uses the biosphere-atmosphere transfer scheme (BATS) and community land model (CLM: version 3.5). The full description of the model as well as the parameterization options are shown in [18].

The model has three options of convective schemes: (i) Kuo scheme, the most simplified and that is activated when the moisture convergence exceeds a threshold value, and (ii) the convective schemes of Grell [20], which considers the cloud as a plume entrainment model composed by a downdraft and updraft. The interaction, via entrainment of air, with the atmosphere occurs only in the top and in the base of the cloud. The convective activity is activated when the updraft reaches the moist adiabatic. This scheme is more sensitive to precipitation efficiency (PEFF) parameter. This parameter quantifies the portion of precipitation that will evaporate before reaching the ground. Therefore, high PEFF values decrease precipitation. Two types of closures can be used: Arakawa and Schubert (all potential energy available for convection is adjusted for each time step [13]) and Fritsch-Chappell (1980) (scale convective adjustment in the order of 30 minutes [14]); (iii) parameterization MIT-emanuel [21], which characterizes the convection trigger when the level of free convection is higher than the cloud base.

For stratiform precipitation the RegCM4 use the subgrid explicit moisture (SUBEX), which was developed by [22]. The formulation for the auto conversion of cloud water into precipitation is as follows: is the conversion rate, is the amount of water present, is the minimum amount of water that must remain in the cloud, and is the conversion factor of water present in precipitation. value depends on the minimum relative humidity () for cloud formation, according to the equation: The value is 101%, may vary between 1 and 100%, and RH is local relative humidity. The threshold amount of water in the cloud is given by: is the temperature in degrees Celsius and is scaling factor.

##### 2.2. Numerical Experiments

###### 2.2.1. Data

The initial and boundary conditions of the atmosphere (wind, temperature, surface pressure, and water vapor) used in the simulation conditions are of the ERA-Interim reanalysis. The ERA-Interim is a global dataset of the atmosphere produced by the European Centre for Medium-Range Weather Forecast (ECMWF) with a horizontal grid spacing of 1.5° by 1.5° and frequency of six hours (00:00, 06:00, 12:00, and 18:00 UTC) [23]. The topography and ground cover are from the United States Geological Survey (USGS) and Global Land Cover Characterization (GLCC), with 60 minutes of horizontal grid spacing [24].

The dataset of sea surface temperature (SST) used were produced by the National Oceanic and Atmospheric Administration (NOAA) using in situ data and satellite, through optimal interpolation (OI) [25]. The data are weekly and available from 1989 to the present day, centered on Wednesday, with a resolution of 1.0° by 1.0°.

The simulated precipitation data will be compared with data Tropical Rainfall Measuring Mission (TRMM) product 3B42-V7. These data are obtained by using satellite infrared channels with 0.25° by 0.25° resolution, latitude versus longitude [26].

###### 2.2.2. Configuration of the Experiments

Seven simulation tests were performed during the Austral autumn, beginning at 00:00 UTC on February 15^{th}, 1998, and ending at 00:00 UTC on June 1th in the same year. February was discarded because this is the time adjustment (spin-up) of the model.

The model grid spacing is 50 km and 18 vertical levels, with the top at 5 hPa. The domain and the topography are shown in Figure 1. Two regions will be analyzed: Amazon (AM) and Northeast (NE) region of Brazil as indicated in Figure 1. Table 1 summarizes the settings of the experiments that varied according to the convective scheme (Grell and MIT-Emanuel), minimum relative humidity for formation of cloud in scale grid (), and the dynamic control (closure) of Grell model (Arakawa-Schubert or Fritsch-Chappell); in addition different PEFF are used if the scheme was the Grell convective.

#### 3. Multiple Linear Regression Using Principal Components

To minimize the error in climate forecasts, predictions with several different configurations are generated and combined. This method is called ensemble prediction [27]. Usually the ensemble prediction is made via a simple arithmetic average (AA) from different simulations or models, or weighting by measures of dispersion.

In this paper we will compare the usual method with the method of multiple linear regression using principal components (here we call PCR method), to produce a combination of the seven experiments described in Section 2.2.2.

The method of multiple linear regression is a multivariate technique that consists in finding a linear relationship between a dependent variable (response variable), in this case, the observed data, and more than one independent variable (predictors variables) that describe the system; here, these are output of the climate model RegCM4.

The following equation shows this relationship, where is the variable to be estimated, are the predictors variables, the intercept, and the coefficients of multiple linear regression to be estimated by least squares method [28]. This method consists in finding a solution that minimizes the sum of squared residuals, which is the difference between the observed and predicted (estimated):

The problem of multiple linear regression is to find the coefficients that relate the independent variables and the dependent variable; this step can be called calibration of the regression model. To find this solution we rewrite (4) in matrix form, taking the -matrix with the dependent variable, the -matrix with the independent variables, the -matrix with coefficients, and the -matrix with errors :

Rewriting the problem in matrix form, we have Multiplying the -matrix by -matrix and adding the -matrix, we obtain the equation below, but in matrix form: The least squares method is used to find the coefficients of multiple linear regression with the condition that the sum of the squares of the errors be minimum. For this, isolate the error in (4), getting

Then the sum of squared errors (SSE: shown in (9) and in matrix form in (10)) is minimized through the derivative with respect to -matrix equaling to zero, as shown in (11). By isolating -matrix (step not shown), we have as the solution of the multiple linear regression in (12). Consider the following:

A possible obstacle to find the solution of (12) is that the matrix cannot be inverted. In other words, it can be a singular matrix, where some predictors variables are linear combinations of other, so there is a correlation between the independent variables. When this occurs, there is multicollinearity and there is no single least squares estimators for the parameters. For climate ensemble prediction, the simulations with different configurations from a single climate model are correlated. Thus, to avoid multicollinearity we will use the principal components of the simulations. This technique aims to explain the structure of variance and covariance of a random vector by constructing linear combinations of the original variables, which are, for this problem, the predictors variables of multiple linear regression. These linear combinations are called principal components and are not correlated [29]. Therefore, the principal components of the explanatory variables are a new set of variables with the same information of the original variables, but uncorrelated, eliminating multicollinearity. The use of principal components to fit a multiple linear regression model was proposed initially by [30]. This technique is called multiple linear regression using principal components.

The first step is to find the principal components (PCs), -matrix, of the matrix of predictors variables , where the relationship between them is given by is the orthogonal matrix of dimension ( is the number of predictors variables) consisting of eigenvectors of the covariance matrix or correlation matrix, . Thus, (7) and (12) can be rewritten in the forms Finding -matrix, with the weights of each simulation, and the -matrix, with the regression coefficients, the regression model is calibrated, this matrix should be used as setting for new ensemble prediction. The eigenvectors of the -matrix that provides the weights of each predictors variable are used to find the new matrix of principal components of new simulations , given by After, to find the principal components, using the coefficient -matrix, the ensemble prediction is obtained by the relation

The multiple linear regression using principal components can work with all PC obtained from the original data, or only to work with components that have higher correlation with the response variable [31]. In the latter case, the errors can be minimized.

For the analysis the results were calculated Bias, mean error (ME), mean absolute error (MAE), and root mean square error (RMSE), according to (17), (18), (19), and (20), respectively. is the observed precipitation, is the precipitation predicted, and is the number of data:

#### 4. Results

##### 4.1. The Regression Model via Principal Component

The TRMM data, which will be the dependent variable , was obtained through average daily precipitation, from March 01 to May 31, for the Amazon region and Northeast region of Figure 1. Similarly independent variables were obtained, which is simulated precipitation (-matrix) of the seven experiments. Preliminary tests indicated that the larger the number of simulations improves the ensemble prediction.

First step was to find the seven principal components of the -matrix, which composes the -matrix (Section 3). Despite the cumulative variance explained to be equal to 86% e 96% in the fourth principal component (see Tables 2 and 3), for AM e NE region, respectively, the implementation of PCR method were considered all PCs (seven PCs not shown here) because each one captures a different parameter of the configuration of the model RegCM4, except for the first component which is a measure of the intensity of the rain. The PC[2] split the effect of the different parameterizations of cumulus used, Grell and MIT-Emanuel; PC[3] differentiates PEF/Wet and PEF/Dry associated with the Grell scheme; PC[4] captures the difference SUBEX/Dry and SUBEX/Wet associated with Grell scheme; PC[5] distinguishes different PEFF associated with different closure of the clouds; PC[6] differentiates closure of the cloud used for parameterization of Grell; and PC[7] captures the difference between the association of Emanuel parameterization with SUBEX/Dry and SUBEX/Wet. Finally, to run the PCR, the regression equations (21) show the regression coefficients that associate each component principal () with the precipitation (), for the Amazon (Prec_{AM}) and Northeast (Prec_{NE}) of the Brazil, respectively. This equation allows us to estimate the average daily precipitation for the period analyzed, using with the same coefficients:

For the regression model to be appropriate, one must satisfy three requirements: (i) the residues must to present random distribution around the mean zero, (ii) the residues must have a normal distribution, and (iii) the variance must to be homogeneous. The residues in the graphs of Figures 2(a) and 2(d), to Amazon and Northeast of the Brazil, respectively, apparently do not present any particular pattern or trend indication.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

The plots in Figures 2(b) and 2(e) show the quantiles of the residuals versus the quantiles of the normal distribution, called QQ-plot for the Amazon and Northeast, respectively. This is necessary to verify the assumption of normality of residuals. The closer to a line, the residues are close to a normal distribution. Figures 2(c) and 2(f) show the square root of the normalized residual versus predicted values randomly distributed, indicating the homogeneity of variance. Therefore, we conclude that model satisfies the three conditions.

##### 4.2. The Performance of Regression Model

For the AA method, we calculated the arithmetic mean of the seven simulations. With the purpose of comparing the performance of the PCR and AA methods to represent the daily rainfall, the graphs in Figure 3 present data from TRMM versus simulated for both methods and regions. The results were compared for the Amazon region and Northeast in Figure 3, with the PCR method in Figures 3(a) and 3(c), and AA method in Figures 3(b) and 3(d). We concluded that the simulation through the ensemble PCR shows a better correlation with the TRMM data relatively to the AA ensemble, especially in the Amazon region.

**(a)**

**(b)**

**(c)**

**(d)**

Despite the North and Northeast of Brazil being located in the tropical region, one has different responses in simulations in climate models. Overall, the simulations for the Northeast converge to the observed, presenting a smaller bias compared to the northern region bias. This is due to the variation of topography, distance to the ocean, the diversity of vegetation types, and forms of land use and other factors. Therefore, the efficiency of the PCR method is sharper in the region with the largest bias.

From the boxplot of TRMM data, PCR, and AA ensembles in Figure 4, we find that the median and interquartile range of the data obtained by AA ensemble diverges significantly from the TRMM data. For the model obtained with the PCR ensemble, compared to data from TRMM, there is the similarity in median precipitation and variance. Regarding the PCR method, there is a slight underestimation of the intense events and overestimation of the weak events. Moreover, this method is able to capture two extremes events (outliers) in accordance with the data TRMM.

**(a)**

**(b)**

The variability of observation explained by simulations for the PCR ensemble was approximately 40%. This value is higher than that obtained with the AA method, which was 28% (see Table 4). The -test, also shown in Table 4, is higher than the tabulated -value, which for a confidence level of 95% is 2.214. The probability of obtaining this result is measured by the value, which showed low values, of the order of for the PCR method.

Table 5 shows the mean error (ME), mean absolute error (MAE), and root mean square error (RMSE), calculated according to (16), (17), and (18), respectively, for PCR and AA ensembles. As expected, the ME was approximately zero for the PCR method to the two regions, once the graph of Figures 3(a) and 3(d) shows the uniform distribution of residue around zero. This shows that there is a trend of underestimation or overestimation of the method. The MAE indicates the magnitude of the error. The MAE for the AA method was approximately twice the value obtained by the PCR method for Amazon region. For Northeast region, the value MAE was 8% less with PCR method. The RMSE had results similar to MAE.

#### 5. Final Comments

Errors and uncertainties in weather and climate forecasting will always exist due to several sources of errors present in a simulation and can be classified into two classes: incomplete or erroneous atmospheric initial conditions and inadequacy of the numerical model.

These errors in the initial conditions are due to instrumental limitations for data collection, discretized observations, and irregularly spaced, increasing the difficulty of interpolation to the grid structure. In the case of models of limited area, the artificial boundary condition increases the errors and uncertainties.

Inadequacy of the numerical model consists in difficulty to represent the influence of all physical-chemical-biological factors in the state of the atmosphere and its evolution in time.

With the ensemble prediction method by varying the physical parameterization, the error due to the inadequacy of the model is minimized since several possibilities of representing the state of the atmosphere are reproduced, and a solution is generated from these. Thus, decreases in the probability of observing extremes surprise that a particular setting or parameter could not represent the forecast.

By comparing the prediction method routinely performed (AA) together with the method presented here, we found that combination of simulations that are correlated, in other words, simulations that bring the same information, or contribution to the final solution does not improve the prediction. A treatment is needed to remove redundant information from simulations, that is, a principal component analysis. And from this, assign specific weights to this new set of variables using multiple linear regression.

The PCR method performed better in the Amazon region, where individual forecasts more diverged from the observations. For the Northeast region, where the bias was close to zero, the result was comparable to the average of the simulations. A significant advantage of the PCR method was the ability to capture extreme events (outlier) for both regions, since the prediction of these events is of interest to the community.

Studies are still needed. Besides, to check the effectiveness of the methods to other regions and periods, it is necessary to take point to point of grid to obtain a spatial distribution of precipitation, refining the process, instead of using the average of a region, as performed here with the purpose a preliminary analysis.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.