#### Abstract

Two nonlinear regression methods, Bayesian neural network (BNN) and support vector regression (SVR), and linear regression (LR), were used to forecast the tropical Pacific sea surface temperature (SST) anomalies at lead times ranging from 3 to 15 months, using sea level pressure (SLP) and SST as predictors. Datasets for 1950–2005 and 1980–2005 were studied, with the latter period having the warm water volume (WWV) above the isotherm integrated across the equatorial Pacific available as an extra predictor. The forecasts indicated that the nonlinear structure is mainly present in the second PCA (principal component analysis) mode of the SST field. Overall, improvements in forecast skills by the nonlinear models over LR were modest. Although SVR has two structural advantages over neural network models, namely (a) no multiple minima in the optimization process and (b) an error norm robust to outliers in the data, it did not give better overall forecasts than BNN. Addition of WWV as an extra predictor generally increased the forecast skills slightly; however, the influence of WWV on SST anomalies in the tropical Pacific appears to be linear.

#### 1. Introduction

The El Niño-Southern Oscillation (ENSO) phenomena, the strongest climatic fluctuation on time scales ranging from a few months to several years, is characterized by interannual variations of the tropical Pacific sea surface temperatures (SST), producing warm (El Niño) and cold (La Niña) episodes [1]. As ENSO affects not only the tropical climate but also the extratropical climate [2], the successful prediction of ENSO is of great importance.

Since the early 1980s, much effort has been devoted to forecasting the tropical Pacific SST anomalies. ENSO forecast models can be categorized into three types: dynamical models, statistical models, and hybrid (statistical-dynamical) models [3, 4].

Most of the statistical models used for ENSO forecast have been linear models, for example, multivariate linear regression (LR) and canonical correlation analysis (CCA) [5]. However, ENSO has nonlinear features—for instance, the cold SST anomalies during La Niña are centred much further west than the warm anomalies during El Niño. This asymmetry can be explained by nonlinear physics [6].

To better model the nonlinear features of ENSO, neural network (NN) models, originally from the field of computational intelligence, have been used for nonlinear regression [7, 8] and nonlinear CCA [9, 10]. In Wu et al. [8], nonlinear regression by NN was used to predict each of the five leading principal components (PCs) of the tropical Pacific SST anomalies separately, and then the predicted PCs were combined with the corresponding eigenvectors (also called empirical orthogonal functions, EOFs) to yield the forecast of SST anomalies over the tropical Pacific.

NN methods, generally regarded as forming the first wave of breakthrough in machine learning, became popular in the late 1980s, whereas kernel methods (e.g., support vector regression SVR) arrived in a second wave in the second half of the 1990s [11, 12]. SVR has two advantages over NN models—it avoided the multiple minima problem associated with nonlinear optimization used in NN models, and robust error norms are used in SVR instead of the nonrobust mean squared error (MSE) norm.

The warm water volume (WWV) above the C isotherm integrated across the equatorial Pacific basin has been shown to lead eastern equatorial Pacific SST anomalies by about 7 months [13]. This relation is consistent with the recharge-discharge oscillator theory of Jin [14].

In this paper, we test if the forecasting of tropical Pacific SST anomalies by Wu et al. [8] can be improved in two ways: (a) can SVR forecast better than NN, and (b) will adding WWV as an extra predictor improve the forecasts linearly or nonlinearly? Section 2 describes the data sets and Section 3 the data analysis methods. Results are given in Section 4, followed by summary and conclusion in Section 5.

#### 2. Data

The warm water volume (WWV) index is defined as the volume integral of water above C isotherm between N––8 by Meinen and McPhaden [13]. The WWV index (for the period of January to December ) was obtained from the TAO Project Office of NOAA/PMEL, where ocean analysis data from the Australian Bureau of Meteorology Research Center (BMRC) were used to compute the index.

Monthly SST data at resolution were obtained from Version of the NOAA Extended Reconstructed SST (ERSST.v2). Data for the tropical Pacific region (–7–2) during the periods 1950–2005 and 1980–2005 were considered. Monthly sea level pressure (SLP) data available on a global grid were downloaded from NCEP/NCAR Reanalysis [15]. The data used here extend from January to December (first set) and from January to December (second set) over the tropical Pacific Ocean (–7–2).

The reason two data sets of SST and SLP data were used, one covering the period 1950–2005 and the other 1980–2005, is because the WWV data were unavailable prior to , hence the first data set did not contain WWV.

Removal of the climatological seasonal cycle and smoothing by a 3-month running mean were performed on the data to obtain the anomalies. Principal component analysis (PCA) was performed on the SST anomalies, with the five leading principal components (PC) retained. The first five PCs account for , , , , and , respectively, of the variance for the 1950–2005 SST data set, and , , , , and for the 1980–2005 data set. The PC time series for 1950–2005 are shown in Figure 1, while the corresponding spacial patterns, that is, the EOFs, are displayed in Figure 2. The first five SST PCs are the predictands for our nonlinear regression models; that is, a separate nonlinear regression model is built to predict each of the five SST PCs into the future by (the lead time).

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(a) EOF 1**

**(b) EOF 2**

**(c) EOF 3**

**(d) EOF 4**

**(e) EOF 5**

Our choice on the number of PCs to use for predictors and for predictands was based on Wu et al. [8], where A. Wu (pers. comm.) tested for the optimal number of PCs to use for the LR model (corresponding tests with NN models would be prohibitively expensive). For the predictor PCs, the number of PCs was varied between 1 to 20 for SST, and separately, 1 to 20 for SLP. For the number of predictand SST PCs, using more than 5 decreased the forecast skills of the LR model.

The predictors used are the first and PCs of SST and SLP, respectively, with the SLP PCs also supplied at time lags of , , and months to provide information on the past evolution of the SLP PCs. Hence there is a total of predictors. For NN and SVR, the 37 predictor time series were further compressed to 12 by a second PCA, also known as extended empirical orthogonal function (EEOF) analysis, or singular spectrum analysis (SSA) [8]. The first SSA PCs were taken to be the final predictors in the NN and SVR models. The forecast lead time is the time interval between the centre of the most recent predictor to the centre of the predictand period.

#### 3. Nonlinear Regression Models

Following Hsieh [12], we outline the basic ideas behind the nonlinear generalization of linear regression (LR) by neural networks (NN) and support vector regression (SVR). First consider the simple linear regression problem with a single predictand variable :

with being the predictor variables, and and being the regression parameters. In the NN approach to nonlinear regression, nonlinear *adaptive* basis functions (also called hidden neurons) () are introduced, so the linear regression is between and :

with typically

Since depends on nonlinearly (due to the nonlinear function tanh), the resulting minimization of the error function (basically the MSE) involves nonlinear optimization, with multiple minima in general, a major disadvantage of the NN approach.

For the NN model in (2) and (3), the model parameters , , and are grouped into a single weight vector . The optimal is obtained through minimization of the regularized error function:

where the first term is the MSE scaled by the factor , with denoting observed data, and the second term is a weight penalty or regularization term scaled by the factor . A relative large would suppress the use of large weights, thereby making the NN model less nonlinear, hence avoiding overfitting to the noise in the data. In this paper, the hyperparameters and are chosen based on the Bayesian framework, hence the name Bayesian NN (BNN) model [11]. The MATLAB code “trainbr” from the MATLAB Neural Network toolbox was used for the BNN model.

Finally, the following procedure was used to optimally select the number of hidden neurons (). The data set was partitioned into ten segments of which nine were used for building the model, and one was reserved to provide independent data for testing the model forecasts. The nine segments for model building will be referred to as D9, and the single testing segment, D1. D9 was further partitioned into two sets, DT for training and DV for validation. Two-thirds of the data was chosen for training, and the remaining third for validation, with the data chosen at random in blocks of at least months due to autocorrelation in the data.

With the value of , and a given random pick of DT and DV, the error function (4) was optimized a total of times from random initial . Since the error function contains many local minimum, there were potentially different models. The model with the lowest MSE over the training set DT was kept, with the error named . This model was then applied to the corresponding validation set DV and the MSE over DV (named ) was also computed. The set () characterized this particular choice of DT, DV, and . Next, using the same DT and DV, the whole process was repeated with to obtain (). This process was carried out for ranging from to , such that in the end there were different sets of pairs (); the corresponding to the lowest was stored as . A different pair of DT and DV was randomly chosen from the same D and the process was repeated until was determined. This was done a total of times which implies the set . Finally, the mode of the set (i.e., the most commonly occurring value in the set ) was chosen as the optimal number of hidden neurons that represents the data set D. With this choice of , a total of models were trained using the entire D set and tested on D. The forecast performance on D1 was evaluated in terms of the ensemble average of the 50 models. This entire process was repeated until all 10 segments had been tested.

Next, instead of adaptive basis functions in (3), we use *nonadaptive* basis functions ; that is, the basis functions do not have adjustable parameters , with

For instance, Taylor series expansion with 2 predictors would have The advantage of using nonadaptive basis functions is that does not depend on any parameter nonlinearly, and so only linear optimization is involved without the multiple minima problem. The disadvantage is that one generally needs a very large (or even an infinite) number of nonadaptive basis functions compared to relatively few adaptive basis functions to model a nonlinear relation, hence the dominance of NN models despite the local minima problem.

The high-dimensionality problem has finally been solved with the kernel trick, that is, although is a very high (or even infinite) dimensional vector function, as long as the solution of the problem can be formulated to involve only inner products like (with superscript T denoting the transpose), then a kernel function can be introduced:

Instead of the unmanageable , one now works only with the very manageable kernel function. From Bishop [11] or Hsieh [12], the nonlinear regression solution can be written in the following form:

where there are data points in the training data set. If a Gaussian kernel function is used, that is,

where controls the width of the Gaussian function, then is simply a linear combination of Gaussian functions. These kernel methods have become popular for nonlinear regression, classification, and so forth, For nonlinear regression, the most common kernel method used is probably the support vector regression (SVR) method.

As mentioned earlier, SVR improves on NN methods by avoiding multiple local minima. It also uses a more robust error norm, the -insensitive error:

Note that for outliers, that is, large , this error function behaves like the robust mean absolute error (MAE) instead of the nonrobust MSE used in NN models.

The SVR problem is solved by minimizing the following function:

where contains the regression parameters in (5), and is an (inverse) regularization parameter. Details for solving the SVR problem are given in [11, 12]. The SVR code used was that developed by Chang and Lin [16].

The SVR model has three hyperparameters , and (controlling the width of the Gaussian kernel in (8)), which need to be determined by a grid search. SVR forecasts were tested by a tenfold cross validation as in the BNN and linear regression models; that is, the data set D1 was reserved for model forecast testing while D9 was used for model development. A sevenfold cross validation was used on D9, that is, 1/7 of D9, was reserved for model validation while 6/7 of D9 was used for training the model for given , and values. By rotating the training and validation data, the MSE was computed over validation data for all of D9, for each set of , and values, and the lowest validated MSE gave the optimal values for the three hyperparameters. The search for the hyperparameters was exhaustive; that is, a coarse grid search was first used to identify the region of minimum validated MSE, and then a finer grid search was used to pinpoint the optimal , and values.

#### 4. Forecast Results

Individual forecasts of the first five PCs of SST were conducted using LR, BNN, and SVR. Forecasts were made at lead times of , , , and months, and the performance of these methods was compared. The analysis was applied to two records, 1950–2005 and 1980–2005, with latter data set containing the warm water volume (WWV) index. Due to space limitation, we will concentrate on the latter data set. Details on the first data set are given in Aguilar-Martinez [17].

For the 1980–2005 record, from the training and validation procedure for BNN, out of a total of 250 cases (as there were 5 predictand PCs, 5 lead times and 10 validation segments, giving cases), 1 hidden neuron was selected in 230 cases, 2 hidden neurons were selected in 19 cases, and for only 1 case, more than 2 hidden neurons were selected (see Table 4.2 in [17]). In summary, two hidden neurons were predominantly selected only for lead times of and months for PC2 (but one hidden neuron was predominant at other lead times), while all other PCs predominantly selected only one hidden neuron. While one hidden neuron allows the BNN to model a nonlinear relation, more complex nonlinearity needs more than one hidden neuron.

The correlation skills evaluated over testing data for the SST PCs for the 1980–2005 record are shown in Figure 3. Two correlation scores are provided for each method, with the bar on the left being the score prior to the inclusion of the WWV index as a predictor. For PC1, before WWV was included, BNN had slightly higher correlation scores than SVR at lead times of , and months, while SVR had marginally higher scores than BNN and LR at short lead times ( and months), (with hyperparameters chosen for the SVR given in Table 4.3 of [17]). Inclusion of the WWV index in the forecast of PC generally increased the correlation scores for all three models, while at the same time reducing or eliminating the differences between them. This increase in the correlation scores is expected given that the WWV along with SST is the main variables in most theoretical discharge/recharge oscillator models [14, 18, 19].

**(a) PC1**

**(b) PC2**

**(c) PC3**

**(d) PC4**

**(e) PC5**

The nonlinearity of ENSO variability is manifested through PC. The correlation skills for PC at lead times of 3–9 months are generally superior for the nonlinear methods, with SVR attaining the best performance (Figure 3). In general, adding the WWV index did not improve the scores for PC2. The scores for PCs 3–5 also showed that the inclusion of the WWV index did not significantly affect the forecast skills of these PCs.

To find out if the SST response to WWV could be governed by nonlinear dynamics, the performance of the models was tested using the WWV index as the sole predictor. The correlation scores for PC are displayed adjacent to the values obtained using the full set of predictors in Figure 4(a), with the score bar on the right obtained from the single predictor (WWV) models. For these models with only the WWV predictor the correlation scores provided by linear regression were higher than nonlinear regression at all lead times except for months, indicating a rather linear relation between SST and WWV, in accordance with Meinen and McPhaden [13] and McPhaden et al. [20]. As PC predictions of these models show relatively low scores at very short lead times (3 months), WWV is therefore a useful predictor mainly at intermediate lead times. The forecast of PC using WWV as the single predictor yielded low correlation scores for all models, as indicated in Figure 4(b), indicating that WWV was not useful for predicting PC2, the mode showing the largest degree of nonlinearity in Figure 3.

**(a) PC1**

**(b) PC2**

SST anomaly fields were reconstructed by summing the five predicted PCs multiplied by their corresponding EOF spatial patterns. SST anomalies averaged over the Niño (E–W, S–N), Niño 3.4 (W–W, S–N), Niño 3 (W–W, S–N), and Niño 1+2 (W–W, S–) regions were then computed. The correlation skills and root mean squared error (RMSE), before and after WWV was introduced into the predictor set, are given in Figures 5 and 6, respectively, for the various Niño regions. Before the WWV index was added, BNN provided a better correlation skill than SVR at lead times of 9–15 months for the Niño and regions. Incidentally, correlation skills for the Niño region show the largest observed discrepancies among the different models, with BNN winning at longer lead times (12–15 months) and SVR at shorter lead times ( and months). Incorporating the WWV tended to reduce the differences in correlation among the methods due to the mainly linear coupling between WWV and SST [18]. The advantage of the nonlinear methods is the most apparent in the region of largest variability, Niño , over short lead times of 3–9 months. However, the correlation scores for LR show improvements at longer lead times relative to the nonlinear methods. This is consistent with the forecast results of PC (Figure 4(a)), where LR compares well with the nonlinear methods at 12–15 months lead times.

(a) Niño |

(b) Niño |

(c) Niño |

(d) Niño |

**(a) Niño 4**

**(b) Niño 3.4**

**(c) Niño 3**

(d) Niño |

Contour maps of the correlation scores between the predicted and observed anomalies for the nonlinear models are provided in Figures 7 and 8. The right column in these figures shows the difference in the correlation scores between the nonlinear model and LR. In Figures 7(d), 7(f), and 7(h), the region where BNN has slightly higher correlation skill than LR shifts westward along the equator as the lead time increases from 6 to 9 to 12 months. At lead times of 6 and 9 months, SVR attains correlation scores slightly higher than those for LR off the equator near the dateline (Figures 8(d) and 8(f)).

**(a) BNN (lead = 3 mon)**

(b) BNN LR |

**(c) BNN (lead = 6 mon)**

(d) BNN LR |

**(e) BNN (lead = 9 mon)**

(f) BNN LR |

**(g) BNN (lead = 12 mon)**

(h) BNN LR |

**(a) SVR (lead = 3 mon)**

(b) SVR LR |

**(c) SVR (lead = 6 mon)**

(d) SVR LR |

**(e) SVR (lead = 9 mon)**

(f) SVR LR |

**(g) SVR (lead = 12 mon)**

(h) SVR LR |

Although adding the WWV index as a predictor slightly enhanced the forecast skills, it reduced the difference in prediction skill between the nonlinear and linear models in some cases. For instance, the slight prediction advantage at lead times of 9–12 months of BNN relative to LR was reduced when WWV was added, as can be seen by comparing Figures 7(f) and 9(f) and Figures 7(h) and 9(h). The correlation score of SVR with WWV added is given in Figure 10. Comparing LR against itself after WWV was added reveals a slight increase in correlation skill emerging at long lead times of 9–15 months concentrated on the central-eastern equatorial region of the tropical Pacific (see Figure 4.15 in [17]).

**(a) BNN(+WWV) (3 mon)**

(b) (BNN LR)(+WWV) |

**(c) BNN(+WWV) (6 mon)**

(d) (BNN LR)(+WWV) |

**(e) BNN(+WWV) (9 mon)**

(f) (BNN LR)(+WWV) |

**(g) BNN(+WWV) (12 mon)**

(h) (BNN LR)(+WWV) |

**(a) SVR(+WWV) (3 mon)**

(b) (SVR LR)(+WWV) |

**(c) SVR (+WWV) (6 mon)**

(d) (SVR LR)(+WWV) |

**(e) SVR (+WWV) (9 mon)**

(f) (SVR LR)(+WWV) |

**(g) SVR (+WWV) (12 mon)**

(h) (SVR LR)(+WWV) |

The analysis was repeated for the longer record of 1950–2005. Compared with the results for the 1980–2005 period, the results from 1950–2005 showed an even smaller difference between the nonlinear and linear methods. This is not surprising since the ENSO episodes during 1950–1979 were weaker and less nonlinear than those from 1980 onward [21]. Hence there is less advantage in using nonlinear methods in the pre-1980 period.

#### 5. Summary and Conclusion

The forecast results of the leading PCs of SST indicate some evidence of nonlinear structure in the ENSO cycle. The nonlinear structure is mainly present in the second mode of the SST field (Figure 3). Overall, the improvement in forecast skills by the nonlinear models over LR was modest, as found by Tang et al. [22] and Wu et al. [8], with the nonlinear methods being most advantageous in the eastern region of Niño 1+2 at 3–9-month lead (Figure 5). As forecast lead time increased from 6- to 12-months, the region where forecast correlation skill for BNN appeared higher than that for LR shifted westward, from the eastern equatorial Pacific to the central and western equatorial Pacific. While SVR did not show significant overall forecast skill advantage relative to LR over BNN, the regions where SVR showed improved skill over LR were more off-equatorial and further west (near the dateline) than for BNN, at 6 and 9 month leads.

Although SVR has two structural advantages over NN models, namely, (a) no multiple minima as there is no nonlinear optimization involved, and (b) an error norm robust to outliers in the data, it did not give overall better forecasts than BNN. Presumably (a) the use of an ensemble average in BNN to deal with multiple solutions from multiple minima was quite adequate, and (b) the data set did not contain drastic outliers to utilize the advantage of the robust SVR model.

Addition of WWV as an extra predictor generally increased the forecast skills slightly; however, the influence of WWV on SST anomalies along the central-eastern tropical Pacific region appears to be linear as the difference between nonlinear and linear models often diminished when WWV was included. As the WWV embodies the large-scale low-frequency dynamics [20], the response of the SST field to the WWV index comes with a particular time delay, at about to months. This observation is in agreement with our results, where the maximum correlation between the SST field and WWV was recorded at around 6–12-month lead time.

#### Acknowledgments

The authors were supported by the Natural Sciences and Engineering Research Council of Canada and the Canadian Foundation for Climate and Atmospheric Sciences. The authors are grateful to Dr. Aiming Wu for his helpful comments. The WWV index of Meinen and McPhaden [13] was kindly provided by the TAO Project Office of NOAA/PMEL.