#### Abstract

Indian monsoon is an important climatic phenomenon and a global climatic marker. Both statistical and numerical prediction schemes for Indian monsoon have been widely studied in literature. Statistical schemes are mainly based on regression or neural networks. However, the variability of monsoon is significant over the years and a single model is often inadequate. Meteorologists revise their models on different years based on prevailing global climatic incidents like El-Niño. These indices often have degree of severity associated with them. In this paper, we cluster the monsoon years based on their fuzzy degree of associativity to these climatic event patterns. Next, we develop individual prediction models for the year clusters. A weighted ensemble of these individual models is used to obtain the final forecast. The proposed method performs competitively with existing forecast models.

#### 1. Introduction

Monsoon is a complex phenomenon of a climatic system. It is influenced by multiple climatic parameters and sea-atmosphere interactions. Prediction of monsoon is challenging due to large variability present in its patterns. Indian Meteorological Department (*IMD*) performs forecast of Indian summer monsoon rainfall (*ISMR*) since 1886. Indian monsoon forecast was initiated by Blanford [1] as early as 1882. The success of forecasts in span of 1882–1885 encouraged Blanford to design operational long range forecast model for monsoon in 1886. Subsequently, Walker [2] developed models studying the statistical correlations between rainfall and different global climate parameters. Thapliyal and Kulshrestha [3] introduce regression model in predicting south-west Indian monsoon rainfall. Gowariker et al. [4] propose power regression model for long-term forecast of monsoon, which provided accurate forecast for a long period, but failed to predict the extreme condition of 2002. In 2004, Rajeevan et al. [5] reassess different climatic parameters and introduce four new parameters to design statistical model for issuing long-range forecast of Indian monsoon. Succeeding in 2007, Rajeevan et al. [6] built models using ensemble multiple regression and pursuit projection regression to forecast Indian rainfall and proved to be superior to past* IMD* models. Schewe and Levermann [7] explain the change in distribution of Indian rainfall and also explain the reasons behind failure of monsoon in certain years. Wu et al. [8] propose a linear Markov model to predict short-term climate variability of East Asian monsoon. Fan et al. [9] develop two statistical prediction schemes for seasonal forecast of East Asian summer monsoon. The schemes take the direct outputs of the existing models and give better prediction of the summer monsoon.

Artificial neural networks (*ANN*) [10] are widely used in modelling the nonlinearity present in monsoon process. Sahai et al. [11] use* ANN* techniques with error backpropagation to forecast Indian summer monsoon rainfall. Hong [12] predicts Indian summer monsoon utilizing recurrent neural network and also demonstrates successful employment of support vector machine in solving nonlinear regression and time series problems. Three different backpropagation neural learning rules, namely, momentum learning, conjugate gradient descent learning, and Levenberg-Marquardt learning, are used by S. Chattopadhyay and G. Chattopadhyay [13] to perform a comparative study of different neural network method to predict rainfall time series.

Presence of large variability in monsoon patterns makes it difficult for a single model to predict its distribution. A number of uncertainties including boundary condition, parameter, and structural uncertainties are involved in construction of these models. Thus, it remains fundamentally challenging to have a single model for prediction. Multimodel ensembles are proposed to overcome the weakness of single model, which combine the outcome of different models to produce efficient results [14, 15]. In addition, monsoon shows different characteristics over years. There exist groups of years where variation of climatic parameters and pattern of rainfall are similar. We use fuzzy clustering to cluster the similar years together and model them separately. The motivation behind using fuzzy clustering is that each year manifests a mixture of physical climatic events. We cannot hard cluster a year into a specific group; years have their membership of belongingness to every cluster. Fuzzy clustering is used to enclose the characteristics of different events being related to a year of study. We use the same set of climatic parameters as predictor set for every cluster but frame different models for each cluster.

A number of prediction models, namely, multiple regression (*MR*), multilayer perceptron (*MLP*), recurrent neural network (*RNN*), and generalized regression neural network (*GRNN*) models, are used for prediction of Indian monsoon for the year clusters. There exists viable reasons for using neural networks like* MLP*,* RNN*, and* GRNN* for modelling: (i) Indian monsoon is a complex process, which cannot be adequately modelled by linear models, (ii) nonlinearity in the time-series pattern can be well captured by neural network learning, (iii) climatic events are much closely related to near years parameters disturbance as compared to distant years, and neural network enables attaching weight to the year parameter in appropriate manner.

In this work, climatic parameters that are strongly correlated with Indian monsoon are identified at the onset, which is followed by fuzzy clustering of years into groups with degree of belongingness of each year to the clusters. Then we model each cluster with four types of models, namely,* MR*,* MLP*,* RNN*, and* GRNN*, to forecast rainfall. Weighted ensemble of forecasts given by respective models for each cluster is considered as final predicted rainfall. Analysis and comparisons are performed on aggregate Indian rainfall and finally, a meteorological interpretation of the obtained clusters is presented.

The paper is organised in the following manner. We discussed the details of data and predictor climatic parameters in Sections 2 and 3. Proposed clustering based approach, prediction model, and ensemble technique are presented in Section 4 with experimental results in Section 5. Meteorological significance is discussed in Section 6 and finally, conclusions are provided in Section 7.

#### 2. Data Sets Used

We consider the annual Indian summer monsoon rainfall (*ISMR*), occurring in four months of June, July, August, and September. Annual* ISMR* is considered during period 1948–2013 for our study. The long period average (*LPA*) (1948–2013) of* ISMR* is 891.8 mm.* ISMR* is expressed as percentage of the* LPA* value. The data is obtained from Indian Institute of Tropical Meteorology, Pune (http://www.imdpune.gov.in/research/ncc/longrange/data/data.html) [16].

Predictor parameters sea level pressure (*SLP*) (http://www.esrl.noaa.gov/psd/gcos_wgsp/Gridded/data.noaa.erslp.html) and sea surface temperature (*SST*) (http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.ersst.html) data are provided by the NOAA/OAR/ESRL/PSD, at spatial resolution of [17]. Surface pressure (*SP*) and zonal wind velocity (*WV*) data are collected from* NCEP* Reanalysis Derived data provided by the NOAA/OAR/ESRL PSD (http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.derived.surface.html) [18], available at resolution of . Finally, Niño 3.4 data, which is the sea surface temperature anomaly for the spatial coverage of to and to in Pacific Ocean region is acquired from National Center for Atmospheric Research (http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ensoyears.shtml) [19]. All the above monthly data are considered for the period 1948–2013 in our study and analysis.

#### 3. Global Climatic Parameters Influencing Indian Monsoon

Indian monsoon is strongly influenced by several global climatic parameters, occurring at places distant from Indian subcontinent. Identification of predictor parameters relies on physical understanding of monsoon event and wind pattern flow. We have selected the climatic parameters based on the parameters used by Indian meteorological department’s models [5, 6], studying their correlation with Indian summer monsoon rainfall (*ISMR*) during our period of study (1948–2013). In the data preprocessing phase, climatic anomaly data are evaluated by calculating the deviation of parameter value from long-term average value of the parameter exclusively for each month, followed by correlation study between* ISMR* and the climatic parameters for a lag of zero to twelve months. We consider the best lagged predictor month having high correlation with* ISMR*. The predictor climatic parameters and their correlation values with Indian monsoon are shown in Table 1. Figure 1 shows the geographic location of climatic parameters influencing Indian monsoon.

*Predictor Sets of Climatic Parameters*. Based on the correlation with Indian monsoon, we have built five predictor sets for forecasting. Different combinations of the identified climatic parameters (Table 1) form the predictor sets. The predictor sets are shown in Table 2.

#### 4. Methodology

We propose fuzzy clustering of monsoon years into groups followed by building models for each group separately and finally predicting Indian summer monsoon rainfall (*ISMR*) as weighted ensemble of forecasts provided by cluster models. The block diagram of the proposed fuzzy clustering-based approach to prediction of* ISMR* is shown in Figure 2. Detailed steps are described in the following subsections.

##### 4.1. Motivation: Variability of Monsoon Patterns

Trends and distributions of monsoon vary to a large extent over years. It is thus necessary to group the years into clusters which have similar patterns of predictor climatic parameters affecting monsoon. The approach of clustering the years is effective as we can build separate models for each cluster. These cluster models will be more accurate as variation within cluster is less. Finally, ensemble of forecasts of these cluster models results in better prediction of Indian monsoon. As an example consider two clusters of years corresponding to strong El-Niño and North Atlantic Oscillation, respectively. A drought year has correlation with both events and hence might have significant degree of belongingness to both clusters.

##### 4.2. Fuzzy Clustering of Monsoon Years

Fuzzy -means clustering is used for grouping the similar years together. Fuzzy -means (*FCM*) is a method of clustering which allows one instance of input to belong to more than one cluster with some membership of belongingness.* FCM* attempts to partition a set of elements into a collection of fuzzy clusters and a partition matrix , , , where gives the degree of belongingness of element to cluster with center .

*FCM* aims to minimize an objective function of (1). The update of partition matrix and centers occur in accordance with (2) and (3), respectively:where denotes the level of cluster fuzziness.

##### 4.3. Prediction Models

Multiple regression and three models of artificial neural networks (*ANN*), namely, multilayer perceptron, recurrent neural network, and generalized regression neural network, are used to design prediction models for each cluster exclusively. Forecast of annual* ISMR* is provided by each cluster model separately and also by ensemble of all the clusters’ model forecast. We describe below the models used.

###### 4.3.1. Multiple Regression (*MR*)

Multiple regression model is used to learn the relationship between several independent predictor variables (s) and a dependent variable (). Multiple regression model having independent variables is shown inwhere is the th observation of th independent variable, where the first independent variable takes the value for all and represents the residual.

###### 4.3.2. Multilayer Perceptron Neural Network (*MLP*)

Multilayer perceptron neural network is a class of* ANN* where connections between the neurons do not form a directed cycle. In this network, the information propagates in only one direction, from input nodes, through hidden nodes, and to the output nodes. The independent and dependent variables constitute the input and output layers, respectively. Number of hidden layers with corresponding nodes must be determined empirically for each prediction task. Four different parameter sets are considered empirically for model designed to forecast* ISMR*, shown in Table 3.

###### 4.3.3. Recurrent Neural Network (*RNN*)

Recurrent neural network is a class of* ANN* which creates an internal state of the network to exhibit dynamic temporal behaviour. Climatic changes or events occurring in near or same time period are highly correlated. Similarly, rainfall patterns are more correlated to influencing factors in the near years as compared to the distant years. This phenomenon is well captured by* RNN* which gives weights in decreasing order to the values in near to distant years during training of network. Thus, it assists in modelling the system dynamics in much natural manner. Same set of climatic parameters as* MLP* network (Table 3) is considered with delay span of* 2* units.

###### 4.3.4. Generalized Regression Neural Network (*GRNN*)

Generalized regression neural network is a variant of radial basis function network.* GRNN* has three layers of artificial neurons: input, hidden, and output. The hidden layer has radial basis neurons, while neurons in the output layer have linear transfer function. Output of radial basis neurons is the input scaled by the spread factor. Given input-output pairs , , with input variables and represents the output from each hidden unit. The* GRNN* output for a test point, , is described bywhereThe reasons behind modelling using* GRNN* are (i) only one tunable design parameter (spread factor), (ii) one-pass algorithm (less time consuming), and (iii) accurately approximate functions from sparse data.

Optimal training year is ascertained for* MR* and* GRNN* models by varying training years from 5 to 30 and validating against least absolute error in prediction during validation period (1984–1993). A training of years specifies that, for predicting year rainfall, available preceding number of years present in a particular cluster are considered for training.

##### 4.4. Ensemble of Predictors

Complexity in monsoon process makes it difficult for a single model to predict rainfall accurately. We design separate models for each cluster of years obtained by fuzzy clustering using four predictors described in Section 4.3. Finally, annual* ISMR* is presented as weighted ensemble of forecasts of model designed for each cluster. Weight is taken as the fuzzy membership of belongingness of the test year in different clusters: where represents the prediction given by a model for cluster , is the fuzzy membership of test year to cluster , and is the total number of clusters.

##### 4.5. Validation of Proposed Approach

The study is performed on data for the period 1948–2013. Fuzzy clustering is performed over the period to cluster it into* three* groups. The number of clusters is decided based on cluster quality. Separate prediction models are designed for all three clusters and ensemble of forecasts of these models is provided as predicted Indian summer monsoon rainfall. Test period 2001–2013 is considered to evaluate the forecasting skills of our proposed approach.

The forecast models for annual* ISMR* are chiefly evaluated in terms of mean absolute error. Other error statistics, namely, root mean square error, prediction yields, Pearson correlation, and Willmott index of agreement, are also evaluated to judge the efficacy of our proposed approach for prediction. They are described below.(i)* Mean Absolute Error* (MAE). Mean absolute error for prediction of annual* ISMR* is calculated in the following way: where and are the actual and predicted* ISMR* series for test period and denotes the total number of test years.(ii)* Root Mean Square Error* (RMSE). Root mean square error calculates the differences between model predicted output and actual values. They are a good measure to compare forecasting errors of various models: (iii)* Prediction Yield* (PY). Prediction yields are evaluated at three different error categories (5%, 10%, and 15% errors) to assess the overall prediction results by judging percent of predicted years within each allowed range of errors.(iv)* Pearson Correlation Coefficient* (PC). Pearson correlation coefficient measures the strength of linear association between actual and predicted values, where the value of 1 means a perfect positive correlation and the value of −1 means a perfect negative correlation: where and are the actual and predicted* ISMR* series for test period and and are their corresponding mean.(v)* Willmott Index of Agreement* (WI). Willmott index of agreement is a standardized measure of the degree of model prediction error. It varies between 0 and 1 with higher values indicating a better fit of the model for prediction:

#### 5. Experimental Results and Analysis

In this section we present the evaluation of our proposed fuzzy clustering-based approach. We first present the results of fuzzy clustering of the monsoon years for different predictor sets. Forecasting skills are evaluated for all cluster and the ensemble model in terms of mean absolute errors for test period 2001–2013. In addition, other measures like root mean square errors in prediction, correlation between predicted and actual rainfall, prediction yields, and agreement index between actual and predicted rainfall are also estimated to establish the efficiency of our proposed approach to prediction of Indian summer monsoon rainfall.

##### 5.1. Clustering of Monsoon Years

Fuzzy clustering is performed over period 1948–2013 to cluster the data into* three* clusters. We have performed an -cut, with value to assign the data instances to the clusters. The value is ascertained empirically such that the distribution of elements within clusters is regular. A data instance can be assigned to more than one cluster simultaneously. The cluster sizes are shown in Table 4 while considering various predictor sets.

##### 5.2. Prediction Accuracy

We predict annual rainfall considering for all five predictor sets (Table 2) separately using four models, namely,* MR*,* MLP*,* RNN*, and* GRNN*. Test period is considered from 2001 to 2013.

###### 5.2.1. Multiple Regression Model (*MR*)

Multiple regression models are built for every cluster by ascertaining optimal training period for each predictor set. Optimal training period is evaluated by varying training years and validating them for least absolute error in prediction during validation period (1984–1993). Individual cluster based as well as weighted ensemble models are considered for prediction. Table 5 gives the mean absolute error for individual cluster based and ensemble models for test period 2001–2013. The model provides mean absolute error of 6.2% for* PredSet4* (Table 2). It is observed that the ensemble model outperforms all the single cluster models for every predictor set. Figure 3 shows the interannual variability of actual and ensemble predicted rainfall as percent of long period average (*LPA*).

###### 5.2.2. Multilayer Perceptron Neural Network Model (*MLP*)

Multilayer perceptron neural network model is designed with four different sets of parameters described in Table 2. Mean absolute errors of all cluster and ensemble models are shown in Table 6.* MLP* model reports an error of 4.0% for* PredSet4* (Table 2) with* MLP* parameters* ParSet1* (Table 3). The actual and predicted rainfall by models built for clusters and ensemble model is shown in Figure 4. Ensemble predicted rainfall closely follows actual rainfall.

###### 5.2.3. Recurrent Neural Network Model (*RNN*)

Mean absolute errors for prediction of annual rainfall by recurrent neural network model for the test period 2001–2013 are presented in Table 7.* PredSet3* (Table 2) with* RNN* parameters* ParSet1* (Table 3) gives error of 5.1%.* RNN* gives weights in decreasing order of their distance from test year to the training years. The pattern of actual and ensemble predicted rainfall in terms of percentage of* LPA* is shown in Figure 5.

###### 5.2.4. Generalized Regression Neural Network Model (*GRNN*)

Generalized regression neural network ensemble and individual cluster models’ errors in terms of mean absolute errors are presented in Table 8. The model reports an error of 6.1% for* PredSet3* (Table 2). Figure 6 shows the interannual variations of ensemble forecast of rainfall by* GRNN* ensemble model along with actual rainfall pattern in terms of percentage of* LPA* for period 2001–2013. It is observed that the predicted values are close to actual rainfall patterns. Prediction by models designed for clusters is shown by different symbols.

##### 5.3. Statistical Measures for Validation of Proposed Approach

Next, we validate the models in terms of other accuracy measures besides mean absolute error. Table 9 shows different forecast verification statistics for ensemble models during test period 2001–2013. We summarize the observations below.(i)*Root Mean Square Error (RMSE)*.* MLP* ensemble model gives* RMSE* of 5.3%, followed by* RNN* ensemble model with 6.4%.* GRNN* and* MR* models give* RMSE* of 7.4% and 8.4%, respectively.(ii)*Prediction Yield (PY)*.* PY* for 5% error category of* MR*,* MLP*,* RNN*, and* GRNN* ensemble models is 46%, 69%, 53%, and 46%, respectively. They give prediction yield of 76%, 92%, 92%, and 84% for allowed error of 10% category. Finally at error category of 15%,* MR*,* MLP*,* RNN*, and* GRNN* ensemble models give yield of 92%, 100%, 92%, and 100%, respectively. Thus, none of the predicted years show abrupt deviation from corresponding actual rainfall pattern.(iii)*Pearson Correlation (PC)*.* PC* of 0.61, 0.81, 0.71, and 0.49 is observed for prediction by* MR*,* MLP*,* RNN*, and* GRNN* ensemble models, respectively. It is noticed that predicted rainfall by* MLP* ensemble model is highly correlated to actual values, while correlation for* GRNN* forecast is least.(iv)*Willmott Index of Agreement (WI)*.* WI* for* MR*,* MLP*,* RNN*, and* GRNN* ensemble models is 0.71, 0.89, 0.81, and 0.62, respectively. The index shows that the agreement between actual and predicted rainfall is high for* MLP* and* RNN* ensemble models.All of the mentioned statistical measures (Table 9) as well as mean absolute error (Table 6) in prediction of monsoon ascertain* MLP* model to be the best among all four proposed models.

##### 5.4. Comparison of Results

###### 5.4.1. Comparison with State-of-the-Art Methods

Proposed fuzzy clustering-based ensemble prediction models are compared with the models used by Indian Meteorological Department (*IMD*). It is compared with existing 16-parameter power regression model [4] and Rajeevan et al. [5] 8- and 10-parameter models. Test period of seven years from 1996 to 2002 is considered.* IMD* models give root mean square errors of 10.8%, 7.6%, and 6.4%, respectively. The* MR*,* MLP*,* RNN*, and* GRNN* ensemble models give 6.0%, 3.4%, 4.4%, and 5.5% root mean square errors, respectively, outperforming all three* IMD* models. The results are shown as a bar graph in Figure 7.

###### 5.4.2. Improvement of Cluster-Based Models over Conventional Models

Ensemble model error obtained by combining all clusters’ model output is compared with error obtained by same model (parameter), trained on the whole dataset without clustering. The mean absolute error for various models and predictor sets combinations are shown in Table 10. The result clearly depicts the improvement in prediction by clustering and ensemble method over nonclustered conventional method.

##### 5.5. Prediction of the Year 2014

Annual Indian summer monsoon rainfall for the year of 2014 is 781.7 mm, which is 87.8% of* LPA* value. Proposed clustering-based ensemble* MR*,* MLP*,* RNN*, and* GRNN* models predict rainfall of 2014 as 96.1%, 80.3%, 80.0%, and 95.3% of* LPA*, respectively. Thus, proposed models show absolute error of 7.0% for forecasting rainfall of 2014.

#### 6. Meteorological Analysis

Next, we try to visualize each cluster in terms of physical climatic events. The clusters obtained by fuzzy clustering are physically interpreted as being characterized by some global climatic events. The climatic events considered and studied during the time period 1948 to 2013 (period considered for clustering in our work) are El-Niño, La-Niña (http://ggweather.com/enso/oni.htm), positive and negative Indian ocean dipole (http://bom.gov.au/climate/IOD), drought, and flood, shown in Table 11.

Figure 8 shows the El-Niño and La-Niña years associated with drought, normal, and excess rainfall years during 1948–2013. The years having rainfall 10% above* LPA* are excess rainfall years and years having rainfall 10% below* LPA* are drought years. The El-Niño and La-Niña years are shown by color codes (*light green and green*) in the figure. The chart helps to visualize the cooccurrence of El-Niño and La-Niña events with extremities of* ISMR*.

##### 6.1. Measuring Association between Climatic Events and* ISMR*

Support and confidence measures are considered to relate physical climatic event to the clusters generated by fuzzy clustering. They are defined below.(i)* Support.* Support is defined as percentage of total number of years in the cluster corresponding to the climatic event: where denotes the number of years associated with a specific climatic event in the cluster and is the total count of years in the cluster.(ii)* Confidence*. Confidence is defined as percentage of years associated with the climatic event in the cluster to the total number of such event years: where is the number of years associated with the climatic event during the period 1948–2013.

We relate a cluster to a physical climatic event described in Table 11, if both support and confidence measures attain the corresponding thresholds. The thresholds are chosen in a way that 50% of years of study are under consideration. A low threshold compromises the importance of a climatic event being related to a particular cluster; on the other hand if even less number of years are taken, then threshold values should be high, which in turn will leave out most of the clusters. Therefore, as an optimal between the extremes, 50% of years are considered. Figure 9 shows histograms with confidence and support as bins of year-count for cases before and after threshold process, respectively, for predictors* PredSet1* (Table 2). The threshold values obtained for predictor sets are presented in Table 12. For each predictor set, we associate the clusters with physical climatic events, if they satisfy both support and confidence thresholds. The climatic events corresponding to cluster are shown in Table 13. Results establish coexistence of events of* La-Niña* and* flood*. It also puts light on high probability of occurrence of* El-Niño*,* drought*, and* positive IOD* events simultaneously.

**(a)**

**(b)**

#### 7. Conclusion

Monsoon is an important phenomenon for economic development of agricultural-land like India. Large variability of monsoon over years makes prediction of rainfall a challenging task. The paper attempts to address this problem by clustering the years into similar groups and finally, multimodel ensemble forecast is provided for Indian summer monsoon rainfall.

Different climatic parameters with best correlated month value are identified and five different predictor sets are built for prediction of Indian monsoon. Four different models, namely,* MR*,* MLP*,* RNN*, and* GRNN*, are designed for each cluster exclusively. The final forecast is provided by weighted ensemble of forecasts by each cluster’s model, where weight is considered as fuzzy membership of belongingness in each cluster. Multilayer perceptron ensemble model provides mean absolute error of 4.0% for prediction of annual rainfall, which is appreciable for forecasting complex monsoon process. Proposed fuzzy clustering-based ensemble approach surpasses the conventional approach. Performance of proposed clustering-based ensemble models is superior to existing* IMD*’s models [4, 5]. The error statistics also ascertain the superiority of multilayer perceptron model over other three proposed models. Lastly, in meteorological context the clusters are linked with global climatic events.

In the future, large number of climatic parameters influencing Indian monsoon can be explored and different predictor set can be used for different clusters of years to provide even better forecasting accuracy.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

This work is supported by RBU project through RESPOND program of ISRO through KCSTC, IIT Kharagpur.