Abstract

It is important to accurately estimate rainfall for effective use of water resources and optimal planning of water structures. For this purpose, the models were developed to estimate rainfall in Isparta using the data-mining process. The different input combinations having 1-, 2-, 3- and 4-input parameters were tried using the rainfall values of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations in Isparta. The most appropriate algorithm was determined as multilinear regression among the models developed with various data-mining algorithms. The input parameters of Multilinear Regression model were the monthly rainfall values of Senirkent, Uluborlu and Eğirdir stations. The relative error of this model was calculated as 0.7%. It was shown that the data mining process can be used in estimation of missing rainfall values.

1. Introduction

The meteorological events affect permanently human life. Considering the meteorological phenomena, which have no possibility of intervention, they cause the important results in human life, accurate estimation and analysis of these variables are also very important. Precipitation, which is generating flow, is an important parameter. The occurrence of extreme rainfall in a short time causes significant events that affect human life such as flood. However, in the event of insufficient rainfall in long period occurs drought. Thus, rainfall estimation is very important in terms of effects on human life, water resources, and water usage areas. However, rainfall affected by the geographical and regional variations and features is very difficult to estimate. Nowadays, there are many researches about artificial intelligence methods used in the estimation of rainfall [17]. Partal et al. [8] developed rainfall estimation models using artificial neural networks and wavelet transform methods. Bodri and Čermák [9] evaluated the applicability of neural networks for precipitation prediction. Chang et al. [10] applied a modified method, combining the inverse distance method and fuzzy theory, to precipitation interpolation. They used genetic algorithm to determine the parameters of fuzzy membership functions, which represent the relationship between the location without rainfall records and its surrounding rainfall gauges. They worked to minimize the estimated error of precipitation with the optimization process.

One of the aims of storing this data in databases and receiving data from many sources is to convert raw data into information at present. This process is called as data-mining (DM) process of converting data into information. In recent years, the use of data-mining process in the field of hydrology is increasing. The studies have been performed using DM process in many areas [1113]. Keskin et al. [14] developed integrated evaporation model using DM process for three lakes in Turkey. Terzi [15] developed the models to forecast flow of Kızılırmak River using rainfall and flow parameters with DM process. Terzi et al. [16] proposed various solar radiation models with DM process using air temperature, relative humidity, wind speed, and air pressure parameters and evaluated performance of the models. Teegavarapu [17] evaluated the use of association rule mining (ARM) in conjunction with a spatial interpolation technique to estimate of missing precipitation data and to overcome one of the major limitations of spatial interpolation techniques. Solomatine and Dulal [18] investigated the comparative performance of artificial neural networks (ANNs) and model trees (MTs) in rainfall—runoff transformation. They determined that both ANNs and MTs produce excellent results for 1-h ahead prediction, acceptable results for 3-h ahead prediction and conditionally acceptable result for 6-h ahead prediction. They obtained almost similar performance for 1-h ahead prediction of runoff, but the result of the ANN is slightly better than the MT for higher lead times from these techniques. Keskin et al. [19] applied data-mining process to river flow prediction. They determined that it was possible using data-mining process for river flow prediction. Teegavarapu and Chandramouli [6] developed a model that uses artificial neural network concepts and a stochastic interpolation technique. They tested the model for estimation of missing precipitation data.

The aim of the study is to evaluate the use of data-mining process to estimate rainfall of Isparta in Turkey. This study is performed using rainfall data of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations in Isparta city.

2. Data-Mining Process

Knowledge discovery is a process that extracts implicit, potentially useful or previously unknown information from the data. The knowledge discovery process is described in Figure 1.

Let us examine the knowledge discovery process in the diagram in Figure 1 in details.(i)Data coming from variety of sources is integrated into a single data store called target data.(ii)Data then is preprocessed and transformed into standard format.(iii)The data-mining algorithms process the data to the output in form of patterns or rules.(iv)Then those patterns and rules are interpreted to new or useful knowledge or information.

The ultimate goal of knowledge discovery and data-mining process is to find the patterns that are hidden among the huge sets of data and interpret them to useful knowledge and information. As described in process diagram above, data-mining is a central part of knowledge discovery process.

The data-mining definition is defined as “the process of extracting previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions” [20]. This data-mining definition has business flavor and for business environments. However, data-mining is a process that can be applied to any type of data ranging from weather forecasting, electric load prediction, product design, among others.

Data-mining also can be defined as the computer-aid process that digs and analyzes enormous sets of data and then extracting the knowledge or information out of it. By its simplest definition, data-mining automates the detections of relevant patterns in database [21].

The emergence of knowledge discovery in databases (KDD) as a new technology has been brought about with the fast development and broad application of information and database technologies. The process of KDD is defined as an iterative sequence of four steps: defining the problem, data preprocessing (data preparation), data-mining, and postdata-mining.

2.1. Defining the Problem

The goals of a knowledge discovery project must be identified. The goals must be verified as actionable. For example, if the goals are met, a business organization can then put the newly discovered knowledge to use. The data to be used must also be identified clearly.

2.2. Data Preprocessing

Data preparation comprises those techniques concerned with analyzing raw data so as to yield quality data, mainly including data collecting, data integration, data transformation, data cleaning, data reduction, and data discretization.

2.3. Data-Mining

Given the cleaned data, intelligent methods are applied in order to extract data patterns. Patterns of interest are searched for, including classification rules or trees, regression, clustering, sequence modeling, dependency, and so forth.

2.4. Postdata-Mining

Post data-mining consists of pattern evaluation, deploying the model, maintenance, and knowledge presentation.

The KDD process is iterative. For example, while cleaning and preparing data, it might be discovered that data from a certain source is unusable, or that data from a previously unidentified source is required to be merged with the other data under consideration. Often, the first time through, the data-mining step will reveal that additional data cleaning is required [22].

3. Study Region and Data

In this study, the data used to developed rainfall estimation models are the monthly rainfall data of Isparta, Senirkent, Uluborlu, Eğirdir, and Yalvaç stations. The Isparta city is located in the Lakes Region located in the north of the Mediterranean Region, and between 30°20′ and 31°33′ east longitudes and 37°18′ and 38°30′ north latitudes. The altitude of Isparta having a surface area of 8933 km2 is the average of 1050 m. The average annual total rainfall of Isparta is 440.3 kg/m2. The most of rainfall (72.69%) has occurred in winter and spring months. The summer and autumn months are quite dry (29.31% of total rainfall). While it is observed usually rain, occasional snow in winter in the region, it is observed in the form of rainstorm the in spring and summer months. The study region and the locations of rain gauges are shown in Figure 2.

The monthly rainfall data for 1964–2005 years used in this study were obtained from Turkish State Meteorological Service. The various rainfall estimation models were developed for Isparta using the rainfall values of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations as input parameters. It was investigated whether or not there are any missing data. Then, the mean values were used for substitution of missing values. The training dataset consisted of the 1964–1996 years was used to develop the models. The trained models were used to run the testing dataset for 1997–2005 years.

4. Model Performance Criteria

In the model assessment stage, after it has built a set of models using different algorithms, these models were evaluated in terms of accuracy. There are a few popular criteria to evaluate the quality of a model. It was chosen coefficient of determination (𝑅2) and root mean-squared error (RMSE) which are the most well known and the commonly used performance criteria [2325]. The 𝑅2 is the proportion of variability in a dataset that is accounted for by the statistical model. The RMSE is valuable and because it indicates error in the units (or squared units) of the constituent of interest, which aids in analysis of the results. The coefficient of determination based on the rainfall estimation errors is calculated as 𝑅2=1𝑛𝑖=1𝑃𝑖(rainfall)𝑃𝑖(model)2𝑛𝑖=1𝑃𝑖(rainfall)𝑃mean2,(1) where 𝑛is the number of observed data, 𝑃𝑖(rainfall) and 𝑃𝑖(model) are monthly rainfall measurement and the results of the developed model, respectively, and 𝑃mean is mean rainfall measurements.

The root mean square error represents the error of model and defined as RMSE=1𝑛𝑛𝑖=1𝑃𝑖(rainfall)𝑃𝑖(model)2,(2) where parameters have been defined above.

5. Rainfall Estimation Models

For rainfall estimation, Decision Table, KStar, Multilinear Regression, M5’Rules, Multilayer Perceptron, RBF Network, Random Subspace, and Simple Linear Regression algorithms were used in this study. The fifteen models were developed using different input combinations with the rainfall values of Senirkent, Uluborlu, Eğirdir and Yalvaç stations to estimate rainfall of Isparta station. These models including 1-input, 2-input, 3-input and 4-input parameters were given in Tables 1, 2, 3, and 4, respectively.

Firstly, the relationships between rainfall data of Isparta station and them of other stations (Senirkent, Uluborlu, Eğirdir, and Yalvaç) were investigated using statistical analyses. The effective variables on Isparta station were ranked in the order of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations. The performance criteria of the models developed with 1-input parameters were given in Table 1 for testing set.

Examining the models given in Table 1, it was determined as the highest 𝑅2 value was 0.745 and lowest RMSE value was 48.44 mm for models developed using Multilinear Regression (MLR), M5’Rules, and Simple Linear Regression algorithms with rainfall data of Senirkent station. These models have the same 𝑅2 and RMSE values. The worst model with the highest RMSE (141.50) was developed with decision table. When the developed models by using MLR, M5’Rules, and Simple Linear Regression algorithms were analyzed, the input parameter of the best performing model was rainfall of Senirkent station. Later, the best models were generally ranked in Uluborlu, Eğirdir, and Yalvaç stations. In Table 2, it was given the 𝑅2 and RMSE values of developed models with 2-input parameters.

As seen from Table 2, the highest 𝑅2 (0.807) and lowest RMSE (36.65) values were obtained for MLR and M5’Rules models developed using rainfall values of Eğirdir and Uluborlu stations. Table 2 shows that increasing of number of the model input parameter improved the performance of the models. While 𝑅2 value of the best model with one input parameter was 0.745, performance of the model with two input parameters is 0.807. The models having 3-input parameters are shown in Table 3.

It was shown that the 𝑅2 values of the models having 3-input parameters (Senirkent—Uluborlu—Eğirdir) were 0.813 and 0.808 for MLR and M5’Rules models in Table 3, respectively. The MLR (Senirkent—Uluborlu—Eğirdir) model showed the best performance. The models developed with Senirkent, Uluborlu, and Eğirdir stations ranked according to statistical analysis showed generally the better performance. The model with the worst performance was Radial Basis Function (RBF) network model. The models having 4-input parameters are shown in Table 4.

It was shown that the 𝑅2 value of the model having three 4-input parameters were 0.806 for MLR model in Table 4. When Yalvaç station was added to the best 3-input model, the 4-input model performance had decreased slightly. The MLR and M5’Rules algorithms in all the DM algorithms gave generally the best results and had the almost same performance except the 4-input model. While the RBF network from artificial neural network algorithms showed the worst performance in all DM models, MLR had relatively good results. Considering all the DM models, MLR model with 3-input parameters (𝑅2=0.813) showed the best performance. Examining RMSE values of the model, the model (Senirkent—Uluborlu—Eğirdir) had the lowest error. Thus, the monthly rainfall results of MLR model (Senirkent—Uluborlu—Eğirdir) are shown in Figures 3 and 4 as comparison plot and time series for testing data set. Figure 3 shows that the MLR model comparison plot was uniformly distributed around the 45° straight line implying that there were no bias effects. It was apparent a good relationship between estimated and measured rainfall values. The relative error between the measured values and the value of the developed MLR model was calculated as 0.7%.

It was shown that, for Isparta region, the developed MLR model gave the best results to estimate rainfall. They cannot be used to estimate rainfall of another region, because the MLR models were developed for Isparta region. For a different region, the models need to be reestablished or need to be calibrated according to data of a new region. In the future, when more data are obtained, the developed models need to be revised. The other methods can give better results than MLR when adding more data or developing model for different region.

6. Conclusions

The rainfall which is an important factor for the use of water resources is a difficult variable to estimate. In this study, data-mining process was used to estimate monthly rainfall values of Isparta. The monthly rainfall data of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations were used to develop rainfall estimation models. When comparing the developed models to measured values, multilinear regression model from data-mining process gave more appropriate results than the developed models in this study. The input parameters of the best model were the rainfall values of Senirkent, Uluborlu, and Eğirdir stations. Consequently, it was shown that the data-mining process, producing a solution more quickly than traditional methods, can be used to complete the missing data in estimating rainfall.