Abstract

Accurately estimating the amount of evaporation loss is necessary for scheduling and calculating irrigation water requirements. In this study, four machine learning (ML) modeling approaches, extreme learning machine (ELM), gradient boosting machine (GBM), quantile random forest (QRF), and Gaussian process regression (GPR), have been developed to estimate the monthly evaporation loss over two stations located in Iraq. Monthly climatical parameters have been used as an input variable for simulating the evaporation rate. Several statistical measures (e.g., mean absolute error (MAE), correlation coefficient (R), mean absolute percentage error (MAPE), and modified index of agreement (Md)), as well as graphical inspection, were used to compare the performances of the applied models. The results showed that the GBM model has much better performance in predicting monthly evaporation over two stations compared to other applied models. For the first case study which was in Diyala, the results showed a prediction enhancement in terms of MAE and RMSE by 7.17%, 21.01%; 16.51%, 15.74%; and 23.14%, 26.64%; using GBM compared to ELM, GPR, and QRF, respectively. However, for the second case study (in Erbil), the prediction enhancement was improved in terms of reduction of MAE and RMSE by 10.88%, 9.24%; 15.24%, 5%; and 16.06%, 15.76%; respectively, compared to ELM, GPR, and QRF models. The results of the proposed GMBM model can therefore assist local stakeholders in the management of water resources.

1. Introduction

In the hydrological cycle, evaporation plays a major role; therefore, monitoring evaporation is important for managing water resources, optimizing irrigation schedules, and modeling agricultural production [1,2]. Besides, evaporation rate has significant importance in studying climate change and global warming because this parameter dissipates a good proportion of the global precipitation [35]. The evaporation loss is influenced primarily by the vapor pressure gradient and the available heat energy, which are determined by the weather data like air temperature, relative humidity, wind speed, and solar radiation [68]. These variables are strongly associated with other aspects like the current season, time of day, geographical location, and sort of climate [9,10]. The evaporation process is therefore extremely nonlinear and complex.

For computing and evaluating evaporation, there are two procedures, direct and indirect [11]. Pan evaporation is considered as a well-known direct method used extensively for the estimation of evaporation rate. In particular, evaporimeters cannot be placed everywhere, especially in inaccessible regions where precise instrumentation is not possible [12]. Furthermore, the process of installing and maintaining this evaporation equipment in several regions is expensive [13]. However, the indirect method includes empirical equations used for measuring the evaporation rate [14]. These empirical equations can be established utilizing meteorological and hydrological parameters such as temperature, sunshine hour, wind speed, humidity, and rainfall [15,16]. Precise measurement of some of these meteorological factors requires advanced tools and skilled labor [17]. Often, instrument malfunctions, improper maintenance, and harsh weather conditions make it difficult to gauge these data minus any errors, which is essential for the prediction of evaporation via empirical equations [18]. Thus, it would be problematic to project evaporation by gauging these factors incorrectly [19].

Thus, indirect systems of estimating evaporation by applying empirical equations are dependent on data and are also influenced by different assumptions. In other words, these approaches are considered as data-sensitive procedures and the accuracy of prediction would mainly depend on the data validity [20]. Additionally, such climatic data are generally scarce or hard to find at a particular hydrological station, and they tend to be discontinuous in certain places [21]. Evaporation is difficult to model through empirical techniques due to its extremely complex physical and nonlinear nature. In addition, an empirical model designed for a specific scenario might not perform well in another scenario, requiring recalibrations of the coefficients before execution. Several empirical models have been created by many researchers in literature to model evaporation loss [22]. The selection of the predictors is one of the main challenges for the nonlinear regression process. Therefore, creating a robust predictive model using empirical procedures is very difficult.

Many studies have been conducted to solve different water-resource problems employing different artificial intelligence (AI) approaches such as random forest (RF), support vector machine (SVM), extreme learning machine (ELM), feed-forward neural network (FFNN), extra-tree, Gaussian process regression (GPR), gradient boosting model (GBM), and quantile regression forest (QRF) [2329]. Goyal et al. [30]. presented a study to estimate the daily evaporation loss over subtropical areas using different AI modeling approaches. The study used six meteorological parameters to establish the applied models. The findings of the study illustrated that the Adaptive Neurofuzzy Inference System (ANFIS) and least square support vector regression (LS-SVR) provide the best accuracy compared to the other used models. Another study was performed in [31] to estimate the evaporation loss of the Beysehir lake located in the southern part of Turkey. This study employed several machine learning approaches coupled with cross-validation technique to predict the monthly evaporation over that case study which is characterized as an arid and semiarid area. The study found that both ANN and SVR had a good prediction accuracy. Qasem et al. [32] developed a complicated model based on the incorporation of the ML models such as SVR and ANN with wavelet transforms (WT) for modeling the monthly rate of evaporation in arid and humid climates. The obtained results showed that the WT did not significantly enhance the prediction accuracy in some cases. Besides, the standard model (ANN) showed satisfactory accuracy in terms of predicting the evaporation rates. As ANN showed higher performance in prediction evaporation loss, it is significant to compare ANN with other machine learning methods such as RF and ELM. A study introduced by [33] provided a good comparison between the performances of ANN and random forest in the prediction of evaporation. The study’s result proved that the RF has better performance than ANN as well as providing very accurate estimates. Furthermore, Althoff et al. [34] presented a study using different ML approaches to estimate the small dams' evaporation loss in Brazil. The findings of the study illustrated that the performance of RF was very satisfactory in the prediction of evaporation loss over small dams. Several other research evidenced the contribution of the AI models in simulating the catchment evaporation processes [3537]. Recently, kernel-based models, fuzzy algorithms, and their hybrids with other algorithms have been successfully used for predicting evaporation [38]. However, developed gradient boosting models were rarely applied in modeling reference evapotranspiration worldwide. According to our knowledge, no study has focused on evaluating and comparing the capability of newly developed gradient boosting models for evaporation estimation in arid to semiarid climate zones of Iraq. Therefore, it is interesting to evaluate the performance of GBM and compare it with reliable AI models such as extreme learning machine (ELM), quantile regression forest (QRF), and Gaussian process regression (GPR) for estimating evaporation rate (Ep) in arid to semiarid climate zones of Iraq.

The contribution of this study is to determine the efficiency of the gradient boosting model (GBM) in estimating the evaporation rate () using data collected from two meteorological stations located in Iraq. The performance of GBM was compared with those of reliable AI models such as extreme learning machine (ELM), quantile regression forest (QRF), and Gaussian process regression (GPR). Furthermore, it is the first time to use GBM model for predicting the monthly evaporation loss related to several stations located in Iraq.

2. Data and Case Study

Iraq is geographically located in the Middle East and has almost two major climate zones, semiarid in the south and semihumid in the north [39]. The Iraqi region lacks sufficient water resources and suffers from droughts [40,41]. As temperatures rise in Iraq, surface water availability decreases, and groundwater levels in aquifers decrease. Iraq's hydrological cycle has been affected severely by evaporation, which currently depletes about 61% of its total precipitation [16,42]. Thus, it is very important to accurately predict the evaporation loss in Iraq. In this study, two case studies are selected to estimate the evaporation rate. The first case study is in Diyala state, while the second station is in Erbil state (see Figure 1). Diyala is located in the central part of the region, while Erbil is located in the northern region. The evaporation rate was predicted as function of six metrological parameters such as sunshine hours, minimum and maximum temperature, wind speed, rainfall, and relative humidity.

3. Methodology

3.1. Gaussian Process Regression

Rasmussen and Williams were the first to introduce the Gaussian process regression (GPR) [43]. This approach is a well-known and nonparametric method used for solving classification and regression problems. Furthermore, GPR model has been commonly employed to address several water resources concerts [4447]. GPR combines Bayesian learning and kernel machines to form a principled and probabilistic approach to create a regression model. A model prediction's uncertainty can be directly outputted alongside the projected value [48].

In general, the mean and kernel function can be used to calculate a GPR [49]. According to this definition, GPR is an assemblage of random variables representing the value of function at the given location . It can be expressed as follows:

is the prior distribution of the regression function, and are the kernel and function, respectively. By considering that the training set includes input finite numbers in a matrix form , the joint distribution of GPR is defined as follows:where is the mean function which can be calculated by the mean function as follows:

Moreover, the kernel function of the applied model can be determined by mean function ) as follows:

In this study, the mean function is set to zero for simplicity to produce a widely used GPR prior. Besides, this technique has been widely used in previous studies [43,50]. Finally, (1) will be rewritten as follows:

3.2. Extreme Learning Machine

Extreme learning machine (ELM) has the advantages of being a single hidden layer feed-forward neural network (FFNN) with good global search ability, simple structure, fast learning speed, and excellent generalization abilities [51]. There are two types of weights in the ELM: the input weights related to the hidden layer which are assigned randomly and the output weights which are attained by analysis and calculation [52]. In other words, unlike traditional neural networks, the ELM does not require iterative learning [53]. The outputs weights of the ELM can be easily computed by determining the generalized inverse of the output matrix of the hidden output weight values. The structure of the ELM is greatly simplified by this process. The training process of ELM is summarized by few steps as follows:.(i)Input the training dataset, and select the ELM’s structure (hidden nodes) and the activation function of the hidden layer (see Figure 2).(ii)Calculate the H matrix (output of hidden layer) as follows:(, are hidden nodes parameters which are randomly assigned.(iii)Determine the output weight matrix ():where T is the actual label vector of the training dataset and is Moore-Penrose generalized inverse matrix (H).

3.3. Quantile Random Forest (QRF)

Random forest (RF) is an ensemble and supervised learning algorithm invented by Breiman [54]. The core concept of this approach is to integrate multiple trees through ensemble learning procedures. Furthermore, RF is a modified version of the Bagging algorithm with the basic idea that, for the original dataset, are selected as a new data and would be trained by using put back sampling method separately. The CART decision tree in RF is employed as a weak learner; however, for each tree is generated, the required number of features will be selected randomly from the original dataset labels. Thus, in a regression problem, the results of weak learners () are averaged to obtain the final model output. Averaging approach of RF has a significant importance in reducing the bias, as well as variance and correlation between trees [23].

The quantile random forest (QRF) is considered as an improved version of RF, applying quantile regression (QR) instead of averaging approach in calculating the final form of a target [55]. Furthermore, the QRF is considered a nonparametric approach enhanced by a solid theoretical foundation [56]. The conditional distribution of the QRF can be mathematically expressed as follows:

in (8) is derived by taking mean value of the observations. With regard to QRF, is representing the weighted average value of all observations .

The steps below illustrate the QRF algorithm:(i)The M decision tree , is created in random forests ( as well as taking into account the observations of each node related to a decision tree.(ii)For , o it will be repeated for all decision trees and then determine all observations of each decision tree. Finally, the weight of each observation is calculated by averaging the weights of tree decisions.(iii)For all calculate the estimate of the distribution function with (9) by using the weights obtained in step (2).

Figure 3 presents the flowchart of the QRF model.

3.4. Gradient Boosting Machine

Gradient boosting machine (GBM) model is one of the most famous supervised algorithms introduced as a robust technique to solve problems related to classification and regression [57]. Decision tree is a faster algorithm but it still suffers instability, so GBM is introduced to solve this serious problem [5860]. Furthermore, GBM has combined the decision tress and boosting algorithms’ advantages [61]. The GBM works mainly on the formulation of the gradient descent of boosting technique and, hence, it is very useful for classification and regression problems [62]. The boosting structure is primarily a constructive scheme of ensemble formation that involves successively adding new weak base models that are trained according to the calculated error of the previous whole ensemble model for each iteration, and these base learners generate only a slightly lower error rate compared to random guessing. The boosting method family is based on a constructive strategy in which the learning mechanism will fit new models sequentially to produce a more precise estimation of the response variable. Figure 4 shows the structure of gradient boosting machine regression model.

The approach of the GBM model can be illustrated in several steps as follows:(i)The GMB is initialized to minimize the loss function with a constant value.(ii)The negative gradient of the cost function is estimated in each iterative training process as the residual value in model (current one).(iii)A new regression tree will be trained to fit the residual obtained from the second step.(iv)In this step, the residual is updated and the current regression tree is added to the previous model.(v)The algorithm of GBM is still iteratively trained and the maximum iterations number (selected by the user) is reached.

The mathematical expressions and brief description of applied GMB algorithm are shown below [63].

3.5. Statistical Evaluation Metrics

The four applied models have been compared and assessed to select the best models for predicting monthly evaporation. There are five statistical criteria, root mean square error (RMSE), mean absolute error (MAE), correlation coefficient (R), mean absolute percentage error (MAPE), and modified index of agreement (Md), which were used to assess the models’ performances for training and testing phases. The mathematical expressions of these parameters are illustrated below [64]:In the above equations, and are the actual and predictive monthly evaporation values at record, respectively. and are the mean observed and predicted monthly evaporation values and is the number of records Algorithm 1.

 Input: Train data
 Data includes ….(xn, yn),
 Loss function
 Output: regression tree
(i)Initialization
(ii)
(a) For i T N, calculating the residuals.
(b) Train a regression tree in order to fit the computed residual () and then obtain the leaf node area regarding tree
(c) For j = 1 To J compute
(d) Update the current model.
(iii)Obtain the final additive model

4. Results and Discussion

In this study, four machine learning modeling approaches have been developed to select the best model for predicting monthly evaporation. The four models (RF, ELM, GBM, and GPR) are trained and validated using climate data collected from two different locations in Iraq. About seventy percent of available data were used for calibration and the other thirty percentage used for validating the predictive models. The used models in this study have been assessed by different statistical criteria as well as graphical presentations.

For the case study, the performances of the applied models through the training phase are summarized in Table 1. The given statistics showed that all the models provided a good similarity between predicted evaporation and predicted values except GPR (R = O.938, and Md = 0.967). Furthermore, it can be observed that the GBM generated fewer error forecasts compared to other models (MAE = 14.170, RMSE = 23.092, MAPE = 0.095, R = 0.987, and Md = 0.993). However, the performances of the ELM and RF models were very similar. However, it can be said that there was a slight advantage in favor of the ELM model. This model provided smaller values of MAE and MAPE compared to the ORF model. Table 2 provides a significant analysis of the models’ performances through the training phase for the second case study. Based on the obtained results, the GBM model showed an excellent ability to predict the monthly evaporation, providing lowest estimated errors (MAE = 13.645, RMSE = 20.509, and MAPE = 0.058) and highest prediction accuracy (R = 0.994, WI = 9.997). The second and third best models were ELM and QRF, respectively. However, the GPR was considered the worst predicted model because it gave the highest values of RMSE, MAPE, and MAE. It can be concluded that, through the training stage, the GPR was noticed to have a poor accuracy of both case studies. However, the GBM model has a robust performance in the simulation of the evaporation rate for both case studies according to the obtained statistical parameters.

To assess the prediction accuracy of the applied models for the two case studies, boxplot diagrams were established to visually show the similarity of the prediction values with the observed evaporation rates. The performances of the four applied models to predict the monthly evaporation rate for both cases studies are graphically illustrated in Figures 5 and 6, respectively. The clearest observation that can be reported was the inability of the GPR model to generate an acceptable accuracy of evaporation estimations. Moreover, this model could not provide a satisfactory prediction especially for higher and lower values of evaporation. However, both figures illustrated that the GBM was superior because the calculated median for that model was very close to the actual value. Additionally, it successfully managed to simulate the higher and lower values of evaporation compared to other models.

Although success has been attained in the monthly evaporation using the GBM model during the training phase, it is very essential to evaluate the proposed model with testing dataset. As is well known, the training results may provide misleading assessment because the model is trained using known input and third corresponding targets [65]. Besides, the testing phase is very crucial in assessing the quality of the predictive models and, hence, the models’ abilities would be assessed very well in terms of generalization and avoiding overfitting [66].

The assessment process of the applied models through the testing phase for the first case study that was in Diyala state is exhibited in Table 3. The superiority of the GBM model in estimating the monthly evaporation compared to other models has been easily noted in the table. More specifically, the GBM model was found to produce a satisfactory estimate with RMSE of 28.478, MAE of 21.541, MAPE of 0.181, R of 0.976, and Md of 0.987. However, the QRF provided the worst prediction accuracy compared to the applied models. With respect to case study 2 which was in Erbil state, the performance of the GBM according to Table 4 was also superior and provided fewer estimated errors (MAE = 26.368, RMSE = 35.345, and MAPE = 0.130) as well as higher values of R (0.985) and Md (0.989).

The reported results for both case studies showed that the GBM significantly outperformed the other machine learning models. The superiority of this model can be measured based on its capacity for reducing the MAE and RMSE for both stations during the testing phase (see Figure 7). The results showed for the case first case study a prediction enhancement in terms of MAE and RMSE by 7.17%, 21.01%; 16.51%, 15.74%; and 23.14%, 26.64%; during using GBM compared to ELM, GPR, and QRF, respectively. However, for the second case study in Erbil state, the prediction enhancement was improved in terms of reduction of MAE and RMSE by 10.88%, 9.24%; 15.24%, 5%; and 16.06%, 15.76%; respectively, compared to ELM, GPR, and QRF models.

The visualization assessment presented in Figures 8 and 9 proved that the estimated monthly evaporation rates for both stations by GBM through the testing phase were very close to the observed values. Moreover, the statistical parameters such as median and highest and lowest values were noticed to be very similar to the actual values. However, these figures showed that the GPR model had a poor performance in both case studies compared to other models.

For further assessment, Taylor diagrams were created using the prediction values obtained from four models for both stations (see Figures 10 and 11). The advantage of using Taylor diagram is to assess the comparable models with the actual data using three statistical parameters (standard deviation, root mean square error, and correlation coefficient). Besides, the equivalent evaporation rates obtained from each model and the actual values were assigned on a polar diagram. It can be seen from figures related to both stations that the location of the GBM model was closer to the actual values than other comparable models.

5. Conclusions

As the evaporation rate is a significant element in the hydrological cycle, its process in nature is very complicated and stochastic. In this paper, the capability of artificial intelligence models such as ELM, QRF, GBM, and GPR has been evaluated in the prediction of monthly evaporation over two stations located in Diyala and Erbil states, Iraq. The input parameters include metrological data such as sunshine hours, minimum and maximum temperature, wind speed, and relative humidity. The models were assessed using different statistical criteria as well as graphical plots. The findings of this study revealed that the GBM modeling approach has an excellent performance in the prediction of the monthly rate of evaporation over two stations with minimum forecasting errors. However, the QRF models showed the poorest performance compared with other applied models. All in all, the achieved results proved that the suggested predictive model (GBM) showed an optimistic technique for these regions; thus, it may assist local stakeholders in the management of water resources.

6. Recommendations

The recommendations for future research can be illustrated as follows:(i)This study recommends the use of the adopted model GBM to estimate the monthly evaporation rates and investigate over several stations located in the middle and southern parts of Iraq. This study showed that the GBM model showed a good prediction accuracy in areas located in the eastern and northern parts of Iraq. Thus, it is very important to investigate the ability of this model in estimating evaporation in another regions.(ii)The application of feature selection tool is very important to choose the most proper input variables, thus reducing the model complexity [13,67].(iii)The GBM model is incorporated with novel bioinspirated algorithms for enhancing its performance prediction, thereby producing much accurate predictions [6870].

Data Availability

The data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.