Abstract

The precise estimation of solar radiation is of great importance in solar energy applications with respect to installation and capacity. In estimate modelling on selected target locations, various computer-based and experimental methods and techniques are employed. In the present study, the Multilayer Feed-Forward Neural Network (MFFNN), -Nearest Neighbors (-NN), a Library for Support Vector Machines (LibSVM), and M5 rules algorithms, which are among the Machine Learning (ML) algorithms, were used to estimate the hourly average solar radiation of two geographic locations on the same latitude. The input variables that had the most impact on solar radiation were identified and grouped as a result of 29 different applications that were developed by using 6 different feature selection methods with Waikato Environment for Knowledge Analysis (WEKA) software. Estimation models were developed by using the selected data groups and all input variables for each target location. The results show that the estimations developed with the feature selection method were more successful for target locations, and the radiation potentials were similar. The performance of the estimation models was evaluated by comparing each model with different statistical indicators and with previous studies. According to the RMSE, MAE, , and SMAPE statistical scales, the results of the most successful estimation models that were developed with MFFNN were 0.0508-0.0536, 0.0341-0.0352, 0.9488-0.9656, and 7.77%-7.79%, respectively.

1. Introduction

Energy, which is an effective parameter in the development of countries, is increasing rapidly with industry, technological advances, and increasing population. Not every country has adequate energy resources to meet the need for energy, and the rapid increase in energy consumption forces countries to turn to alternative sources in energy supply. For this reason, countries prefer renewable energy sources such as solar, wind, hydro, bio, hydrogen, geothermal, and tidal energy to meet their energy needs instead of conventional energy sources [1]. Solar energy, which plays a critical role in electricity generation with each passing day, has become one of the promising renewable energy sources attracting the attention of countries because it is clean, unlimited, and sustainable compared to fossil fuels. As a result of this, investments in solar energy for electricity generation are increasing rapidly in recent years with technological advances in solar energy, global climate change, dependence on other countries, and other environmental factors. In this context, photovoltaic (PV), as one of the usages of solar energy application areas, is intensively applied in order to produce electricity [2, 3].

PV, which is used reliably in electricity generation, has been growing rapidly in the world for more than 40 years, and the amount of electrical energy produced from PV power plants has reached 480 GW [4]. Before designing and modelling a PV system in a selected geographical area, solar radiation (SR) data must be measured as the most important input value, where the feasibility of the designs made in terms of investment can be evaluated according to this data. This value is not only necessary in PV designs but is also the most important parameter in many scientific and engineering works on solar energy practices [5]. For this reason, it is the most accurate method to obtain long-term data in a selected special geographic area. However, measuring the SR everywhere is often not possible, as it requires costly, long, and precise processes. In addition, radiation values cannot be measured in an accurate way in most countries because the measurements can only be made in certain areas. For this reason, experimental, statistical, and Artificial Intelligence- (AI-) based estimation methods were developed to calculate the value of SR worldwide [68]. ML algorithms, which are a subfield of AI, are one of the most common methods used in estimation studies.

Many studies have been conducted in recent years based on ML algorithms in different geographic areas of the world to estimate SR. In these studies, algorithms including Artificial Neural Network (ANN), Support Vector Machines (SVMs)/(Support Vector Regression) SVR, -NN, Linear/Nonlinear Regression, M5, and Random Forests have been used frequently [9]. However, estimation models were developed in these studies by selecting a specific geographical area of a country or different geographical locations in the country [10]. Before the development of estimation models for a selected geographic location, it must be decided which hourly, daily, and monthly average global radiation values that fall onto a certain horizontal surface will be used [11]. Notton et al. [12] recommended that the monthly average data should be used if preliminary modelling or draft design is required to be done, and the daily average data can be used if a more comprehensive design is to be established. However, they also indicated that it is necessary to use hourly average or shorter-scale data in more precise and result-oriented designs. Zhang et al. [13] explained that the estimation processes of studies with hourly data compared with daily and monthly data are more difficult and complex. For this reason, estimation models made with hourly data are less common since they contain more difficult and complex processes. After determining the input data according to the type of work that will be carried out, the SR values of the target area can be estimated by using one [14], multiple [15], or hybrid [16] ML algorithms. Different solutions were sought for Global Solar Radiation (GSR) estimation problems in developed models by making changes on the functional structure and architecture of one single ML algorithm by comparing multiple algorithms or by working two or more AI methods together.

It is possible to classify the planned studies in which the ML method is used in GSR estimation in three different categories according to the measurement time intervals of SR: Monthly Average Global Solar Radiation (MAGSR), Daily Average Global Solar Radiation (DAGSR), and Hourly Average Global Solar Radiation (HAGSR). HAGSR- [17, 18], DAGSR- [1922], and MAGSR- [2326] based estimation models were developed by using one single ML algorithm, and it was noticed that the ANN algorithm was used frequently compared to other algorithms because of its flexible structure and accuracy. On the other hand, studies on the methods in which multiple ML algorithms can be analyzed and used together at the same time are increasing rapidly. In these types of studies, a clear idea can be achieved on the effectiveness of each ML algorithm on the dataset used, and the most successful models can be compared and evaluated. In this context, Pang et al. [27] estimated GSR comparatively by using ANN and Recurrent Neural Network (RNN) ML algorithms in 10-, 30-, and 60-minute time zones. Li et al. [28] developed estimation models with the help of seven-year measured hourly data with the Multivariate Adaptive Regression Spline (MARS) ML algorithm to estimate HAGSR and compared their results obtained in Hong Kong with the ANN and logistic regression algorithms. They reported that ANN achieved superior performance compared to the other algorithms. Khosravi et al. [29] developed the most successful estimation models to estimate HAGSR for two different network groups on the Iranian island of Abu Musa by using MFFNN, Radial Basis Function Neural Network (RBFNN), SVR, Fuzzy Inference System (FIS), and Adaptive Neuro-Fuzzy Inference System (ANFIS) ML algorithms. The first network was planned with five inputs, and the second network was planned with one single input, and it was reported that the SVR reached superior estimative accuracy than other algorithms on both networks. Lotfinejad et al. [30] investigated the DAGSR of different cities of Iran by using Bat Neural Network (BNN), Generalized Regression Neural Network (GRNN), and Neuro-Fuzzy (NF) algorithms. They reported that the models developed with the recommended BNN algorithms performed better than other algorithms. Meenal and Selvakumar [31] examined a comparative DAGSR estimation model among SVM, ANN, and experimental models by identifying the most suitable input variables from nine input data from four different cities of India and showed that SVM was more successful than the other algorithms. Loutfi et al. [32] developed ten different HAGSR estimation models in the city of Fes, Morocco, with the help of nine different input variables from 2010 to 2014 five-year with Multilayer Perceptron (MLP) and Neural Autoregressive with Exogenous Inputs (NARX) algorithms. Among the models developed, they contended that the most successful estimation model was the model developed with NARX. Lazzaroni et al. [33] compared the GSR estimation models that were developed according to hourly, daily, and monthly time zones with SVR and Extreme Learning Machine (ELM) ML algorithms by using three-year hourly data in Milan, Italy, with the -NN algorithm. Long et al. [34] investigated the estimation of DAGSR by using ML-based ANN, -NN, SVM, and Multivariate Linear Regression (MLR) algorithms and made a comparative analysis of data-driven algorithms. Ozgoren et al. [35] compared the estimation models developed with the ANN and Multi-Nonlinear Regression (MNLR) algorithms to estimate MAGSR in 31 cities of Turkey using five-year input data collected between 2002 and 2006. Moghaddamnia et al. [36] estimated the DAGSR by using the different meteorological parameters of Britain’s Brue Basin by using the Local Linear Regression (LLR), NARX, MLP, Elman Network, and ANFIS ML algorithms.

In the present study, the purpose was to comparatively analyze the HAGSR of two geographical provinces located on the same latitude of the Mediterranean Region by using four different ML algorithms MFFNN, -NN, M5 rules, and SVR-based LibSVM library. Another purpose was to use the WEKA software program to determine the features of input data that has the most impact on SR. For this purpose, the best features were determined in five groups by developing twenty-nine different applications with the help of six feature selection functions. The eventual models of ML algorithms that were used in the study were developed according to the output groups of feature selection functions. HAGSR estimation models were evaluated with respect to among themselves and also on the basis of the algorithm that was used, and the results were then compared with similar studies. In addition, unlike other previous studies, the present study developed estimation models and evaluated their performance by using the classical SVR algorithm and LibSVM software, which are similar to each other. The framework outlining how the data mining processes and four different ML algorithms are used in this study to evaluate the solar radiation potential of two provinces in the same latitude is shown in Figure 1. The WEKA software was used in data mining processes such as data preprocessing and feature selection, and Matlab R2017b software was made use in modelling studies developed with ML algorithms used in SR estimation.

The rest of the study is organized as follows. The provinces for which the models were developed and the meteorological and categorical dataset used in the study are defined in Section 2. The details of the feature selection processes that were used to determine the most appropriate input data groups, the methodologies of the MFFNN, SVR-LibSVM, -NN, and M5 rules algorithms, the architectural and functional structures of the developed models, and the methods applied are also explained in this chapter. Section 3 includes the results and comparative analyses of the estimation models developed with the ML algorithm used for each input data group. The HAGSR estimation performance of the two provinces, which are located in the same latitude, was evaluated with multiple statistical error methods and was also compared with previous similar studies. The results of this study and its contribution to the literature are summarized in Section 4.

2. Materials and Methods

In this section, the evident features of the two selected provinces and the editing of data to be used in ML models are explained. Then, the selection procedures of the most effective input groups are mentioned using feature selection processes. The input data were determined in five different groups at the end of the selection process, and the development processes of the best ML models were explained for each group. In addition, the structural characteristics of the ML algorithms that were used in the comparative estimation of HAGSR and the statistical scales that were used in evaluating model performance are discussed in detail.

2.1. Study Area and Preparation of Database

The provinces of Kahramanmaras and Isparta were selected as the study areas by considering the climatic characteristics, elevation, various different geographical characteristics, and in particular latitude. The selected provinces are located in the Mediterranean Region and have a high solar power potential with an average annual sunshine time of 2956 hours and an average annual amount of solar energy of 1390 kWh/m2 [37]. The location of the selected provinces in the Mediterranean Region and the latitude coordinates of meteorological measurement stations are given in Figure 2.

Radiation data is the most important parameter used in solar energy-based systems. However, the radiation data value cannot be measured at every measuring station across the country; instead it is measured at a limited measuring station. SR was measured for certain locations by the Turkish State Meteorological Service (TSMS), which is a government agency with a large network of stations in Turkey. In the study, the data collected for the target provinces consisted of meteorological data measured by TSMS between 2002 and 2006. These data used were the hourly average data that were measured every 5 minutes and meteorological data from measuring stations collected from Hourly Pressure (), Hourly Sunshine Duration (HSD), Hourly Humidity (), Hourly Temperature (), Hourly Wind Speed (WS), and Hourly Solar Irradiance (HSI). 3D plots showing the change of SR for both monthly and seasonal measurements of yearly intervals of Kahramanmaras and Isparta are given in Figures 3 and 4. The annual distributions of the SR values measured in these charts are given in detail on an hourly basis. The specific characteristics of geographical and meteorological data of the target provinces are given in Table 1.

The data preprocessing that will improve the quality of the raw data to be used in the study is one of the most important processes that have a direct positive effect on the performance of all computer science-related algorithms [38]. Since ML algorithms are generally data-focused structures, several operations like cleaning, scaling, reduction, and normalization have significant effects on the accuracy of the estimation [39]. In the present study, four categorical data were included in the meteorological dataset including year of measurement (year), month of the year (month), day of the month (day), and hour of the day (hour). Geographical data were not used since the effect of the latitude was evaluated. Measurement time intervals of the other meteorological data were determined according to HSI measurement time intervals. The data between 06:00-17:00 hours for January and February; 06:00-18:00 hours for March, April, October, November, and December; and 07:00-19:00 hours between May and September were selected. Factors such as measurement time differences between years, variability of the measured time zones of each month, and winter time-summer time were effective in selecting the time ranges. Any missing data was calculated by taking the arithmetic average of the data in the same time frames of the previous and following years, and the data that were calculated in this way constituted approximately 4% of all data. A total of 23442 SR data were obtained for each province. After the raw data were arranged and determined, min-max normalization was applied and scaled. The normalization formula applied is given in equation (1). In this formula, each input () value was normalized () linearly between the 0 and 1 range by finding the minimum () and maximum () values of the raw dataset [40].

In data selection, since different estimation results are obtained each time when a certain year range is used in training the model and the remaining years are used to test the model, it was ensured that all data at hand were randomly allocated hourly with a specifically coded program, instead of determining year-based training and test data. In this way, it was aimed that the output results of the estimation models were not affected by the data selection by providing a homogeneous distribution in the input data according to years. The number and basic characteristics of the training and test data, which were determined hourly for each province, are given in Table 2.

2.2. Selection of the Best Input Data Groups with WEKA

Since the data pool used in ML-based GSR estimation studies are quite extensive, the characteristics of the data and their relations with each other affect the output performance of models. Although some data have positive effects on the output, some others may have negative effects, and some have no effect. For this reason, determining the most effective data features on output prior to the modelling process will decrease the dimensionality of the data employed in this process, facilitate the interpretation of the estimation, and shorten the modelling process increasing the estimation accuracy [10, 41]. The methods and techniques used in the selection of the features affecting SR the most as well as the methods and applications used in the current study are compared with similar ML-based studies in Table 3. In feature selection methods used commonly, the features that have the greatest effect on the SR data are found and a new input dataset is determined. Unlike in previous studies, different input data groups were created in this study by evaluating the different input parameters affecting SR data with multiple applications that were developed by the selection methods applied.

In this study, the open-source WEKA program was used to select the most affected features of SR. This program was developed by Waikato University by using the JAVA programming language. Two feature selection methods based on the wrapper and filter approaches were used to select features that most influenced the SR data in the program. Although the filter approach uses simple, fast, and scalable methods, the wrapper approach processes the data by using classification-based techniques. The relations between different features selected in each application and the classifier models were evaluated in this study in the selection processes [43].

Instead of processing the data with one single feature selection method, it was aimed to evaluate the effect of input variables on SR by developing multiple applications in each function by using six different feature selection functions based on filter and wrapper approaches. Three of these work as wrapper approach-based functions, and the other three work as filter-based functions. Classifier Subset Evaluator (CSE), Wrapper Subset Evaluator (WSE), and Classifier Feature Evaluator (ClassAE) are based on the wrapper approach; and Correlation-based Future Selection Subset Evaluator (CfsSE), Correlation Feature Evaluator (CorrAE), and Relief Feature Evaluator (RAE) are filter-based selection functions. In addition, two basic methods (i.e., random and comprehensive search) were used based on the type of feature selection function. Some selection functions support multiple search methods, and some others support only one. Search methods such as Best-First (BF), Evolutionary Search (ES), Firefly Search (FS), Elephant Search (ELS), Ant Search (AE), Linear Forward Selection (LFS), Greedy Stepwise (GS), and Ranker were used in this respect. However, ten different ML algorithms like Multi-Layer Perceptron (MLP), Linear Regression (LR), Simple Linear Regression (SLR), M5 rules, M5P, Decision Table (DT), Random Forest (RD), Additive Regression (AR), Elastic Net Regularization (ENR), and -Nearest Neighbors Classifier (IBk) were used as classifiers in wrapper-based feature selection functions.

In the selection process of the most effective input variables, 29 different applications were developed by using a total of six different feature selection functions. A 10-fold cross-validation method was used in all applications. The screenshot of the application developed by using the CfsSE feature selection function with the BF search method is given in Table 4. It is seen in the table that the year, , and HSD data were most effective on SR, and the other data had no effect.

Six different data groups were created for each feature selection function, with variables that most affected the SR. The input variables that affected the SR the most according to the selection functions for the provinces for which the models were developed are given in Tables 5 and 6. In processes where more than one selection was applied, the selection of the most effective features was determined by evaluating the number of applications and the impact totals of the selected variables. Consequently, the feature was not included in the data group if the impact level on SR was negative, neutral, or very low. Some feature variables calculated for selection functions and the number of inputs were similar in selection processes. The results of the CSE and WSE feature selection functions in Isparta and the results of the CSE and RAE feature selection functions in the data of Kahramanmaras were similar.

For each province, the final data groups and feature numbers created to be used in estimation models to be developed with ML algorithms as a result of feature selection processes are given in Table 7.

2.3. MFFNN Algorithm

ANN is an ML algorithm developed based on nerve cells specific to humans. This structure is known as a computer-modelled version of the biological and intellectual structure of the brain and is used frequently in solving problems such as estimations which cannot be calculated by nonlinear and classical calculation methods, time series problems, pattern recognition, and classification [44]. For the past 50 years, many neural network architectures have been developed based on Feed-Forward and Recurrent Networks to be used for various purposes and in a number of fields. Each architectural structure does not reach the same level of success on input data [45]. For this reason, the MFFNN ML algorithm, which was based on the Feed-Forward architectural structure, which is suitable for the available data structure and exhibits high performance, was used. Since MFFNN works with the backpropagation learning algorithm to minimize error, it has greatly increased learning success [25]. The Matlab R2017b software program was used in the development and modelling of this network. The architectural structure and working principles of the MFFNN that was used in the modelling studies are given in Figure 5.

GR1-GR5 represents the input data groups selected at the end of the feature selection process, and GR6 represents all input data that did not undergo any selection process. The architectural structure of the neural network was created in three layers, and a 5-iteration training model was developed for each neuron by using 1-50 neurons in the hidden layer. No significant increase was detected in the operating performance in neurons over 50, and the working time became considerably longer. In the developed MFFNN models, each input data () connected to neurons between layers was multiplied by a weight value (), added by bias (), and the net input values () were calculated. The formula of the net input is given in

Net input is activated with a transfer function once it is calculated [46]. A hyperbolic tangent sigmoid transfer function (Tansig) was used between the input layer and the hidden layer and between the hidden and the output layer. By using Tansig, net-input values are scaled in the -1 to +1 range. When determining the transfer function, the logistic sigmoid (Logsig) or Tansig function was determined to be available in the hidden layer, while Tansig or Linear (Purelin) functions were available in the output layer. Choosing a function other than these significantly reduced the performance. The formula for the Tansig transfer function is given in

The Levenberg-Marquardt Backpropagation (Trainlm) training function was used in the MFFNN. Other training functions such as Trainbr (Bayesian Regularization Backpropagation) and Traincgb (Conjugate Gradient Backpropagation) were also tested. However, since the best performance was provided with Trainlm, this training function was selected.

2.4. SVR Algorithm

SVM is known as the ML algorithm that was developed by Vapnik and commonly used in classification problems. The smallest subsets of training data are used to find the best prediction model between two classes with SVM [47]. However, since it was not adequate in multiclass estimation problems, the SVM-based SVR method was developed. SVR uses a technique based on regression problems and based on calculating a linear regression function in a multidimensional feature set [48]. The architectural structure of the SVR used in modelling studies is given in Figure 6.

The gaps between the data are kept wide in the SVR algorithm, ideal locations are found, and errors are minimized. In a dataset with a certain number of elements, represent the input vector , respectively, represents the corresponding output vector, and represents the total number of elements [49]. The formula of the SVR linear function is given in

represents the nonlinear mapping function which converts multidimensional data structures into a two-dimensional chart, represents the weight vector, and represents bias. The error function is given in equation (5). The constant and the values are determined by the user and are defined as the estimation accuracy of the training data.

The equation that minimizes the error function is given in equation (6). and are the LaGrange multipliers and are referred to as the support vectors if the training vector has a value other than zero. This structure is known as the critical values for SVR algorithms [50]. The structure is called the kernel function and converts the data it receives as input into an available form. Different types of kernel functions are used in SVR. Three different types of kernels, i.e., Polynomial (POL), Normalized Polynomial (NOR-P), and Gaussian Radial Basis Functions (RBF), were used in the models that were developed with SVR, and formulas for these functions are given in equations (7)–(9), respectively.

The classic SVR algorithm was also evaluated in the study by developing estimation models with LibSVM, which is another SVR-based method, and which is also an SVM-SVR-based algorithm software supporting single-class SVMs, two or multiclass SVMs, and SVRs [51]. LibSVM is preferred because it is a method that is used quite frequently in academic studies but not much preferred in SR prediction studies. Two different SVR types and kernels (Epsilon SVR (E-SVR) and Nu-SVR) were used in the estimation models that were developed with LibSVM, and RBF was used as the kernel. All prediction models were developed with Matlab R2017b software using LibSVM library interface software plugin.

2.5. KNN Algorithm

This algorithm is widely preferred in classification problems. However, a regression-based method was used in the present study. KNN is a nonparametric lazy ML-based learning algorithm and estimates by searching for the closest neighbors in the training dataset. KNN’s nonparametric equation is given in equation (10), where each was taken as the neighbor of data. In the formula, the value represents the target output for each training data.

Each new data intended to be estimated is looked at in the neighborhood of from the previous data with a KNN. The distance between any data value and all values in the training dataset is calculated and then the nearest training data values were determined. The average of the target output values is estimated for these values [52, 53]. The Euclid function was used for the calculation of the distance. The formula for the Euclid Function is given in equation (11). Care should be paid in choosing the value; small values should be used since the model tends to overfitting if the selection is too high [54]. In the present study, the value was taken as 1, 2, 3, 4, 6, and 10, and six different KNN models were developed for each data group. The model was deemed to over fit with a value of more than 10. The linear nearest neighbor (LinearNN), which is a rough force-based search algorithm, was also used in the study. With this structure, the distance between each point pairs was found in the dataset.

2.6. Rule-Based M5 Algorithm

The M5 algorithm was developed by Quinlan as an advanced version of the Classification and Regression Tree (CART) [55], which is based on a binary decision tree structure developing a relation between dependent and independent variables of tree leaves creating a linear regression model on each leaf to estimate the value of the samples reaching the leaf. The algorithm is established on two structures, which are the decision tree and the linear regression. The best leaf is determined as the rule in the M5 algorithm, and pruning and dividing occur in two stages. In the dividing operation, the dataset at hand is divided into subsets to create a decision tree. It is also ensured in this process that numerical features are constantly estimated on each node by using a linear regression function in leaf nodes [56]. Standard deviation is used to find the error in the relevant node, and the error is seen to decrease here at the desired rate for each feature. The division ends if there is little change in the values of the samples that reach a node or if the number of samples decreases too much [57]. The Standard Deviation Reduction (SDR) formula is given in equation (12), where is defined as the set of feature values reaching the node, is the feature values taken from the divided node, and std is the standard deviation [57].

A rule-based type of the M5 algorithm was used in the present study. In this method, which is also known as M5 rules, a series of M5 trees are created where the best leaf (rule) is hidden, and the sample dataset with the best rule in each cycle is removed from the training dataset without creating the next tree. While the M5 algorithm creates one single decision tree, M5 rules create a complete tree in each cycle. M5 rules develop a series of rules based on the M5 algorithm by using the Partial and Regression Tree (PART) algorithm [58].

2.7. Performance Analysis of ML Algorithm Models

The widely used statistical error measurement and analysis methods were employed in evaluating the performance of the models that were developed with ML algorithms in predicting SR output both themselves and among each other. The Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Symmetric Mean Absolute Percentage Error (SMAPE) are the error measurement statistics used in the study. Two different statistical analysis methods, Correlation Coefficient () and Coefficient of Determination (), were also used. The formulas used for the statistical scales are given in equations (13)–(18), respectively. In the formulas, , , , and are the measured, estimated, and measurement and estimation averages, respectively.

The percentage errors are used widely to compare the estimation performance of various datasets. MAPE, which is an estimation error calculation method independent from the scale, gives an incorrect result when measurement and estimation values are zero or have a value quite close to zero [59]. The SMAPE percentage scale was used to overcome this problem since the measurement and estimation results had values that were zero or quite close to zero.

3. Results and Discussion

Although input data is used in SR estimation studies in many areas and location, it is not common to evaluate SR on the same latitude and at locations that have similar geographical characteristics. Based on the effect of latitude on sunshine duration and the angle of coming solar rays, Darhmaoui and Lahjouji [60] calculated that the annual solar radiation values were at similar levels at the same latitude points of a geographical area, with a strong relationship between optimum tilt angle and target latitude value. Ahlgren et al. [61] emphasized the relationship between annual yield and latitude because there was a directly proportional relationship between latitude and direct normal radiation where parabolic groove collectors were located. For this reason, places that had the same latitude coordinates were selected on the target area, and ML algorithms were employed for high-accuracy GSR estimation. The estimated results of the data groups were compared by using statistical error measurement and analysis methods including SMAPE, MAE, RMSE, and to evaluate the training and testing estimation performance of the developed models. The closer the value between the measured and estimated in statistical error measurement methods is to 0 and the closer to 1 in analysis methods, the estimation accuracy of the developed models is higher [40]. The flowchart of the HAGSR estimation processes of both provinces is given in Figure 7.

Different features were used for each data group to estimate SR with the MFFNN algorithm by employing data groups in the GR1-GR6 range. During the training process of the models, a five-iteration structure was created for each hidden layer neurons between 1 and 50, and 250 different models were developed for each input group, improving 1500 different models in total. In the range of 0 to 1000 epochs, the network performance plots of the models that reached the best estimation results in the training process of both provinces are shown in Figure 8. When the training, validation, and testing SR estimation of each neural network model that was developed was evaluated statistically, the following results of the most successful MFFNN estimation models were determined and are given in Table 8.

As seen in Table 8, the most successful estimation models were calculated by using GR3 for Isparta and GR2 input data for Kahramanmaras. Although a 48-hidden layer neuron was found as the most successful MFFNN model in the first iteration in Isparta, a 40-hidden layer neuron was the most successful estimation model in the second iteration in Kahramanmaras. The training performance of the most successful models that were developed for Isparta and Kahramanmaras was found to be and 0.0023 and and 0.9845, respectively. When the best estimation values were compared with the actual values measured by using the test data, the , MAE, and SMAPE values for Isparta were 0.9488, 0.0352, and 7.77%, respectively; and these values were 0.9656, 0.0341, and 7.79%, respectively, for Kahramanmaras. The estimation performance of the two target areas was evaluated with different scales, and both the training and test data reached very similar results.

Boxplots between the measured and estimated values of the study done on the selected provinces are given in Figure 9. In these plots, the statistical error average measurement results between the test input data and the estimated values of each province can be seen. Scatter plots between measured and estimated values of the most successful model developed for each data group are given in Figure 10. It is understood in both plots that a high level of correlation was achieved for GR2, GR3, and GR6 data groups in Isparta, and a similarly high-level relationship was reached for GR2, GR3, GR5, and GR6 data groups in Kahramanmaras.

Another ML algorithm that is employed in estimating HAGSR is SVR. The results calculated with SVR-based estimation models were found to be quite low. Therefore, it was decided that the classic SVR estimation results should be evaluated with LibSVM, which is another SVR-based method. LibSVM was preferred because it is a well-known method in academic literature. In both methods, the most suitable combinations were determined by creating numerous models and the most successful estimation models were developed in the selection of user-defined C (complexity and cost parameter), epsilon (error parameter), and Nu (parameter used instead of C). The performance results of 18 different models that were developed for each province by using the POL, NOR-P, and RBF core functions with SVR are shown in Figure 11. The estimates were obtained between the 0.6786 and 0.8596 range for the province of Kahramanmaras according to the scale and 0.5273-0.7969 for Isparta. A total of 12 different estimation models were developed with LibSVM for the data groups in each province. The statistical results of SR estimation models that were developed by using the RBF kernel function for two different regression-based SVR algorithms are given in Table 9. The models that were developed with Nu-SVR were more successful than E-SVR. The model that was developed with the GR2 data group had the most successful estimation performance with 0.0675, 0.0501, 12.14%, and 0.9394, respectively, according to the RMSE, MAE, SMAPE, and scales in the Kahramanmaras target area. Similarly, the model that was developed by using the GR3 data group in Isparta was successful with 0.0752, 0.0573, 12.11%, and 0.8995, respectively. Comparative scatter plots of the most successful models developed with two different SVR methods used in the study according to the selected provinces are given in Figure 12.

As understood in Figure 12, the best SR estimation results of the models that were developed with LibSVM from both similar methods were found to be more successful than the classic SVR. For this reason, it was decided to use the estimation results of LibSVM in comparative evaluation of ML algorithms.

A total of 36 different estimation models were developed by using selected input data for each province based on six different -neighbor coefficients between 1 and 10 with the KNN ML algorithm. The estimation performance results of the two most successful models that were developed in each data group with user-defined parameters are given in Table 10. It was determined that the parameter was a defining feature in the estimation models, but there was not always a correct proportion towards an increase. No significant performance increases were detected in all models developed with over 10 parameters, and modelling time was extended. In the relevant table, the most successful model that was developed for Kahramanmaras estimated SR with 0.0605, 0.0419, 0.9511, and 8.84%, respectively, with the GR2 data group according to the RMSE, MAE, , and SMAPE scales. For Isparta, similarly, it was estimated with the GR3 data group resulting in 0.0646, 0.0438, 0.9261, and 8.88%, respectively. The scatter plots of estimation results are given in Figure 13. It is seen in the SMAPE scale that the SR estimations of the provinces used in the study are very close and similar.

Six different rule-based estimation models were developed for each province by using selected data groups of the targeted cities with the M5 rules algorithm. The estimation performance of the developed models is given in Table 11. The best model developed for Kahramanmaras was estimated by using the GR5 data group, which is unlike other ML algorithms employed in the study. Isparta, on the other hand, was estimated similarly by using the GR3 data group. According to the RMSE, MAE, , and SMAPE statistical scales, the values of 0.0610, 0.0418, 0.9506, and 9.01%, respectively, were obtained in the performance of the best model for Kahramanmaras. Similarly, 0.650, 0.0441, 0.9254, and 8.42% were obtained for the province of Isparta. Scatter plots of the most successful models that were developed in the target cities are given in Figure 14. According to the plots, it is understood that the data distributions and performance measurement metrics of the target provinces were very close to each other.

Aside from the trial studies in all ML algorithms used in the target provinces to increase estimation accuracy and to select the most successful models in each data group, 3000, 72, 12, and 24 different estimation models were developed with MFFNN, KNN, M5 rules, and SVR algorithm-based LibSVM library, respectively. In all studies, the estimated performance of the models that were developed with the GR2 (month, hour, , , and HSD) and the GR5 (month, hour, , , WS, and HSD) data groups determined at the end of the feature selection process in Kahramanmaras was more successful. In Isparta however, the models that were developed with the GR3 (year, month, day, hour, , , and HSD) data group showed higher performance. The statistical comparisons of the best performing models according to ML algorithms used in SR estimations of both provinces are given in Table 12. Based on the statistical scales that were employed in the study, the MFFNN algorithm estimated SR more accurately in both provinces than the other algorithms. However, similar estimation results were achieved with the KNN and M5 rules algorithm for each province, and the lowest performance values were detected in SVR models that were developed with LibSVM. With the MFFNN algorithm, the SR estimation results achieved in Kahramanmaras and Isparta according to SMAPE were 7.79% and 7.77%, respectively; 8.84% and 8.88%, respectively, with the KNN algorithm; and 12.14% and 12.11%, respectively, with the LibSVM algorithm. According to SMAPE, the fact that the SR estimation results of both provinces selected in the study are very close to each other a level is associated with the similarity of latitude and some geographical characteristics.

In the HAGSR estimation studies, the final performance results of the estimation models that were developed with the GR6 data group by using all the available input data were lower than the final performance results of models that were developed with the GR2, GR3, and GR5 data groups, which were groups created at the end of the feature selection process. MAE and SMAPE performance plots according to four ML algorithms of all data groups used for the target provinces are given in Figures 15 and 16. It is clearly seen that feature selection processes have positive contributions to the performance of the developed estimation models.

The HAGSR estimation models that were developed for Kahramanmaras and Isparta estimated the solar energy source of target areas quite well in general. However, the studies with the GR1 and GR4 data groups represent the input data groups that have the lowest estimated performance in both provinces. As a result, it was concluded that CfsSE and CorrAE, which are among the feature selection functions, applied to the meteorological and categorical input datasets, were inadequate in determining the best input data. The most successful feature selection functions were ClassAE and WSE for Kahramanmaras and CSE and WSE for Isparta. The comparisons of the SR estimations and real measurement results of the best models that were developed with the four ML algorithm using the 7 input data that were determined with the CSE and WSE feature selection functions for Isparta are given in Figure 17. Similarly, the comparisons of the best models that were developed with 5 inputs for the ClassAE feature selection function and 6 inputs for the WSE feature selection function for Kahramanmaras are given in Figure 18. The five-day hourly input data that were selected randomly from the test data for July 5-9 in 2004 were used for comparisons. It is seen that the HAGSR estimation models that were developed for Kahramanmaras are slightly more successful in estimating SR compared to Isparta where the test data time zones were selected randomly for each day.

The comparison of the HAGSR estimation models that were developed by using ML algorithms in the literature, and the most successful model developed in this study, is given in Table 13. The most successful models that were developed in previous studies were commonly based on a neural network, as in this study. It is understood that the accuracy of the proposed estimation model is better than, or similar to, previous studies.

4. Conclusion

In the present study, a comparative evaluation was made by developing models based on four different ML (MFFNN, KNN, SVR-based LibSVM, and M5 rules) algorithms to predict the HAGSR of the provinces of Kahramanmaras and Isparta, which are located on the same latitude coordinates of the Mediterranean Region. The most suitable input features were determined for each feature selection function by using meteorological and categorical input data and by developing 29 different applications based on six different feature selection functions with WEKA, and the input data were created in five different selection groups (GR1-GR5). Six different input datasets were determined to be used in modelling by including the GR6 data group in which all input data were collected to this selection group. The most successful estimation models were developed with the MFFNN algorithm in Kahramanmaras and Isparta by using the GR2 and GR3 data groups, respectively. Although month, hour, , , and HSD data were the most effective features in Kahramanmaras on estimation models, the variables of year, month, day, hour, , , and HSD were the most effective in Isparta. It is clear that HSD is the most effective data on SR in all data groups selected. The results show that the predictive accuracy of models that were developed with the data groups created at the end of the selection process increased, modelling time decreased, and the model is easier to interpret.

According to the data groups, the performance of KNN and M5 rules models was quite similar in each province. The performance of the estimation model that was developed with the KNN algorithm for the GR2 data group in Kahramanmaras was , and for the M5 rules algorithm. In Isparta, the performance of the estimation model that was developed with the KNN algorithm for the GR3 data group was , and for the M5 rules algorithm. The lowest performances were received for the GR1 and GR4 data groups in each province.

The best SR estimation performance of the two provinces was achieved with the MFFNN algorithm. When the results were evaluated in statistical terms, very close values were obtained in Kahramanmaras and Isparta. The MAE of the most successful model that was developed in Kahramanmaras for the MFFNN algorithm was found to be 0.0341 and 0.0352 for Isparta. Similarly, the SMAPE of the most successful model that was developed in Kahramanmaras was found to be 7.79% and 7.77% in Isparta. Although the statistical evaluation result of the different ML algorithms used in the study was low, similar results were obtained. These results show us that these two cities, which are very far from each other, have similar SR estimation potentials and that the latitude or different geographical characteristics have significant effects on these similarities. As a result of the present study, the HAGSR potential of both cities was estimated successfully and performed better than any other studies conducted in this field. In future studies, different parts of Turkey and the world should be evaluated in terms of performance of various ML algorithms and time intervals.

Data Availability

The data used to support the findings of this study are available from the corresponding author or Turkey General Directorate of Meteorology Meteorological Data Information Sales and Presentation System (MEVBIS) website upon request; website address: https://mevbis.mgm.gov.tr/mevbis/ui/index.html#/Workspace.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

Acknowledgments

The authors are grateful to the MEVBIS staff for his assistance during the research.