Abstract

Energy supply together with the data management is one of the key challenges of our century. Specifically, to decrease the climate change effects as energy requirement increases day by day poses a serious dilemma. It can be adequately reconciled with innovative data management in (renewable) energy technologies. The new environmental-friendly planning methods and investments that are discussed by researchers, governments, NGOs, and companies will give the basic and most important variables in shaping the future. We use modern data mining methods (SOM and -Means) and official governmental statistics for clustering cities according to their consumption similarities, the level of welfare, and growth rate and compare them with their potential of renewable resources with the help of Rapid Miner 5.1 and MATLAB software. The data mining was chosen to make the possible secret relations visible within the variables that can be unpredictable at first sight. Here, we aim to see the success level of the chosen algorithms in validation process simultaneously with the utilized software. Additionally, we aim to improve innovative approach for decision-makers and stakeholders about which renewable resource is the most suitable for an exact region by taking care of different variables at the same time.

1. Introduction

Today, concerns about the environment, energy security, and economic prosperity are increasingly developing. The International Energy Agency (2013-2014) estimated that demand for primary energy will increase globally by 55 percent between 2005 and 2030. Additionally, in developing countries where economies and population are expected to grow fast and primary energy demand is projected to grow by 74 percent during the same period, fossil fuels are expected to remain the dominant source of primary energy, accounting for 85 percent of the overall increase in global demand [1]. But the question is how long this unsustainable de facto situation can continue.

It is observable that since the 1990s, the attempt to convert production systems to clean production systems has not been the aim of the companies alone, but this was started to be implemented by joint operations between the industry, government bodies, and the public [2]. In addition to be a key criterion of the sustainable development stipulation, for business and states/interstates, the integration of social and ecological challenges by managing the data in an innovative manner into decision-making process from the beginning is of growing importance for minimizing future environmentally related risk and sustainable associated economic costs [35].

In this perspective, this research focuses on verification and experimental comparison of the two possible data mining options (and data mining itself) for new solar energy investment decisions. In Section 2, we gave the basic background in multivariate cluster analysis, including Self-Organizing Map and -Means with software tools for comparison. Additionally, we explained why these methods have been chosen. Our data owing to the cities have been gathered from governmental/special official statistics foundations via mail and/or face to face conversations. In Section 3, we commanded, compared, and discussed the graphics about solar radiation and consumption values which are created by Rapid Miner 5.1 and MATLAB. These programs have been used for comparison of the results, respectively. In Section 4, we summarized the results and we proposed our suggestions about algorithms in pattern recognition for validating the investment decision. As a result, in addition to present new approach for assessment of the (future) solar power investments, we will suggest the better algorithm and software for decision-making and optimization process for stakeholders.

2. Material and Methods

2.1. Data Mining and Clustering

The objective of data mining (DM) is to identify valid novel, potentially useful, and understandable correlations and patterns in existing data and finding useful patterns in data is known by different names in different communities [8, 9]. Chung and Gray and Liao and Triantaphyllou gathered the recent updates of data mining algorithms in different disciplines and diversified range of applications [10, 11]. Also from the ecology point of view, various successful studies were collected as examples of different applications [12]. While many data mining tasks follow a traditional, hypothesis-driven data analysis approach, it is commonplace to employ an opportunistic, data-driven approach that encourages the pattern detection algorithms to find useful trends, patterns, and relationships [6]. In this manner, as shown in Table 1, we decided the neural networks as the best solution for our study since it is important from the relationship point of view. Because of a modern analyzing studies complication and hardness to confirm (or validate) the results, it is better to use more than one method with the same variable attributes.

The similar trilemma of energy/data/optimization has been used to analyze clustering of 316 different European regions according to their cultural background [13]. From the other side, Seidler and Adderley used data mining approach to develop a proactive approach against the criminal networks [14]. Additionally, Santillana et al. [15] present a methodology mainly based on data mining and machine learning techniques not only for observation but also for estimation and evaluation of the influenza disease effects.

Clustering, as the main methodology of this study, is simply division of data into groups of similar objects [8]. There are so many methods and algorithms to do that in the literature. But, two different clustering algorithms are chosen to investigate the data of this study and compare their results: Self-Organizing Maps and -Means algorithms. Sobkowicz et al. (2012) [16], similar to us, used successfully these algorithms for modeling, simulating, and forecasting of the ideas in the web and social media. The general reasons for selecting these two algorithms are their popularity, flexibility, applicability, and handling high dimensionality [17].

Additionally, Figure 1 shows total electricity cost including construction, production, and decommissioning for renewal energy. Published in September 2014, the chart shows that electricity from hydro power is already more cost effective than electricity from other fuels. Even the price for electricity from wind power is becoming competitive in some cases with that of natural gas and almost with that of coal-generated energy. The price of power from solar panels is still high, but the cost differences decrease day-by-day mainly due to more efficient panels, mass production, and unstable fossil fuel costs. Many studies indicate that the transition cost of the energy system will be far less than the long term cost for keeping on using the fossil fuels. Since the most expensive energy is the still solar, the investment decision must be analyzed very carefully from various perspectives. In this study, we chose the solar energy values because it is the widest one and simple to build nearly all around the world. We used clustering for exposing the secret relationship among consumption similarities of cities for making cost effective investments. Therefore, it is expected to obtain new neighborly relations that will be the base of the new renewable source investments according to the new clustered structure.

2.2. Self-Organizing Maps

Self-Organizing Maps (SOM) were presented in the beginning of the 1980s by Kohonen, Finland academician. The main idea of SOM is to map the data patterns onto an -dimensional grid of units or neurons [18] and it is usually used for mapping high-dimensional data into one-, two-, or three-dimensional feature maps (simple geometric relationships) to increase intelligibility [19, 20]. Additionally, the SOM is an adaptive display method that is particularly suitable for the representation of structured/normalized statistical data. The mapping represents a data set in an ordered form, whereby mutual similarities of data samples will be visualized as geometric relations of the images of the samples on the map [21]. SOM procedures are used in a range of applications, but they are having a major impact in the fields of data exploration and data mining [22].

Traditional multivariate statistical approaches are often confused by data sets with variable relationships that are nonlinear, by data distributions that are abnormal (typically with multiple populations), and by the data sets themselves that may be disparate, sparsely filled (contain “nulls”) with both continuous and discontinuous numeric data and text [23]. The SOM, ordered vector-quantization approach can overcome many of these problematic issues [24] and due to its layer-based structure effects of independent variables that can be visible separately. Lately, Miche et al. [25] adapted also SOM clustering a priori knowledge successfully in 5 different examples.

2.3. K-Means Algorithm

The -Means algorithm, one of the mostly used and well-known clustering algorithms, is classified as a partition or nonhierarchical clustering method. -Means is typically used with the Euclidean metric for computing the distance between points and cluster centers [26]. It quickly converges to a local minimum of its cost function [27]. In -Means, because the centers are being connected to each other, the data in one- or two-dimension reflection in space can be achieved [28]. Additionally, -Means was successfully used as a solution for accuracy challenge in large-scale image classification as a part of hybrid model that shows the adaptable structure of the algorithm into different fields like another example [29] cited by Elssied et al. [30] from soft computing perspective.

Briefly, the primary objective of applying -Means is to provide groups whose members are close (have high similarity degree) and well separated. In the clustering process, there are no predefined classes and no examples that would show what kind of desirable relations should be valid among the data, so it is natural to be asked about the validity and quality of the results obtained [31]. At every step, the center of each cluster is recalculated by using the average vector of the objects (which are assigned to the same cluster) and every object can just be situated in one cluster [32].

2.4. System Architecture

As mentioned in many studies, focusing on just consumption values to analyze energy challenge is not enough by itself. Other necessary influencing factors should be cared for as well. In this manner, first we grouped the cities in accordance with their electricity consumption. Under consumption, we have seven different variables. At least, we compared 81 cities with seven different subtitles of total consumption (i.e., government, industry, agriculture, city lighting, housing, trade, and others) and we also took care of data between 2004 and 2014 (totally 5670 values). An example data set with the explanation is presented in Appendix A. Our iteration coefficient to reach the truer results is 1000 times. We used SOM and -Means (as a representative of the nonhierarchical methods) algorithms. Because we want to see the sensitiveness of the algorithms primarily with the limited data, we set the value of SOM and -Means equal and compared their performance when . (We picked 5 according to first observable outputs of randomly chosen data sets which were handled via MATLAB and Rapid Miner 5.1.)

Later, we normalized and clustered the data in SOM and then -Means with (first) both of the software programs: MATLAB and Rapid Miner. There are more than one data mining tool as R, WEKA, Orange, Rapid Miner, Tanagra, KNIME, and so forth in addition to MATLAB which is well known tool especially in engineering and mathematical disciplines. We chose Rapid Miner as an alternative to MATLAB mainly because it is user friendly and with modern structure in ingredients for free. Rapid Miner also has the most important connectors: databases, Excel, txt, html, CSV, and so forth. In Rapid Miner, it is easy to use wizards for setting up your data sources and a graphical environment for processing data flows. Rapid Miner can use every algorithm in the other programs and has very innovative pathways to offer possible solutions and mistakes. Although WEKA looks the most successful data mining tool according to the study of Borges et al. (2013) [33], they did not take Rapid Miner as a candidate in their research. Generally Rapid Miner, R, Weka, and KNIME have most of the desired characteristics for a fully functional platform [34], but lastly Piatetsky [35] presented that in 2013 Rapid Miner was the mostly used open source tool in real projects.

First of all, we manipulate our (prepared) data with -Means and SOM in MATLAB. However, we reduce the variables from seven to three to ensure especially duration and stability in MATLAB. Reduced data can be visualized in an easier way in two dimensions and it does not have meaningful effect on the results in case of using one of these methods: data aggregation, dimension reduction, data compression, and discretization [36]. During reducing variables, we calculated the most effective inputs (with standard variations) via principal component analysis method.

In this manner, Figure 2 shows the first clustering results on the graph briefly. The algorithm that has been used in MATLAB and the detailed results are also presented in Appendix B.

Next, the same data were handled via Rapid Miner 5.1. in addition to MATLAB for comparison. The table in Appendix C shows the clustering results separately according to consumption values by -Means and SOM. From one perspective, it shows the relevance for investment requirements and the clusters.

After observing the clusters of similarities according to consumption (independent of geographical neighborhood), we decided to find similarities of their solar power potential. The detailed and sensitive data has been gathered from national and local meteorology authorities in addition to archives of Ministry of Energy. The summary of data and algorithm for MATLAB that has been utilized with the results are presented in Appendix D. Furthermore, the Rapid Miner 5.1. results for SOM and -Means are in Appendix E.

These two data sets (solar radiation and consumption) are suitable to compare for observing which algorithm is more accurate and sensitive. In this manner, Figure 3 illustrates clustering schema of the cities according to SOM. Based on the profiles of the clusters identified by the SOM, we could say that SOM grouped the clusters successfully but not too sensitively. The yellow part spreads out remarkable area. Dark yellow and orange parts are represented with just a small area. But still, five different groups can also be observed from the spreading of the colors which has already been presented with details in Appendix F.

Till now, we have analyzed which cities are similar to each other from the point of electricity consumption and potential solar power views and also which algorithm gives better results. But for better decision support system (to serve the other aim of the study) we should know which city/cluster will need more power in the future simultaneously. Accordingly, we must know at least the increase of population and the ratio of industrialization. For example, if a city has got its own rich solar power potential but always loses its population or goes back in the welfare indexes because of any reason, it “strongly” means that for this city extra investment will not be necessary.

Because -Means can provide more sensitive results versus SOM, from now on it has been decided to go on with -Means via Rapid Miner 5.1. In this context, Figure 4 shows the clustering results according to population forecasting. The cities are clustered according to their similarity in population growth ratio and the results are visualized by color codes. The data have been supplied from Turkish Statistical Institute and can be found in detail in Appendix F. As seen, in every cluster, it is possible to find a city from any geographical region. So, it is possible to find “sister cities/regions” by focusing on the reasons in further studies. Additionally, there have been always some exceptions like megacities (Istanbul, Ankara, and Izmir in our example) where their values are far from the rest of the data and they can create separate research areas to themselves.

Similarly, Figure 5 shows the clustering of cities with color codes (and the numbers on the graph) in different way according to their welfare indexes values. Their index values have been provided by Ministry of Development and could be found in Appendix G. The colors represent the cities and the dimensions symbolize the index value. Smaller bubbles show more developed and bigger bubbles represent less developed cities. It is easy to notice that there is really big difference between the two extremes.

Therefore, it is possible to compare the results of cities’ future energy demand indicators with their consumption-similarity neighborhood and their solar power neighborhood clusters. Decision-makers can see the basic decision support components in the integrity of a cluster.

3. Results and Discussion

As mentioned by World Bank and World Energy Council in 2015 Energy Trilemma Index [37, 38], energy-efficiency improvements have very positive developments in the field but they are still slow. Today’s geopolitical as well as geoeconomic reality is dominated by urbanized areas, mainly organized as cities. The sustainability of cities, in particular megalopolises of over 10 million inhabitants, depends on regional and global networks for their basic needs in food, water, and energy. Security of supply and quality of life strongly depend on the functioning of these networks. But the present technology for producing electricity from renewable resources is still more expensive than the conventional ones despite technological development and various governmental supports. So, it is important to take care of cities/regions renewable richness not only by itself but also from different ways which at least should include the population forecasting, industry, agriculture and transformation rate and possibilities, climate, and the consumption routines. For this aim, both -Means and SOM algorithms are suitable because of their convenience for huge and small data set, large and small number of clusters, and ideal and random data set usage and also feasible for different software alternatives. For nonoverlapping situations all the methods had good performance.

In our study, both algorithms were run with same data for similar number of iterations. Our result shows that -Means is more sensitive and easier to implement. Additionally, when the becomes greater, the performance of SOM decreases and when the data are not big enough, -Means creates more successful outputs. In this context our findings are similar with Mingoti and Lima (2006) [23] and Abbas (2008) [39]. Likewise, Shahapurkar and Sundareshan [40] presented the results of application of the -Means technique to data set which has 798 observations in 6 dimensions. According to their outcomes, -Means clustering algorithm can provide various advantages compared to the SOM in terms of point density, accuracy, topology preservation, and computational requirements. Also, de Castro Leão et al. [41] showed in their study that at last -Means can reach results as successful as SOM but in faster and simple manner which is partly similar with Bação et al. (2005) [42]. So, the paper shows that it is also possible to reach meaningful solutions without challenging huge amount of data via -Means in a simple way. Additionally, Rapid Miner reached more sensitive results against MATLAB about clustering and forecasting in this paper with the same data.

Besides, we showed that data mining methods can be used as a successful tool also in renewable energy investment planning as many other areas. This study also illustrated in an innovative way that data mining methods successfully exposed the hidden relationship between the requirements and assets simultaneously by covering welfare ratio and cities energy requirements forecasting accordance to their growth of population and development. For instance, if we focus on the clusters in Figures 4 and 5, we realized that the cities that are at the same cluster according to their population potential often show the same welfare indexes line. On the other side, for a specific city/region, one can find its solar energy power, future need, and its welfare situation at one glance as a whole. If we focus on one city where we need more electricity in the future and have rich solar potential, it does not mean that we should invest sun-based production facility for sure. (Generally, the renewable resources are more suitable for home-based usage than the industrial ones because of their investment costs and own restrictions.) Briefly, the intersection(s) of these can give glue for the (more) accurate and efficient investment decisions.

4. Conclusion

In this paper, we have compared two clustering algorithms, respectively, via two different software programs to see which one is more successful under certain circumstances with sufficient (limited) data for the usage in new renewable energy investment decisions. In order to compare both algorithms, we have utilized four different types of data: electricity consumption, solar power potential, population forecasting, and welfare indexes of the cities. Although both software programs and algorithms are adequate enough for the analysis, from the time, easiness, and workforce perspectives, -Means method and Rapid Miner tool have demonstrated their competence. This can help especially the social sciences researchers/decision-makers to check their theories in an easier way.

Additionally, we presented a validation system for administrators both in government and in private sector of energy management to keep track of the suitable areas of energy investment according to different factors such as region, population, and welfare. Particularly visualizing and analyzing the dynamics between the humanities and energy are innovative. This will offer new ways to improve insight in creating optimal strategies for sustainable smart energy systems. Thus, administrators can decide which city/cluster needs more energy and more importantly new energy investment. In this way, for future studies other factors affecting the energy consumption could be included for more specific results. This can be a start point of studies about nonconventional and not state centric but urban centric programming methods.

Appendix

A. Example Data Set (for 2004)

See Table 2.

B. Clustering Algorithm for MATLAB and Results

The Algorithm[D,txt] = xlsread('inputs.xls');%idx_som will give the group indices of each element for som clustering[idx_som s] = som_clust(D, 'maxclust', 5, );idx_som%IDX_kmeans will contain the group numbers of cities%group centers will contain the group averages[IDX_kmeans, group_centers] = KMEANS(D', 5);figure;bar(hist(IDX_kmeans));title('-meansclustering histogram');IDX_kmeansCities_list = txt(3:end,1);xlswrite('output_som.xls'cities_list, 81, 'A1:A81');xlswrite('output_som.xls', idx_som, 81, 'B1:B81');xlswrite('output_kmeans.xls', cities_list, 81, 'A1:A81');xlswrite('output_kmeans.xls',IDX_kmeans, 81, 'B1:B81');

Results

Cluster 1 Cities. “İstanbul”

Cluster 2 Cities. “Adana”, “Denizli”, “Gaziantep”, “Hatay”, “Kayseri”, “Konya”, “Manisa”, “Mersin”, “Tekirdağ”, “Zonguldak”, “Çanakkale”.

Cluster 3 Cities. “Bursa”, “Kocaeli”, “İzmir”

Cluster 4 Cities. “Adıyaman”, “Aksaray”, “Amasya”, “Artvin”, “Aydın”, “Balıkesir”, “Batman”, “Bayburt”, “Bolu”, “Burdur”, “Bilecik”, “Bingöl”, “Bitlis”, “Düzce”, “Diyarbakır”, “Edirne”, “Elazığ”, “Erzurum”, “Erzincan”, “Eskişehir”, “Gümüşhane”, “Giresun”, “Isparta”, “Iğdır”, “Karabük”, “Karaman”, “Kars”, “Kastamonu”, “Kırıkkale”, “Kırklareli”, “Kırşehir”, “Kütahya”, “Kilis”, “Malatya”, “Mardin”, “Muğla”, “Muş”, “Nevşehir”, “Niğde”, “Osmaniye”, “Sakarya”, “Samsun”, “Sinop”, “Sivas”, “Siirt”, “Tokat”, “Trabzon”, “Tunceli”, “Uşak”, “Van”, “Yalova”, “Yozgat”, “Çankırı”, “Çorum”, “Şırnak”

Cluster 5 Cities. “Ankara”, “Antalya”

C. Clustering via Rapid Miner and Results

See Table 3.

D. Clustering Results according to Solar Power Potential with Algorithm for MATLAB

See Table 4.[num,txt,raw] = xlsread('SSAPMA.xls','SSAPMAS');cities = txt(3:end,1);[idx,c] = kmeans(num,5);for kume = 1:5disp('=================='),disp(['Category' num2str(cluster)])disp('==================')disp(sehirler(idx==cluster))disp(['Center of cluster = ' num2str(c(cluster))])end

Results

Category 1 (center of cluster = 2.54 (the median point of the values in the same cluster)). “Agri”, “Ankara”, “Aydin”, “Balikesir”, “Bartin”, “Bilecik”, “Bolu”, “Burdur”, “Bursa”, “Cankiri”, “Corum” “Denizli”, “Duzce”, “Erzincan”, “Gaziantep”, “Karabuk”, “Kastamonu”, “Kilis”, “Kirikkale”, “Kocaeli”, “Kutahya”, “Mugla”, “Osmaniye”, “Sakarya”, “Usak”, “Zonguldak”

Category 2 (center of cluster = 1.7922). “Ardahan”, “Artvin”, “Rize”, “Trabzon”

Category 3 (center of cluster = 2.6791). “Adana”, “Afyonkarahisar”, “Bingol”, “Canakkale”, “Eskisehir”, “Isparta”, “Izmir”, “Karaman”, “Kirsehir” “Konya”, “Manisa”, “Nigde”, “Sivas”, “Yalova”, “Yozgat”

Category 4 (center of cluster= 2.8492). “Adiyaman”, “Aksaray”, “Batman”, “Bitlis”, “Diyarbakir”, “Edirne”, “Elazig”, “Istanbul”, “Kahramanmaras”, “Kayseri”, “Kirklareli”, “Malatya”, “Mardin”, “Mus”, “Nevsehir”, “Sanliurfa” “Siirt”, “Sirnak”, “Tekirdag” “Tunceli”

Category 5 (center of cluster = 2.27). “Amasya”, “Antalya”, “Bayburt”, “Erzurum”, “Giresun”, “Gumushane”, “Hakkari”, “Hatay”, “Igdir”, “Kars”, “Mersin”, “Ordu”, “Samsun”, “Sinop”, “Tokat”, “Van”

E. Clustering Results according to Solar Power Potential with Rapid Miner (by -means and SOM)

See Table 5.

F. Population and Population Forecasting of the Cities

See Table 6.

G. Welfare Indexes

See Table 7.

Disclosure

The project “Saving industrial water with using solar power within the framework of artificial intelligence” that was prepared by the corresponding author via utilizing the same methods about water management won the winner prize in UNDP-Every Drops Matter contest in 2014 [43].

Competing Interests

The authors declare that they have no competing interests.