Abstract

Predicting electricity consumption is notably essential to provide a better management decision and company strategy. This study presents a hybrid machine learning model by integrating dimensionality reduction and feature selection algorithms with a backpropagation neural network (BPNN) to predict electricity consumption in Thailand. The predictive models are developed and tested using an actual dataset with related predictor variables from public sources. An open geospatial data gathered from a real service as well as geographical, climatic, industrial, household information are used to train, evaluate, and validate these models. Machine learning methods such as principal component analysis (PCA), stepwise regression (SWR), and random forest (RF) are used to determine the significant predictor variables. The predictive models are constructed using the BPNN with all available variables as baseline for comparison and selected variables from dimensionality reduction and feature selection methods. Along with creating a predictive model, the most related predictors of energy consumption are also selected. From the comparison, the hybrid model of RF with BPNN consistently outperforms the other models. Thus, the proposed hybrid machine learning model presented from this study can predict electricity consumption for planning and managing the energy demand.

1. Introduction

Due to the pandemic crisis in 2020, total energy demand during 2020–2030 is likely to be higher than the International Energy Agency (IEA) forecast. Electricity consumption has become one of the critical issues in most countries. Therefore, the accurate prediction of electricity consumption has an essential role in achieving efficient energy utilization. Assessment of the electricity consumption in advance will improve operation strategies and management of energy storage system and planning activities for future power plants [1, 2]. One of the largest electricity consumers is the business sector. Generally, socioeconomic and environmental factors contribute to electricity consumption. Socioeconomic factors include industrial and household information, while environmental factors include geographical and climatic information. Determining the significant relation of different factors related to electricity consumption could provide guidelines for electricity authority management to carry out the planning and strategies in an efficient manner.

Over the past several decades, many statistical and computational intelligence methods have been implemented in the fields of prediction. Previous studies are mainly limited to a small dataset of independent variables based on time-series forecasting, regression analysis, and clustering methods [3, 4]. The statistical methods have some restrictions on the linearity, normality, and independence of variables [5]. Computational intelligence methods, such as artificial intelligence algorithms, have been primarily implemented in prediction [6]. Recent research shows that the super computing power is more efficient and effective in handling and analyzing huge volumes of data. The machine learning algorithms, a subset of artificial intelligence, exhibit superior performance in handling large numbers of data [7].

The majority of research in electricity consumption modeling uses various machine learning methods, in particular artificial neural network (ANN), decision tree, and clustering. Platon et al. [1] performed research on ANN concerning the hourly prediction of electricity consumption. Walker et al. [8] predicted the energy consumption of the building using random forest (RF) and ANN. Pérez-Chacón et al. [9] proposed a methodology for finding patterns of electricity consumption using the k-means. Shi et al. [10] employed echo state network in the prediction of building energy demand. Furthermore, the literature review on machine learning algorithms in energy research has been proposed by Mohandes et al. [11] and Lu et al. [12].

Moreover, the current research of predicting the future electricity demand related to hybrid models combining different two or more machine learning algorithms was reviewed and analyzed by Deb et al. [13] and Mamun et al. [14]. The hybrid models show excellent results because individual algorithms have different advantages. Zekić-Sušac et al. [4] integrated the variable selection with ANN in developing a predictive model for the energy cost of public buildings in Croatia. Zekić-Sušac et al. [15] integrated clustering and ANN to improve the accuracy of modeling energy efficiency. Muralitharan et al. [16] proposed a convolutional neural network based optimization approach for predicting the future energy demand. Pérez-Chacón et al. [17] used a big data time-series and experimental method with decision tree, gradient boosting machine (GBM), pattern sequence-based forecasting, ARIMA, and ANN to forecast the electricity demand. Zekić-Sušac et al. [18] used variable reduction procedures with RPart regression tree, RF, and deep neural networks to construct a predictive model for the energy demand of public buildings in Croatia. Basurto et al. [19] employed a hybrid intelligent system based on ANN and clustering algorithm to predict the solar energy in Spain. Therefore, it is appropriate to exploit machine learning algorithms in electricity consumption prediction.

This study has demonstrated the prediction of electricity consumption. The proposed procedure has several phases: data collection and data preprocessing, dimensionality reduction and feature selection, and prediction. In the data collection and data preprocessing phase, the data is collected from publicly available sources and processed to handle the missing values and outliers. In the dimensionality reduction and feature selection phase, the techniques including principal component analysis (PCA), stepwise regression (SWR), and RF are applied. The prediction phase is implemented by using the backpropagation neural networks (BPNN) algorithm. The selected important predictor variables from PCA, SWR, and RF have been used as the inputs for the BPNN algorithm to predict the electricity consumption. Besides creating a predictive model, the subset of relevant variables is also selected and compared. Six metrics evaluate the effectiveness of the predictive models. The model with the highest accuracy in the test evaluation has been selected.

The other sections of this paper are as follows. Section 2 introduces the machine learning algorithms employed in this study and exploration of the relevant literature. Section 3 outlines the architecture of the proposed predictive models, experimental dataset, and the evaluation method. Section 4 describes the experimental outcome and some statistical results. Finally, Section 5 gives the paper achievement as well as the conclusions and future works.

2. Literature Review

Machine learning is an algorithm to construct empirical models from the dataset and is categorized as data-driven modeling requiring a sufficient quantity of historical data to predict future demand reliably [8]. Machine learning algorithms extract essential information presented in large amounts of the recorded data, thereby achieving better performance and accuracy [7, 20]. Systematic literature reviews of artificial intelligence and machine learning algorithms are provided by Duan et al. [21], Borges et al. [22], and Dwivedi et al. [23]. It is accepted that these techniques bring a significant impact and new research frameworks in industries such as finance, medicine, manufacturing, and various government, public sector, and business domains. For example, the machine learning methodology is employed in the prediction of crime [24], future price of agricultural products [25], natural gas consumption [26], commercial banks performance [27], landslide displacement [28], and seawater evaporation [29]. As may be seen, prediction algorithms have been extensively investigated in several sectors.

Constructing the predictive models by employing a lot of variables is not straightforward in practice. All these variables might not be completely collected in a real-world situation and result in a more complex model. The prominence of dimensionality reduction and feature selection of modeling variables have been broadly revealed in [30]. The PCA is the most common feature extraction method to reduce the dimensionality of large dataset into a small dataset that retains most of the information [31]. The PCA uses the eigenvalue and eigenvector for projecting the high-dimensional dataset on to a lower-dimensional space. It converts a set of correlated variables into a set of principal components. Only the principal components that can sustain the most original variance will be extracted [32, 33]. Dimensionality reduction with the PCA is applied in many domains such as electricity consumption [1], finance [31], engineering [34], and agriculture [35, 36]. The disadvantage of PCA is that the predictor variables become less interpretable and have no corresponding physical meaning and this makes it more challenging to determine the predictor variables that are important in the predictive model [35, 37].

Feature selection is a process to select the features that contain the most useful information while discarding redundant features that contain little to no information. The wrapper feature selection algorithm such as SWR and RF is a method that depends on the accuracy of the subsequent feature selection criterion. The SWR is a statistical method of fitting regression models and has the advantage of evading collinearity [38]. It defines appropriate variable subsets and evaluates variables priorities [39]. Selection of predictor variables is performed automatically by assessing the relative importance of the variables based on prespecified criteria such as the F-test, the t-test, the adjusted R2, and the Akaike information criterion (AIC).

The RF is a supervised machine learning algorithm that is effective and efficient for both classification and prediction. It is based on a decision tree algorithm and classified as a bagging ensemble learning method [24, 40]. The RF structure is composed of multiple decision trees, and then each RF tree runs in a parallel manner to each other. During the variable randomization in each iteration, a variable importance index and the Gini index can be given [41]. The final value is evaluated by aggregating the results from all leaves of each tree [35, 42]. The RF is also one of the best algorithms for estimating the importance of variables and is applied in various fields [20, 4345]. Furthermore, the RF is an excellent prediction algorithm and has the advantages of its generalization and a good balance of error [11, 46, 47]. Some analysts employed both SWR and RF to select the input variables or analyze the importance of variables in domains such as the electronic industry [5], geographical poverty [45], reservoir characterisation [48], and soil carbon [49]. They stated that RF is better than SWR in identifying nonlinear relationships between variables.

The ANN is classified as a supervised learning method and also deployed in the comparison of prediction performance with other machine learning techniques [5055]. Among the ANN, BPNN is one of the most widely used technique to optimize the feedforward neural network. The backpropagation algorithm is a broadly used technique and a standard method for training the weights in a multilayer feedforward neural network through a chain rule method [56]. The weights of a neural net are appropriately adjusted based on the loss in the previous iteration. Therefore, this results in a lower error rate, making the model more reliable by enhancing a generalization. Researchers have applied BPNN in many classifications and predictions. As an example, BPNN is employed in agricultural product sales [57], crude oil future price [58], and hybrid cement [59] and also deployed in the comparison of prediction performance with other machine learning techniques [5052].

3. Materials and Methods

The outline of the proposed framework is shown in Figure 1. The hybrid predictive model comprises three stages conducting in sequence. The first stage explores the exploratory data analysis. The second stage uses the PCA, SWR, and RF methods to select suitable predictor variables. The third stage is to establish the predictive model by constructing the BPNN. The developed models are trained with 10-fold cross-validation and evaluated. All models have been implemented and tested on intel®Core™ i7-8550U, CPU @1.80 GHz, 1.99 GHz running, 64 bit Windows 10 operating system with 8 GB RAM. The Scikit-learn machine learning package for the Python programming language is used to implement the models with Python version 3.9.4. Many algorithms configuration parameters are set to the defaults of Scikit-learn version 0.23.0, Numpy version 1.20.2, and Pandas version 1.2.3.

3.1. Data Collection

The real data samples were collected from publicly available online sources from the beginning of 2018 to the end of 2019. The dataset contained 884,736 records of monthly electricity consumption. The 21 predictor variables were grouped into five categories, namely, geospatial, geographical, climatic, industrial, and household factors. The electricity consumption and geospatial factor were obtained from the official website of the Provincial Electricity Authority of Thailand [60]. The geographical, industrial, and household factors were obtained from the official website of the National Statistical Office of Thailand [61]. The climatic factor was obtained from the official website of the Thai Meteorological Department [62].

3.2. Data Preprocessing

The geospatial factor represented four categorical variables. Firstly, the electrical substation was the type of electrical distribution substations covering the four regions of Thailand (South, North, Northeast, and Central). Each region has three areas; thus, this variable can take on 12 different values (0–11). Secondly, the business type belonged to eight types of business (0–7): small residential houses, large residential houses, small business, medium business, large business, specific activities, government, and agriculture. Thirdly, the time of use was categorized into three periods: peak day (2: 00 p.m.–7: 00 p.m.), semipeak (5: 00 a.m.–2: 00 p.m., 7: 00 p.m.–12: 00 a.m.), and offpeak (12: 00 a.m.–5: 00 a.m.) as suggested by Yang et al. [63]. Finally, periods in Thailand are grouped into three seasons: summer (February–May), rainy (June–October), and winter (November–January).

Data quality is a key success in model development since poor data quality can negatively impact model accuracy. To perform the analysis, it is vital to identify outliers which are measurement results far from other values. Hence, they are not a representative of the majority of data. The outliers are the minimum amount of electricity consumption in the range of 0.00–0.09 W from the specific activities, government, and agriculture sectors and are subsequently removed from the dataset. Only 231 records of the available data are considered as the outliers, and the final modeling dataset contained 884,505 records. The geographical, climatic, industrial, and household factors are numerical values on a monthly basis in each province. According to the geospatial factor, each variable is averaged monthly. The 21 predictor variables and one target variable are shown in Table 1 with their descriptive statistics.

3.3. Reduction of Modeling Inputs

The variable reduction as inputs is a critical issue in a successful predictive model. Theoretically, a model should be built with a small number of relevant inputs to achieve an acceptable level of predictive accuracy [1].

3.3.1. Principle Component Analysis

All 21 predictor variables are processed by the PCA for finding the principal components in the dimensionality reduction. The StandardScaler function in the Scikit-learn Python library is used to standardize the 21 predictor variables onto a unit scale. Therefore, all the normalized predictor variables have a mean of zero and a standard deviation of one.

3.3.2. Stepwise Regression

By dropping the correlated predictor variables, the 15 predictor variables are retained. All remaining variables are entered or removed from the regression equation of the SWR model one by one. When each predictor variable is entered, a selection is adopted based on the AIC to remove redundant variables. This process is repeated until no significant predictor variable is entered into or removed from the regression equation.

3.3.3. Random Forest

The RF is used for evaluating the importance of predictor variables. The RF model is implemented using an ensemble of 1000 trees, and the number of trees was determined by trial and error. A typical split criterion is the mean square error (MSE) between the target and predicted output in a node. The 8-maximum depth of the tree was used in model construction.

3.4. BPNN Predictive Model Development

Before developing the models, the data are divided into training and testing subsamples. The training subsample is used for constructing the model, while the testing subsample is used to determine the model efficiency. The sample data presented in Table 2 indicates the number of samples in the training group (70% of samples) and the testing group (30% of samples). This division ratio is recommended by Zekić-Sušac et al. [4, 18]. In the training stage, the 10-fold cross-validation is also applied to reduce the overfitting problem and provides more reliable and unbiased models. The training dataset is divided randomly into 10-fold, and all models are evaluated 10 times. The cross-validation technique is implemented because it will make the model more reliable for new unseen data [33]. Lastly, the final assessment of predictive accuracy is evaluated with the remaining unseen 30% of data in the testing groups.

A feedforward ANN and backpropagation training method are chosen for developing the predictive model for electricity consumption. The result of this predictive model is the value of the target variable, which indicates the forecasted electricity consumption. The six BPNN models are developed using all available variables and selected variables previously reduced by PCA, SWR, and RF methods. This study tests the architectures of BPNN with two or three hidden layers, as suggested by Zekić-Sušac et al. [4, 18]. The structure of six BPNN predictive models is indicated in Table 3. All six BPNN models have only one node in output layer, that is, the electricity consumption value obtained from the predictive model. The rectified linear unit function (ReLU) is utilized as the activation function to define the output of that node. The BPNN is entered with the training subsample, and the learning rating is 0.001. The stopping criterion for the training process is set where either the epochs reach 1,000 or the training goal is reached.

3.5. Performance Evaluation Metrics

For assessing the performance of all predictive models, the following statistical indicators have been computed: coefficient of determination (R2), root mean square error (RMSE), mean absolute percentage error (MAPE), and predictive accuracy (Acc) according to Walker et al. [8], Pérez-Chacón et al. [17], Qiao et al. [26], Chen et al. [43], and Li et al. [46]. In a model with lower RMSE and MAPE, higher R2 indicates better accuracy. These metrics can be formulated as follows:

The other evaluation metrics are symmetric mean absolute percentage error (SMAPE) and normalized root means square error (NRMSE) according to Zekić-Sušac et al. [4, 18] and Janković et al. [53]. A model with lower NRMSE and SMAPE indicates greater accuracy in prediction. These metrics can be formulated as follows:

All parameters are explained as follows: is the target (real) output, is the calculated output, is an average of the target output, and is an average of the calculated output and is the total number of measurements.

4. Results and Discussion

4.1. Correlation and Multicollinearity Analysis

The Pearson correlation coefficient (rp) is used to analyze the correlation between two numerical variables, and Spearman’s rank correlation coefficient (rs) is utilized to evaluate the correlation in categorical variables. The correlation coefficients indicated the strength, and direction of association between all electricity consumption variables is shown in Figure 2.

For the multicollinearity analysis, the VIF of all predictor variables is also computed. The presence of multicollinearity implies that the variable provides redundant information contained in other variables [3]. The available variables in the modeling stage are determined by considering the correlation and VIF between the pair variables. Therefore, six predictor variables, namely, Electricity_Substation, Season, Population_N, Mean_Minimum_Temperature, Agriculturists, and Expenditure, are removed from the 21 predictor variables. As a result, the 15 remaining predictor variables are used as the input of SWR.

4.2. Selecting Predictor Variables
4.2.1. Principle Component Analysis

The 21 predictor variables are sent as the input of PCA. Since PCA works on numerical variables, four categorical variables are converted into numerical variables using the one-hot encoding technique. This study uses two experiments, namely, PCA 1 and PCA 2, by using the cumulative contribution rate of principal components as shown in Figure 3. For PCA 1 and PCA 2, the principal components with the cumulative contribution rate reach 95% and 99%, respectively. Therefore, the first 9 and 13 principal components are considered to be significant for PCA 1 and PCA 2 and used as variables in the predictive modeling stage.

4.2.2. Stepwise Regression

The 15 predictor variables from the correlation and multicollinearity analysis stage are used as the input for the SWR. Since the Usage_Type and TOU variables are categorical variables, these variables are set as dummy variables to trick the SWR algorithm into correctly analyzing variables. Consequently, the input variables in the SWR estimation parameter consist of 22 variables. Variables selected by SWR are summarized in Table 4, with the VIF metrics calculated per variable. According to the VIF value, the 11 selected variables are essential in modeling the electricity consumption. Geospatial variables are highly correlated with electricity consumption, among which TOU is the most important variable, followed by Usage_Type.

4.2.3. Random Forest

The importance of 21 predictor variables by the RF model is listed in Table 5. From the results, the four most important variables according to the percentage increase in mean squared error (%IncMSE) are Usage_Type, Area, Mean_Station_Pressure, and Industrial_Labor. The variables selected by RF have two experiments: RF 1 with 18 selected predictor variables having importance over 0.01 and RF 2 with 11 variables having the importance over 0.02.

4.3. Prediction Evaluation

The results of six machine learning models are shown in Table 6. The predictive performance is evaluated using six metrics, namely, RMSE, MAPE, NRMSE, SMAPE, R2, and Acc. The single model is developed with all available 21 predictor variables (BPNN). The hybrid models are constructed with the principal components extracted by the PCA method (PCA 1 + BPNN and PCA 2 + BPNN) and with the reduced variables selected by the SWR and RF methods (SWR + BPNN, RF 1 + BPNN, and RF 2 + BPNN).

It can be seen from Table 6 that the most successful is the RF 1 + BPNN with 18 selected variables, producing R2 of 0.9932 and Acc of 0.9926. The RF 2 + BPNN has a slightly lower R2 of 0.9630 and Acc of 0.9923, while the BPNN has R2 of 0.9638 and Acc of 0.9931. In summary, the hybrid model is the most effective in electricity consumption prediction, as Deb et al. [13] and Mamun et al. [14] show.

In the SMAPE assessment, using all available variables (BPNN) results in a SMAPE of 21.0117%. The PCA 1 + BPNN and PCA 2 + BPNN produce the higher SMAPE of 28.0458% and 22.0321%, respectively, while the lowest SMAPE is obtained by the SWR + BPNN (21.0112%). Although the accuracy of the most successful model is not satisfactory (SMAPE is 21.0112%), comparing to Walker et al. [8] and Zekic ́-Sušac et al. [18], whose models obtained an accuracy below 20%, it is greater than the model obtained by Zekic ́-Sušac et al. [4] (SMAPE is 22.3555%).

In the case of using the dimensionality reduction method, PCA 1 + BPNN produced RMSE of 0.0601 and the PCA 2 + BPNN produced RMSE of 0.0405. This means that the RMSE value decreases with the number of principal components increasing as stated by Zhang [34] and in accordance with a decrease in the number of principal components leading to a decrease in predictive accuracy [34]. The error values show a slight improvement in relation to BPNN only when the 13 principal components are used (RMSE = 0.0405 in PCA 2 + BPNN comparing to RMSE = 0.0415 in BPNN). At the same time, the performance of other hybrid models does not improve in terms of RMSE.

The RF selects the significant modeling inputs among 21 available inputs. As a result, 18 significant variables and 11 significant variables are selected and used in the model of RF 1 + BPNN and RF 2 + BPNN, respectively. The RF 1 + BPNN model slightly outperforms the RF 2 + BPNN. This means that using more relevant variables mainly presents superior prediction accuracy [35]. Moreover, using a smaller number of variables does not make a significant difference of the model prediction performance while minimizing the computational cost. A comprehensive comparison shows that most accuracy comes from the RF 1 + BPNN (0.9926), followed by RF 2 + BPNN (0.9923) and BPNN (0.9631). This result suggests that the hybrid model using fewer inputs is slightly more accurate. This also suggests that irrelevant variables bring meaningless information to the input dataset and add unnecessary variability and noise and hinder the ability to accurately model [1]. The RF can model nonlinearity relations between variables and has high accuracy, as investigated by Richardson et al. [49]. The results of this study indicate that the proposed hybrid RF + BPNN model performs excellently in terms of predictive accuracy.

By comparing the hybrid model results using dimensionality reduction and feature selection, the RF 1 + BPNN model gives the remarkable and insightful performance. In this study, the results of two experiments of PCA + BPNN models do not achieve good performance. This is likely due to the correlation and multicollinearity among predictor variables. The SWR + BPNN seems to have a comparable performance with a BPNN but is lower than the RF + BPNN. Smith [38] suggested that the SWR is less effective in the case of a large number of possible predictor variables. The results indicated that the RF has higher accuracy than the SWR and is more powerful in identifying the nonlinear relationships between target and predictor variables, as reported by Liu et al. [45].

The results of the experiments suggest that the PCA and SWR cause a loss of predictive accuracy. In addition, the predictive models developed by using the RF-selected models outperform the models using all available variables. It reveals that the RF has sufficient complexity to reduce a high-dimensional dataset.

Another focus of this study is to investigate the abilities of the feature selection methods. The selected variables by SWR, RF#1, and RF#2 are listed in Table 7. The selected variables describe geospatial, geographical, climatic, industrial, and household factor. This could be interpreted that all groups of variables play an essential role in predicting electricity consumption. There are some differences between the selected variable in each variable group. Even if the redundant and irrelevance variables are discarded, the correlation between the rest of the variables still exists.

The identically identified variables by SWR, RF 1 and RF 2 are Usage_Type, Mean_Maximum_Temperature, Mean_Relative_Humidity, Industrial_Labor, and Household. Therefore, these variables are considered to be the critical variables in explaining electricity consumption. Among the industrial factor group of variables, all three models have selected Industrial_Labor, while the SWR and RF#1 have selected Industrial_Plant. The largest number of selected variables represents the climatic factor group of variables. This result confirms the findings of the previous study by Walker et al. [8], Deb et al. [13], and Zekic ́-Sušac et al. [18]. Among all variables of the climatic factor, the temperature is reported to be the most significant variable influenced by the electricity consumption [3].

The hybrid predictive model in this study successfully integrates the existing machine learning algorithms and the selection of predictor variables in the electricity consumption is demonstrated. The predictive model is constructed from the actual electricity consumption data during 2018–2019 from the Provincial Electricity Authority of Thailand. The derived hybrid machine learning model can assist in determining key variables to support or make decisions on the management of electricity consumption. Based on a successful demonstration with authentic data, the Provincial Electricity Authority can deploy this hybrid predictive model to achieve efficient energy demand planning and management. Future studies to improve the predictive accuracy may include machine learning techniques for feature selection or deep learning.

5. Conclusions

This study investigates the machine learning algorithms under different models to predict electricity consumption. The most effective prediction will likely come from using a more relevant dataset. The number of predictor variables is an important factor, directly related to the performance and accuracy of the predictive models. Some influential predictor variables must be selected to achieve a robust and accurate predictive model and reduce the computation time. The PCA, SWR, and RF algorithms are employed to select the predictor variables according to their importance. This study successfully selected 11 predictor variables out of the original 21 predictor variables by the RF. Both SWR and RF select the same set of 5 predictor variables, and the others are slightly different selections between models.

The BPNN with 21 predictor variables is conducted as a benchmark for performance comparison in this study. The hybrid models are constructed by combining the PCA, SWR, and RF with the BPNN. The 10-fold cross-validation technique is also employed to ensure the unbiased, reliable, and accurate predictive model. This performance is compared in line with the RMSE, MAPE, NRMSE, SMAPE, R2, and accuracy metrics. Comparison results confirm that the RF is superior to the PCA and SWR. Based on the experiments, the integrated RF with the BPNN algorithm could effectively improve the accuracy of the electricity consumption prediction.

Data Availability

The data used to support the findings of this study are available upon request from the corresponding author.

Conflicts of Interest

All authors state that there are no conflicts of interest.

Authors’ Contributions

WK contributed to conceptualization, methodology, writing the original draft, visualization, validation, and investigation. YS contributed to conceptualization, supervision, writing the original draft, review and editing, validation, investigation, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

The authors would like to thank the government data solution project team for their collaborative support and guidance during this study. This research was financially supported by the new strategic research project (P2P), Walailak University, Thailand.