The power load prediction is significant in a sustainable power system, which is the key to the energy system’s economic operation. An accurate prediction of the power load can provide a reliable decision for power system planning. However, it is challenging to predict the power load with a single model, especially for multistep prediction, because the time series load data have multiple periods. This paper presents a deep hybrid model with a serial two‐level decomposition structure. First, the power load data are decomposed into components; then, the gated recurrent unit (GRU) network, with the Bayesian optimization parameters, is used as the subpredictor for each component. Last, the predictions of different components are fused to achieve the final predictions. The power load data of American Electric Power (AEP) were used to verify the proposed predictor. The results showed that the proposed prediction method could effectively improve the accuracy of power load prediction.

1. Introduction

With the rapid development of society, electric power is applied in all aspects of production and life. To ensure average production and living needs, power enterprises always produce more energy than needed. However, due to the nonstorage of power, excess energy will cause a waste of resources, and excessive operation also has an impact on the safety of power equipment [13]. Therefore, power load prediction is of considerable significance to power enterprises. The benefits of load prediction include effective planning of annual power supply, reduction of power waste and costs, and development of operation plans. An accurate prediction can provide a reliable decision basis for operation and ensure the power system’s sustainable development.

However, in reality, many factors cause the results of power load prediction to be challenging. The power load is a very complex nonlinear time series, making it very difficult to predict the power load accurately. For example, the weather will affect the cost of power [4]; other factors, such as the differences of the region’s development levels and the unpredictable natural disasters [5], can also cause various changes in the power load.

The power load data generally contain the following four components:(1)Trend component: it reflects the main trend of power load data, which mainly contains upward and downward trends. The trend component is the basic level of power load for a long time. If the power load increases, the component is an upward trend, and if the power load decreases, the component will have a downward trend.(2)Daily period component: the power load data have distinct period characteristics in a day, that is, high power load in the day and low power load at night.(3)Annual period component: the power load data have another period component in a year, and the power load will change in different months.(4)Residual component: this part is the remains after removing the trend and periodic component from the original data, which contain complex nonlinear data and noise.

Figure 1(a) shows the power load data of the United States Electric Power Company (AEP) from January 1, 2017, to January 1, 2020. The horizontal coordinates in the graph indicate that the sampling interval is one hour. Figure 1(b) shows the trend component separated from the original data. Figure 1c(c_1) shows the overall daily period component, while Figure 1c(c_2) shows a portion of our excerpted July 2018 daily period component as a display, and we can see that the daily period component has a distinct period. Figure 1(d) shows the annual period component. There is too much electricity in winter and summer, while the power load in spring and autumn is low.

In recent years, an accurate prediction of time series has become the focus of researchers. Time series usually have nonlinear, nonstationary, and complex period characteristics [6, 7]. Existing methods for predicting time series include statistical prediction, machine learning, and combined prediction [810]. The statistical prediction methods are usually based on mathematical models [11, 12]. The establishment of prediction models often includes the regression analysis method [13], the gray model method [14, 15], support vector machine (SVM) [16, 17], autoregressive integrated moving average (ARIMA) [18], and artificial neural network (ANN) [19]. These methods often find it difficult to obtain accurate predictions when dealing with complex nonlinear data.

Unlike the method described earlier, deep learning does not require the preknown information and has stronger learning ability and prediction ability. For example, Tang et al. [20] presented a multilayer two-way recursive neural network based on LSTM and GRU; Gao et al. [21] used GRU to build models for short-term load prediction. Guo et al. [22] presented an integrated in-depth learning method that integrates multiple LSTM networks to develop large-span cycles using LSTM’s nonlinear mode and similar day methods. Kollia and Kollias [23] proposed using deep convolution-recursive neural networks to process data in time series or two-dimensional information to improve prediction accuracy. Yin et al. [24] proposed a three-state energy model, which includes three states: generator, power load, and closed status. And a kind of scalable deep learning is proposed for real-time economic power generation scheduling and control based on the three-state energy of the future smart grid. Zhang et al. [25] presented a prediction model based on the restricted Boltzmann machine and Elman. He et al. [26] proposed a deep belief network (DBN) embedded in a parametric copula model. Although these in-depth learning methods’ accuracy is improved compared with the traditional methods, it is still difficult to learn enough feature representation in the face of nonstationary data.

Based on the above research, the latest research combines decomposition methods with deep learning to achieve better prediction results. These methods decompose the original data into components, use different prediction methods to predict the decomposed components, and fuse the predicted results to achieve better prediction results. For example, the seasonal-trend decomposition procedure based on loess (STL) can obtain trend, seasonal, and residual components of the complex data [27], which have been used as a hybrid prediction in the authors’ former research for weather forecasting [8, 28]. Another decomposition method named as wavelet decomposition, Wang et al. [29], decomposed the original time series to construct the predictor for different subsignals. Li et al. [30] proposed to use an extreme learning machine (ELM) combined with the variational mode decomposition (VMD). Guo et al. [31] proposed to decompose the original sequence by empirical mode decomposition (EMD) and select different models (including AR, MA, and ARMA) based on different characteristics of the subcomponents. The authors had used the EMD to decompose the PM 2.5 time series data to obtain more accurate forecasting [32, 33]. Compared with VMD and EMD, STL decomposition can guarantee to give a known number of components (including 3 components, such as trend, seasonal, and residual components) and is particularly suitable for sequential data with periodicity.

Comparing statistical and machine learning methods, the results show that the prediction accuracy can be improved by decomposing into multiple components and modeling separately with predictors. Moreover, it is also found that the hyperparameters have a significant impact on the prediction performance. Therefore, optimization methods of hyperparameters have been used. For example, [3436] decomposed the original data based on a wavelet algorithm and predicted the components by a particle swarm optimization (PSO) neural network. Another optimization method named as the fruit fly optimization algorithm (FOA) is used to select parameters for the generalized regression neural network (GRNN) [37]. In contrast, He et al. [38] used the Bayesian optimization algorithm based on a Parzen estimator to optimize the hyperparameters of the quantile regression forest (QRF) predictor. FOA and PSO are group optimization algorithms, which are not suitable for model hyperparameter tuning because they need to have enough initial sample points, and the optimum efficiency is not high. However, for the training process of deep learning models, we need to desample as little as possible to improve the optimum efficiency. Therefore, the Bayesian optimization algorithm is widely used in the deep learning models as it can obtain the optimal global value with the fewest sampling points.

This paper uses a serial two-level decomposition structure to improve the prediction performance due to the complexity of multiple periods of the power load data. Furthermore, the Bayesian optimization algorithm is applied to optimize the hyperparameters of the model. The innovation contribution of this paper is shown as follows:(1)Bayesian sequential model-based optimization (SMBO) algorithm is used to optimize model parameters to improve the model prediction performance.(2)According to the double-period characteristics of power load data, the original data are decomposed with a serial two‐level decomposition structure. Four GRUs are used for the trend component, daily period component, annual period component, and residual component.

The rest is arranged as follows. Section 2 introduces the proposed serial two-level decomposition and the model used to realize the prediction in detail. In Section 3, we give the experimental results of the SMBO optimization algorithm and compare them with other experiments. At last, Section 4 summarizes the conclusion.

2. Serial Two-Level Decomposition Optimization Model

The model consists of decomposition, prediction, and fusion processes. The prediction model’s framework is shown in Figure 2, in which the two-level decomposition structure is used at first, and the data are decomposed into four components. In the training phase, GRUs are used to train these four components. In the prediction phase, GRUs are used to predict four components. Finally, the results of each submodel are fused to get the final prediction results.

2.1. Serial Two-Level Decomposition

The time-series data of the original electric load are decomposed with two levels. Figure 3 shows the detailed information of the decomposition node of Figure 2. After the first-level decomposition, the three components of trend, period, and residual are obtained. However, the residual data still contain trend and periodic information. Therefore, the residual obtained by the first decomposition is decomposed again. Similarly, three sets of data, new trends, new periods, and new residuals resulting from the second decomposition, can be obtained.

After the first-level decomposition of the original electric power load data, trend , period , and residual are obtained. The second decomposition of residual was carried out to obtain trend , period , and residual . Finally, the components with the same characteristics are combined to obtain the final three decomposition results: trend , period , and residual .

2.1.1. First‐Level Decomposition

Power load data are a discrete-time series with a length of N, which means , so three sets of data, trend, period, and residual, can be represented as discrete:where , , and are the period component, the residual component, and the trend component. Detailed decomposition steps can be represented as follows:(1)In the first‐level decomposition, the power load of a day has a potential period, so the decomposition period is set to be 1 day. For the hourly data used here, the number of periods is set to 24. is used to calculate the number of periods; the meaning of in the formula is to round up the input data .(2)Utilizing the average regression method, the trend component , which can express the overall trend of time series, is extracted from the original data .(3)The following two steps are followed to obtain the period component of the original data :(a)Calculate the initial value of the periodic component through .(b)Because is described to round up , and may not be the same. Therefore, instead of making selection on all the data, select data to from the data , superpose the data with the point of the same time, and divide by to get a periodic curve, which is duplicated times so that the periodic component with the same N points can be obtained.(4)Raw data minus period and trend data gives residual , such as .

2.1.2. Second‐Level Decomposition

Considering that the first‐level decomposition does not completely decompose the original data and the residual data still contain rich periodic information and trend information, the second‐level decomposition of the residual obtained from the first‐level decomposition is carried out in the similar method. Furthermore, we can get a second set of data that represents the annual trend, the annual period, and the residual:where , , and are the period component per year, residual component, and trend component, respectively. Detailed decomposition steps can be represented as follows:(1)The period of the second decomposition is set to 1 year, 8760 hours. The number of periods is then calculated using , where is described as rounding .(2)Utilizing the average regression method, the trend component , which can express the overall trend of time series, is extracted from the residual data .(3)The following three steps are followed to obtain the period component of the residual data :(a)Use to calculate the initialization period component of the input data.(b)Select data to from the data , superpose the data with the point of the same time, and divide by to get a periodic curve, which is duplicated times so that the periodic component with the same N points can be obtained.(c)The window size of is 24 hours, and the mean value is used instead of the original data to obtain a new periodic component of length N.(4)Daily residual data minus period and trend data gives residual , such as .

2.2. Subpredictor

After two-level decomposition, the five components of residual , trend , trend , period , and period were obtained. Trend and trend represent the linear trend of the data, and then and are combined to get the trend of the data. Therefore, four groups of subdata are finally obtained. Four GRU networks are trained as predictors for each subdata.

GRU network is the development of LSTM models. It simplifies the structure of the model, reduces the network parameters that need to be trained, and inherits the ability of LSTM to solve long-dependency problems. Hence, GRU is a good model structure for prediction. The GRU model consists of two parts, the update gate and the reset gate, and each GRU cell is structured, as shown in Figure 4.

The update gate’s function is to adjust the information transmission from the previous moment to the current moment. The smaller the value, the less information from the last moment to the present moment. The goal of the reset gate is to adjust the degree of ignored information from the previous moment. The larger the reset gate value is, the less information is ignored so that the new input can be fused with more stored information.

The formula for the forward propagation of input data in each GRU is shown in the following:where is the input; represent the candidate state of the update gate, the reset gate, the hidden node at the -th time point, and output state of the hidden node at the -th time point; represent the weight in the model; represents the bias; represents the multiplication of elements; and are the activation functions used in the cell, and the mathematical expression of the activation functions is as follows:

According to the structure of the above GRU cell and the relationship between the input and output data, a GRU network is built and shown in Figure 5. The network includes multiple GRU units, and the number of network layers is 2. As shown in Figure 5, is the input of the GRU network, is the output, and m is the number of GRU cells in each layer.

2.3. Sequence Model-Based Optimization (SMBO)

Before deep-learning model training, we need to initialize the hyperparameters of the model, which can improve the model’s prediction performance. As to the two decomposition prediction models for the trend component and daily period component, the traditional parameter selection method can achieve prediction by the network’s initialization parameters. However, the annual period component and residual data are complicated, so here, we use one of the Bayesian optimizations, named the SMBO algorithm [39].

SMBO needs an objective function and then updates the posterior distribution of the parameter space objective function. Here, the objective function is selected as the root mean square error (RMSE), that is,where is the number of input hyperparameter group k, is the prediction result obtained by the model using hyperparameter combination , and is the real value. Then, for the SMBO algorithm, we havewhere is the best parameter combination determined by the SMBO algorithm, is a set of input hyperparameters, and is the parameter space of multidimensional hyperparameters.

The update of the parameter space includes two steps: Gaussian process (GP) and hyperparameter selection. In the Gaussian process, the algorithm realizes the modeling and fitting optimization of the objective function and obtains the posterior distribution corresponding to the input ; in the process of hyperparameter selection, “development” and “exploration” are used to realize the process of finding the optimal parameter with the minimum cost. “Exploration” refers to finding appropriate parameters in the unsampled hyperparameter space, which often leads to the global optimal parameter combination. “Development” will search the last set of the hyperparameter space according to the posterior probability distribution. When the set objective function follows the Gaussian distribution,where is the average value of and , is the covariance matrix of , is expressed as a Gaussian process, is the objective function, is the distribution of obeying , and initialization can be expressed as

During parameter searching of the SMBO algorithm, the covariance matrix of the Gaussian process will change with the number of iterations. If the hyperparameter group entered in step is , the covariance matrix can be expressed aswhere ; then, the posterior probability of the objective function can be obtained successively:where is the observation data, is the average value of in the step, is the variance of in the step, is the probability of obtaining the objective function in the case of step i + 1 data and parameter group , is the normal distribution, and is the distribution of obeying . The next step is to find the best parameter through hyperparameter selection after the posterior probability is obtained. The upper confidence bound (UCB) acquisition function is used in the “development” in this paper:where is a constant, is the UCB acquisition function, and is the hyperparameter group selected in step .

The SMBO algorithm is shown in Algorithm 1.

Input: is the root mean square error of the proposed model, is the number of selected hyperparameter groups, H is the UCB acquisition function, is the input data, is the proposed model, is the input hyperparameter group.
Output: returns the optimal hyperparameter group .
for to do
Model the objective function and calculate the posterior probability.
Parameter group selection using the UCB acquisition function.
Using superparameter group to train network to get the prediction
Update data set
end for

3. Experiment Results and Discussion

In this study, the electric power load data are from American Electric Power Company (AEP), which includes 26,280 data from January 1, 2017, to January 1, 2020. In the experiment, the first 70% data were selected as the training set data and the remaining 30% as the test set data. The data are first decomposed and then normalized for each subcomponent separately. In the training process, trend and period are put directly into different subpredictors for training, while period and residual are first subjected to hyperparameter seeking by the SMBO algorithm and then put into different subpredictors for training. In the testing process, different components are put directly into the trained subpredictors to get the final prediction results, and the results from different subpredictors are fused to get the final prediction results. Then, the power load prediction is used to plan the precision supply of the power and development of operation plans. The overall system flow diagram is shown in Figure 6.

3.1. Experimental Setup

The predictor uses Keras to build a learning model. All models were trained and tested on a PC server with Intel Core-CPU i7-2.21 GHz processor with 32 G RAM. In deep learning, many hyperparameters need to be set (for example, the number of network layers, weight initialization, and learning rate). The GRU network structure is set to two layers.

For the complex components such as the annual periodic component and residual component, this paper uses the SMBO algorithm to find some optimal hyperparameters of the network. The rest of the parameters use the Keras default initialization parameters to obtain the model parameters by optimizing the predetermined optimization function.

For components such as the daily period component and trend component, the GRU model uses the Nadam optimization algorithm, and all parameters use Keras default values. Activation functions in the network use tanh and ReLU. The learning and prediction steps are set to 24; that is, the model uses the previous day’s power load to predict the next day’s power load. Predicting one day in advance can help related departments get a general idea of the next day’s power load and make appropriate plans based on the load.

In this study, five indexes are used to evaluate the performance of the model, including root mean square error (RMSE), normalized mean square error (NRMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), and Pearson correlation coefficient (R). The smaller the first four of the five indicators, the more accurate the model prediction. The larger the value of the fifth indicator (R), the better the fitting effect between the observed value and the predicted value. The calculation formula of these five indicators is as follows:where is the number of samples, is the ground-truth value of the power load, is the average of the ground-truth value, is the predicted value, and is the average of the prediction.

3.2. Hyperparameter Selection Based on Bayesian Optimization

Table 1 shows the superparameter space of the SMBO algorithm. The selected hyperparameters include the number of neurons in the first layer, batch size, and optimizer. All the superparameter groups are tested in the model with an epoch of 100, and the optimized group of hyperparameters is finally obtained for subsequent network training.

Table 2 shows the comparison of the prediction performance of the Bayesian parameter optimization and no-optimization method for the annual periodic component and the residual component . According to Table 2, it can be observed that the Bayesian optimization can significantly improve the effectiveness of the model. For example, the RMSE of the annual period component declined by 25.9% from 471.25 to 349.0855. Similarly, the residual component fell 5.5% from 606.5918 to 572.7338.

The comparison between the predicted power load for December 12 to 23, 2019, and the real power load is shown in Figure 7. We can see that the weekend power load is lower, while the weekday power load is higher. As to one day, the morning and afternoon are higher, and the noon is lower. Therefore, it is reasonable that the proposed method decomposes the original power load data twice, taking into account the periodicity per day and year.

3.3. Comparison of Prediction Results with Different Models

In the setup experiment, we compare the performance of the proposed method with seven models, i.e., recurrent neural network (RNN) [40], long short-term memory (LSTM) [41], GRU [42], STL-RNN (RNN based on STL), STL-LSTM (RNN based on STL) [43], STL-GRU (GRU based on STL) [8], and wavelet-LSTM (W-LSTM) [44]. The hourly power load data used are from February 6, 2019, to December 31, 2019. The partial prediction results of each model are shown in Figure 8. It can be seen from the figure that the proposed model has the best performance.

Figures 9 and 10 show the comparison results of the five indicators, respectively. From Table 3, we can see that the RMSE of RNN, LSTM, and GRU is 905.4590, 835.8678, and 805.0688, respectively, and the MAEs are 677.6044, 630.6519, and 599.9143, respectively. Through STL decomposition, the RMSE of STL-RNN, STL-LSTM, and STL-GRU RMSE is 851.9837, 771.5973, and 747.3044, respectively, decreasing by 5.9%, 7.6%, and 7.2%. Moreover, MAEs are 664.7142, 579.3349, 554.4263, respectively, decreasing by 1.9%, 8.1%, and 7.6%, respectively. Therefore, it can be seen that the decomposition method is effective to improve the prediction performance.

On the contrary, the results show that the GRU network has the best prediction performance. For example, compared with the RMSE of RNN and LSTM, the RMSE of GRU is reduced by 11.1% and 3.7%, respectively. As to the decomposition method, compared with STL-RNN and STL-LSTM, the RMSE of STL-GRU is decreased by 12.3% and 3.1%, respectively. It validates the correctness of selecting GRU as the subpredictor in this paper.

Furthermore, we find that the serial two-level decomposition is rational, and the proposed model works best, which obtains the least of RMSE 676.6433, MAE 486.0197, and SMAPE 0.0328, the highest R 0.9575, and the second-least NRMSE 0.0572. It is believed that the original data contain nonlinear information; after two serial decompositions, the complex periodic information and trend information are predicted separately, which can better fit the data, and the combination can obtain a predicted result in better performance. The proposed deep-learning prediction models in this paper can combine some parameter estimation algorithms [4551] such as the iterative algorithms [5257] and the recursive algorithms [5864] to study new modeling and prediction approaches for different engineering application problems [6569] such as system modeling, information processing, and transportation communication systems.

4. Conclusions

More accurate results of power load prediction can make the power generation companies and power operation companies better control the operation status, facilitate the regulation of the market, save costs, and prevent pollution. This study uses a serial two-stage decomposition structure to decompose the electric power load time series according to different periods, which reduces the complex nonlinear relationship of the original data of the electric power load. The overall trend component indicates that the electric power load data change slowly, which can be understood as the electric power load remains at a certain level for a long time. The daily period component indicates daily variation, higher in the day and slightly lower in the night; furthermore, as to one year, it is lower in winter and higher in summer, all of which correspond to the actual use of electricity.

After decomposing the raw power load sequence, the GRU is used to build the model prediction component. The predictions from the subpredictors are then fused to obtain a more accurate prediction. After a two-stage decomposition of the data, the trend information in the original complex time series, as well as the multiple period information, is separated into subsequences. Different prediction models for subsequences were built based on their different characteristics to obtain the final prediction results. The proposed prediction methods in this paper can be applied to other literature studies [7076] for different purposes. In the future research, to further improve the model’s performance, new network structures would be adopted, and other decompositions or combined methods would be tried. The model proposed in this study can be applied not only to power prediction but also to other data that contain multiple period information.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.


This work was supported in part by the National Natural Science Foundation of China (nos. 61673002, 61903009, and 61903008), the Beijing Municipal Education Commission (nos. KM201910011010 and KM201810011005), the Young Teacher Research Foundation Project of BTBU (no. QNJJ2020-26), and Beijing Excellent Talent Training Support Project for Young Top-Notch Team (no. 2018000026833TD01).