#### Abstract

Establishing an accurate and robust short-term load forecasting (STLF) model for a power system in safe operation and rational dispatching is both required and beneficial. Although deep long short-term memory (LSTM) networks have been widely used in load forecasting applications, it still has some problems to optimize, such as unstable network performance and long optimization time. This study proposes an adaptive step size self-organizing migration algorithm (AS-SOMA) to improve the predictive performance of LSTM. First, an optimization model for LSTM prediction is developed, which divides the LSTM structure seeking into two stages. One is the optimization of the number of hidden layer layers, and the other optimizes the number of neurons, time step, learning rate, epochs, and batch size. Then, a logistic chaotic mapping and an adaptive step size method were proposed to overcome slow convergence problems and stacking into local optimum of SOMA. Comparison experiments with SOMA, PSO, CPSO, LSOMA, and OSMA on test function sets show the advantages of the improved algorithm. Finally, the AS-SOMA-LSTM network prediction model is used to solve the STLF problem to verify the effectiveness of the proposed algorithm. Simulation experiments show that the AS-SOMA exhibits higher accuracy and convergence speed on the standard test function set and has strong prediction ability in STLF application with LSTM.

#### 1. Introduction

Electric load forecasting has a great impact on dispatching work and production scheme of the power system [1, 2]. Accurate STLF is not only necessary for the power grid’s steady and safe functioning but also provides significant economic benefits to power corporations [3, 4]. The research shows that the prediction error is reduced by 1%, which can save 1.6 million dollars per year for a 10 GW power plant [5]. Moreover, as the key to intelligent power system research, intelligent load forecasting is of great significance for promoting the construction of smart city in the future [6]. The STLF problem is to predict future power consumption by analyzing the power consumption in the past period [7]. The traditional forecasting method analyses the chart, which is greatly affected by the weather and has a low degree of accuracy. With the development of statistical software and artificial intelligence technology, several prediction methods with higher accuracy have appeared. It mainly includes the time series method, autoregressive integral moving average (ARIMA), and so on [8]. This method’s basic idea is to use the temporal nature of the historical load data to forecast. So, it has a higher prediction accuracy for the data with solid timing. But its predictive ability is limited for data with many nonlinear relationships. With the development of power systems, the amount of data is so large that its nonlinear relationships become more complex. Intelligent algorithms are mainly machine learning methods represented by the support vector machine (SVM) [9], random forest (RF) [10], and artificial neural network (ANN) [11]. Most of these algorithms require setting the time feature manually, and the data temporal correlation feature must be fully considered. For long-term data series, their predictive ability is limited.

The deep LSTM network is widely used in STLF problems [12], which is the latest time series forecasting model. According to their composition, methods for using an LSTM network to solve STLF problems are classified as the mixed model (M-model) or optimization model (O-model). An LSTM network is used in the M-model to learn the nonlinear relationship after extracting features of STLF data. For example, the CNN-LSTM [13] model mixes the convolutional neural network (CNN) operated by three one-dimensional convolution pools and the three-layer LSTM. The model composed of CNN without a pooling layer [14], deep LSTM, and the CNN-SEQ2SEQ-ATT [15] model of attention mechanism. M-model predictions have been proven to be more effective than LSTM alone. The O-model uses an optimization algorithm to adjust the network structure of LSTM, which can achieve more accurate load prediction. In addition, the optimization algorithm is mainly represented by evolutionary computation (EC), which includes IBA-LSTM [16], IGOA-LSTM [17], and EMD-PSO-LSTM [18]. The original evolutionary algorithm is usually optimized to overcome its flaw of falling into local optimal in the previous models. The convergence speed and convergence accuracy are also improved to get a more accurate load prediction effect. Nowadays, deep LSTM neural networks are still facing numinous challenges. For example, network performance is unstable and parameter optimization takes a long time [19, 20].

EC and its variants have proposed many solutions to the abovementioned NP-hard problems, such as genetic algorithm (GA), differential evolution (DE), particle swarm optimization (PSO), artificial bee colony (ABC), and Grey Wolf optimization (GWO) [21]. Parameter optimization relies on employee experience in the LSTM model for flood prediction based on rainfall-runoff observation data [22]. PSO was proposed to optimize the batch size, time step, and number of hidden layer neurons of the LSTM. The comparison experiment proves that the PSO-LSTM model can effectively improve flood prediction accuracy over different advanced periods. In prediction shear wave velocity (Vs) [23], an adaptive subgroup division and particle updating method were proposed to optimize PSO to avoid the PSO-LSTM model falling into local optimization. Then, the batch size, time step, and the number of hidden layer neurons in the LSTM are optimized iteratively by using the optimized PSO. Compared with other algorithms, this method has higher prediction accuracy and robustness. These methods have achieved good results. However, due to their complexity, it is not easy to maintain a balance between convergence speed, convergence accuracy, and robustness [24, 25].

As a classic EC, the self-organizing migrating algorithm (SOMA) [26] was proposed by Zelinka and Lampinen and has good performance in engineering application [27]. SOMA has the advantages of easy parallelization and fewer tuning parameters than others. It has been widely used in dynamic constraint optimization [28], path planning [29], image processing [27], and other fields. At present, the improvement of SOMA has achieved a sound effect, which has provided an excellent theoretical basis for its application in the engineering field. SOMGA [30] is a combination of GA and SOMA. The initial phase of SOMGA uses tournaments to compete against each other, creating new individuals through single-point crossing and bit mutation. This algorithm has stronger robustness than GA and SOMA in 25 test functions. mNM-SOMA [31] uses the NM crossover operator to find the optimal leader, and compared with SOMA, GA, and PSO, the performance of mNM-SOMA is the most superior. The CCMA-ES-SOMA [28] model uses the CCMA-ES algorithm to detect the feasible region of the optimization problem quickly. SOMA has obvious advantages in solving NP-hard problems. However, there needs to be literature on structural optimization of using neural networks. Therefore, a SOMA improvement scheme was proposed to solve the LSTM structure optimization problem in this study.

#### 2. Materials and Methods

##### 2.1. LSTM and Parameter Optimization

###### 2.1.1. LSTM Memory Unit

LSTM solves the gradient explosion and dispersion problems of RNN. The subsequent nodes become weaker in perceiving the previous nodes, and it appears that they forget the previous information as time passes when the number of network layers increases. In short, LSTM can perform better in longer sequences than a normal RNN. Figure 1 shows the operation of the LSTM memory unit. Compared with RNN, LSTM adds a memory unit specifically for saving historical information. The control of input, forget, and output gates updates the history information in the network.

The forward propagation process of LSTM can be expressed as formulas (1)–(5), where is the weight matrix, the is the bias corresponding to the weights, the is sigmoid function, and is the matrix dot product operation. The input and output vectors of the implicit layer of the LSTM are and at a time step of *t*. The memory unit is and the input gate is used to control that how much of the network’s current input data flows into the memory cell, in other words, how much can be saved to , and the values are expressed as follows:

The forget gate is a key component of the LSTM unit, which controls information to retain and forget. It uses somehow to avoid the gradient disappearance and explosion problems triggered when the gradient propagates backwards in time. The forget gate determines what historical information will be discarded. The information in the memory cell of the previous moment has an impact on the current memory cell .

The output gate controls the effect of the memory cell on the current output value , and the part of the memory cell will be printed at the time step of *t*.

Because of its excellent performance, LSTM is used for a large number of sequence learning tasks, such as robot control [32], speech recognition [33], time series prediction [34, 35], and market prediction [36, 37].

###### 2.1.2. Deep LSTM

The classical LSTM model is composed of an input layer, hidden layer, and output layer. A deep LSTM can be formed by stacking multiple (≥2) hidden layers.

Each layer solves a portion of the task before passing it on the next layer, until the final layer provides the output. Graves et al. constituted a deep LSTM by stacking LSTM hidden layers [38]; it is applied to the speech recognition problem which has obvious advantages in benchmark tests. The deep LSTM architecture is defined as a model that consists of multiple LSTM layers. The upper LSTM hidden layer provides sequential output to the lower layer instead of outputting a single value. This shows that when building the model, the depth of the network is more important than the number of LSTM cells in a given layer. The structure of the time-expanded deep LSTM recurrent neural network is shown in Figure 2.

###### 2.1.3. Optimization of Deep LSTM Parameters

The LSTM recurrent neural network is a stable technique for time series processing, and the number of hidden layers in the LSTM changes the abstract value of the input observation. This method increases the training time and memory cost exponentially. Meanwhile, the disappearance of interlayer gradient will lead to the weakening of network performance, and this phenomenon becomes more significant when there are many layers. That will lead to slow update iterations in the hidden layers which is closer to the input layer and the effectiveness and efficiency of convergence will decline sharply, and will even have easier access to local minima. As shown in Figure 3, the input to the LSTM is a three-dimensional vector consisting of batch size, time step, and features, which are described below.

The current mainstream training method for DNN is gradient descent, which is trained using the model inputs for prediction. Then, the predicted values are compared with the actual values as an estimate of the error, the specified loss function is used to update the model weights, and the process is repeated. Each time step is the feature data that input to the model each time, whose size determines the overall size of the model’s single input features. In addition, it also affects the mapping of the model’s prediction results to times.

The network model weights are updated based on a subset of the training data, which is set in batches. Batch size is the number of samples used to estimate the error gradient in the training data set, in which a sample is the set of eigenvalues at one of the abovementioned time steps. As a statistical estimation, the more training data for the error gradient and the more estimates are computed; the more likely the weights of the network to be better tuned, the better the model’s performance. Thus, if the model is used for more predictions, the cost of error gradient estimation is constantly improved. Batch size is an important hyperparameter that affects the learning algorithm’s dynamics.

The learning rate is a hyperparameter used by the network model. It controls the weight degree of the updating model according to the estimation error and the speed at which the model adapts to the problem. Lower learning rates require more training cycles for adequate training, while more significant learning rates lead to rapid changes in network model weights and require fewer training cycles. However, if the learning rate is too high, the model may converge to the suboptimal solution quickly, while if the learning rate is too low, the process may be stagnant.

Epochs defines the number of times the learning algorithm is trained over the entire training data set. Iteration, which consists of one or more batches, means that each sample in the training dataset has completed one prediction of the model. The number of iterations directly affects the number of features learned by the model. Too much or too little of it will result in overfitting and underfitting of the trained model.

##### 2.2. AS-SOMA Model

###### 2.2.1. SOMA

Three main phases of the SOMA are as follows: initialization, migration cycle, and end of migration. In the first stage, each particle of the initial population finds the corresponding fitness value according to the specific problem and determines the best leader. The key to the SOMA is that the second phase of the migration cycle composes multiple cumulative migration updates. The particle repeatedly makes small jumps to the leader by adopting a specific step size and randomly initializing the guidance of the perturbation . It is a constraint variable that controls the particle’s movement dimension, and the expression is defined as follows:where is a random value between [0, 1], and is regulated by setting the coefficient . The mode of particle migration can be expressed as followswhere and , respectively, represent the particle to be migrated and the leader in the migration cycle of the first *ML* generation and is the new position obtained by the particle migration. is expressed as the interval length of as follows:

The cycle will be stopped when the migration of particles reaches the accumulated maximum . Meanwhile, the selected leader will lead the particles to carry out the next round of migration and the end of migration when the algorithm stop condition is met.

###### 2.2.2. Shortcomings of SOMA

SOMA has significant flaws in the initialization and migration cycle phases due to its design principles. For example, the initialization scheme and update mode used the idealized randomness and the fixed step, which has problems with slow convergence and easy to fall into local optimal in optimization. The following is a detailed introduction.

To follow the law of biological population, SOMA adopts a random scheme in the initialization stage. However, this random initialization scheme may lead to problems, such as the particle is only in the local optimal region, the population density is too large, and the individual spacing needing to be more prominent in the optimization process. Then, the slow convergence, local optimum, and advance convergence will occur in the optimization process.

In the migration cycle stage, due to the fixed value of the migration step, the particles cannot migrate fully and effectively. Large or small steps affect the execution efficiency and convergence ability of the algorithm to varying degrees. The selection method of SOMA for the leader of each migration is the maximum fitness, which can effectively guide the population towards a better direction of migration. But it needs to include the diversity of guidance and may even lead to incorrect guidance at the initialization stage of algorithm execution when the leader is locally optimal.

###### 2.2.3. AS-SOMA

The logistic chaotic mapping and the adaptive step size method are used to optimize SOMA in the initialization stage and migration cycle stage, respectively, in this paper. The particles can be evenly distributed in the whole search space in the initialization stage, which promotes the balance between development and exploration in the renewal process. Logistic map is one of the simplest chaotic maps, and it is described as formula (9), where represents the *i*th chaotic number, the , and is an adjustable parameter.

In the initialization stage of AS-SOMA, the multiple chaotic sequence numbers were generated by changing the initialization value of logistic chaotic map . The obtained chaotic number is multiplied by the search space of different dimensions, taking it as a coefficient to obtain the particles evenly distributed in the whole search space. The specific calculation is described as follows:where and are upper and lower bounds of particles in the j-dimension, respectively. Assuming that and , in Figure 4, there is no blind area when , , and . Therefore, the value of is 4 in this document, is evenly distributed in the interval of , and the particles in the search space are uniformly distributed in .

The adaptive step method is used to optimize the fixed value of the original step in the AS-SOMA migration cycle. The method stores the step size information of successfully migrated particles by creating an archive of success information , which is going to accumulate to form the mean of the normal distribution. The next migration of particles is guided by generating random step in the way of normal distribution, which is described as follows:

The is initialized to 0.21 for better results [39] and updated at the end of each migration cycle in this paper, and the updated formula is as follows:where is a constant between 0 and 1, and represents the step size archive of successfully migrated particles.

##### 2.3. AS-SOMA-LSTM Optimization Model

###### 2.3.1. LSTM Training

The training model of depth LSTM is shown in Figure 5. The parameters to be optimized include hidden layer, number of neurons, time step, learning rate, epochs, and batch size. To make the network weights converge more stably during the model’s training process, this paper uses exponential descent for the hyperparameter learning rate, which decreases exponentially as the number of LSTM training generations increases. So, the optimization process only searches for its initial value.

###### 2.3.2. STLF Problem Based on the AS-SOMA-LSTM Optimization Model

As shown in Figure 6, the AS-SOMA based on the above LSTM training model encodes the particles as where is the *i*th particle, and are the number of neurons in the first and second hidden layers, respectively. is the time step of the model input, are the size of the initial learning rate, the epochs of training iterations, and the batch size in model training, respectively. The abovementioned particles were decoded as parameters of the LSTM network model for training and prediction, and the optimal parameters were obtained through continuous iteration according to AS-SOMA.

The STLF problem processing process based on the AS-SOMA-LSTM optimization model is shown in Figure 7. First, the AS-SOMA initializes the population to generate individuals, which generates the corresponding set vector by setting the coefficient . Subsequently, the dimension parameters are configured in LSTM to construct the corresponding prediction model, which is derived from decoding individuals in the population. The model is trained and tested to calculate the root mean square error (RMSE), which is returned to AS-SOMA as a fitness function value. AS-SOMA sorted the fitness function values of the obtained population and selects the leader with the minimum value to guide the next population migration. The calculation process of population migration is repeated, and the number of repetitions is the maximum number of iterations set by AS-SOMA. The value of the last generation leader is the hyperparameter of the optimal model.

#### 3. The Performance of the AS-SOMA

##### 3.1. Experimental Parameters

MATLAB is used to conduct comparative experiments on fifteen test functions of CEC2015 [40] in this paper. According to the basic characteristics, the CEC2015 can be divided into the following four categories: F1 and F2 are single-peak functions, F3, F4, and F5 are basic multipeak functions, F6, F7, and F8 are three mixed functions, and F9–F15 are seven synthetic functions. The experimental comparison of AS-SOMA with the mainstream population intelligence algorithm PSO and its modified algorithm CPSO and the parameter settings of each algorithm are given in Table 1.

##### 3.2. Algorithm Evaluation

In order to improve the convergence accuracy of AS-SOMA, the experiments were evaluated by the mean and standard deviation of the best run results on the functions, and it is described as . Then, this paper is compared with other algorithms in multiple run results, and the mathematical expression is described as follows:where is the global optimal value of the algorithm on the function and is the number of times the function runs. Wilcoxon’s rank-sum test is performed on the obtained experimental data to verify the significance of the performance between individual algorithms. If value is less than 0.05 (), it indicates that the operating results of the current algorithm are significantly different from those of AS-SOMA. The convergence rate is compared by observing the convergence rate of the fitness value of the benchmark function.

##### 3.3. Experimental Analysis of AS-SOMA Performance

In the experiment, the population size of PSO, CPSO, SOMA, LSOMA, OSOMA, and AS-SOMA are 100, and the maximum evaluation times of CEC2015 benchmark function is 300,000, among which the times of individual tests of each function is 25 and its dimension .

###### 3.3.1. The Solving Accuracy

Table 2 shows the experimental results of each algorithm on CEC2015 benchmark function set. “+”, “=” and “−” indicate that the accuracy of the corresponding algorithm is better, similar, and worse than AS-SOMA. The bold display indicates the best result of the mean and the best variance.

AS-SOMA achieves optimal experimental results on both single-peak functions, which are compared with the five comparison algorithms as shown in Table 2. Wilcoxon’s rank-sum test results show that the performance of AS-SOMA is significantly better than other comparison algorithms on function *F*_{2}. In the test results of function *F*_{1}, AS-SOMA has a similar convergence performance with SOMA, and it is significantly better than 4 of the five comparison algorithms. Compared with basic multimodal functions, AS-SOMA achieved two best results in three functions. Wilcoxon’s rank-sum test results showed that AS-SOMA is significantly better than PSO, CPSO, SOMA, LSOMA, and OSOMA on the test functions of (*F*_{4}, *F*_{5}), (*F*_{4}, *F*_{5}), (*F*_{3}), (*F*_{3}, *F*_{4} and *F*_{5}), and (*F*_{3}, *F*_{5}). In the mixed function, it is obvious that the experimental results of AS-SOMA on function *F*_{6} are better than other algorithms. It is obvious that the experimental results of AS-SOMA on function *F*_{6} are better than CPSO, SOMA, LSOMA, and OSOMA. It is significantly better than PSO, CPSO, SOMA, and LSOMA on the function *F*_{8}. In function *F*_{7}, AS-SOMA has some competitiveness compared with basic SOMA in obtaining optimal results. For the test results of *F*_{9}, *F*_{11}, *F*_{12}, *F*_{13}, and *F*_{14} functions of the composite, AS-SOMA is obviously superior to all comparison algorithms.

To sum up, AS-SOMA can solve the synthesis function effectively. The excellent results are mainly attributed to the step size adaptive mechanism and the population initialization mechanism based on the logistic chaotic mapping. The interaction between the two mechanisms improves the solving accuracy of the algorithm.

###### 3.3.2. The Convergence Speed

To further verify the optimization effect of the algorithm in the search process, Figure 8 shows the average convergence result of the comparison algorithms in 15 benchmark functions based on the CEC2015 data set 25 times. The ordinate is the natural logarithm of the average value of the independent 25-times running results of each algorithm. The horizontal coordinate represents the sampling point, and its value is from and .

It can be seen from the experimental results in Figure 8, AS-SOMA achieves peak performance on *F*_{1}, *F*_{2}, *F*_{5}, *F*_{6}, *F*_{8}, *F*_{10}, *F*_{11}, *F*_{13}, and *F*_{14} while being competitive on *F*_{4}, *F*_{7}, *F*_{9}, *F*_{12}, and *F*_{15}. In addition, CPSO has the best convergence speed on *F*_{3}, OSOMA has the best convergence speed on *F*_{4}, and SOMA has the best convergence speed on *F*_{7}. However, the convergence speed of PSO and LSOMA is not optimal on all benchmark functions.

The experimental results in Figures 8(e) and 8(k) show that the AS-SOMA has excellent convergence speed compared with other comparison algorithms. In addition to the experimental data in Table 2, it can be seen that AS-SOMA has excellent performance on *F*_{5} and *F*_{11} both in terms of solving accuracy and convergence speed. It can be concluded that AS-SOMA has good convergence in solving basic multimode functions and synthetic functions.

###### 3.3.3. Search Behavior Analysis

In order to explain the convergence performance of AS-SOMA, this paper analyzes the search behavior of AS-SOMA by analyzing the population evolution of AS-SOMA on function *F*_{4}. function has a global minimum point that is far from another local optimum, and it is a typical cheating problem. So, it is hard to get out of local optimum. Figure 9 shows the search process performed by AS-SOMA on *F*_{4} when the population size is 100 and the evolutionary algebra *G* is 1, 3, 5, and 10, respectively. It is obvious that function has a very complex model structure. The results of the three-generation evolutionary search performed by the algorithm are shown in Figure 9(b) when the population with a population size of 100 is initialized uniformly and randomly in the solution space. It can be observed that the population migrates toward the global optimum while maintaining diversity.

**(a)**

**(b)**

**(c)**

**(d)**

Figure 10 shows the migration step change diagram of AS-SOMA individuals in the search process of function, and abscissa represents the evolution time. It is easy to see that step decreases as the number of evaluations increases. The population explores the search space in the early stages of algorithmic search tentatively, the less successful step information in the archive makes less population learnable information at this time, the larger step ensures the search speed of the algorithm. After a period of searching, the success information is increased gradually in the archive. Step begins to decrease steadily through the use of adaptive regulatory mechanisms. It reduces the probability of particles falling into the local optimality.

#### 4. Structural Optimization of LSTM by AS-SOMA

##### 4.1. STLF Data

This paper selects data in minutes from the UCI machine learning warehouse. It collected measurements of electricity consumption in various parts of a household from December 2006 to November 2010. The observed values include total active power (TAP) (kW), total reactive power (TRP) (kW), average voltage (AV) (V), average current (AC) (A), kitchen power (KP) (kWh), washing machine power (WMP) (kWh), and air conditioning system power (ASP) (kWh). This paper uses the replacement of null outliers and data accumulation to preprocess the total active power original data. As shown in Table 3, the statistical unit is expanded from hours to days.

The target problem is the prediction of daily electricity load in the next week. In addition, this paper focuses on the prediction of the TAP, which is the sum of the power consumption of the appliance [44–46]. The 3 : 1 scale partition dataset is presented in this study in Figure 11(a), and the first three years and the fourth year of data completed the training and testing of the model. Figure 11(b) shows the statistical indicators of maximum, minimum, mean, median, range, standard deviation, skewness, and kurtosis to conduct a descriptive statistical analysis of the TAP.

**(a)**

**(b)**

##### 4.2. Model Evaluation

The mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percent error (MAPE) are metrics used to evaluate the predictive performance of the models. The MSE, RMSE, MAE, and MAPE are defined as follows:where is the predicted value, is the true value, and the value range of RMSE is .

##### 4.3. Prediction Experiment Results and Analysis

In order to reduce the operation error, each algorithm is independently run ten times to record the maximum, median, minimum, mean, and standard deviation of the execution results of each algorithm. Wilcoxon’s rank-sum test is used to compare the performance of the algorithm.

###### 4.3.1. Hidden Layer Optimization

Structure optimization of LSTM is divided into two stages. The hidden layer number has a small search space and only integer, which is the first stage of structure optimization. The optimization of the number of nodes in each hidden layer is the premise of structural optimization. Other hyperparameters set include hidden layer neurons, learning rate, training algebra, batch size, time steps, and the values are 200, 0.001, 500, 16, and 14. The data within fourteen days were used to predict the data for the next 7 days, and Adam optimizer [47] is used to complete the weight update. This paper compared the predicted results to model the optimal number of hidden layer nodes.

On the data prediction of STLF, Table 4 shows the prediction performance of LSTM with different layers. It can be seen that the network has better performance with the hidden layer digit of 2, and there is a significant difference compared with layer 1 and layer 4. As a result, a model with fewer hidden layers will result in insufficient expression ability of the model to the problem. On the contrary, the disappearance of gradient between layers will lead to weakened network performance when there are too many hidden layers. In the process of prediction model optimization mentioned below, the number of hidden layers of LSTM is determined to be two.

###### 4.3.2. Optimization of Other Parameters

The second stage is to optimize other parameters, including the number of neurons in the hidden layer, time step, learning rate, epochs, and batch size. In order to make the model converge more stable in the training process, this paper uses the method of exponential decline for the hyperparameter learning rate.

Table 5 shows the size of search space corresponding to the parameters when AS-SOMA is used to optimize the network model. The mutation probability of GA parameters in the comparison algorithm is 0.001. Parameter settings of others are consistent with Table 1. As a control experimental algorithm, AS-SOMA_1 implements the improvements in the initialization phase but not in the migration phase. The population size of all algorithms is 10, and the algorithm execution termination condition is 500 times of fitness function evaluation.

For the final optimization model of AS-SOMA-LSTM, the network structure with the best effect is that the first layer contains 47 hidden nodes, the second layer contains 86 hidden nodes, the time step is 11, the batch size is 24, the learning rate is 0.03414, and the epochs is 338. The comparison between AS-SOMA-LSTM model prediction results and original data is shown in Figure 12.

Table 6 shows that AS-SOMA achieves the best results on each metric and outperforms the state-of-the-art methods by 0.44%, 0.37%, 0.117%, and 0.005% in terms of mean MSE, mean RMSE, mean MAE, and mean MAPE.

Table 7 shows the statistics of the experimental results of different algorithms in STLF based on the LSTM optimization model on the RMSE. It can be seen that AS-SOMA has achieved the best results in maximum, median, and average among the eight algorithms. AS-SOMA has more advantages than AS-SOMA_1 in terms of average value and standard difference. Wilcoxon’s rank-sum test shows that there is no significant difference in the performance between the two algorithms compared with the other five algorithms. This phenomenon shows that the two SOMA improvement schemes contribute equally to the final performance, and the two schemes jointly improve the performance of AS-SOMA.

AS-SOMA ranks fourth out of seven algorithms in the minimum index. One of the reasons is the algorithm’s randomness, where each algorithm has a certain probability of obtaining the optimal value of the problem. Secondly, AS-SOMA proposed an adaptive step mechanism based on SOMA, which improved the convergence ability and stability of the algorithm. Meanwhile, it also reduces the exploration ability of the algorithm. OSOMA ranked first in minimum index, which is based on reverse learning mechanism to improve SOMA. The idea of reverse learning is to promote the particle to migrate in the opposite direction of the best particle with a certain probability. This mechanism enhances the exploration ability of the algorithm greatly. But statistics shows that the results of multiple experiments are inconsistent. According to the combination of mean and standard deviation in Table 5, AS-SOMA has obvious advantages in solving STLF problem by LSTM.

Figure 13 is the average convergence curve of eight algorithms on STLF problem with LSTM.

**(a)**

**(b)**

**(c)**

**(d)**

Figure 13(a) shows the poor performance of GA at the initial stage of search. The reason is that the initial fitness function value of GA in a search reaches 1893.852, which is the particle to get a set of poor solutions in the case of random initialization. The model is not trained as it should be when the particle is decoded into the LSTM. It makes the prediction performance of the network poor.

Figure 13(b) is obtained by trimming the original convergence curve. It can be seen that AS-SOMA has a certain advantage in the whole search process. OSOMA based on reverse learning has the worst performance among the eight algorithms, it shows that the scheme based on reverse learning is not suitable for LSTM structure optimization.

Figure 13(c) shows the eight algorithms in the early stage of search. It can be seen that CPSO, A-MsPSO, and AS-SOMA based on logistic chaotic mapping initialization have faster convergence in the early stage of search. In the initialization stage, the whole population is evenly distributed in the entire search space, which ensures the exploration ability and diversity of the population in the early stage.

Figure 13(d) shows the situation of eight algorithms in the late search period. It can be seen that AS-SOMA has a fast convergence ability in the later stage of the search. The reason is that the improved scheme based on adaptive step size can solve the problem of poor convergence effectively, which is caused by the fixed step size. Additionally, the dynamical adjusting step size balances the development and the utilization of the algorithm.

In the early stage of the search, PSO is slightly ahead of SOMA, and SOMA gradually accelerates convergence ahead of PSO in the late stage of the search. The reason is that the movement of each particle in the PSO is guided by the optimal individual of the population. This guidance pushes the particles closer to in every dimension by subtracting the vector. Similarly, the movement of each particle is influenced by the leader of the population in SOMA. However, not all dimensions of particles migrate in all directions, and interferes in the process of migration. Therefore, in the process of LSTM structure optimization, the full-dimensional migration of searching the early PSO can make the population approach better network parameters quickly. However, the multidimensional motion tends to make the population skip the optimal parameter when the population concentrates near the optimal solution in the late searching period. SOMA can immobilize certain dimensions and migrate others. It will approach the optimal solution gradually by controlling the direction.

The convergence process of AS-SOMA is verified in the LSTM hyperparameter variation in Figure 14. In the early stage of the search, the LSTM prediction model presents certain fluctuations in the time step, epochs, and batch size. The hyperparameters quickly stabilize after the 100th optimization and gradually tend to the optimal. Figure 14(b) shows that the learning rate presents a certain steady state in the process of AS-SOMA convergence.

**(a)**

**(b)**

**(c)**

**(d)**

Figure 15 shows the correlation degree among time step, learning rate, epochs, and batch size, whose absolute values are 0.37, 0.026, 0.066, and 0.029, respectively. It has the highest correlation between time step and batch size, with a value of 0.37. Therefore, it is necessary to analyse the hyperparameter constraint conditions in the process of combinatorial optimization in order to better train the network.

This paper constructs the AS-SOMA-LSTM optimization model to solve the STLF problem. The original SOMA is improved by a logistic chaotic and adaptive step size method, and its solution accuracy and convergence speed are verified. The LSTM optimization model based on AS-SOMA is proposed and applied to the practical problem of STLF. First, the optimal number of hidden layers is determined by the exhaustive method. Then, the remaining five hyperparameters are optimized by using the AS-SOMA, including the number of neurons, time step, learning rate, epochs, and batch size. Simulation experiments show that the AS-SOMA-LSTM model has significant advantages over other methods.

#### Data Availability

The dataset used to support the findings of this study can be accessed upon request.

#### Consent

Not applicable.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Authors’ Contributions

Chang Wang, Zijian Cao, and Xiaofeng Rong conceptualized the study; Xiaofeng Rong and Zijian Cao developed methodology; Chang Wang and Hanghang Zhou investigated the study; Hanghang Zhou, Chang Wang, and Linjuan Fan wrote the original draft of the manuscript; Xiaofeng Rong and Zijian Cao reviewed and edited the article and work to polish the whole study; the authors acquired fund. All authors have read and agreed to the published version of the manuscript.

#### Acknowledgments

This research was partially funded by the Shaanxi Natural Science Basic Research Project (2020JM-565) and the Science and Technology Plan Project Industry-University-Research Collaborative Innovation Plan of Weiyang District, Xi’an City, Shaanxi Province (K20200167).