Nowadays, haze has become a big trouble in our society. One of the significant solutions is to introduce renewable energy on a large scale. How to ensure that power system can adapt to the integration and consumption of new energy very well has become a scientific issue. A smart generation control which is called hierarchical and distributed control based on virtual wolf pack strategy is explored in this study. The proposed method is based on multiagent system stochastic consensus game principle. Meanwhile, it is also integrated into the new win-lose judgment criterion and eligibility trace. The simulations, conducted on the modified power system model based on the IEEE two-area load frequency control and Hubei power grid model in China, demonstrate that the proposed method can obtain the optimal collaborative control of AGC units in a given regional power grid. Compared with some smart methods, the proposed one can improve the closed-loop system performances and reduce the carbon emission. Meanwhile, a faster convergence speed and stronger robustness are also achieved.

1. Introduction

Recently, the thermal power generation makes the environmental pollution more serious, especially the air pollution. Therefore, more and more clean energies such as wind and photovoltaics are continuously merged into the strongly coupling interconnected power grid [1]. However, new troubles, such as voltage over limit and power fluctuations as well as frequency instability [24], are brought out. Meanwhile, the safe operation of the power grid is also affected. The traditional centralized automatic generation control (AGC) cannot obtain the similar control performance with the decentralized AGC since the energy distributions are more dispersed. It will be an inevitable trend for the future smart grid to research the decentralized AGC.

In recent years, many scholars have devoted to the optimal control strategy of decentralized AGC [513]. Authors in [6] put forward the concept of optimal AGC by using the original dual transformation method, which is based on the optimal control theory. It showed that the dynamic equation and the constructed AGC control strategy of the interconnected system could realize multiarea decentralized optimal AGC control. However, the used optimal AGC controller needed to feedback all the state variables which were difficult to be obtained directly in the actual system. In [8], a new method was proposed based on the model predictive control. It focused on a decentralized optimal AGC control strategy based on cooperative synchronous power grid. While the stability and robustness of the multivariable predictive control method including the application in actual AGC system needed to be further studied, the method was a great amount of calculation and time-consuming. Yu et al. [11] demonstrated that an optimal AGC can be achieved under the circumstance that the agents are in small number. However, the algorithm is only applicable to systems with a small number of agents and its application is limited. In the same way, the decentralized control has been studied by the author in the early stage, namely, decentralized correlated equilibrium Q()-learning (DCEQ()) [12] based on multiagent (MA). It can solve the complex stochastic dynamic characteristics and optimal coordination control of AGC after the access of distributed energy. Nevertheless, if the number of MA increases, the searching time for the MA equilibrium solution is geometric growth, which will limit the application of DCEQ() in larger systems. Therefore, the decentralized win or learn fast policy hill-climbing() (DWoLF-PHC()) [13] based on MA was developed, in which by using average mixed strategy instead of equilibrium strategy. Thus, the dynamic characteristics of the system are effectively improved, and the dynamic optimization control of the total power is also obtained. However, the DWoLF-PHC() still has multisolution problem. It results in system instability when the number of MA increases sharply.

The above literatures have some limitations that they only focus on the control strategy of the total power in the AGC. However, the dynamic optimal allocation of the total power is not involved. In fact, the modern power grid has gradually been developed into a hierarchical and distributed control (HDC) structure, which integrates the large-scale new energy. For this reason, a single control strategy is difficult to meet the requirements of control performance standards (CPS). Therefore, a hierarchical and distributed control based on virtual wolf pack strategy (HDC-VWPS) is proposed in order to attenuate the stochastic disturbance problem caused by massive integration of new energy to the power grid. The proposed strategy is based on multiagent system stochastic consensus game (MAS-SCG). It is divided into two parts. The first part is an AGC optimal control method which combines a new win-lose judgment criterion, policy hill-climbing algorithm (PHC) [14], and eligibility trace [15]. Especially, the new win-lose judgment criterion is named as policy dynamics-based WoLF (PDWoLF) [16]. Moreover, the control method, which is called PDWoLF-PHC(), is based on multiagent system stochastic game (MAS-SG) theory. Meanwhile, the second part is the collaborative consensus (CC) algorithm [17] which is based on multi-agent system collaborative consensus game (MAS-CC) theory. This algorithm is used to distribute the total power dynamically and optimally. Consequently, the perfect combination of AGC control and distribution is realized. At the same time, the intelligence from the whole to the branch is truly obtained. The significant difference between smart generation control (SGC) and AGC is that the original proportional-integral (PI) control in AGC is replaced by the smart control in SGC.

The rest of the paper is as follows. The SGC framework based on HDC structure is proposed in Section 2. The HDC-VWPS is expounded in Section 3. Meanwhile, Section 4 is the AGC design based on HDC-VWPS. Section 5 covers the case study, and Section 6 summarizes the full text, respectively.

2. SGC Framework Based on HDC Structure

Hierarchical reinforcement learning (HRL) [18] is a hierarchical control method that can solve the problem of “curse of dimensionality” in traditional reinforcement learning effectively. A new method, namely, HDC-VWPS, is put forward to obtain the optimal total power and its optimal dispatch dynamically. The term “virtual wolf pack” is a generator set group (GSGs) of a certain control area. The PDWoLF-PHC(λ) with the win or learn fast (WoLF) attribute based on heterogeneous MAS-SG theory is adopted to obtain the total power of each GSG. Meanwhile, the ramp time CC algorithm based on homogeneous MAS-CC theory is used to distribute the total power to each unit dynamically in order to achieve the optimal coordination control of each GSG. The “leader” of virtual wolf pack refers to a new dispatcher who is responsible for communicating, contacting and cooperating with the leaders of the other GSGs, and sending the instructions to each unit in the GSGs. Each GSG only has one leader. The SGC framework based on HDC structure is shown in Figure 1, where is the tie-line exchange power, is the interconnected power grid frequency error, is the total power of GSGi, and is the regulation power of the uth unit in GSGi.


A HDC-VWPS is designed to coordinate and optimize the operation of GSGs in the SGC system with HDC structure through the integration of MAS-SG and MAS-CC.

3.1. MAS-SG Framework

Based on the MAS-SG framework, a PDWoLF-PHC() algorithm is proposed to the game among GSGs to obtain total power command of each GSG.

The WoLF principle can meet the convergence requirement by changing the learning rate without sacrificing rationality, namely, learn quickly when losing and cautiously when winning [14]. However, in more than 2 × 2 games, the players cannot accurately calculate the win-lose criterion and can only rely on the estimation. Therefore, an improved WoLF version, PDWoLF, whose judgment criterion can be accurately computable in more than 2 × 2 games, was explored in [16]. Also, it can converge to Nash equilibrium in more than 2 action games.

It indicates that PHC algorithm can meet the requirement of the rationality in [14]. Therefore, PDWoLF-PHC can satisfy the requirements of the convergence and the rationality at the same time. It also converges faster with a higher learning rate ratio [16]. The PDWoLF-PHC is the extension of the classical Q-learning [19]. It combines the multistep backtracking idea of the SARSA() [15] to search the optimal action-value function through the continuous trial and error dynamically. The parameter refers to the use of an eligibility trace. It can solve the temporal credit assignment of time-delayed reinforcement learning. The optimal value function and strategy are as follows. where is the set of possible actions under state .

The eligibility trace is updated by where denotes the eligibility trace at the kth step iteration under state and action , is the discount factor, and is the trace-attenuation factor.

The function will be iteratively updated according to where is the Q-learning rate and is the reward function value from state to under the selected action is the value function when executing action in state , which uses look-up table method. is a greedy action. After sufficient trial and error iterations are done, the state-value function will converge to the matrix with the probability of one. Finally, an optimal control strategy, represented by the optimal function ( matrix), can be obtained.

The win-lose criterion of PDWoLF-PHC() is determined by two parameters and for a given agent. Strategy is updated for an agent according to (4) in the state state-action pair . where is the variable quantity of the updating strategy. The updating rule is described as follows.

In (6), is the number of possible actions. is the variable learning rate and . Also, is defined as the variable learning rate ratio. is updated by where is the decision space slope value and is the decision change rate at the kth step iteration. Meanwhile, and are updated by

3.2. MAS-CC Framework

The MAS-CC framework is introduced into the HDC-VWPS to dynamically allocate the total power command to each unit.

3.2.1. Graph Theory

The topology of MAS can be expressed as a directed graph with a set of nodes edges and a weighted adjacency matrix . Among them denotes the ith agent, edge means the relationship among agents, and constant is the weight factor between and . If there is a connection between any two vertices, then the graph is called a strongly connected graph. The Laplacian matrix of graph can be written as follows. where the matrix reflects the topology of the MA network.

3.2.2. Collaborative Consensus

In a MAS, it is usually called collaborative consensus (CC) [20] while an agent interacts with the adjacent one to reach the consensus. A MAS consisting of autonomous agents is regarded as a node in a directed graph . The purpose of CC is to obtain a consensus in each agent and to update state in real time after communicating with neighboring agents. Due to the communication delay among agents, the first-order CC algorithm of a discrete system is chosen as follows. where is the state of the ith agent, represents the discrete time series, and denotes the entry of the row stochastic matrix at discrete time. is given by

The CC algorithm can be achieved if and only if the directed graph is strongly connected on the condition of the continuous communication and constant gain .

3.2.3. Ramp Time Collaborative Consensus

The ramp time is chosen as the consensus variable among all units in a GSG. A unit which has a higher ramp rate will be distributed with more disturbances. The ramp time of the uth unit in GSGi can be obtained as follows. where is the regulation power of the uth unit in GSGi. is the ramp rate of the unit and is calculated as follows. where and are the upper and lower bounds of the ramp rate, respectively.

The ramp time of the uth unit in GSGi can be updated according to (10) as follows. where is the total number of units in GSGi. is the row stochastic matrix.

Then the ramp time of the GSGi leader can be updated as follows. where represents the GSGi’s adjustment factor of the power error. denotes the power error between the GSGi total power and the total power of all units. It is obtained from

In the condition of the total power command if the ramp time needs to be increased; otherwise needs to be reduced. Oppositely, will be increased or decreased in condition that .

As a ramp time CC algorithm among units is adopted, the power of some units may exceed their maximum power. At the same time, the smaller the unit maximum ramp time is adopted, the faster the power limit is reached. While the power limit is reached, the uth unit’s power and ramp time are as follows. where and are the maximum and minimum reserve capacity of the uth unit in GSGi, respectively. Furthermore, if the power of the uth unit exceeds its limit, the weight factor becomes as follows. where is the weighted adjacency matrix of the GSGi.

4. AGC Design Based on HDC-VWPS

4.1. Reward Function Selection

The impact of energy management system (EMS) on the environment is considered, and carbon emission (CE) as part of the reward function is also introduced. Meanwhile, in the load frequency control (LFC), each regional power grid will control the generator set in this area according to its own area control error (ACE). The main purpose is the ACE is zero when the steady state is reached. Therefore, in the reward function, the weighted sum of CE and ACE is taken as the objective function. The reward function in GSGi is defined as follows. where is the actual output power of the uth unit in GSGi at the kth iteration; indicates the instantaneous value of ACE at the kth iteration; and represent the weight factor of controlled area’s CE and ACE, respectively. Here, equals 0.5. is the CE intensity coefficient of the uth unit in GSGi, whose unit is kg/kWh. and are the upper and lower bounds of the uth unit’s capacity in GSGi, respectively. The CE intensity coefficients for each type of generator set are as follows. where is the uth unit regulation power of the GSGi in MW.

4.2. Parameter Setting

A reasonable set of six parameters and is required in the design of the control system.

The trace-attenuation factor allocates the credits among state-action pairs. Usually, the parameter is located between 0 and 1. It determines the convergence rate and the non-Markov decision process (MDP) effects for large time-delay systems. Generally, the factor can be interpreted as a time scaling element in the backtracking. For Q-function errors, a small means that few credit will be given to the historical state-action pairs while a large denotes that much credit will be assigned. Through trial and error, it shows that 0.7 < λ < 0.95 is acceptable. Here, is selected.

The discount factor is between 0 and 1, which discounts the future rewards in Q functions. A value close to 1 should be chosen as the latest rewards in the thermal-dominated LFC process which is the most important. Experiments demonstrate that 0.6 < γ < 0.95 is proper. Here, is chosen.

The Q-learning rate is set between 0 and 1, which weighs the convergence rate of the Q-functions, namely, algorithm stability. Note that a larger can accelerate the learning rate, while a smaller can enhance the system stability. In the prelearning process, the initial value of is chosen to be 0.1 to obtain the overall search. After that, in order to gradually increase the stability of the system, it will be reduced in a linear way.

The variable learning rate is between 0 and 1, which derives an optimal policy by maximizing the action value. Especially, the algorithm will be degraded into Q-learning if equals 1. The main reason is that a maximum value action is permanently executed in every iteration. For a fast convergence rate, the greedy strategy with a variable learning rate ratio is selected in a stochastic game. Through trial and error, it shows that can obtain stable control characteristics.

The value of power error adjustment factor in GSGi is related to which is shown in

is the total power of GSGi in MW.

4.3. HDC-VWPS Procedure

The Overall HDC-VWPS Procedure Is Described in Algorithm 1.

Initialize and for all ;
Set parameters and decision time;
Give the initial state ;
(1) Choose an exploration action based on the mixed strategy set ;
(2) Execute the exploration action to AGC units and run LFC system for the next sec;
(3) Observe a new state via CPS1 and ACE;
(4) Obtain a short-term reward using Eq. (19);
(5) Update eligibility trace according to Eq. (2);
(6) Update Q function using Eq. (3);
(7) Select variable learning rate δ with Eq. (7);
(8) Compute by Eq. (5) and Eq. (6);
(9) Calculate and according to Eq. (8);
(10) Update the mixed strategy according to Eq. (4);
(11) Obtain the total power of the GSGi;
(12) Determine the ramp rate according to Eq. (13);
(13) Execute CC algorithm according to Eq. (14) and Eq. (15);
(14) Calculate the uth unit power in GSGi;
(15) If the power limit is not exceeded, then execute step 17;
(16) Calculate and according to Eq. (17). And update using Eq. (9), Eq. (11) and Eq. (18);
(17) Calculate the power error according to Eq. (16);
(18) If is not satisfied, execute step 13;
(19) Output the uth unit power ;
(20) Set , and return to step 1.

5. Case Study

5.1. The Modified Model with Two-Area LFC Power System in IEEE

In order to test the control performance of the proposed strategy, an IEEE-modified model with two-area LFC power system [21] is selected as the simulation object, whose framework is shown in Figure 2. The system parameters are taken from [22], and those of GSG1 and GSG2 are provided in Table 1.

The work cycle of the AGC is set to be 4 s. Note that HDC-VWPS has to undergo a sufficient prelearning through off-line trial and error before the final online implementation. It includes extensive explorations in CPS state space for the optimization of Q-functions and state-value functions [23]. Figure 3 presents the prelearning of each area produced by a continuous 10 min sinusoidal disturbance. It is obvious that the HDC-VWPS can converge to the optimal strategy in each GSG with qualified CPS1 (the average of 10 min CPS1) and (the average of 10 min ACE).

Furthermore, a matrix with 2 norms is used as the criterion for the prelearning termination of an optimal strategy [24]. .1 is a specified positive constant. Both the value and look-up table will be automatically saved after the prelearning, such that HDC-VWPS can be applied into a real power system. The convergence result of Q-function differences is given in Figure 4. The result is obtained in each GSG during the prelearning, in which the HDC-VWPS can accelerate the convergence rate by nearly 26.7%~40% over that of Q(λ).

In order to evaluate the robustness of each algorithm, the control performances of DWoLF-PHC(λ), Q(λ), and Q-learning are compared with that of HDC-VWPS under a step and a stochastic load disturbance in GSG1. The simulation results under a step load disturbance are shown in Figure 5. In Figure 5(a) it is shown that the overshoots are around 6.3758%, 4.907%, 7.2614%, and 13.0435%, respectively. Meanwhile, in Figure 5(b), it refers that the average values of ACE are 0.1261 MW, 1.0682 MW, 1.2216 MW, and 1.0438 MW, respectively. In addition, in Figure 5(c), it is illustrated that the minimum CPS1 is 189.6487%, 186.7696%, 189.6426%, and 190.1703%, respectively. In the meantime, the simulation results under a stochastic load disturbance are described in Figure 6. In Figure 6(a), it is demonstrated that HDC-VWPS has the strongest robustness. Besides, in Figure 6(b), it refers that the average values of ACE are 22.7175 MW, 45.1846 MW, 66.6484 MW, and 75.7486 MW, respectively. Moreover, in Figure 6(c), it is presented that the minimum CPS1 is 167.7471%, 159.4400%, 150.6757%, and 127.3168%, respectively. Therefore, HDC-VWPS can provide better control performances for AGC units.

The stochastic white noise is used as the load disturbance after the prelearning process, in which the control performance of each algorithm obtained in each GSG is summarized in Figure 7. CE, (average values of the frequency deviation), (average values of 1 min ACE), and CPS1 are the average values over 24 h. It can be seen from Figure 7 that compared with the other methods, HDC-VWPS can reduce CE by 1.21%~1.51%, by 4.5647 × 10−4~7.5851 × 10−4 Hz, and by 5.79%~44.22% and increase CPS1 by 0.0007%~0.02%.

5.2. Four-Area Model of Hubei Power Grid

Four-area model of Hubei power grid is shown in Figure 8. As shown in Figure 9, an AC/DC hybrid Hubei power grid model, which consists of totally 43 units of four GSGs, is analyzed in the paper. The control performance is CPS, and the work cycle of AGC is set to be 4 s. The of Hubei power grid model is 118 MW. is the governor output, and is the turbine output. At the same time, is the time constant of the governor, is the time constant of the turbine, and is the equivalent function of AC frequency response, respectively. Related parameters are set as follows. , , , , , , and . Generation rate constraint (GRC) is the in this study. GRC and all the other system parameters are given in Table 1.

The system includes coal-fired power plants, hydropower plants, and pumped storage power plants. The output of each plant is relative to its own governor, and the setting point of AGC is obtained according to the optimal dispatch. The long-term AGC control performance based on MA is evaluated by a statistic experiment with 30-day stochastic load disturbance. Four types of controllers are simulated, that is, Q-learning, Q(λ), DWoLF-PHC(λ), and HDC-VWPS. The statistic experiment results obtained under the impulsive perturbations and stochastic white noise load fluctuation are showed in Figures 10 and 11, respectively. Especially, and are the average values of the frequency deviation and ACE. CPS1, CPS2, and CPS are the monthly compliance percentages. The same weight of HDC-VWPS in each GSG is chosen, which has a more effective joint cooperation than other policvies. As a result, a higher scalability and self-learning efficiency can be achieved.

Figure 10 shows that the HDC-VWPS in GSG1, compared with other methods, reduces CE by 11.48%~29.45%, by 0.237~0.0325 Hz, and ACE by 8.57%~90.37% and increases CPS1 by 1.03%~29.9%.

Figure 11 shows that the HDC-VWPS in GSG1, compared with other methods, reduces CE by 0.17%~20.24%, by 0.003~0.078 Hz, ACE by 45%~94% and increases CPS1 by 0.03%~4%. Similar results can be obtained in other GSGs.

It can be seen from the simulation results that the HDC-VWPS has stronger adaptability and better control performance than that of other three methods. In each GSG area, the win-lose criteria of the unit depend on the sign of the product of and . By determining the “lose” or “win” of an agent, the corresponding variable learning rate is selected to obtain the optimal Q function through updating the Q value dynamically. Meanwhile, the variable quantity is determined in the mix strategy updating. Finally, the optimal mixed strategy is gained by the dynamic updating continuously. The results also demonstrate that the proposed strategy can effectively reduce the CE and improve the utilization rate of new energy.

6. Conclusion

Based on the MAS-SCG theory, a novel HDC-VWPS method with new win-lose judgment criterion and eligibility trace is proposed to dynamically obtain the optimal total power and its optimal dispatch. Also, it can attenuate the stochastic disturbance caused by massive integration of new energy to the power grid.

Based on MAS-SG, a PDWoLF-PHC(λ) algorithm is proposed to solve the universality problem which usually a strict knowledge system is required for agents under the traditional MAS-SG system. It also solved the problem which the agents cannot accurately calculate the judgment criterion and converge to Nash equilibrium slowly in more than 2 × 2 games. Based on MAS-CC theory, the ramp time CC algorithm is used to allocate the total power command to each unit dynamically.

The simulation results verify the effectiveness of the proposed strategy using modified power system model in the IEEE two-area LFC and Hubei power grid model in China. Compared with other four smart methods, the proposed one can satisfy the CPS requirements and improve the performance of the closed-loop system. Also, it can reduce the CE and maximize the utilization rate of energy.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the National Natural Science Foundation of China (51707102 and 61603212).