BioInspired Learning and Adaptation for Optimization and Control of Complex Systems
View this Special IssueResearch Article  Open Access
Research on Hierarchical and Distributed Control for Smart Generation Based on Virtual Wolf Pack Strategy
Abstract
Nowadays, haze has become a big trouble in our society. One of the significant solutions is to introduce renewable energy on a large scale. How to ensure that power system can adapt to the integration and consumption of new energy very well has become a scientific issue. A smart generation control which is called hierarchical and distributed control based on virtual wolf pack strategy is explored in this study. The proposed method is based on multiagent system stochastic consensus game principle. Meanwhile, it is also integrated into the new winlose judgment criterion and eligibility trace. The simulations, conducted on the modified power system model based on the IEEE twoarea load frequency control and Hubei power grid model in China, demonstrate that the proposed method can obtain the optimal collaborative control of AGC units in a given regional power grid. Compared with some smart methods, the proposed one can improve the closedloop system performances and reduce the carbon emission. Meanwhile, a faster convergence speed and stronger robustness are also achieved.
1. Introduction
Recently, the thermal power generation makes the environmental pollution more serious, especially the air pollution. Therefore, more and more clean energies such as wind and photovoltaics are continuously merged into the strongly coupling interconnected power grid [1]. However, new troubles, such as voltage over limit and power fluctuations as well as frequency instability [2–4], are brought out. Meanwhile, the safe operation of the power grid is also affected. The traditional centralized automatic generation control (AGC) cannot obtain the similar control performance with the decentralized AGC since the energy distributions are more dispersed. It will be an inevitable trend for the future smart grid to research the decentralized AGC.
In recent years, many scholars have devoted to the optimal control strategy of decentralized AGC [5–13]. Authors in [6] put forward the concept of optimal AGC by using the original dual transformation method, which is based on the optimal control theory. It showed that the dynamic equation and the constructed AGC control strategy of the interconnected system could realize multiarea decentralized optimal AGC control. However, the used optimal AGC controller needed to feedback all the state variables which were difficult to be obtained directly in the actual system. In [8], a new method was proposed based on the model predictive control. It focused on a decentralized optimal AGC control strategy based on cooperative synchronous power grid. While the stability and robustness of the multivariable predictive control method including the application in actual AGC system needed to be further studied, the method was a great amount of calculation and timeconsuming. Yu et al. [11] demonstrated that an optimal AGC can be achieved under the circumstance that the agents are in small number. However, the algorithm is only applicable to systems with a small number of agents and its application is limited. In the same way, the decentralized control has been studied by the author in the early stage, namely, decentralized correlated equilibrium Q()learning (DCEQ()) [12] based on multiagent (MA). It can solve the complex stochastic dynamic characteristics and optimal coordination control of AGC after the access of distributed energy. Nevertheless, if the number of MA increases, the searching time for the MA equilibrium solution is geometric growth, which will limit the application of DCEQ() in larger systems. Therefore, the decentralized win or learn fast policy hillclimbing() (DWoLFPHC()) [13] based on MA was developed, in which by using average mixed strategy instead of equilibrium strategy. Thus, the dynamic characteristics of the system are effectively improved, and the dynamic optimization control of the total power is also obtained. However, the DWoLFPHC() still has multisolution problem. It results in system instability when the number of MA increases sharply.
The above literatures have some limitations that they only focus on the control strategy of the total power in the AGC. However, the dynamic optimal allocation of the total power is not involved. In fact, the modern power grid has gradually been developed into a hierarchical and distributed control (HDC) structure, which integrates the largescale new energy. For this reason, a single control strategy is difficult to meet the requirements of control performance standards (CPS). Therefore, a hierarchical and distributed control based on virtual wolf pack strategy (HDCVWPS) is proposed in order to attenuate the stochastic disturbance problem caused by massive integration of new energy to the power grid. The proposed strategy is based on multiagent system stochastic consensus game (MASSCG). It is divided into two parts. The first part is an AGC optimal control method which combines a new winlose judgment criterion, policy hillclimbing algorithm (PHC) [14], and eligibility trace [15]. Especially, the new winlose judgment criterion is named as policy dynamicsbased WoLF (PDWoLF) [16]. Moreover, the control method, which is called PDWoLFPHC(), is based on multiagent system stochastic game (MASSG) theory. Meanwhile, the second part is the collaborative consensus (CC) algorithm [17] which is based on multiagent system collaborative consensus game (MASCC) theory. This algorithm is used to distribute the total power dynamically and optimally. Consequently, the perfect combination of AGC control and distribution is realized. At the same time, the intelligence from the whole to the branch is truly obtained. The significant difference between smart generation control (SGC) and AGC is that the original proportionalintegral (PI) control in AGC is replaced by the smart control in SGC.
The rest of the paper is as follows. The SGC framework based on HDC structure is proposed in Section 2. The HDCVWPS is expounded in Section 3. Meanwhile, Section 4 is the AGC design based on HDCVWPS. Section 5 covers the case study, and Section 6 summarizes the full text, respectively.
2. SGC Framework Based on HDC Structure
Hierarchical reinforcement learning (HRL) [18] is a hierarchical control method that can solve the problem of “curse of dimensionality” in traditional reinforcement learning effectively. A new method, namely, HDCVWPS, is put forward to obtain the optimal total power and its optimal dispatch dynamically. The term “virtual wolf pack” is a generator set group (GSGs) of a certain control area. The PDWoLFPHC(λ) with the win or learn fast (WoLF) attribute based on heterogeneous MASSG theory is adopted to obtain the total power of each GSG. Meanwhile, the ramp time CC algorithm based on homogeneous MASCC theory is used to distribute the total power to each unit dynamically in order to achieve the optimal coordination control of each GSG. The “leader” of virtual wolf pack refers to a new dispatcher who is responsible for communicating, contacting and cooperating with the leaders of the other GSGs, and sending the instructions to each unit in the GSGs. Each GSG only has one leader. The SGC framework based on HDC structure is shown in Figure 1, where is the tieline exchange power, is the interconnected power grid frequency error, is the total power of GSGi, and is the regulation power of the uth unit in GSGi.
3. HDCVWPS
A HDCVWPS is designed to coordinate and optimize the operation of GSGs in the SGC system with HDC structure through the integration of MASSG and MASCC.
3.1. MASSG Framework
Based on the MASSG framework, a PDWoLFPHC() algorithm is proposed to the game among GSGs to obtain total power command of each GSG.
The WoLF principle can meet the convergence requirement by changing the learning rate without sacrificing rationality, namely, learn quickly when losing and cautiously when winning [14]. However, in more than 2 × 2 games, the players cannot accurately calculate the winlose criterion and can only rely on the estimation. Therefore, an improved WoLF version, PDWoLF, whose judgment criterion can be accurately computable in more than 2 × 2 games, was explored in [16]. Also, it can converge to Nash equilibrium in more than 2 action games.
It indicates that PHC algorithm can meet the requirement of the rationality in [14]. Therefore, PDWoLFPHC can satisfy the requirements of the convergence and the rationality at the same time. It also converges faster with a higher learning rate ratio [16]. The PDWoLFPHC is the extension of the classical Qlearning [19]. It combines the multistep backtracking idea of the SARSA() [15] to search the optimal actionvalue function through the continuous trial and error dynamically. The parameter refers to the use of an eligibility trace. It can solve the temporal credit assignment of timedelayed reinforcement learning. The optimal value function and strategy are as follows. where is the set of possible actions under state .
The eligibility trace is updated by where denotes the eligibility trace at the kth step iteration under state and action , is the discount factor, and is the traceattenuation factor.
The function will be iteratively updated according to where is the Qlearning rate and is the reward function value from state to under the selected action is the value function when executing action in state , which uses lookup table method. is a greedy action. After sufficient trial and error iterations are done, the statevalue function will converge to the matrix with the probability of one. Finally, an optimal control strategy, represented by the optimal function ( matrix), can be obtained.
The winlose criterion of PDWoLFPHC() is determined by two parameters and for a given agent_{.} Strategy is updated for an agent according to (4) in the state stateaction pair . where is the variable quantity of the updating strategy. The updating rule is described as follows.
In (6), is the number of possible actions. is the variable learning rate and . Also, is defined as the variable learning rate ratio. is updated by where is the decision space slope value and is the decision change rate at the kth step iteration. Meanwhile, and are updated by
3.2. MASCC Framework
The MASCC framework is introduced into the HDCVWPS to dynamically allocate the total power command to each unit.
3.2.1. Graph Theory
The topology of MAS can be expressed as a directed graph with a set of nodes edges and a weighted adjacency matrix . Among them denotes the ith agent, edge means the relationship among agents, and constant is the weight factor between and . If there is a connection between any two vertices, then the graph is called a strongly connected graph. The Laplacian matrix of graph can be written as follows. where the matrix reflects the topology of the MA network.
3.2.2. Collaborative Consensus
In a MAS, it is usually called collaborative consensus (CC) [20] while an agent interacts with the adjacent one to reach the consensus. A MAS consisting of autonomous agents is regarded as a node in a directed graph . The purpose of CC is to obtain a consensus in each agent and to update state in real time after communicating with neighboring agents. Due to the communication delay among agents, the firstorder CC algorithm of a discrete system is chosen as follows. where is the state of the ith agent, represents the discrete time series, and denotes the entry of the row stochastic matrix at discrete time. is given by
The CC algorithm can be achieved if and only if the directed graph is strongly connected on the condition of the continuous communication and constant gain .
3.2.3. Ramp Time Collaborative Consensus
The ramp time is chosen as the consensus variable among all units in a GSG. A unit which has a higher ramp rate will be distributed with more disturbances. The ramp time of the uth unit in GSGi can be obtained as follows. where is the regulation power of the uth unit in GSGi. is the ramp rate of the unit and is calculated as follows. where and are the upper and lower bounds of the ramp rate, respectively.
The ramp time of the uth unit in GSGi can be updated according to (10) as follows. where is the total number of units in GSGi. is the row stochastic matrix.
Then the ramp time of the GSGi leader can be updated as follows. where represents the GSGi’s adjustment factor of the power error. denotes the power error between the GSGi total power and the total power of all units. It is obtained from
In the condition of the total power command if the ramp time needs to be increased; otherwise needs to be reduced. Oppositely, will be increased or decreased in condition that .
As a ramp time CC algorithm among units is adopted, the power of some units may exceed their maximum power. At the same time, the smaller the unit maximum ramp time is adopted, the faster the power limit is reached. While the power limit is reached, the uth unit’s power and ramp time are as follows. where and are the maximum and minimum reserve capacity of the uth unit in GSGi, respectively. Furthermore, if the power of the uth unit exceeds its limit, the weight factor becomes as follows. where is the weighted adjacency matrix of the GSGi.
4. AGC Design Based on HDCVWPS
4.1. Reward Function Selection
The impact of energy management system (EMS) on the environment is considered, and carbon emission (CE) as part of the reward function is also introduced. Meanwhile, in the load frequency control (LFC), each regional power grid will control the generator set in this area according to its own area control error (ACE). The main purpose is the ACE is zero when the steady state is reached. Therefore, in the reward function, the weighted sum of CE and ACE is taken as the objective function. The reward function in GSGi is defined as follows. where is the actual output power of the uth unit in GSGi at the kth iteration; indicates the instantaneous value of ACE at the kth iteration; and represent the weight factor of controlled area’s CE and ACE, respectively. Here, equals 0.5. is the CE intensity coefficient of the uth unit in GSGi, whose unit is kg/kWh. and are the upper and lower bounds of the uth unit’s capacity in GSGi, respectively. The CE intensity coefficients for each type of generator set are as follows. where is the uth unit regulation power of the GSGi in MW.
4.2. Parameter Setting
A reasonable set of six parameters and is required in the design of the control system.
The traceattenuation factor allocates the credits among stateaction pairs. Usually, the parameter is located between 0 and 1. It determines the convergence rate and the nonMarkov decision process (MDP) effects for large timedelay systems. Generally, the factor can be interpreted as a time scaling element in the backtracking. For Qfunction errors, a small means that few credit will be given to the historical stateaction pairs while a large denotes that much credit will be assigned. Through trial and error, it shows that 0.7 < λ < 0.95 is acceptable. Here, is selected.
The discount factor is between 0 and 1, which discounts the future rewards in Q functions. A value close to 1 should be chosen as the latest rewards in the thermaldominated LFC process which is the most important. Experiments demonstrate that 0.6 < γ < 0.95 is proper. Here, is chosen.
The Qlearning rate is set between 0 and 1, which weighs the convergence rate of the Qfunctions, namely, algorithm stability. Note that a larger can accelerate the learning rate, while a smaller can enhance the system stability. In the prelearning process, the initial value of is chosen to be 0.1 to obtain the overall search. After that, in order to gradually increase the stability of the system, it will be reduced in a linear way.
The variable learning rate is between 0 and 1, which derives an optimal policy by maximizing the action value. Especially, the algorithm will be degraded into Qlearning if equals 1. The main reason is that a maximum value action is permanently executed in every iteration. For a fast convergence rate, the greedy strategy with a variable learning rate ratio is selected in a stochastic game. Through trial and error, it shows that can obtain stable control characteristics.
The value of power error adjustment factor in GSGi is related to which is shown in
is the total power of GSGi in MW.
4.3. HDCVWPS Procedure
The Overall HDCVWPS Procedure Is Described in Algorithm 1.

5. Case Study
5.1. The Modified Model with TwoArea LFC Power System in IEEE
In order to test the control performance of the proposed strategy, an IEEEmodified model with twoarea LFC power system [21] is selected as the simulation object, whose framework is shown in Figure 2. The system parameters are taken from [22], and those of GSG1 and GSG2 are provided in Table 1.

The work cycle of the AGC is set to be 4 s. Note that HDCVWPS has to undergo a sufficient prelearning through offline trial and error before the final online implementation. It includes extensive explorations in CPS state space for the optimization of Qfunctions and statevalue functions [23]. Figure 3 presents the prelearning of each area produced by a continuous 10 min sinusoidal disturbance. It is obvious that the HDCVWPS can converge to the optimal strategy in each GSG with qualified CPS1 (the average of 10 min CPS1) and (the average of 10 min ACE).
(a) The average of 10min CPS1
(b) The average of 10min ACE
(c) The HDCVWPS controller output
Furthermore, a matrix with 2 norms is used as the criterion for the prelearning termination of an optimal strategy [24]. .1 is a specified positive constant. Both the value and lookup table will be automatically saved after the prelearning, such that HDCVWPS can be applied into a real power system. The convergence result of Qfunction differences is given in Figure 4. The result is obtained in each GSG during the prelearning, in which the HDCVWPS can accelerate the convergence rate by nearly 26.7%~40% over that of Q(λ).
(a) The convergence result of Q(λ)
(b) The convergence result of HDCVWPS
In order to evaluate the robustness of each algorithm, the control performances of DWoLFPHC(λ), Q(λ), and Qlearning are compared with that of HDCVWPS under a step and a stochastic load disturbance in GSG1. The simulation results under a step load disturbance are shown in Figure 5. In Figure 5(a) it is shown that the overshoots are around 6.3758%, 4.907%, 7.2614%, and 13.0435%, respectively. Meanwhile, in Figure 5(b), it refers that the average values of ACE are 0.1261 MW, 1.0682 MW, 1.2216 MW, and 1.0438 MW, respectively. In addition, in Figure 5(c), it is illustrated that the minimum CPS1 is 189.6487%, 186.7696%, 189.6426%, and 190.1703%, respectively. In the meantime, the simulation results under a stochastic load disturbance are described in Figure 6. In Figure 6(a), it is demonstrated that HDCVWPS has the strongest robustness. Besides, in Figure 6(b), it refers that the average values of ACE are 22.7175 MW, 45.1846 MW, 66.6484 MW, and 75.7486 MW, respectively. Moreover, in Figure 6(c), it is presented that the minimum CPS1 is 167.7471%, 159.4400%, 150.6757%, and 127.3168%, respectively. Therefore, HDCVWPS can provide better control performances for AGC units.
(a) Controller output
(b) ACE
(c) CPS1
(a) Controller Output
(b) ACE
(c) CPS1
The stochastic white noise is used as the load disturbance after the prelearning process, in which the control performance of each algorithm obtained in each GSG is summarized in Figure 7. CE, (average values of the frequency deviation), (average values of 1 min ACE), and CPS1 are the average values over 24 h. It can be seen from Figure 7 that compared with the other methods, HDCVWPS can reduce CE by 1.21%~1.51%, by 4.5647 × 10^{−4}~7.5851 × 10^{−4} Hz, and by 5.79%~44.22% and increase CPS1 by 0.0007%~0.02%.
5.2. FourArea Model of Hubei Power Grid
Fourarea model of Hubei power grid is shown in Figure 8. As shown in Figure 9, an AC/DC hybrid Hubei power grid model, which consists of totally 43 units of four GSGs, is analyzed in the paper. The control performance is CPS, and the work cycle of AGC is set to be 4 s. The of Hubei power grid model is 118 MW. is the governor output, and is the turbine output. At the same time, is the time constant of the governor, is the time constant of the turbine, and is the equivalent function of AC frequency response, respectively. Related parameters are set as follows. , , , , , , and . Generation rate constraint (GRC) is the in this study. GRC and all the other system parameters are given in Table 1.
The system includes coalfired power plants, hydropower plants, and pumped storage power plants. The output of each plant is relative to its own governor, and the setting point of AGC is obtained according to the optimal dispatch. The longterm AGC control performance based on MA is evaluated by a statistic experiment with 30day stochastic load disturbance. Four types of controllers are simulated, that is, Qlearning, Q(λ), DWoLFPHC(λ), and HDCVWPS. The statistic experiment results obtained under the impulsive perturbations and stochastic white noise load fluctuation are showed in Figures 10 and 11, respectively. Especially, and are the average values of the frequency deviation and ACE. CPS1, CPS2, and CPS are the monthly compliance percentages. The same weight of HDCVWPS in each GSG is chosen, which has a more effective joint cooperation than other policvies. As a result, a higher scalability and selflearning efficiency can be achieved.
Figure 10 shows that the HDCVWPS in GSG1, compared with other methods, reduces CE by 11.48%~29.45%, by 0.237~0.0325 Hz, and ACE by 8.57%~90.37% and increases CPS1 by 1.03%~29.9%.
Figure 11 shows that the HDCVWPS in GSG1, compared with other methods, reduces CE by 0.17%~20.24%, by 0.003~0.078 Hz, ACE by 45%~94% and increases CPS1 by 0.03%~4%. Similar results can be obtained in other GSGs.
It can be seen from the simulation results that the HDCVWPS has stronger adaptability and better control performance than that of other three methods. In each GSG area, the winlose criteria of the unit depend on the sign of the product of and . By determining the “lose” or “win” of an agent, the corresponding variable learning rate is selected to obtain the optimal Q function through updating the Q value dynamically. Meanwhile, the variable quantity is determined in the mix strategy updating. Finally, the optimal mixed strategy is gained by the dynamic updating continuously. The results also demonstrate that the proposed strategy can effectively reduce the CE and improve the utilization rate of new energy.
6. Conclusion
Based on the MASSCG theory, a novel HDCVWPS method with new winlose judgment criterion and eligibility trace is proposed to dynamically obtain the optimal total power and its optimal dispatch. Also, it can attenuate the stochastic disturbance caused by massive integration of new energy to the power grid.
Based on MASSG, a PDWoLFPHC(λ) algorithm is proposed to solve the universality problem which usually a strict knowledge system is required for agents under the traditional MASSG system. It also solved the problem which the agents cannot accurately calculate the judgment criterion and converge to Nash equilibrium slowly in more than 2 × 2 games. Based on MASCC theory, the ramp time CC algorithm is used to allocate the total power command to each unit dynamically.
The simulation results verify the effectiveness of the proposed strategy using modified power system model in the IEEE twoarea LFC and Hubei power grid model in China. Compared with other four smart methods, the proposed one can satisfy the CPS requirements and improve the performance of the closedloop system. Also, it can reduce the CE and maximize the utilization rate of energy.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (51707102 and 61603212).
References
 O. Dobzhanskyi, P. Gottipati, E. Karaman, X. Luo, E. A. Mendrela, and A. M. Trzynadlowski, “Multilayerwinding versus switchedflux permanentmagnet AC machines for gearless applications in cleanenergy systems,” IEEE Transactions on Industry Applications, vol. 48, no. 6, pp. 2296–2302, 2012. View at: Publisher Site  Google Scholar
 A. Milicua, G. Abad, and M. A. Rodriguez Vidal, “Online reference limitation method of shuntconnected converters to the grid to avoid exceeding voltage and current limits under unbalanced operation—part I: theory,” IEEE Transactions on Energy Conversion, vol. 30, no. 3, pp. 852–863, 2015. View at: Publisher Site  Google Scholar
 A. Noruzi, T. Banki, O. Abedinia, and N. Ghadimi, “A new method for probabilistic assessments in power systems, combining Monte Carlo and stochasticalgebraic methods,” Complexity, vol. 21, no. 2, 110 pages, 2015. View at: Publisher Site  Google Scholar
 Y. A.R. I. Mohamed, “Suppression of low and highfrequency instabilities and gridinduced disturbances in distributed generation inverters,” IEEE Transactions on Power Electronics, vol. 26, no. 12, pp. 3790–3803, 2011. View at: Publisher Site  Google Scholar
 M. Mohammadi, A. Danandeh, H. Nasir Aghdam, and N. Ojaroudi, “Wavelet neural network based on islanding detection via inverterbased DG,” Complexity, vol. 21, no. 2, 324 pages, 2015. View at: Publisher Site  Google Scholar
 O. Elgerd and C. Fosha, “Optimum megawattfrequency control of multiarea electric energy systems,” IEEE Transactions on Power Apparatus and Systems, vol. PAS89, no. 4, pp. 556–563, 1970. View at: Publisher Site  Google Scholar
 N. Ghadimi, “An adaptive neurofuzzy inference system for islanding detection in wind turbine as distributed generation,” Complexity, vol. 21, no. 1, 20 pages, 2015. View at: Publisher Site  Google Scholar
 N. Atic, D. Rerkpreedapong, A. Hasanovic, and A. Feliachi, “NERC compliant decentralized load frequency control design using model predictive control,” in 2003 IEEE Power Engineering Society General Meeting (IEEE Cat. No.03CH37491), vol. 559, Toronto, Ontario, Canada, July 2003. View at: Publisher Site  Google Scholar
 O. Abedinia, N. Amjady, and A. Ghasemi, “A new metaheuristic algorithm based on shark smell optimization,” Complexity, vol. 21, no. 5, 116 pages, 2016. View at: Publisher Site  Google Scholar
 A. Ahadi, N. Ghadimi, and D. Mirabbasi, “An analytical methodology for assessment of smart monitoring impact on future electric power distribution system reliability,” Complexity, vol. 21, no. 1, 113 pages, 2015. View at: Publisher Site  Google Scholar
 T. Yu, J. Liu, K. W. Chan, and J. J. Wang, “Distributed multistep Q(λ) learning for optimal power flow of largescale power grids,” International Journal of Electrical Power & Energy Systems, vol. 42, no. 1, pp. 614–620, 2012. View at: Publisher Site  Google Scholar
 T. Yu, L. Xi, B. Yang, Z. Xu, and L. Jiang, “Multiagent Stochastic Dynamic Game for Smart Generation Control,” Journal of Energy Engineering, vol. 142, no. 1, article 04015012, 2016. View at: Publisher Site  Google Scholar
 L. Xi, T. Yu, B. Yang, and X. Zhang, “A novel multiagent decentralized win or learn fast policy hillclimbing with eligibility trace algorithm for smart generation control of interconnected complex power grids ☆,” Energy Conversion and Management, vol. 103, pp. 82–93, 2015. View at: Publisher Site  Google Scholar
 M. Bowling and M. Veloso, “Multiagent learning using a variable learning rate,” Artificial Intelligence, vol. 136, no. 2, pp. 215–250, 2002. View at: Publisher Site  Google Scholar
 R. S. Sutton and A. G. Barto, Reinforcement Learning: an Introduction, MIT Press, Cambridge, 1998.
 B. Banerjee and J. Peng, “Adaptive policy gradient in multiagent learning,” in Proceedings of the second international joint conference on Autonomous agents and multiagent systems  AAMAS '03, pp. 686–692, New York, NY, USA, July 2003. View at: Publisher Site  Google Scholar
 X. Zhang, H. Xu, T. Yu, B. Yang, and M. Xu, “Robust collaborative consensus algorithm for decentralized economic dispatch with a practical communication network,” Electric Power Systems Research, vol. 140, pp. 597–610, 2016. View at: Publisher Site  Google Scholar
 M. Ghavamzadeh, S. Mahadevan, and R. Makar, “Hierarchical multiagent reinforcement learning,” Autonomous Agents and MultiAgent Systems, vol. 13, no. 2, pp. 197–229, 2006. View at: Publisher Site  Google Scholar
 C. J. C. H. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992. View at: Publisher Site  Google Scholar
 S. K. Gupta, K. Kar, S. Mishra, and J. T. Wen, “Collaborative energy and thermal comfort management through distributed consensus algorithms,” IEEE Transactions on Automation Science and Engineering, vol. 12, no. 4, pp. 1285–1296, 2015. View at: Publisher Site  Google Scholar
 G. Ray, A. N. Prasad, and G. D. Prasad, “A new approach to the design of robust loadfrequency controller for large scale power systems,” Electric Power Systems Research, vol. 51, no. 1, pp. 13–22, 1999. View at: Publisher Site  Google Scholar
 O. I. Elgerd and H. H. Happ, “Electric energy systems theory: an introduction,” IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC2, no. 2, pp. 296297, 1972. View at: Publisher Site  Google Scholar
 D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control: reinforcement learning framework,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 427–435, 2004. View at: Publisher Site  Google Scholar
 T. Yu, B. Zhou, K. W. Chan, L. Chen, and B. Yang, “Stochastic optimal relaxed automatic generation control in nonMarkov environment based on multistep Q(λ) learning,” IEEE Transactions on Power Systems, vol. 26, no. 3, pp. 1272–1282, 2011. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2018 Lei Xi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.