Abstract

In a spot wholesale electricity market containing strategic bidding interactions among wind power producers and other participants such as fossil generation companies and distribution companies, the randomly fluctuating natures of wind power hinders not only the modeling and simulating of the dynamic bidding process and equilibrium of the electricity market but also the effectiveness about keeping economy and reliability in market clearing (economic dispatching) corresponding to the independent system operator. Because the gradient descent continuous actor-critic algorithm is demonstrated as an effective method in dealing with Markov’s decision-making problems with continuous state and action spaces and the robust economic dispatch model can optimize the permitted real-time wind power deviation intervals based on wind power producers’ bidding power output, in this paper, considering bidding interactions among wind power producers and other participants, we propose a gradient descent continuous actor-critic algorithm-based hour-ahead electricity market modeling approach with the robust economic dispatch model embedded. Simulations are implemented on the IEEE 30-bus test system, which, to some extent, verifies the market operation economy and the robustness against wind power fluctuations by using our proposed modeling approach.

1. Introduction

Wind power is one of the fastest growing renewable power resources [1]. In the spot electricity market (EM) with wind power penetration, the fluctuating and random nature of this intermittent resource hinders the integration of wind power into EM and operation of power systems. Moreover, the strategic interactions among wind power producers (WPPs) and other market participants such as fossil generation companies (GenCOs) and distribution companies (DisCOs) have increased the complexity of EM modeling which is a necessary tool for market analysis, design, bidding decision-making, and every market modification [2].

The objectives of all participants bidding in EM are maximizing their own profits. Wind power and some other renewable power resources often participate in spot EM as “price takers” because of their low marginal costs. Therefore, the only bidding parameter a WPP needs to determine is its production level [3]. On the one hand, the limited predictability nature of wind power makes WPPs usually not meet the production level they bid, which increases the probability of system imbalances [4]. Relevant regulators in many countries have designed various penalty mechanisms to financially punish WPPs for their deviations of real-time productions from their bidding ones. Hence, if neglecting the marginal costs of wind power [5], maximizing a WPP’s profit means minimizing the deviation cost and maximizing the bidding revenue simultaneously. On the other hand, the fluctuating and random nature of wind power makes other EM participants to bid in this stochastically fluctuating EM environment in order to maximizing their own profits, which in turn affects the bidding revenues of WPPs mainly through locational marginal prices (LMPs) clearing by the independent system operator (ISO). Therefore, in this more complicated situation, developing fast and reliable market modeling approaches which contain bidding interactions among all kinds of participants has become considerably more important than before. One aim of this paper is to apply a new reinforcement learning algorithm based on the gradient descent continuous actor-critic (GDCAC) algorithm for solving double-side hour-ahead EM modeling containing strategic bidding interactions among WPPs and other market participants such as GenCOs and DisCOs.

Generally speaking, literatures relevant to our research can be divided into two categories: optimal wind power (or other renewable power) bidding in EM with wind penetration and EM modeling considering (or not considering) wind and some other renewable power penetrations. In the aspect of optimal wind power bidding in EM, methods for finding the optimal bidding strategy for a WPP have been introduced by many researchers. Vilim and Botterud [3] proposed two stochastic bidding models based on kernel density estimation (KDE) for a WPP to obtain the optimal day-ahead bidding strategy. Ravnaas et al. [6] proposed a seasonal autoregressive integrated moving average (SARIMA) algorithm for a WPP to obtain the optimal day-ahead bidding strategy. Sharma et al. [5] studied the behaviors of strategic WPPs in markets dominated by wind generators using the Cournot game model. In [7], Matevosyan et al. proposed an imbalance cost minimization bidding strategy for a WPP by forecasting the wind power probability distribution functions. Li and Shi [8] proposed a stochastic bidding model for a WPP based on the Roth–Erev reinforcement learning algorithm. Laia et al. [9] considered the uncertainty on the electricity price through a set of exogenous scenarios and solved the bidding problem of a price-taker thermal-wind power producer by using a stochastic mixed-integer linear programming approach. In [10], Chaves-Ávila et al. analyzed the impact of different balancing rules (penalty mechanism) on wind power short-term bidding strategies through a stochastic optimization model. Based on the Stackelberg game model, Xiao et al. [11] put forward a closed analysis on WPP’s optimal bidding strategy in day-ahead EM involving large-scale wind power. Lei et al. [12] studied, using a stochastic bilevel model, the optimal bidding decision for a WPP participating in a day-ahead EM that employs stochastic market clearing and energy and reserver cooptimization, in which only the wind generation uncertainty is considered. Similar researches on the optimal bidding strategy of a WPP can also be seen in [11, 1318].

However, authors in [3, 518] only studied how to find the optimal bidding strategy for a WPP within EM environment, and the modeling methods of those literatures are either static game models (Cournot and Stackelberg game models) or bilevel stochastic optimization model which cannot simulate the impact of wind power on dynamic bidding process of other participants (GenCOs and DisCOs) in a spot EM considering wind power penetration.

In order to overcome those deficiencies listed above, researches on spot EM modeling methods considering or not considering wind and some other renewable power penetration have been proposed in many literatures.

In general, the main purpose of EM modeling approaches is to regard the EM as a whole system, in which the interactions among all market participants are investigated, and the bidding process or the equilibrium result is simulated. EM modeling approaches mostly lie within twofold [2]: game-based models and agent-based models. In [2], Salehizadeh and Soltaniyan have summarized that game-based EM models are inferior to agent-based models, and the reasons are as follows: (1) some game-based models often result in a set of nonlinear equations which cannot be easily solved or might yield no solution; (2) some game-based models need to repeatedly solve the multilevel mathematical programming approaches so as to depict the dynamic bidding process in EM, while the computational complexity limits the ability to simulate large EM systems with a game-based model; and (3) almost all game-based models are based on an assumption which is to take the known probability distribution function of the market clearing price (MCP) or other competitors’ bidding strategies as common knowledge, and the abovementioned assumption is not more applicable in a realistic situation [19]. Hence, many researches about the application of agent-based methods for EM modeling have been proposed recently. Rahimiyan and Rajabi Mashhadi [19] modeled and simulated the EM bidding process using the multiagent Q-learning algorithm considering discrete state and action sets and the game-based approach, respectively. Comparison of the agent-based model with the game-based model in [19] confirms the superiority of the agent-based model in this issue. Santos et al. [20] proposed an agent-based wholesale EM test bed (called MASCEM: multiagent simulator of competitive electricity markets) in which the variant Roth–Erev reinforcement learning (VRERL) algorithm was used to model the bidding behavior of the GenCOs agents. Similar researches on agent-based EM modeling can also be seen in [2128], but none of researches in [1928] is involved in considering wind and some other renewable power penetrations.

Shafie-khah et al. [29] proposed a multiagent EM model based on a heuristic dynamic algorithm to help analyzing the market powers of GenCOs in EM considering wind power uncertainty. Dallinger and Wietschel [30], based on an agent-based EM equilibrium model, have studied the impact of plug-in electric vehicle on EM with renewable power penetration. Reeg et al. [31] studied the policy design problem to foster the integration of renewable energy sources into EM by using an agent-based approach. Zamani-Dehkordi et al. [32] studied the impact of a proposed wind farm project on wholesale and retail electricity prices by using EM models based on nonparametric regression algorithms. In [33], by using the Q-learning algorithm, Haring et al. proposed a multiagent EM approach to analyze the effects of renewable power uncertainty on the spot EM bidding progress. Salehizadeh and Soltaniyan [2] modified the multiagent EM approach through the fuzzy Q-learning algorithm, by which the effects of renewable power uncertainty on the spot EM bidding progress was also studied within a continuous market state (wind power) space, but discrete action spaces. Paschen [34] analyzed the dynamic behavior of day-ahead EM prices in Germany due to structural shocks in wind and solar power by using a dynamic structural vector autoregressive model. Similar studies can also be seen in[35, 36], but researches in [2936] regard the wind power or other renewable powers as an exogenous random variable so that strategic bidding behaviors of wind or other renewable power producers as well as impact of the EM bidding process on WPPs are neglected in those literatures.

So far as we know, there is no relevant research containing the following three points simultaneously:(1)To construct a multiagent-based EM model which contains not only the impact of WPPs’ uncertain output on strategic bidding behaviors of other market participants but also the impact of the EM bidding process on WPPs’ bidding decision-making(2)To construct a multiagent-based EM model in which both the EM environment state space and bidding strategy (action) spaces of all kinds of market participants such as WPPs, GenCOs, and DisCOs are continuous(3)To construct a multiagent-based EM model in which the market clearing model of ISO is propitious to promote the wind power accommodation capacity of the power system, which is another aim of this paper

This paper applies a new modified reinforcement learning algorithm, namely, GDCAC algorithm, for hour-ahead EM modeling. In our proposed EM approach, all kinds of participants such as WPPs, GenCOs, and DisCOs are regarded as interactively strategic bidding agents who, during the bidding process, must select their optimal bidding strategies from their continuous strategy spaces based on the EM environment state they learned within a continuous state space, respectively, and without causing troubles of “curse of dimensionality.” The market clearing model of ISO in our approach is a robust economic dispatch model (REDM) [37] which can optimize the permitted real-time wind power deviation intervals based on WPPs’ bidding power output. By using our proposed approach, the dynamic interactions among all kinds of participants as well as the Nash equilibrium (NE) results of EM can be simulated and obtained. On the one hand, our proposed approach can provide a bidding decision-making tool for WPPs, GenCOs, and DisCOs to get more profits in EM. On the other hand, our proposed approach can also provide an economic and operational analysis tool for promoting the development of renewable resources. Moreover, in our simulation, the proposed approach is implemented on the IEEE 30-bus test system. Other than testing and verifying the feasibility and rationality of our proposed approach such as reaching NE results after enough iterations and being superior to other agent-based approaches, comparison of our proposed market clearing model with that in [12] under the same bidding approach based on the GDCAC algorithm is also implemented, which indicates the necessity of adopting the REDM for promoting wind power accommodation in EM.

The rest of this paper is organized as follows: in Section 2, the multiagent double-side hour-ahead EM modeling containing strategic bidding interactions among WPPs, GenCOs, and DisCOs are explained. Sections 3 and 4 describe the detailed procedure of applying the GDCAC algorithm for EM modeling. Section 5 conducts the simulations and comparisons. Section 6 concludes the paper.

2. Multiagent Hour-Ahead EM Modeling

2.1. Participants’ Bidding Models

In our proposed double-side hour-ahead wholesale EM model, we consider every WPP, GenCO, and DisCO as an agent. An agent has the ability of learning through its bidding experiences in order to maximize its own profit. For the sake of simplicity and without the loss of generality, we assume that every WPP and GenCO has only one generation unit. In each hour, every GenCO and DisCO solves its own bidding problem and sends its price-quantity bid curve for the next hour to the ISO. Moreover, every WPP, because of its “price taker” role in EM, solves its own bidding problem and sends its bidding power output to the ISO. ISO, after receiving all bid curves from GenCOs and DisCOs as well as all bidding power outputs from WPPs, performs the process of robust economic dispatch management and sends the scheduled power results as well as LMPs to all market participants (WPPs, GenCOs, and DisCOs).

For WPP i (i = 1, 2, …, Nw), the only bidding parameter for hour t is its planed (bidding) power output . WPP i can adjust its bid by changing this parameter. In power systems of many countries, wind power is given priority to be scheduled by ISO comparing with other nonrenewable resources [37], which is to say prior-scheduled wind power for hour t, namely, , is equal to . However, because of the high variability and random nature of this intermittent resource, the (predicted) real-time output power of WPP i for hour t, namely, , which is actually a random variable [12], usually tends to deviate from the scheduled one, which is harmful to the secure operation of the power system and tends to cause system imbalance. Hence, penalty mechanisms to financially punish WPPs for their deviations of real-time productions from their bidding ones must be involved. Taking the penalty method of [12] into consideration, the expected profit of WPP i for hour t can be described as follows:where represents the hour-ahead nodal price (LMP) for hour t at the bus connecting WPP i. is a random variable, which is used to describe the scenarios of wind power uncertainty. represents the envelope space of wind power scenarios. represents the probability of occurrence of the scenario . and represent the (predicted) real-time power output and penalty price of WPP i for hour t in scenario , respectively. In this paper, we involve that the penalty price of WPP i is related to the (predicted) real-time LMP at the bus connecting WPP i [12].

Moreover, there is a difference between the (predicted) real-time power output and the (predicted) natural power output (namely, ) of WPP i in hour t. WPP i can determine whether its (predicted) real-time power output is equal to the natural one by conducting pitch control or using storage equipment [37]. The functional relationship between these two random variables can be formulated as follows [37]:where represent the permitted upper and lower bounds of power output of WPP i that the can be accepted by system for hour t. In this paper, we consider the (predicted) real-time natural wind power outputs of all WPPs as common knowledge.

For GenCOj (j = 1, 2, …, ), the formulation of its bid curve for the next hour t is a supply function based on its real marginal cost function [28]:where represent the power production (MW) and bidding strategy ratio of GenCOj for hour t, respectively. GenCOj can adjust its bid curve by changing its parameter .

The marginal cost function of GenCOj iswhere and represent the slope and intercept parameters of GenCOj’s marginal cost function, respectively.

Moreover, we assume every GenCO is an AGC (automatic generation control [37]) unit which can automatically undertake the real-time power imbalance of system with a certain proportion (namely, ). Therefore, the expected profit of GenCOj can be described aswhere represents the hour-ahead nodal price (LMP) for hour t at the bus connecting GenCOj, represents the (predicted) real-time nodal price (LMP) for hour t at the bus connecting GenCOj in scenario , and represents the GenCOj’s hour-ahead scheduled output power result for hour t.

For DisCOm (m = 1, 2, …, Nd), the formulation of its bid curve for the next hour t is a demand function based on its real marginal revenue function [28]:where represent the power demand (MW) and bidding strategy ratio of DisCOm for hour t, respectively. DisCOm can adjust its bid curve by changing its parameter .

The marginal revenue function of DisCOm iswhere and represent the slope and intercept parameters of DisCOm’s marginal revenue function, respectively.

Profit of DisCOm can be described aswhere is the hour-ahead nodal price (LMP) for hour t at the bus connecting DisCOm and represents the DisCOm’s hour-ahead scheduled power demand (load) result for hour t.

2.2. ISO’s Market Clearing Model

In the traditional dispatching mode considering wind power penetration, ISO sends the scheduled values of wind power to WPPs and WPPs are required to strictly follow the scheduled values in the case of their generation capacities. This traditional mode has the following two obvious defects [37]:(1)In the case of low precision of wind power prediction, the traditional dispatching mode is not conducive to the wind power accommodation. It can lead to extreme operating conditions, which may seriously threaten the system security when the wind power violently fluctuates.(2)It may lead to frequent pitch control when wind turbines strictly track the scheduled values of output power, which would affect the lives of the wind turbines.

The main reason for those two defects listed above is that in the traditional dispatch mode, the uncertainty of wind power is not taken into account. Hence, ISO does not know the maximum permitted wind power output fluctuation range in the premise of ensuring system security and cannot optimize wind power accommodation capacity of the power grid. Therefore, nowadays, more and more attentions have been paid to the REDM [37] which aims to promote the wind power accommodation in considering wind power uncertainty. According to [37], the robust hour-ahead economic dispatch model for hour t can be mathematically described as follows:where and in equation (8) represent the deviation penalty coefficients of permitted upper and lower bounds of the wind power output of WPP i, and equations (9)–(15) represent the hour-ahead system constraints including power balance constraint (equation (9)), DC power flow constraints in each transmission line l (equations (11)–(13)), and load and power production of every DisCO and GenCO (equations (14) and (15)). The hour-ahead LMPs of system can be calculated by using dual variables of equations (9)–(13). Formulations for hour-ahead LMP are in Appendix A. Equations (16)–(19) represent the (predicted) real-time system constraints including power balance constraint (equation (16)), DC power flow constraints in each transmission line l (equations (17) and (18)), and power production of every WPP (equation (19)).

From equations (16)–(18), it is obvious that (predicted) real-time DC power flow in each transmission line l is the linear function of (predicted) real-time power output by every WPP. From equation (2), (predicted) real-time power output of WPP i (i = 1, 2, …, ) must satisfy to say we can solve the abovementioned REDM by replacing with and , respectively (Appendix B) [37] and generating new (predicted) real-time balancing and transmission constraints as follows:

The (predicted) real-time LMPs (RTLMP1s) of system when (predicted) real-time power output of every WPP increases to its (scheduled) permitted upper bound can be calculated by using dual variables of equations (9) and (21)–(23), and the (predicted) real-time LMPs (RTLMP2s) of system when (predicted) real-time power output of every WPP decreases to its (scheduled) permitted lower bound can be calculated by using dual variables of equations (9) and (24)–(26). Therefore, RTLMP1s and RTLMP2s represent 2 extreme real-time dispatching results caused by real-time wind power deviations of all WPPs. For the sake of simplicity and without loss of generality, we approximately consider the mean value of RTLMP1 and RTLMP2 at bus z as the (predicted) real-time LMP at bus z and neglect the impact of different on (predicted) real-time LMPs.

3. Agent-Learning Mechanism

For an agent in our proposed approach, all the other agents together constitute the EM environment it faces. Therefore, interactions between an agent and all the other agents are equivalent to interactions between this agent and the EM environment it faces. An agent has the ability of learning through repeated interactions with the EM environment for finding its optimal action (bidding strategy or bidding power output), which can maximize its (expected) profit in face of whatever the EM environment state is. In this paper, in order to clearly describe our proposed approach, we use the definitions which are organized as follows:(1)Iteration. Since the market is assumed to be cleared in hour-ahead basis, we define each market round as an iteration.(2)State Variable. For WPPi and in iteration t, the hour-ahead and (predicted) real-time LMPs at the bus connecting WPP i calculated in iteration t 1, namely, , , are defined as the EM environment state variables; for GenCOj, the hour-ahead and (predicted) real-time LMPs at the bus connecting GenCOj calculated in iteration t 1, namely, and , are defined as the EM environment state variable. For DisCOm, the hour-ahead LMP at the bus connecting DisCOm calculated in iteration t 1, namely, , is defined as the EM environment state variable. Hence, the state vectors and scalar for WPPi, GenCOj, and DisCOm can be formulated as follows [28]:where , , and are continuous, closed, and bounded state spaces for WPPi, GenCOj, and DisCOm, respectively.(3)Action Variable. For WPPi, the hour-ahead bidding power output, namely, , is defined as the action variable of this agent in iteration t. For GenCOj or DisCOm, the hour-ahead bidding strategy rate, namely, or , is defined as the action variable of GenCOj or DisCOm in iteration t. Hence, the action scalars for WPPi, GenCOj, and DisCOm can be formulated as follows:

Obviously, from equations (28)–(30), we can see that the action spaces for WPPi, GenCOj, and DisCOm are continuous, closed, and bounded intervals.(4)Reward. In iteration t, similar to what was mentioned in [28], every agent learns from the state of the EM environment () and then selects its action which in turn forms its bidding power output or curve for sending to the ISO. After receiving all bidding outputs and curves, hour-ahead LMPs permitted upper and lower bounds of (predicted) real-time power outputs by WPPs, as well as hour-ahead power supply and demand schedules are determined by ISO with our REDM represented by equations (8)–(19). Rewards of WPPi, GenCOj, and DisCOm can be depicted as equations (1), (5), and (8), respectively.

Based on experiencing these received rewards over enough iterations, an agent in EM can gradually learn to know how to take the corresponding optimal hour-ahead actionwhich brings the most profit in face of any state () of the EM environment. Hence, and (i = 1, 2, …, ; j = 1, 2, …, ; m = 1, 2, …, Nd) are changing dynamically over iterations, which may be or not be constant after enough iterations.

4. Methodology

Inspired by the studies in [1926], the dynamic bidding process in spot EM can be realized via table-based reinforcement learning algorithms (TBRLAs) such as Q-learning, fuzzy Q-learning, Roth–Erev learning, and SARSA algorithms. As mentioned in [28, 38], TBRLAs can only rapidly solve the Markov decision-making problems with discrete state and action spaces. When one of the state and action spaces becomes continuous, the problem called “curse of dimensionality” will be caused, and the learning speed of TBRLAs becomes so slow that the agent cannot find its optimal action under any given state of environment over iterations.

As mentioned in Section 3, actually both the state and action spaces of every agent in EM are continuous, closed, and bounded space or interval, which guarantees the process of global optimization. Therefore, it is improper to model and simulate the dynamic bidding process in our proposed hour-ahead EM containing strategic bidding interactions among WPPs, GenCOs, and DisCOs by using TBRLAs. Method in this paper is to apply a modified reinforcement learning algorithm, called the GDCAC algorithm [28, 38], for modeling and simulating our proposed EM.

Because the mathematical principle and pseudocode of the GDCAC algorithm have been described in [28], we only propose the step-by-step procedure of implementing the GDCAC algorithm for hour-ahead EM modeling containing strategic bidding interactions among WPPs, GenCOs, and DisCOs as follows:(1)Input. For the whole EM, input common knowledge is such as every WPP’s reduced (predicted) real-time wind power output scenarios (WPOSs) with corresponding probabilities and all WPP’s joint real-time WPOSs with corresponding probabilities. For WPPi (), input the basic function : for formulating its value function , and its optimal policy function , time step length parameter series and where and and and . For GenCOj (), input the basic function : for formulating its value function and its optimal policy function , time step length parameter series and , where and and and . For DisCOm, () input the basic function : for formulating its value function and its optimal policy function , time step length parameter series and where and and and . Moreover, input the discount, standard deviation, as well as the maximum training and decision-making iterations parameters, namely, , and T1 and T2, for every WPP, GenCO, and DisCO.(2)t = 0.(3)Initialize the linear parameter vectors and for WPPi, linear parameter vectors and for GenCOj, and linear parameter vectors and for DisCOm.(4)If t < T1, then in iteration t, WPPi selects and implements an action () from state , GenCOj selects and implements an action () from state , and DisCOm selects and implements an action () from state . If T1 < t < T1 + T2, then in iteration t, WPPi selects and implements an action () from state , GenCOj selects and implements an action () from state , and DisCOm selects and implements an action () from state . After action selecting and sending it to ISO by every agent, ISO implements the REDM represented by equations (8)–(19) by which the EM environment state vector variables are updated from to and the immediate reward , , and are generated.(5)WPP i observes the immediate reward by using equation (1) and the new EM environment state ; GenCOj observes the immediate reward by using equation (5) and the new EM environment state ; and DisCOm observes the immediate reward by using equation (8) and the new EM environment state .(6)Learning. In this step, and for WPP i, and for GenCOj, and and for DisCOm are updated by using the TD (0) error (namely, , , and ) and gradient descent method.

WPP i:

GenCOj:

DisCOm:(7)t=t+ 1.(8)If t<T1 + T2, return to step (4).(9)Output. For WPPi, and and . For GenCOj, and and . For DisCOm, and and .

According to [28, 38], we choose Gaussian radial basis function as , , and .

5. Simulation Results and Discussions

5.1. Data and Assumptions

In this section, our proposed approach is implemented on the IEEE 30-bus test system with 2 WPPs, 6 GenCOs, and 20 DisCOs [2]. The schematic structure of this test system is shown in Figure 1. The output power of the WPP connected to bus 7 (marked as WPP 1) and 10 (marked as WPP 2) lies within the ranges of [0 80] MW and [0 50] MW, respectively. According to [39, 40], we assume both of the real-time wind power outputs of these two WPPs follow the Weibull distribution independently and respectively. Then, the (predicted) real-time WPOSs of these two WPPs can be generated by using the Monte Carlo method, and method of real-time WPOS reduction is referred to [39, 40]. Table 1 shows the reduced 10 (predicted) real-time WPOSs and their corresponding probabilities of these two WPPs which can be used as exogenous parameters in our proposed approach.

Based on Table 1, the number of joint WPOSs corresponding to combinations of (predicted) real-time power outputs generated by WPP1 and WPP2 is still 100 () which is too many for the subsequent calculations. Hence, in this paper, the 100 joint WPOSs are further reduced to 10 by using the tabu search algorithm proposed in [40]. Table 2 shows the reduced 10 (predicted) real-time joint WPOSs and their corresponding probabilities.

Moreover, parameters of GenCOs’ and DisCOs’ bid functions are shown in Tables 3 and 4 [2], respectively.

In order to verify the 3 points, which are as follows: (1) our proposed EM approach can reach dynamic stability and Nash equilibrium (NE) after enough training and decision-making iterations, (2) the superiority of our proposed EM approach comparing with approaches based on TBRL algorithms (e.g., Q-learning algorithm) in terms of participants’ (expected) profits and expected social welfare (SW) can be calculated as the sum of (expected) profits of all participants [2], and (3) the impact of different market clearing methods (e.g., REDM and stochastic economic dispatch model (SEDM) [12]) on bidding stability results considering strategic interactions among WPPs and other participants, 3 corresponding simulations conducted by using Matlab R2014a software are carried out one by one as follows.

5.2. Testing the Ability of Our Proposed EM Approach to Reach Dynamic Stability and NE

In this section, we assume that every WPP, GenCO, and DisCO in the market are the GDCAC-based agents with continuous state and action spaces, and dynamic interactions among all GDCAC-based agents actually constitute our proposed GDCAC-based EM approach. The related parameters of the GDCAC algorithm are listed in Table 5.

In our simulation and comparisons (the same as the subsequent sections), every agent will go through a process of training with 3000 iterations in which all agents’ action selecting policies consider the balance of exploration and exploitation [28]. After the training process, decision-making process with 500 iterations will be implemented by all agents, in which only the greedy policy will be adopted when selecting actions in face of any state of the market [28]. Moreover, in the beginning of the first training iteration, because every agent has no experience in strategy selecting, we randomly set hour-ahead bidding outputs of WPPs and bidding strategies of GenCOs and DisCOs within their respective intervals.

During the decision-making process, the dynamic adjustment of the EM environment state and bidding strategy (output) of every agent may be constant which means the market reaches the dynamic stability. Testing and verifying whether our proposed GDCAC-based approach reaches to dynamic stability after 3000 training iterations can be shown in Figures 24, respectively.

From Figures 24, we can see that the adjusting processes of hour-ahead LMPs, (expected) profit of every agent, and (predicted) real-time LMPs connecting WPPs (penalty prices charging from WPPs) in our proposed GDCAC-based approach keep constant during 500 decision-making iterations. It has been verified in [28] that other adjusting processes in EM such as that of expected SW and every agent’s bidding strategy would reach constant while the adjusting process of LMPs keeps constant. Therefore, reaching the dynamic stability of our proposed GDCAC-based approach after 3000 training iterations is concluded in this paper. However, dynamic stability is not equivalent to NE. Hence, in order to examine whether the obtained bidding strategies of all agents after 3000 iterations of the training process and 500 iterations of decision-making process reach NE, we observe each agent’s (expected) profit by changing its bidding strategy but fixing the other agents’ bidding strategies after 3500 iterations. A combination of the obtained bidding strategies of all agents represents NE when there is no agent that can increase its (expected) profit in case of other agents’ bidding strategies unchanged. We define a Nash index [2] which is equal to 1 when the NE is reached and otherwise is equal to 0. Figure 5 demonstrates the adjusting process of Nash indices during 3500 iterations in our proposed GDCAC-based approach.

It is known to us from Figure 5 that our proposed GDCAC-based EM approach is able to successfully generalize agents’ experiences in face of any state point from the adjacent state points to reach NE after enough training and decision-making iterations. Moreover, by using the same method, the ability to reach the dynamic stability and NE of the comparative Q-learning-based approach, which will be mentioned in Section 5.3, after the same iterations can also be verified, which will not be demonstrated here due to the length of the article.

The obtained hour-ahead LMPs, RTLMP1s, and RTLMP2s of 30 buses after 3500 iterations in our GDCAC-based EM approach are depicted in Figure 6.

It can be seen in Figure 6 that hour-ahead LMPs of 30 buses are equal to each other after 3500 iterations, which is to say the hour-ahead dispatched results causes no congestion in any transmission line of this test system. In addition, there exist differences among RTLMP1s and RTLMP2s of 30 buses no matter with respect to the permitted upper or lower bound of power outputs by WPPs. Explanations of the above simulation results given by this paper can be expressed as when deviations between the (predicted) real-time outputs of WPPs and their hour-ahead scheduled ones exist, the power output of each generator connected bus and the power flow on each transmission line in this system are redistributed, in order for the system to tolerate the (predicted) real-time wind power deviations to a certain degree, and it is necessary for REDM in hour ahead to not only make each GenCO maintain a certain value of reserve capacity but also to reserve for each transmission line some additional transmission capacity to deal with the (predicted) real-time power flow changes.

5.3. Comparison of Our Proposed Approach and TBRL-Based Approach

In this section, for the purpose of approaches comparisons, our proposed GDCAC-based EM approach and the Q-learning-based EM approach are implemented on this test system, respectively. There are 3 learning scenarios (LSNs) which are set in this paper for simulation and comparisons. LSN.1 assumes that every WPP, GenCO, and DisCO in the market are the GDCAC-based agents with the continuous state and action spaces, which is the same as our proposed GDCAC-based approach mentioned in Section 5.2. LSN.2 assumes that WPP1 is a Q-learning-based agent with discrete state and action spaces, while other agents are the same as that in LSN.1, and LSN.3 assumes that every WPP, GenCO, and DisCO in the market is a Q-learning-based agent with discrete state and action spaces, which means the comparative Q-learning-based EM approach. Table 6 presents the related information while taking LSN.2 and LSN.3 into account, respectively. The parameters of the comparison of the Q-learning algorithm [19, 28] which use policy to balance exploration and exploitation in 3000 training iterations and greedy policy in 500 decision-making iterations are also listed in Table 6.

After 3500 iterations, (expected) profits of all agents and expected SWs in 3 LSNs are listed in Table 7.

From Table 7, the following can be inferred:(1)After the same number of iterations, WPP1’s (expected) profit in LSN.1 is higher than that in LSN.2. This, to some extent, indicates one can get more profit by using our proposed GDCAC-based method to bid in EM than using the Q-learning based one within the same condition (namely, the same parameters values, number of iterations, and adaptive learning mechanism of other agents).(2)After the same number of iterations, the expected SW in LSN.1 is higher than that in LSN.2 and the expected SW in LSN.2 is higher than that in LSN.3. This, to some extent, indicates that, with the increase in the number of agents using our proposed GDCAC-based method to bid in EM, the expected SW can be improved.

In conclusion, regarding to the (expected) profit of a specific agent and expected SW, it is obvious that our proposed GDCAC-based approach is better than the comparative Q-learning based one. The main reasons of this result are as follows: (1) the state and action spaces in the comparative Q-learning approach are discrete; otherwise, it will cause the curse of dimensionality, which is not the same as all continuous state and action spaces in the GDCAC-based approach; (2) the phenomenon of discrete state and action spaces makes it harder to find the globally optimal action solution in face of any given state than the continuous ones [28].

5.4. Comparison of Different Market Clearing Models in Our Proposed EM Approach

In this section two market clearing models embedded in our proposed GDCAC-based EM approach are compared in this test system, respectively. One is the REDM mentioned in Section 2.2 The other is the SEDM mentioned in [12]. Under SEDM, we still assume that it gives priority to the scheduling of the hour-ahead bidding outputs of WPPs in the system. Moreover, with respect to SEDM, it, based on 10 joint real-time WPOSs listed in Table 2, takes maximizing the expected SW as the objective function [12], and simultaneously considers hour ahead and (predicted) real-time transmission constraints etc. in order to obtain the optimal hour-ahead scheduled power output and demand results of all GenCOs and DisCOs. In this paper, we adopt expected SW, bidding power outputs, and permitted (predicted) real-time upper and lower bounds of power outputs of WPP1 and WPP2 obtained after 3500 iterations for comparison. The calculating results of these indices by using different dispatch models in our proposed EM approach are listed in Table 8.

From Table 8, the following can be inferred:(1)After 3500 iterations, hour-ahead bidding outputs of WPPs within the REDM-embedded EM approach are significantly more than those within the SEDM-embedded EM approach. Explanations of this simulation result given by this paper can be expressed as follows: although both of the two dispatch models have endogenous penalty mechanism for wind power output deviations, which can affect the dynamic adjustment process of bidding power outputs of WPPs, the permitted upper and lower bounds are dynamically adjusted to adapt for the hour-ahead bidding power output of each WPP within the REDM-embedded EM approach while in each iteration of the SEDM-embedded EM approach, the hour-ahead bidding power output of each WPP is required to meet the (predicted) real-time transmission constraints corresponding to 10 WPOSs listed in Table 2. Therefore, WPPs in REDM-embedded EM approach can adjust their bidding power outputs to relatively high levels while those in SEDM embedded one are more inclined to adjust their bidding power outputs to the average level of the 10 WPOSs listed in Table 2 in order to avoid the risks of (expected) profit decline caused by larger power deviations.(2)After 3500 iterations, expected SW obtained from the REDM-embedded EM approach is significantly more than that obtained from the SEDM embedded one. Explanations of this simulation result given by this paper can be expressed as follows: in order to meet all (predicted) real-time transmission constraints corresponding to 10 obviously different WPOSs listed in Table 2, more reserve transmission capacity in each transmission line are required by using SEDM, which may force out more scheduled power outputs and demands by GenCOs and DisCOs than REDM under the same bidding power outputs of WPPs.(3)Moreover, other than the scheduled hour-ahead power outputs and demands of all GenCOs and DisCOs, it can also be scheduled by REDM the permitted upper and lower bounds of (predicted) real-time power output of each WPP. If a WPP’s (predicted) natural power output exceeds its permitted power output interval which is defined by its scheduled permitted upper and lower bounds, its (predicted) real-time power output can be adjusted equal to the adjacent bound by conducting pitch control or using storage equipment [37]. This characteristic of REDM means continuous arbitrary changes of the real-time power output within the corresponding permitted power output interval by each WPP would not cause congestion in any transmission line in the system. However, by using SEDM, only the hour-ahead power outputs and demands of all GenCOs and DisCOs can be scheduled. Although a scheduled output by REDM can meet all (predicted) real-time transmission constraints corresponding to 10 WPOSs listed in Table 2, it cannot be guaranteed that real-time power outputs of WPPs other than any of the WPOS listed in Table 2 also would not cause congestion in any transmission line in the system, and WPPs would not know their corresponding permitted power deviation intervals according to which they can adjust their natural power outputs by conducting pitch control or using storage equipment. Hence, the SEDM-embedded EM approach is less conducive to the wind power accommodation than the REDM embedded one.

Therefore, no matter with respect of economy or reliability, REDM has a lot of advantages over SEDM when being embedded in the EM modeling approach.

6. Conclusion

In this paper, considering strategic interactions among WPPs, GenCOs, and DisCOs, we have proposed a GDCAC-based EM modeling approach with REDM embedded. Simulation results have verified the feasibility and the scientific nature of our proposed approach, and some conclusions can be drawn as follows:(1)With our proposed GDCAC-based EM approach, the simulated bidding process after enough training and decision-making iterations can reach dynamic stability which has been tested and verified as the NE result.(2)Our simulation on the IEEE 30 bus test system with 28 participants takes only 1.17 minutes to reach the final result. That is to say, the time complexity of our proposed approach is relatively low so that we can extend it to the modeling and simulation of more realistic and more complex EM system.(3)Our proposed GDCAC-based EM approach is superior to the TBRL- (Q-learning-) based approach in terms of increasing the profit of a specific agent and expected SW. The main reason is that only TBRL algorithm can be used to analyze Markov decision-making problems with discrete state and action spaces.(4)The obtained bidding results also reveal that in, the premise of maintaining relatively high wind power accommodation ability of the system, the overall SW can be improved by using REDM as the market clearing model when comparing with SEDM. This, to some extent, has verified the robustness against wind power fluctuations, the reliability about scheduling results, and the market operation economy of our proposed EM approach with REDM embedded.

Moreover, on the one hand, our proposed approach can provide a bidding decision-making tool for WPPs, GenCOs, and DisCOs to get more profits in EM. On the other hand, our proposed approach can also provide an economic and operational analysis tool for promoting the development of renewable resources.

Appendix

A. Formulations for Hour-Ahead LMP

The hour-ahead LMP for energy credit and load payment at bus Gz (or Dz) can be calculated aswhere , , and represent the dual variables of equations (9), (12), and (13), respectively. L represents the generalized Lagrange function of model (equations (8)–(19)).

B. Discussion on the Reformulation of Constraints (1618) to (2125)

From equations (16)–(18), it is obvious that increases with the increase of () and decreases with the decrease of (). This is to say the violation of real-time constraints is most likely to happen when or . Hence, for the purpose of maintaining robustness, we can solve the abovementioned REDM by replacing with and , respectively, and generating new (predicted) real-time balancing and transmission constraints as follows:

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the fund project of the Central University of North China Electric Power University under 2017XS113.