Abstract

With the development of the Internet of things and smart grid technologies, modern electricity markets seamlessly connect demand response to the spot market through price-responsive loads, in which the trading strategy of load aggregators plays a crucial role in profit capture. In this study, we propose a deep reinforcement learning-based strategy for purchasing and selling electricity based on real-time electricity prices and real-time demand data in the spot market, which maximizes the revenue of load aggregators. The deep deterministic policy gradient (DDPG) is applied through a bidirectional long- and short-term memory (BiLSTM) network to extract the market state features that are used to make trading decisions. The effectiveness of the method is validated using datasets from the New England electricity market and Australian electricity market by introducing a bidirectional LSTM structure into the actor-critic network structure to learn hidden states in partially observable Markov states through memory inference. Comparative experiments of the method show that the method can provide greater yield results.

1. Introduction

The basic feature of the electricity market is that prices follow demand and price changes affect the quantity demanded [1]. The economic operation of the electricity market will help to reduce the cost of electricity use and is an effective way of enhancing the security of the electricity system through economy [2]. The study of response characteristics in terms of timing, trading rules, etc., can enhance the flexibility of electricity markets to improve the accuracy of forecasting and decision-making [35]. In recent years, with the development of the Internet of things and smart grid technologies, especially the technological advancement of ambient intelligence, the widespread deployment of smart meters has equipped more customers with two-way communication capability, making price-responsive load possible. Price-responsive demand (PRD), which unifies the original price-based and incentive-based demand-side response, makes the originally uncontrollable price-based demand response a controllable resource and unifies the incentive-based demand response to the response to price.

For system operators, PRD is a reliable real-time resource that can be described as the price-adjusted load, providing a new means and tool for dispatch; for consumers, PRD reduces electricity bills and improves energy use efficiency. The EcoGrid EU trial results show that residential load could be considered price-sensitive on certain test days [6]. Moreover, the California ISO PJM and Alstom Grid results show that PRD helped improve the efficiency of market operations and significantly increase system reliability [7]. Price-responsive mechanisms facilitate the integration of new flexible energy sources and reduce rail operating costs [8].

Load aggregators can consolidate demand response customer resources and become price-responsive loads as a single large customer. To a certain extent, this eliminates uncertainty in user response behavior and allows small and medium loads to participate in the electricity market in conjunction with their own load control characteristics; aggregated demand response resources can be flexibly managed to improve response efficiency based on forecast or current electricity spot market prices. In Liu et al. [9], a hybrid stochastic/robust optimisation approach with a model that minimizes the expected net cost was proposed for distributed generation (DG), storage, dispatchable DG, and price-sensitive load bidding strategies in the pre-electricity day market. The results show that the wind power output had a negative correlation with the price-based demand response load response, and the correlation could reduce the system operating cost and improve the economy of system dispatch. In Geng et al. [10], a two-stage stochastic market power purchase model with DR resources was constructed to minimize the energy purchase cost of integrated energy service providers in different types of markets, and the impact of flexible heating load on their power purchase strategy was presented. In the previous day’s market, a multi-time scale stochastic optimal scheduling model for electric vehicle (EV) charging stations with demand response was proposed with the objective of minimizing the daily operating cost and introducing price-based demand response to optimise the net load curve of charging stations [11]. Combining the price-based demand response measures, the optimisation was proposed with the objectives of maximizing the revenue of EV load aggregators and minimizing the load fluctuation [12].

Heuristic algorithms, meta-heuristics, and intelligent evolutionary algorithms for the optimal solution of decision problems are used in various fields. In Zhao and Zhang [13], a learning-based generalisation algorithm is proposed to improve generalisation by adjusting the evolutionary strategy of the algorithm based on feedback information in the optimisation process according to the actual problem. Pasha et al. [14] present an integrated optimisation model whose objective is to maximize the total turnover profit generated by the transport business and solve the proposed model through a decomposition-based heuristic algorithm. Kavoosia et al. [15] propose an evolutionary algorithm to solve the developed mathematical model, implemented through an enhanced adaptive parameter control strategy that effectively varies the algorithm parameters throughout the search process. Dulebenets’ [16] study proposes a new adaptive polymorphic memory algorithm to solve the scheduling problem of transport and to help operators in proper operational planning. Rabbani et al.’s [17] study presents a mixed integer linear programming model to find the optimal route sequence and minimize time consumption through non-dominated sequential genetic algorithm II and multi-objective particle swarm optimisation.

In recent years, deep reinforcement learning has autonomous recognition and decision-making capabilities and has been successfully applied in the energy sector [1820]. The feasibility of using it for grid regulation has also been demonstrated [21, 22]. The requirements associated with demand response can be met [23]. Reinforcement learning theory represents a mathematical model of learning that is rewarded by repeated trial and error and is based on the psychological term operant conditioning, which derives its name from the phenomenon of the increased frequency of autonomous behavior reinforcement. A customer agent model was proposed in [24] applying reinforcement learning Q-learning for predicting price-sensitive load reductions. A pricing strategy was investigated in [25] for charging station operators based on noncooperative games and deep reinforcement learning, and the effectiveness of the proposed framework was validated with real data from cities. Moreover, a real-time pricing technique was proposed in [26] based on multi-intelligent reinforcement learning, and it worked well in producing consumer-driven applications of mini-smart grids. The researchers behind [27] considered thermostatically controlled loads, energy storage systems (ESS), and price-responsive loads for flexible demand-side dispatch of microgrids based on deep reinforcement learning, which significantly reduced input costs. The researchers of [28] gave a dynamic pricing strategy based on DDPG considering the historical behavior data of electric vehicles, peak-valley time-sharing tariff, and the demand-side response pattern to guide the customer tariff behavior and exploit the economic potential of the electricity market. Considering the cooperation between wind farms and electric vehicles, an intelligent pricing decision was proposed in [29] for EV load aggregators based on deep reinforcement learning algorithms to achieve an increase in overall economic benefits. To maximize the long-term revenue of electricity sellers under the electricity spot market, the researchers of [30] proposed a dynamic optimisation scheme for demand response using reinforcement learning. For the price difference between the day-ahead and real-time markets in the electricity spot market, the researchers of [31] achieved an effective solution for the optimal bidding strategy based on deep reinforcement learning. Further, an improved deep deterministic policy gradient algorithm was proposed in [32] as a building-level control strategy to improve the distributed electric heating load-side demand response capability. A dual DQN agent was proposed in [33] to evaluate the elasticity of power systems. Other research [34] combined the cross-entropy method (CEM), the maximum mean difference method (MMD), and the deep deterministic policy gradient algorithm with twin delays (TD3) in evolutionary strategies to propose the diversity evolutionary strategy deep reinforcement learning (DEPRL).

In summary, load aggregators, acting on behalf of small and medium electricity consumers in price-responsive load trading, face the problem of how to purchase electricity from the market and sell it to consumers and need to optimise their decision-making options in terms of both purchases and sales in order to maximize profits. Therefore, it is necessary to study the buying and selling strategies of price-responsive loads that can be carried out by load aggregators in dynamic trading in the electricity spot market. It is also necessary to overcome the problem of the slow training convergence rate when the input dimension of reinforcement learning is too large. Based on the above problems, this study proposes a deep reinforcement learning method based on BiLSTM for load aggregators to purchase and sell electricity, taking the maximum revenue of load aggregators under the price-responsive load mechanism as the scenario. The contributions of this study are as follows.

We propose a BiLSTM-DDPG model to make the trading strategy for load aggregators. We describe the trading process as a partially observable Markov decision process (POMDP). The bidirectional LSTM neural network is used to process the bidirectional time axis state information one by one and generate bidirectional coded information to cope with the dynamic changes in an uncertain environment. We propose the BiLSTM-DDPG method, which integrates time-domain processing and has autonomous recognition and decision-making capabilities. BiLSTM can extract features and temporal relationships, avoiding gradient disappearance and gradient explosion. DDPG allows for more accurate recognition and optimal decision-making for complex electricity spot market environments.

2. Materials and Methods

2.1. BiLSTM Model

The recurrent neural network (RNN) is a neural network that processes temporal data as input to itself. In a single computational unit, the data (xt) from the previous t moments and the computational output (ht1) from the previous t − 1 moments are used as input, and in the unit output, in addition to the output yt, ht is also generated, and the data are passed on to the next moment (t + 1) for the next computation. The RNN based on this design structure has predictive capability. LSTM is an improved RNN, and compared with RNN, LSTM adds the forgetting gate at the output and implements the forgetting function by a state parameter (c). The LSTM structure is shown in Figure 1.

The LSTM cell contains an oblivion gate, an input gate, and an output gate. The oblivion gate (ft) selectively forgets the information of the previous cell, as shown in equation (1); it takes the information of the previous cell and the current state as input and outputs a value from 0 to 1 by the sigmoid function, and this value is the percentage of retained transmission information. The current cell input information proportion is controlled by the input gate, as shown in (2), C is the proportion of retained information, as shown in (3), and (4), representing Ct, weights the retained information and new information as the current cell state. The output gate determines how much information is output, and (5) and (6) pass some of the information from the current cell to the later cells [3537]. The DDPG algorithm with LSTM added stores and passes on information about the trend of the hidden state of the environment in the time domain, as shown in the following equations:

The BiLSTM propagates the state of the hidden layer using a timeline of “from the past to the future” and “never to the past” directions, as shown in Figure 2. The BiLSTM captures the transformation pattern of features on a bidirectional time axis. In the figure, LSTM1 and LSTM2 are the forward and reverse LSTM models, respectively. The output at moment ht can be expressed as follows:

2.2. Reinforcement Learning

The mathematical basis for reinforcement learning is the Markov decision process (MDP), which consists of a state space, an action space, a state transfer matrix, a reward function, and a discount factor. The MDP tries to find a strategy that allows the system to obtain the maximum cumulative reward value. The state is a generalisation of the current environment; the state space is the set of all possible states, denoted as S; action is the decision made; the action space is the set of all possible actions, denoted as A; the agent is the subject doing the action; and the policy function is the decision to control the action of an intelligent body based on the observed state.

Agent environment interaction (AEI) is when an intelligent body observes the state of the environment (s) and makes an action (a), the action changes the state of the environment, and the environment gives the intelligent body a reward (r) and a new state (s′), as shown in Figure 3.

In this study, MDPs can be expressed as (S, O, A, P, r, γ, S), where S is a set of consecutive states, and A is a series of consecutive actions. P:S×A×S⟶R is the transfer probability function, r:S×A⟶R is the reward function, γ is the discount factor, S is the initial state distribution, and O is the set of continuous partial observations corresponding to the states in S. In training, S0 is obtained by sampling from the initial state distribution S. At each time step t, the intelligence determines the current ambient state space (St ∈ S). The reward r: S × AR is obtained by taking the action atπ (st) according to the strategy π: S ⟶ A, and the new ambient state St+1 is obtained.

The goal of the intelligent body is to maximize the expected return, as follows:

The payoff is the discounted sum of future returns, as follows:

The Q function is defined as follows:

In the partially observable case, an agent acts on partial observations, at = π(Ot), where Ot is the partial observation corresponding to the complete state (St).

2.3. DDPG Model

The DDPG algorithm incorporates the ideas of DQN and uses a deterministic policy function to enable the problem to perform better on continuous spaces of high dimensionality. The learning framework for deterministic strategies takes the approach of the actor-critic algorithm, where the actor is the action strategy, and the critic is the evaluation, which in this case estimates the value function using function approximation methods. The network structure of DDPG is shown in Figure 4.

DDPG uses two neural networks to represent the deterministic strategy A = πθ(s) and the value function Qμ(s, A). The network parameters are θ and μ, where the strategy neural network is used to update the behavioral strategy of the intelligence, corresponding to the actor network in the actor-critic structure, and the value network is used to approximate the value function and provide gradient information for the update of the strategy network, corresponding to the critic network in the actor-critic structure. DDPG finds an optimal strategy πθ to maximize the expected return, as follows:

A parameter update of policy network by the gradient is as follows:

The expected return value after taking action A in state S, following strategy π, is as follows:

The value network is updated according to the value-network-updating method in DQN; namely, the loss minimization function L(μ) is used to update the value network parameters, as shown in the following equations:where θ′’ and μ′’ denote the target actor network and target critic network parameters, respectively. DDPG uses a data playback mechanism to obtain training samples [3841]. The information about the gradient of the Q-value function regarding the action of the intelligent body is passed to the actor network through the critic network, and the update of the policy network is performed in the direction of boosting the Q-value according to (16).

2.4. BiLSTM-DDPG-Based Trading Strategy for Load Aggregators on PRD

The description of the variables of the BiLSTM-DDPG-based trading strategy for load aggregators on PRD is shown in Table 1. The BiLSTM-DDPG model processing steps for power markets are shown in Figure 5. The DDPG deep reinforcement learning with BiLSTM structure is based on the actor-critic network structure, shown in Figure 6.

For load aggregators, the main objective of participating in price demand response is to maximize the benefits of energy trading. The total benefits received by the load aggregator in the real-time market are as follows:where RRT is the profit on electricity sales in the real-time market, CRT is the cost of electricity purchased by the electricity seller in the real-time market, and CDA is the cost of electricity purchased in the day-ahead market.where is the selling price in the real-time market at time t, and is the amount of electricity sales. is the purchase price in the real-time market at time t, and is the amount of electricity purchased. is the purchase price in the day-ahead market at time t, and is the amount of electricity purchased.

The input of the neural network is the state, and the output is action value. The neural network consists of three full connection layers; the first two layers are activated by the rectified linear unit function, and the third layer is the linear connection. The agent is built according to the logic of the pseudo-code, obtaining the reward values, iterating through the Bellman equation, and then gradient descending the difference between the target network and the action network, where the target network is updated using the soft update method. The parameters and description of DDPG algorithm used in the case are shown in Table 2.

The DDPG that introduces the BiLSTM needs to use the before-and-after order of states during training, so the corresponding experience pool data are saved as a sequence of whole sets to provide experience data for subsequent updates of the actor and critic networks, and the sequence of saved data is as follows:where T is the number of steps per set. When the number of time steps is a multiple of T, the program structure is cleared of historical data records, and the empirical data are recoded. We can reconstruct observable historical information and full-state historical information from empirical data, as follows:

The critic and actor networks are updated separately. As BiLSTM is a time-series-based RNN, the updates to the critic and actor networks are backpropagated through time (BPTT), and the updates are as follows:

The pseudo-code of BiLSTM-DDPG is as follows:The parameters μ and θ initialize the critic network Qμ(at, st) and the actor network πθ(ht), respectively;μ′μ,θ′θ initialize the target networks Qμ and πθ;Initialize the experience replay area ®:for episode = 1,. .., M doClear the history information h0 and c0;For t = 1,. .., T doGet observation (ot) and full state (St.) from environmentUpdate the history information, ;Generate action, ;End forStore the empirical sequence () into the experience pool R;Sample N episodes of experiences in experience pool R;Construct partial observable history, ;Construct the full state history message, ;Calculate the target value of each sample, ;Update the critic network: ;Update the actor network: ;Update target network: End for

3. Results and Discussion

3.1. Experimental Settings

The experimental environment is as follows: Python 3.6.2, TensorFlow 2.0.0a GPU, Intel(R) Core(TM) [email protected] GHz∼2.70 GHz, 64 bit, 8 GB of RAM, and NVIDIA GeForce 940MX.

The first dataset is the annual whole-point data of the New England electricity market (ISO-NE) in the United States, selected for the Connecticut Region [42]. Real-time electricity price data are collected for 1,917 consecutive days from January 1, 2016, to March 31, 2022, at a frequency of once per hour, for a total of 46,008 moments.

The second dataset is the annual whole-point data of the Australian Energy Market Operator (AEMO) in the Australian, selected for the Connecticut New South Wales Region [43]. Real-time electricity price data are collected for 1186 consecutive days from January 1, 2018, to September 30, 2021, at a frequency of once per half hour, for a total of 56,928 moments.

3.2. Experiment 1: Hourly Load Aggregator Trading Strategy in ISO-NE

In this experiment, the prediction of three days’ worth of trading strategy from January 1, 2016, to March 28, 2022, is used as the training set, data from March 29, 2022, to March 30, 2022, as the validation set, and data from April 1, 2021, to April 3, 2021, as the test set.

The comparison of the profit curves for the hourly load aggregator trading strategy in ISO-NE is shown in Figure 7; the performance of buying and selling is shown in Table 3; trading strategies from April 1, 2021, 0:00, to April 2, 2021, 4:30, in the ISO-NE results are shown in Figure 8. The overall evaluation from April 1, 2021, 0:00, to April 2, 2021, 4:30, in ISO-NE is shown in Table 4. It is demonstrated that the proposed method is more economical than DNN-DDPG, RNN-DDPG, and LSTM-DDPG, indicating ith as better convergence ability.

3.3. Experiment 2: Load Aggregator’s Trading Strategy Every Half Hour for 2 Days in AEMO

In this experiment, the prediction of every half hour for two days’ trading strategy from January 1, 2018, to September 27, 2021, is used as the training set, data from September 28, 2021, to September 29, 2021, are used as the validation set, and data from September 30, 2021, are used as the test set.

The comparison of the profit curves for the hourly load aggregator trading strategy in AEMO is shown in Figure 9; the performance of buying and selling is shown in Table 5. Trading strategies from April1, 2021, 0:00, to April 2, 2021, 4:30, in AEMO results are shown in Figure 10. The overall evaluation from April 1, 2021, 0:00, to April 2, 2021, 4:30, in AEMO is shown in Table 6. It is also demonstrated that the proposed method is more economical than for DNN-DDPG, RNN-DDPG, and LSTM-DDPG, indicating ith as better convergence ability.

4. Conclusions

This study investigates deep reinforcement learning in load aggregators’ participation in the electricity spot real-time market trading strategy. The proposed improved DDPG algorithm can be used for load aggregators’ real-time load purchase and sale transactions in the electricity spot real-time market. The main work is as follows: (1) an improved BiLSTM-DDPG with better convergence ability is proposed to solve the problem that DDPG does not easily converge when the input dimension is too large; (2) deep reinforcement learning is introduced into the analysis of power purchase and sale strategies in the electricity spot market so that load aggregators can participate in demand response with better results; and (3) in the case of IOS-NE and AEMO, it is proved that under the strategy implemented by the proposed method, it is more economical for the load aggregator to participate in the price-responsive load than for DNN-DDPG, RNN-DDPG, and LSTM-DDPG.

The proposed algorithm can be used to solve scenarios with large data volumes and high requirements for timeliness in the electricity market, providing an idea for the study of optimisation problems. This study focuses on the load aggregator’s purchase and sale model, but has not studied the point-to-point user. Future research will combine transfer learning and federal learning to achieve distributed peer-to-peer transaction optimisation in electricity retail market.

Data Availability

The data of the models and algorithms used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the National Key R&D Program of China (No. 2016YFB0900100).