Abstract

The training process analysis and termination condition of the training process of a Reinforcement Learning (RL) system have always been the key issues to train an RL agent. In this paper, a new approach based on State Entropy and Exploration Entropy is proposed to analyse the training process. The concept of State Entropy is used to denote the uncertainty for an RL agent to select the action at every state that the agent will traverse, while the Exploration Entropy denotes the action selection uncertainty of the whole system. Actually, the action selection uncertainty of a certain state or the whole system reflects the degree of exploration and the stage of the learning process for an agent. The Exploration Entropy is a new criterion to analyse and manage the training process of RL. The theoretical analysis and experiment results illustrate that the curve of Exploration Entropy contains more information than the existing analytical methods.

1. Introduction

Reinforcement learning (RL) has become an efficient solution to deal with the Markov Decision Processes (MDPs), such as traffic self-adaptive control [1], smart grid prediction [2], and activity-travel patterns in a city [3]. Along with the development of artificial neural networks [4], the application of RL has made great successes in the field of nonlinear and large-scale system decision problem. At the same time [5], the learning performance of RL and the judgment of the convergence for a certain RL algorithm has been the core problem for training an agent all the time. An agent learns the action selection strategy by interacting with the environment through trial-and-error and the acquisition of rewards [6]. Different values of reward corresponding to all state-action pairs are used to update the value function from which the action selection probability is generated in each training round. During the training process, the state transition probability indicates the character of the environment, and the action selection probability is metabolic which indicates the learning process of the agent [7]. In order to get more information of the training process and understand the learning problem more deeply, it is necessary to analyse the learning process with more efficient methods. Therefore, a new criterion is necessary to judge when to terminate the training process or whether an algorithm is better and faster than another one.

Meanwhile, some literature has analysed the training process for RL based on the parameters , and some other parameters which are also important for the convergence rate [8]. The influence to the convergence in RL from the major parameters, algorithmic complexity, the reward designing, and the training data is analysed in [9]. The storage complexity and exploration complexity are defined to analyse the complexity of RL for some complex problems, especially for quantum control systems [10]. However, these studies have only paid attentions to the change of the convergence results introduced by the parameters’ variation, instead of the learning process of the agent.

On the other hand, the concept of Entropy that describes the randomness or uncertainty of a physical system has been introduced into information theory [11], called as Information Entropy which solved the problem of information quantification and is used to indicate the amount of information in the information source. It has been successfully applied in many fields such as information processing, artificial intelligence, and statistics [12]. In the field of MDPs [8], entropy has been used to optimize the decision results. In the field of RL, some entropy-based methods have also been studied in depth. In [13], the maximum causal entropy framework was extended to discount reward setting in inverse RL. A kind of maximum entropy-based RL [14] was also applied to the problem of speech recognition for telephone speech. Ramicic and Bonarini used the entropy-based prioritized sampling method to optimize experience replay for deep Q-Learning [15]. Nevertheless, as far as we know, there is no result reported on the learning process for MDP based on entropy.

In addition, the causal sparse tsallis entropy regularization was applied into the sparse Markov decision with RL [16]. The computation and estimation for general entropy functions including classical Shannon and Rényi entropies in Markov chains are introduced [17]. A kind of estimators for the entropy of the ergodic homogeneous Markov chains with countable state spaces was constructed [18]. Girardin and Limnios defined an entropy rate and extended the Shannon-McMillan-Btriman theorem for a kind of countable discrete-time semi-Markov process [19]. Inspired by these studies, we focus on the change of the transition probability and action selection probability in the whole training process of RL. The concept of entropy will be used to define a new norm to illustrate the training process of an RL agent.

In consideration of the value function or the probability of action selection in RL, the agent has no knowledge about the environment before learning. The value function for each state cannot reflect the real and accurate information of the state. The probability of action selection for the agent at each state is random. The agent can get more than one possible paths along which it can get to the target, although most of them are not optimal for its mission. During the learning process, as the count of the trial-and-error increases, the value function of the agent can reflect the environment and the mission more and more accurately. For each state, the action selection becomes more and more certain. At last, the agent will get one or more optimal policies for the mission through the training process. This means that the value function for the RL system indicates some potential information through the change of the uncertainty of action selection, which has not been realized clearly until now. In [2022], a kind of strategy entropy-based algorithm to accelerate the learning speed through self-adaptive learning rates was introduced, but the relationship between the strategy entropy and the learning process has not been explained clearly. In this paper, we define the concept of Exploration Entropy (EE) corresponding to each state and the whole system to explain the learning process of RL.

The rest of the paper is organized as follows. Section 2 introduces the concept of RL and Exploration Strategy. In Section 3, the definition of Exploration Entropy for RL is introduced, and related issues are analysed. In Section 4, the EE is applied to two kinds of typical RL systems to demonstrate the application of EE for the learning process of RL. Conclusions are given in Section 5.

2. Preliminaries

2.1. Reinforcement Learning

An RL agent learns a map between the environment state space and the action space through its interaction with the environment including observing the system’s state, selecting and executing actions, and getting numerical action reward [23]. The mathematical theoretical basis of RL is discrete-time finite-state MDPs [24]. In a general way, a five-element tuple is often used to define the MDP model for RL [25, 26]:(i) is the state space(ii) is the action space for state (iii) is the probability of the states transition with executing action a at state (iv) is a reward function getting from environment after executing action a at state (v)V is a criterion function or objective function of all the in the whole process

The whole process of an MDP is made up of a series of state-action pairs: , which is determined by the probability of states transition. Also, the state transition probability matrix that consists of reflects the inherent properties of the system under the executed action. The policy π for the agent is a sequence which is a strategy for action selection based on the action selection probability . In RL, is computed from the criterion function, such as a kind of value function matrix. The criterion function implying knowledge and experience is established on historical learning process and would be used to select future actions.

The goal of RL is to learn an optimal policy that the agent would get the maximal accumulated rewards starting from the initial state to the target state in an episode. In the RL algorithm, the accumulated rewards called “Value function” [27] can be defined formally bywhere is the discount factor which reflects the influence degree of the future states to the current action. According to the Bellman equation, the value function will be described aswhere is the state value function for the optimal policy :

During the training process, taking the method of temporal difference (TD) as an example, the value function will be updated aswhere is the learning rate.

In a general RL training process, when the system, including the environment and agent, is at a certain state , the agent gets the state information through observation of environment, and then, chooses an action based on . The system transits to a new state based on caused by the executed action . The agent gets a numerical reward which shows the usefulness degree of for the agent’s target. The agent updates V using the reward and other parameters.

2.2. Exploration Strategy

The performance of an agent’s policy can only be observed until the end of an episode, and the value function describes the long-term accumulated rewards of all the actions that have been executed. The agent should try every action to get the reward for every state in order to find the best action by comparing the accumulated rewards. More generally, the rewards for an action at a certain state fit into some probability distribution. To get the average reward, the agent should execute the action many times. It is a key problem how to select an action for a certain state based on the value function in the training process [28].

There is an exploration-exploitation dilemma in action selection which is based on the value function. In a training process, if an agent gives preference to the maximal value for the current state which is called exploitation, then the training process may converge quickly. However, the agent may get to the local optimal policy but not the global optimal policy. On the contrary, the episodes of training process will be increased on the preference of exploration. To balance exploration and exploitation [26], some action selection strategies have been applied to RL, such as ϵ-greedy, softmax, and so on. In the method of ϵ-greedy where , the value function at a certain state will be classified to the maximum and others. The agent will select the maximum value function with the probability of and select other actions with the probability of ϵ. In the method of softmax, the probabilities will be arranged from large to small based on the value function. However, all of these methods set the probability for action selection directly and simply based on value function without consideration of the system change.

Therefore, whether it can converge and the convergence rate of the strategy rely heavily on the action selection strategy and the parameters of RL such as discount factor γ, learning rates α, reward value, and so on. However, these parameters are set according to experts experience or some tricks which rarely base on any mathematical principle. In this paper, we try to reveal the characters of convergence process of RL by taking advantage of Exploration Entropy defined below which may provide a new angle to improve RL algorithm.

3. Exploration Entropy for RL

RL algorithm has been proved to be an effective method in MDPs [7]. However, little research has studied the training process [9]. In this section, we first give the definition of Exploration Entropy (EE) and formulate the reinforcement learning procedure with EE regarding Q-learning and probabilistic Q-learning. Then, we present the general performance analysis methods for RL , including convergence analysis and termination conditions. Finally, EE is applied to the measurement of multiple optimal solutions for a certain RL problem.

3.1. Exploration Entropy

The optimal policy for RL is represented by the probability distribution of actions for each state [29]. As a trial-and-error process, the rewards got by interacting with the environment are used to update the value function. The probability distribution of actions which represents the uncertainty of the action selection is calculated from the value function. It means that the learning process is essentially the process of reducing the uncertainty of action selection strategy on which actions should be chosen at each state. Hence, the performance of RL and the balance between exploration and exploitation can also be described with the degree of uncertainty for action selection [22]. Here, we introduce a new notion of EE to measure the degree of uncertainty. Shannon entropy (i.e., Shannon measure of uncertainty) has been well used in information theory [11], where the amount of uncertainty is measured by a probability distribution function p on a finite set [30]:where X is the universal set, x is a element of the finite set X, and is the probability distribution function on X.

Similar to Shannon entropy, the concept of State Entropy (SE) is defined based on the probability distribution of action selections to measure the uncertainty of action selection for a certain state s. The resulting function is [22]

For an RL system, the global uncertainty of action selection can be described with Exploration Entropy :where is the universal state set.

For an RL agent with m actions in each state, it can be proved [11] that the SE (uncertainty of action selection in a certain state) will be maximum and equal to when all the probabilities are equal to . Also, when every SE is maximal, the EE will also be maximal because the denominator in (7) is constant with the training process in an RL system. The maximum EE means that all the action selection probabilities of all states are equal, which is always the situation at the initialization without any prior knowledge about the environment. Along with the learning process, will tend to decrease and reach its minimum when the learning process converges and gives us the optimal policy.

For example, when there are only four alternative actions at state s, the probabilities for these actions will be (). The maximum State Entropy for state s is 2, which is obtained when all the action selection probabilities of the state are equal to . This result indicates that when all the probabilities are equal, the uncertainty of action selection will be at a maximum. This conclusion can be generalized to the states with m action choices. For the state s that has m action choices, the maximum and will be obtained when all the probabilities of actions are equal to . Formally,

In this paper, the Exploration Entropy will be computed in two fundamental RL algorithms: QL with softmax and PQL, as shown in Alg.1, 2 to analyse the RL training process. Softmax strategy is used to select actions in Algorithm 1. At a certain state, each action has a variable probability which is related to the q value and will be used to calculate and . As the training process proceeds, the agent will nearly traverse every state. As a result, a stable Q-table and convergent will be generated. The other algorithm is shown in Algorithm 2, probabilistic Q-learning. The difference between the two algorithms is that their probability of action selection has different computing ways and updating methods. It is worth to point out that the exploration entropy-based method will not work effectively in the standard ϵ-greedy because the action selection probability is completely dependent on the epsilon parameter which is set depending on experts’ experience or other tricks. The same phenomenon appears in QL with Greedy algorithm.

Initialize arbitrarily
Initialize the policy
repeat
 Initialize s, , τ
repeat
  a  action with probability for
  Take action a, observe reward r, and next state
  
  Update with Softmax strategy
  
  ,
until is destination
until the learning process ends
Initialize arbitrarily
Initialize the policy
repeat
 Initialize s,
repeat
  a  action with probability for
  Take action a, observe reward r, and next state
  
  
  Normalize
  
  ,
until is destination
until the learning process ends
3.2. Performance Analysis Using Exploration Entropy

In this subsection, the Exploration Entropy is used to analyse the training process of RL which reflects some important characters of the system and the exploration strategy.

3.2.1. Convergence Analysis in Training Process

It has been known that the policy of the agent is made up of a series of pairs. Also, a is selected according to s and the knowledge of the environment. At the beginning of the training process, as to any s the value function is 0 and the reward for any is unknown. The uncertainty of the policy is max because the agent has to select an action haphazard. Along with the training process, more states are traversed, more actions are executed, and much more rewards are got. The value function is updated by equation (4). So, the agent can make more efficient action which is definitely better than others. The uncertainty has been decreased with the decrease of the uncertainty of action selection, SE reflects the intelligibility of the state, and EE which includes all the SE reflects the convergence degree of the policy.

3.2.2. Termination Conditions in Training Process

In the field of RL algorithms [9], the convergence rate is the major problem. The key indicators used to reflect the advantages and disadvantages of the algorithm are the accumulation of rewards and steps number. However, in most algorithms with such exploration strategy as ϵ-greedy, softmax [10], the two key indicators are often influenced by the algorithm parameters of the action selection strategy, such as ϵ, t, and so on. At the same time, for some complex MDP, it is almost impossible to get the optimal policy in limited time. The aim of RL algorithm is to get the second best solution. In the above case, the curves based on the only two key indicators cannot always reflect physical truth of the convergence process for training.

On the other hand, as the update of continues [7], the EE of the system is also being constantly updated. As the key indicator of system uncertainty, it is easy to use EE to estimate or distinguish the convergence time for a certain RL algorithm.

3.3. Measurement of Multiple Optimal Solutions Using Exploration Entropy

Normally, there may be multiple optimal solutions for a certain MDP. However, to the extent of the author’s knowledge, there is no RL algorithm that has concerned this issue. The Exploration Entropy provides an angle to analyse this problem.

For an RL system, the value function updating process will converge to the Bellman equation, if the RL system is convergent. It means that the value function for a certain state will not change anymore. Also, the agent can get the optimal policy with Greedy strategy to select action based on the static value function. As to the agent, if there is only one path to get to the target state from the start state, it means that the agent will select only one action which has the maximal accumulated rewards value. According to the previous definition, the SE of every state and EE will be both 0 finally in the one-optimal-solution situation in ideal conditions. On the other hand, if there is more than one path to get to the target state, it means that the agent could get several (take 2 as an example) equivalent actions which have the same accumulated rewards value at some states (take state as an example). Based on the Greedy principle, each one of the two actions with the same maximal reward value is chosen with the probability of 0.5. All other actions are selected with the probability of 0. The SE of is 1, and obviously the EE is bigger than 0. This can be concluded as the EE value bigger than 0 indicates multiple optimal solutions.

Considering a simple grid MDP described as in Figure 1, each square in the grid represents a state in the problem. An agent at each state has four actions to select: east, west, south, and north. The agent will move a step by executing an action and get a reward of at all states except state A. All actions at state A will take the agent to state B and get a reward of , as shown in Figure 1(a). After training the agent with random strategy in which the agent selects actions with the same probability at every state, the optimal policy and convergent value function which satisfies the Bellman equation will be obtained, as shown in Figures 1(b) and 1(c).

In this simple path planning problem, it is obvious that there are more than one path for the agent to get to state A from an arbitrary state, as shown in Figure 1(b). As to the value function, taking state B as example, the agent has two optimal selections, west and north, while each selection probability of the two actions is 0.5. The SE of state B is . On the contrary, the SE of state 1 is because the agent has the only action east to execute. For all the states, . Based on the previous analysis, when the value function gets to convergence, the nonzero Exploration Entropy reflects that there are multiple optimal paths for the MDP.

4. Experiment Using Exploration Entropy

To test the proposed EE and related analysis methods, we carry out several groups of experiments using two typical RL control examples, i.e., a classical control problem of indoor robot navigation and a quantum control problem of two-level quantum systems.

4.1. Indoor Robot Navigation

Problem description: As is shown in Figure 2, there is a maze. Each square represents a state which belongs to the state set S. Also, the finite action set A contains 4 directions: up, down, left, and right. The agent should arrive to the Goal point from the Start point without going through the Obstacle point, as to G, S, and O in Figure 2. If the agent falls into the Obstacle, it will get a PUNISH which is . Also, if the agent reaches the destination, it will get a REWARD, 11 for PQL and 10 for Softmax. The experiment settings for all these algorithms are listed as follows: all Q-values are initialized as 0, the discount factor , and the learning rate . For Q-learning with Softmax, τ is initialized to 5.2 and then gradually reduced. For PQL, the update step of probability . In this section, the Entropy-based analytical method will be explained in the problem of path planning with two strategies shown in Algorithms 1 and 2.

After training with 1000 episodes, the experimental results are shown below. Figures 3 and 4 show the change of Exploration Entropy and steps. To demonstrate the effectiveness of our method, we have done the experiment 10 times and then computed the average steps and exploration entropy. As is shown in Figure 3, there are two stages in Softmax strategy. The first stage is from the first episode to the 200th episode, where the Q-table keeps updating and gradually converges, and the same to steps. In the second stage, after the 200th episode, the Exploration Entropy and steps are both basically stable. This indicates that the Q-table is basically stable and the convergence process is basically completed. Similar to PQL strategy, while the first stage is from the start to the 100th episode, and the second stage is after the 100th episode. The similar result for Maze B is shown in Figure 4.

On the other hand, the convergent process of every state is shown in Figures 58 intuitively for Maze A and Maze B, respectively. Taking Figures 5 and 6 as an example, the depth of grey color in any square is proportional to the State Entropy value. The black squares represent a big SE value which means that action selection of the corresponding states is with high uncertainty. The relatively white squares represent a relatively small entropy value, which means the action selection of these states is relatively clear. Obviously, as the iteration goes on, more and more squares become closer to white, which means their action selection is more and more clear. Finally, the figures become stable. It is worth saying why the Exploration Entropy is not 0 and the two algorithms have different EE at the end of the two experiments (this cannot influence the conclusion that EE can inflect the degree of convergence and act as a terminational condition of the training process).(a)Obviously there are multiple optimal paths; therefore, the EE only can be close to 0 theoretically.(b)Training episodes may not be enough. So, the entropy value may be still relatively big. Besides, the agent cannot traverse each state enough times in the actual situation, and the number of exploring times for a certain state is not the same between the two algorithms.(c)The experimental final EE can be influenced to some extent by some parameter settings such as REWARD value, PUNISH value, and so on. Besides, it can also be changed by the computing way of the probability of action selection.

It is noticed that the Exploration Entropy and the steps have almost the same variation tendency. Therefore, EE can reflect the stage of training process. Besides, the entropy and the steps become convergent nearly at the same time. So, we can use EE as a termination condition.

4.2. Quantum Control
4.2.1. Problem Description

Here, we consider the control problem of finite-dimensional (N-level) quantum systems [31]. We can denote the eigen states of the free Hamiltonian of an N-level quantum system as . The evolving state of a controlled quantum system can be expressed as the eigen states in the set D:where complex numbers satisfy . Introducing a control acting on the system via a time-independent interaction Hamiltonian and denoting as , evolves according to the Schrödinger equation [32]:where , , , , is the reduced Planck constant, and the matrix A and B correspond to and , respectively.

The propagator is a unitary operator such that for any state , the state is the solution at time of (1) and (2) with the initial condition at time . For the sake of convenient representation, is simplified as , . Furthermore, the control set is given, where each control corresponds to a unitary operator . Then, we define the performance function [33]:where is the trace operator, is the initial state, is the target state, and is the adjoint of U. Thus, the task of the learning control system can be transformed into finding a global optimal control policy.

The learning control problem of the two-level quantum system is so simple and typical that it is important to solve it [3437]. Here, we focus on the spin-1/2 system, which is a typical two-level quantum system with important theoretical research and practical application. The state of the spin-1/2 quantum system can be written aswhere and . For the quantum control problem, its permitted controls are , , and , called no control input, a positive pulse control, and a negative pulse control, respectively [28]. Figure 9 shows the effect of one-step control on the evolution of the quantum system. More specifically, the propagators are demonstrated as follows:where

4.2.2. RL Control of Spin-1/2 Systems

The learning control objective is to drive the spin-1/2 system from the initial state to the target state and minimize the control steps by three-switch control with all the propagators and Bang-Bang control with the propagators , as shown in Figure 9. To solve the quantum control problem using RL, here, we make the following hypothesis: we can discretize the state of the spin-1/2 system into a finite-state set and have the finite action (propagator) set , . More specifically, the system state set S is discrete by 30 warps and 30 wefts, the initial state , and the target state . For the three-switch control method, , , and for Bang-Bang control, , .

We use Q-learning [7, 38] and PQL [28] by two control methods (three-switch control and Bang-Bang control). The goal of all RL algorithms is to find an optimal policy , which corresponds to the optimal control in (12). The experiment settings for all these algorithms are listed as follows: all Q-values are initialized as 0, the discount factor , and the learning rate . If the agent gets target state , it will receive a reward , and otherwise, it will get a reward . For Q-learning with ϵ-greedy strategy, ϵ is initialized to 0.5 and then gradually reduced to 0. For PQL, the update step of probability .

4.2.3. Experimental Results of Three-Switch Control

Figure 10 illustrates the control effect using RL algorithms (Q-learning and PQL), and the two-level quantum system can be controlled from the initial state to the target state with a certain number of control sequences (a learned control strategy), which is proof of the effectiveness of RL algorithms in solving such quantum control problem.

For Q-learning, the learning process converges after about 200 episodes, while PQL requires about 150 episodes. In addition, through the picture of Exploration Entropy, we can see that with the learning episodes increasing, the EE shows a downward trend. Moreover, the EE curves of Q-learning and PQL converge to approximately 200 learning episodes and 150 learning episodes, respectively, which is almost the same as the number of learning episodes when the step convergence curves of the two learning algorithms converge. As for the reason that EE cannot eventually converge to 0 (theoretical value), it may be because the agent has learned the control strategy without exploring all the states. We can not ignore the phenomenon that the EE graph can not only indicate the end of the agent’s training (RL algorithm’s convergence) from another angle, but also visually show the convergence degree of the agent, which is mainly reflected in the change process of the agent’s exploration strategy.

4.2.4. Experimental Results of Bang-Bang Control

The learning control performance for RL algorithms is shown in Figure 11. We can see that RL algorithms are effective for solving this kind of quantum control problem because the two-level quantum system can be transferred from the initial state to the target state with a learned control strategy.

The learning process converges after about 150 episodes using Q-learning, and PQL needs about 120 episodes to find an optimal control sequence. Similarly, both the step convergence curve and the Exploration Entropy curve can represent the learning process information of the agent by the Bang-Bang control method, in which the EE curve is in a downward trend as a whole. Also, the reason why the EE curve does not converge to 0 at the end is also the same as that of the three-switch method (the agent learns the optimal control strategy without exploring all the states). It is not difficult to find that the EE graph can not only indicate the degree of convergence in the agent’s training process, but also reflect the termination condition of the training.

5. Conclusion

In this paper, a new approach to analyse the training process of reinforcement learning is proposed, which includes State Entropy and Exploration Entropy. The SE reflects the training level a single state of the agent in a whole training process. The EE reflects more important information of the agent’s training which includes the convergence degree and termination condition of training. The experimental results of the two typical problems, i.e., indoor navigation and quantum control in simulation environment illustrate the advantages of the proposed approach, when used to analyse the improved RL algorithms. In addition, it could be used to accelerate the RL training process because the method based on entropy can judge the end of the training process more exactly. Our future work will focus on the extension of EE to continuous cases of RL with deep neural networks and multiagent RL systems [39].

Data Availability

The data used to support the findings of this study are available form the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (no. 2016YFD0702100), the National Natural Science Foundation of China (no. 71732003), and the Fundamental Research Funds for the Central Universities (no. 011814380035).