Abstract

Solving reinforcement learning problems in continuous space with function approximation is currently a research hotspot of machine learning. When dealing with the continuous space problems, the classic Q-iteration algorithms based on lookup table or function approximation converge slowly and are difficult to derive a continuous policy. To overcome the above weaknesses, we propose an algorithm named DFR-Sarsa(λ) based on double-layer fuzzy reasoning and prove its convergence. In this algorithm, the first reasoning layer uses fuzzy sets of state to compute continuous actions; the second reasoning layer uses fuzzy sets of action to compute the components of Q-value. Then, these two fuzzy layers are combined to compute the Q-value function of continuous action space. Besides, this algorithm utilizes the membership degrees of activation rules in the two fuzzy reasoning layers to update the eligibility traces. Applying DFR-Sarsa(λ) to the Mountain Car and Cart-pole Balancing problems, experimental results show that the algorithm not only can be used to get a continuous action policy, but also has a better convergence performance.

1. Introduction

Reinforcement learning is a kind of machine learning methods that gets the maximum cumulative rewards by interacting with the environment [1, 2]. If a reinforcement learning problem can be modeled as a Markov decision process (MDP), methods such as dynamic programming (DP), Monte Carlo (MC), and temporal difference (DP) can be used to get an optimal policy.

Classic reinforcement learning methods are generally used for dealing with discrete state and action space problems, where each of the state values or state action values is stored in a lookup table. This kind of methods can effectively solve simple tasks, but not for large, continuous space problems. At present, the most common approach to solve this problem is using function approximation methods to approximate the state value or action value function. The approximate function can generalize the learned experience from a state space subset to the entire state space. Besides, an agent can choose the best action sequence through the function approximation [3, 4]. A variety of function approximation methods are used to reinforcement learning problems at present. Sutton et al. proposed a gradient TD (GTD) learning algorithm [5], which combined TD algorithms with linear function approximation, and also introduced a new objective function related to Bellman errors. Sherstov and Stone proposed a linear function approximation algorithm based on online adaptive tile coding, in which the experimental results verified its effectiveness [6]. Heinen and Engel used incremental probabilistic neural network to approximate value function in reinforcement learning, which can be used to solve continuous state space problems well [7].

Reinforcement algorithms with the function approximation methods mentioned above usually have slow convergence and generally can only be used for getting discrete action policies [59]. By introducing prior knowledge, reinforcement learning algorithms based on fuzzy inference systems (FIS) not only can effectively accelerate the convergence rate, but also may get continuous action policies [1012]. Horiuchi et al. put forward fuzzy interpolation-based -learning, which can solve the continuous space problems [13]. Glorennec and Jouffe combined FIS and -learning, using prior knowledge to make the global approximator, which can effectively speed up the convergence rate. However, the algorithm cannot be used to get a continuous action policy [14]. Fuzzy Sarsa proposed by Tokarchuk et al. can effectively reduce the scale of state space and accelerate the convergence rate, but it easily causes “curse of dimensionality” when applied to multidimensional state-space problems [15]. Type-2 fuzzy -learning proposed by Hsu and Juang has strong robustness to noise, but its time complexity is relatively high, and meanwhile, it cannot guarantee convergence [12].

Though the classic -iteration algorithms based on only one fuzzy inference system can be used for solving continuous action space problems, there still exist reasons for the slow convergence: for each iteration step in the learning process, there might exist a state-action pair that corresponds to different -values due to the structure of FIS. If the next iteration step needs to use the -value of the mentioned state-action pair to update the value function, the algorithm will simply select a -value randomly, since there are no criteria on how to choose the best one from different -values, which will influence the learning speed. Because this situation may happen many times in the learning process, it will greatly slow down the convergence rate.

In allusion to the problem that classic -iteration algorithms based on the lookup table and fuzzy inference system converge slowly and cannot obtain continuous action policies as well, DFR-Sarsa(λ), which means Sarsa(λ) based on double-layer fuzzy reasoning, is proposed in this paper, and the convergence is proven theoretically. The algorithm has two-layer fuzzy reasoning. Firstly, it puts states as input of the first fuzzy reasoning layer and gets continuous actions as output. Secondly, the second fuzzy reasoning layer uses the obtained actions from the first layer as input and gets -value component of each activation rule of the first layer. Finally, through the combination of two-layer fuzzy reasoning, -values of the input states are obtained. What is more, a new eligibility trace based on gradient descent is defined, which is dependent on membership degrees of activation rule in two-layer fuzzy reasoning. Applying DFR-Sarsa(λ) and other algorithms to Mountain Car and Cart-pole Balancing problems, the results show that DFR-Sarsa(λ) not only can obtain a continuous action policy, but also has a better convergence performance.

2. Backgrounds

2.1. Markov Decision Process

In reinforcement learning framework, the process interacting with the environment can be modeled as an MDP [16], and the MDP can be described as a quadruple , where(1)is the state set and is the state at time ;(2) is the action set and is the action that the agent takes at time ;(3) is the reward function, that means, after the agent takes action at time , the current state transfers from to , and the agent receives an immediate reward at the same time. represents a random reward generated from a distribution with mean ;(4) is the state transition function, where represents the probability of reaching after taking action in state .

The policy is a mapping from state space to action space , , which represents the probability that the agent selects action in state . is used to solve the state value function (-function) or action value function (-function). -function satisfies (1) And -function satisfies (2)

The objective of reinforcement learning is to get the optimal policy . It satisfies for all . Under the optimal policy , the optimal -function and optimal -function satisfy (3) and (4), respectively:

If and are known, DP is a good solution for getting optimal action policy. However if and are unknown, TD algorithms such as -learning or Sarsa can be the choice. Sarsa is an on-policy algorithm, and when the eligibility trace mechanism is introduced, it becomes a more efficient algorithm, which can effectively deal with temporal credit assignment. Besides, Sarsa(λ) can be combined with function approximation to solve continuous state space problems.

Definition 1 is a constraint on bounded MDP (mainly about state-space, action-space, reward, and value function). Attention should be given that all algorithms in this paper meet the definition.

Definition 1 (bounded MDP). and are known as finite sets; let represent the state-action set; that is, , and then is also a finite set. Reward function satisfies . The bound factor of MDP is , where is a discount factor. For all and for all , and hold.

2.2. Fuzzy Inference System

FIS is a system that can handle fuzzy information. Typically, it mainly consists of a set of fuzzy rules whose design and interaction are crucial to the FIS’s performance.

There are many types of fuzzy inference systems at present [17] in which a simple type of FIS named TSK-FIS is described as follows: where the first part is called antecedent and the second part is called consequent. means th rule in the rule base. is an -dimensional input variable. is the fuzzy set in the th fuzzy rule which corresponds to the th dimension of input variable. A membership function is usually used to describe it. is a polynomial function with an input variable . If the input is a vector, the output is also a vector. When is a constant, FIS is called zero-order FIS.

When the FIS has an exact input value , we can calculate the firing strength of the th rule (for T-norm product): is used to calculate the output of FIS: set firing strength as weight, multiply their corresponding consequent and sum up; then we can obtain the final output as follows:

TSK-FIS can be used for function approximation which approximates the objective function by updating the consequent of fuzzy rules. In general, the approximation error is measured by mean square error (MSE). When FIS gets an optimal approximation performance, the vector , which consists of all rules consequents, satisfies (8) where is the objective function and is its approximate function.

3. DFR-Sarsa()

3.1. The Update of -Value

Under the framework of MDP, two-layer fuzzy inference structures are constructed to approximate -function. Figure 1 shows the framework using two-layer fuzzy reasoning to approximate -function, where the inputs of FIS1 are states; the outputs are continuous actions obtained by FIS1 through fuzzy reasoning; the inputs of FIS2 are continuous actions obtained from FIS1; the outputs are the components of -value of the continuous actions. Then, the two-layer FISs are combined to get the approximating -function of continuous action .

The main structure of the two-layer FIS is described as follows.

(1) The rule of FIS1 is given as follows: where is the state and is the th discrete action in the th fuzzy rule. The action space is divided into discrete actions. is a component of -value corresponding to the th discrete action in the th fuzzy rule. When the state is , the firing strength of the th rule is If , we call the th rule “the activation rule.”

In the activation rule , we select an action from discrete actions by -greedy action selection policy according to the value . The selected action is called activation action, denoted by . Therefore, by multiplying activation actions selected from FIS1 to its firing strength and summing them up, we get the continuous action as follows:

We call a continuous action because the change of is smooth with state , which does not mean that any action in action space can be selected in state . To simplify (11), regularize the firing strength as follows: so (11) can be written as

(2) The rule of FIS2 is given as follows:

The construction of depends on FIS1. The core of the fuzzy set is the th action of the th rule in FIS1, and its membership function is described as ; the value from the consequent part of the rule equals the value in FIS1.

Set the continuous action obtained from FIS1 as the input of FIS2; it can activate rules of FIS2. Through fuzzy reasoning of FIS2, we can get the -value component of the th rule in FIS1 as follows:

In the same way of getting (12), regularize the membership function in (15); we get then (15) can be written as

From (17), we can get , the -value component obtained by the activation rule of FIS1. So when taking continuous action , the -value of all activation rules in FIS1 is given as follows:

From (18) we can see that -value depends on fuzzy sets of the two-layer FIS and their shared consequent variables . Since fuzzy sets are set according to prior knowledge in advance, they are no longer changed in the algorithm. In order to get convergent -value, the FISs require updating until convergence.

In order to minimize the approximation error of FIS, that is, parameter vector meets (8), the algorithm uses gradient descent method to update the parameter vector as follows: where the bracket part in (19) is the TD error. Set ; combining the backward TD() algorithm [1], we get where is a step-size parameter and is the eligibility trace vector at time , which corresponds to parameter vector . It is updated as follows: of (21) is a kind of accumulating trace [1], where is the discount factor and is the decay factor. represents the gradient vector obtained by the partial derivative of -function on each dimension of parameter vector at time [1]. According to (18), we get the gradient value of each dimension in at time as follows: then (21) can be further expressed as

3.2. The Learning Process of DFR-Sarsa(λ)

In this section, DFR-Sarsa(λ) is proposed based on the algorithm Sarsa in literature [1] and the content of MDP in Section 2.1. DFR-Sarsa(λ) not only can solve reinforcement learning problems with continuous state and discrete action space, but also can solve problems with continuous state and continuous action space. Algorithm 1 describes the general process of DFR-Sarsa(λ).

(1) Initialize parameter vector , eligibility trace vector , discount factor , step-size parameter
(2)Repeat(for every episode):
(3)  initial state
(4) According to (10), compute ,
(5) According to -greedy policy, select activation action ,
(6) According to (13), select action when state is
(7) According to (16), compute , ,
(8) According to (17) and (18), compute
(9) Repeat(for each step of episode)
(10)  Update eligibility trace: , ,
(11)  Take action , receive next state and reward
(12)  
(13)  According to -greedy policy, select activation action ,
(14)  According to (13), select action when state is
(15)  According to (16), compute , ,
(16)  According to (10), compute ,
(17)  According to (17) and (18), compute
(18)  
(19)  
(20)  
(21) Until is the terminal state
(22) Until preset episode number or other terminal condition meets

3.3. Convergence Analysis

In the literature [18, 19], the convergence of on-policy TD(λ) using linear function approximation is analyzed in detail. When this kind of algorithm meets some assumptions and lemmas, it converges with probability 1. Since DFR-Sarsa(λ) is exactly such an on-policy TD(λ) algorithm, it can be proved to be convergent when it satisfies some assumptions and lemmas in literature [18]. And this paper will not take too much details for its convergence proof.

Assumption 2. The state transition function and reward function of MDP follow stable distributions.

Lemma 3. The Markov chain that DFR-Sarsa(λ) depends on is irreducible and aperiodic, and the reward and value function are bounded.

Proof. Firstly, we prove its irreducibility. According to the property of Markov process, if any two states of a Markov process can be transferred from each other, it is irreducible [20]. DFR-Sarsa(λ) is used for solving reinforcement learning problems that satisfy MDP framework, and the MDP meets Definition 1. Thus for any state in the MDP, there must exist an that meets , which indicates that state can be visited infinitely. Therefore, each state can be transferred to any other state. So the Markov chain of DFR-Sarsa(λ) is irreducible.
Secondly, we prove that it is aperiodic. For the irreducible Markov chain, if one of the states in Markov chain is proved aperiodic, the entire Markov chain can be proved aperiodic. In addition, if a state of the Markov Chain has the property of autoregression, the state can be proven aperiodic [20]. For state of the MDP, there must exist a state transition satisfying , which indicates that state is autoregressive. From the above analysis, we can conclude that the MDP is aperiodic. Therefore, the Markov chain that DFR-Sarsa() depends on is aperiodic.
Finally, we prove that its reward and value function are bounded. Literature [1] shows that value function is a discounted accumulating reward function, which satisfies the equation , . By Definition 1, we know that the reward function is bounded, and it satisfies , where is a constant. Hence By Inequation (24), we can conclude that value function is bounded.
In summary, Lemma 3 is proved.

Condition 1. For each membership function , there exists a unique state that , for all , while the other membership functions in state are 0; that is, , for all .

Lemma 4. The basis functions of DFR-Sarsa(λ) are bounded, and the basis function vector is linearly independent.

Proof. Firstly, we prove the basis functions are bounded. From and , we get where represents infinite norm. Since the basis function of DFR-Sarsa(λ) is known as from (25), we get that the basis functions of DFR-Sarsa(λ) are bounded.
Secondly, we prove the basis function vector is linearly independent. In order to make the basis function vector linearly independent, let the basis functions meet Condition 1 [21], where the function form is shown in Figure 4. From literature [21] we know that, when Condition 1 is met, the basis function vector is linearly independent.
The requirement in Condition 1 can be relaxed appropriately by making the membership degree of at state a small value, for example, a Gaussian membership function with smaller standard deviation. Applying the membership function to DFR-Sarsa(λ), experimental results show that DFR-Sarsa(λ) is convergent, though the convergence still cannot be given theoretically.
In summary, Lemma 4 is proved.

Lemma 5. Step-size parameter α of DFR-Sarsa(λ) satisfies (26)

Proof. Set step-size parameter of DFR-Sarsa(λ)  , where is the time step. By Newton power series expansion, we get where is Euler's constant. Because is an increasing function, it satisfies when .
Cosider the inequality part in Inequation (28) can be proven by induction; thus is met when .
By (27) and Inequation (28), we get that the step-size parameter of DFR-Sarsa(λ) satisfies (26); thus we proved Lemma 5.

Theorem 6. Under the condition of Assumption 2, if DFR-Sarsa(λ) satisfies Lemma 3 to Lemma 5, the algorithm converges with probability 1.

Proof. Literature [18] gives the related conclusion that, under the condition of Assumption 2, when on-policy TD(λ) algorithms with linear function approximation meet certain conditions (Lemma 3 to Lemma 5), the algorithms converge with probability 1. DFR-Sarsa(λ) is just such an algorithm and it meets Assumption 2 and Lemma 3 to Lemma 5. So we get that DFR-Sarsa(λ) converges with probability 1.

4. Experiments

In order to verify DFR-Sarsa(λ)’s performance about the convergence rate, iteration steps after convergence, and the effectiveness of continuous action policy, we take two problems as experimental benchmarks: Mountain Car and Cart-pole Balancing. These two problems are classic episodic tasks with continuous state and action spaces in reinforcement learning, which are shown in Figures 2 and 3, respectively.

4.1. Mountain Car

Mountain Car is a representative problem with continuous state space, as shown in Figure 2. Suppose the underpowered car cannot accelerate up directly to reach the top of the right side. So, it has to move around more than once to get there. Modeling the task as an MDP, in which the state represented a two-dimensional variable: location and speed; that is, . The action is the force that drives the car to move horizontally, which is bounded in . In this problem, the system dynamics are described as follows: where , , and is a constant related to gravity. In addition, time step is 0.1 s and the reward function is as follows: Equation (30) is a punishment reward function, where means the reward received at time .

In the simulation, the number of episodes is set to 1000. The maximum time step in each episode is also set to 1000. The initial state of the car is , . When the car arrives to the destination () or the time steps exceed 1000, we finish this episode and begin a new one. The experiment will end after 1000 episodes.

In order to show the effectiveness of DFR-Sarsa(λ), we compare the algorithm with Fuzzy Sarsa proposed by Tokarchuk et al. [15], GD-Sarsa(λ) proposed by Sutton et al. [3], and Fuzzy Q(λ) proposed by Zajdel [22]. Additionally, the effect of eligibility trace on the convergence performance is also tested.

At present, there is no proper way to select parameters that make the four algorithms have their best performance, respectively. In order to make the comparison more reasonable, the parameters that exist in all of the four algorithms will be set at the same value, while the parameters that do not exist in all of the four algorithms will be set at the value from where it firstly comes.

We first set the parameters of DFR-Sarsa(λ): 20 triangular fuzzy sets whose cores are equidistant are used to partition each state variable, which results in 400 fuzzy rules. Similarly, use eight triangular fuzzy sets whose cores are equidistant to partition the continuous action space, where the number of fuzzy rules is 8. Set the other parameters, , , and . The form of fuzzy partition in Fuzzy Sarsa is the same as in DFR-Sarsa(λ). Other parameters are set to , , and . GD-Sarsa(λ) uses 10 tilings of to divide state space, where the parameters are set as the best experimental parameters given in literature [1]: , , , and . The form of fuzzy partition in Fuzzy Q(λ) Sarsa is also the same as in DFR-Sarsa(λ). Other parameters are set in accordance with literature [22] to , , , and .

DFR-Sarsa(λ), Fuzzy Sarsa, GD-Sarsa(λ), and Fuzzy Q(λ) are applied to Mountain Car. Figure 5 shows the average result in 30 independent simulation experiments. The -coordinate indicates the number of episodes, and y-coordinate represents the average time steps the car drives from the initial state to the target. As can be seen from Figure 5, the convergence performance of DFR-Sarsa(λ) is better than those of the other three algorithms.

The detailed performance of the four algorithms is shown in Table 1 (the benchmark time is the average time of a single iteration of DFR-Sarsa(λ)).

In order to test the effectiveness of the proposed eligibility trace, DFR-Sarsa(λ) with eligibility trace and DFR-Sarsa without eligibility trace are both applied in Mountain Car. Figure 6 shows the convergence performance of these two algorithms. It can be seen that these two algorithms converge in the same average time steps, but the convergence speed of DFR-Sarsa(λ) is better than that of DFR-Sarsa.

4.2. Cart-Pole Balancing

Figure 3 shows a Cart-pole Balancing system, in which the cart can move left or right on the horizontal plane. A pole is hinged to the cart, which can rotate freely within a certain angle. The task is to move the cart horizontally to keep the pole standing in a certain range . Similarly, modeling the task as an MDP, the state is a two-dimensional variable, which is represented by the vertical angle of pole , and the angular velocity of the pole ; that is, . These two state variables satisfy (rad) and (rad/s). The action is the force exerted on the cart, which ranges from −50 N to 50 N. In addition, the force is added by a noise force which is uniformly distributed in . The system dynamics are described as where  m/s2 is acceleration of gravity,  kg is the mass of pole,  kg is the mass of cart,  m is the length of pole, and constant . The change of reward depends on the change of state. At each time step (0.1 s), when the angle of the pole with vertical direction is no more than , reward 0 is received. While the angle is more than , the reward is −1, and the episode ends.

The parameter setting in this example is similar to the settings in Section 3.1, so we only give the difference here: 12 equidistant triangular fuzzy sets are used to partition the continuous action space, which leads to 144 fuzzy rules.

DFR-Sarsa(λ) and GD-Sarsa(λ) are executed on 30 independent simulations on Cart-pole Balancing; the results are shown in Figure 7, where the -coordinate represents the number of episodes; the -coordinate represents the average time steps. As can be seen from Figure 7, the convergence performance of DFR-Sarsa(λ) is also better than GD-Sarsa(λ).

The detailed performance of the two algorithms is shown in Table 2 (the benchmark time is the average time of a single iteration of DFR-Sarsa(λ)).

Figure 8 shows the results of GD-Sarsa(λ) and DFR-Sarsa(λ) on Cart-pole Balancing task, respectively. We have known that GD-Sarsa(λ) is based on discrete action policies, while DFR-Sarsa(λ) is based on continuous action policies. From Figure 8 we can see that the continuous action policy obtained by DFR-Sarsa(λ) can make the pole’s angle change in only a small angle, while discrete action policy obtained by GD-Sarsa(λ) makes the pole’s angle change in a large range. This fact indicates that policies obtained by DFR-Sarsa(λ) are much more stable than that of GD-Sarsa(λ). Thus, DFR-Sarsa(λ) is more suitable for applications which require more stable policies.

5. Conclusions

In allusion to the problem that classic reinforcement learning algorithms based on lookup table or function approximation converge slowly and are difficult to obtain continuous action policies, this paper presents an algorithm with eligibility trace based on double-layer fuzzy reasoning—DFR-Sarsa(λ). Firstly, the algorithm constructs two fuzzy reasoning layers to approximate -function, which are associated with state, action, and -value. Then, it uses gradient descent method to update eligibility trace and the consequent of fuzzy rules in the two FISs. Applying the proposed algorithm and other three similar relatively new algorithms to Mountain Car and Cart-pole Balancing system, experimental results show that, compared with reinforcement learning algorithms using only one fuzzy inference system, our algorithm requires fewer steps to convergence, though it increases the time complexity; compared with algorithms based on lookup table or some other function approximation methods, DFR-Sarsa(λ) has better convergence performance and can obtain a continuous action policy.

The performance of DFR-Sarsa(λ) relies on the two-layer fuzzy inference systems, while the performance of fuzzy inference system mainly depends on the fuzzy sets and fuzzy rules. In this paper, the type of fuzzy sets and the number of rules are given as prior knowledge, and they are no longer changed during the learning process. In order to achieve a much better convergence performance, we will focus on using appropriate optimization algorithms to optimize the membership functions and adjust the fuzzy rules adaptively.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61070223, 61103045, 61070122, 61272005, and 61170314), the Natural Science Foundation of Jiangsu Province (BK2012616), the Foundation of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172012K04).