Computational Intelligence and Neuroscience

Volume 2016 (2016), Article ID 4824072, 15 pages

http://dx.doi.org/10.1155/2016/4824072

## Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning

^{1}School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215000, China^{2}School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, Jiangsu 215500, China^{3}Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu 210000, China^{4}Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China^{5}College of Electronic & Information Engineering, Suzhou University of Science and Technology, Jiangsu, Suzhou 215000, China

Received 29 May 2016; Revised 28 July 2016; Accepted 16 August 2016

Academic Editor: Leonardo Franco

Copyright © 2016 Shan Zhong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP with -regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. The hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. The purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. The results demonstrate that they perform best in terms of convergence rate and sample efficiency.

#### 1. Introduction and Related Work

Reinforcement Learning (RL) [1–4], a framework for solving the Markov Decision Process (MDP) problem, targets generating the optimal policy by maximizing the expected accumulated rewards. The agent interacts with its environment and receives information about the current state at each time step. After the agent chooses an action according to the policy, the environment will transition to a new state while emitting a reward. RL can be divided into two classes, online and offline. Online method learns by interacting with the environment, which easily incurs the inefficient use of data and the stability issue. Offline or batch RL [5] as a subfield of dynamic programming (DP) [6, 7] can avoid the stability issue and achieve high sample efficiency.

DP aims at solving optimal control problems, but it is implemented backward in time, making it offline and computationally expensive for complex or real-time problems. To avoid the curse of dimensionality in DP, approximate dynamic programming (ADP) received much attention to obtain approximate solutions of the Hamilton-Jacobi-Bellman (HJB) equation by combining DP, RL, and function approximation [8]. Werbos [9] introduced an approach for ADP which was also called adaptive critic designs (ACDs). ACDs consist of two neural networks (NNs), one for approximating the critic and the other for approximating the actor, so that DP can be solved approximately forward in time. Several synonyms about ADP and ACDs mainly include approximate dynamic programming, asymptotic dynamic programming, heuristic dynamic programming, and neurodynamic programming [10, 11].

The iterative nature of the ADP formulation makes it natural to design the optimal discrete-time controllers. Al-Tamimi et al. [12] established a heuristic dynamic programming algorithm based on value iteration, where the convergence is proved in the context of general nonlinear discrete systems. Dierks et al. [13] solved the optimal control of nonlinear discrete-time systems by using two processes, online system identification and offline optimal control training, without the requirement of partial knowledge about the system dynamics. Wang et al. [14] focused on applying iterative ADP algorithm with error boundary to obtain the optimal control law, in which the NNs are adopted to approximate the performance index function, compute the optimal control policy, and model the nonlinear system.

Extensions of ADP for continuous-time systems face the challenges involved in proving stability and convergence meanwhile ensuring the algorithm being online and model-free. To approximate the value function and improve the policy for continuous-time system, Doya [15] derived a temporal difference (TD) error-based algorithm in the framework of HJB. Under a measure of input quadratic performance, Murray et al. [16] developed a stepwise ADP algorithm in the context of HJB. Hanselmann et al. [17] put forward a continuous-time ADP formulation, where Newton’s method is used in the second-order actor adaption to achieve the convergence of the critic. Recently, Bhasin et al. [18] built an actor-critic-identifier (ACI), an architecture that represents the actor, critic, and model by taking NNs as nonlinearly parameterized approximators while the parameters of NNs are updated by least-square method.

All the aforementioned ADP variants utilized the NN as the function approximator; however, linear parameterized approximators are usually more preferred in RL, because they make it easier to understand and analyze the theoretical properties of the resulting RL algorithms [19]. Moreover, most of the above works did not learn a model online to accelerate the convergence rate and improve the sample efficiency. Actor-critic (AC) algorithm was introduced in [20] for the first time; many variants which approximated the value function and the policy by linear function approximation have been widely used in continuous-time systems since then [21–23]. By combining model learning and AC, Grondman et al. [24] proposed an improved learning method called Model Learning Actor-Critic (MLAC) which approximates the value function, the policy, and the process model by LLR. In MLAC, the gradient of the next state with respect to the current action is computed for updating the policy gradient, with the goal of improving the convergence rate of the whole algorithm. In their latter work [25], LFA takes the place of LLR as the approximation method for value function, the policy, and the process model. Enormous samples are still required when only using such a process model to update the policy gradient. Afterward, Costa et al. [26] derived an AC algorithm by introducing Dyna structure called Dyna-MLAC which approximated the value function, the policy, and the model by LLR as MLAC did. The difference is that Dyna-MLAC applies the model not only in updating the policy gradient but also in planning [27]. Though planning can improve the sample efficiency to a large extent, the model learned by LLR is just a local model so that the global information of samples is yet neglected.

Though the above works learn a model during learning of the value function and the policy, only the local information of the samples is utilized. If the global information of the samples can be utilized reasonably, the convergence performance will be improved further. Inspired by this idea, we establish two novel AC algorithms called AC-HMLP and RAC-HMLP (AC-HMLP with -regularization). AC-HMLP and RAC-HMLP consist of two models, the global model and the local model. Both models incorporate the state transition function and the reward function for planning. The global model is approximated by LFA while the local model is represented by LLR. The local and the global models are learned simultaneously at each time step. The local model is used for planning only if the error does not surpass the threshold, while the global planning process is started at the end of an episode, so that the local and the global information can be kept and utilized uniformly.

The main contributions of our work on AC-HMLP and RAC-HMLP are as follows:(1)Develop two novel AC algorithms based on hierarchal models. Distinguishing from the previous works, AC-HMLP and RAC-HMLP learn a global model, where the reward function and the state transition function are approximated by LFA. Meanwhile, unlike the existing model learning methods [28–30] which represent a feature-based model, we directly establish a state-based model to avoid the error brought by inaccurate features.(2)As MLAC and Dyna-MLAC did, AC-HMLP and RAC-HMLP also learn a local model by LLR. The difference is that we design a useful error threshold to decide whether to start the local planning process. At each time step, the real-next state is computed according to the system dynamics whereas the predicted-next state is obtained from LLR. The error between them is defined as the state-prediction error. If this error does not surpass the error threshold, the local planning process is started.(3)The local model and the global model are used for planning uniformly. The local and the global models produce local and global samples to update the same value function and the policy; as a result the number of the real samples will decrease dramatically.(4)Experimentally, the convergence performance and the sample efficiency are thoroughly analyzed. The sample efficiency which is defined as the number of samples for convergence is analyzed. RAC-HMLP and AC-HMLP are also compared with S-AC, MLAC, and Dyna-MLAC in convergence performance and sample efficiency. The results demonstrate that RAC-HMLP performs best and AC-HMLP performs second best, and both of them outperform the other three methods.

This paper is organized as follows: Section 2 reviews some background knowledge concerning MDP and the AC algorithm. Section 3 describes the hierarchical model learning and planning. Section 4 specifies our algorithms—AC-HMLP and RAC-HMLP. The empirical results of the comparisons with the other three representative algorithms are analyzed in Section 5. Section 6 concludes our work and then presents the possible future work.

#### 2. Preliminaries

##### 2.1. MDP

RL can solve the problem modeled by MDP. MDP can be represented as four-tuple :(1) is the state space. denotes the state of the agent at time step .(2) represents the action space. is the action which the agent takes at the time step .(3) denotes the reward function. At the time step , the agent locates at a state and takes an action resulting in next state while receiving a reward .(4) is defined as the transition function. is the probability of reaching the next state after executing at the state .

Policy is the mapping from the state space to the action space , where the mathematical set of depends on specific domains. The goal of the agent is to find the optimal policy that can maximize the cumulative rewards. The cumulative rewards are the sum or discounted sum of the received rewards and here we use the latter case.

Under the policy , the value function denotes the expected cumulative rewards, which is shown aswhere represents the discount factor. is the current state.

The optimal state-value function is computed as

Therefore, the optimal policy at state can be obtained by

##### 2.2. AC Algorithm

AC algorithm mainly contains two parts, actor and critic, which are stored separately. Actor and critic are also called the policy and value function, respectively. The actor-only methods approximate the policy and then update its parameter along the direction of performance improving, with the possible drawback being large variance resulting from policy estimation. The critic-only methods estimate the value function by approximating a solution to the Bellman equation; the optimal policy is found by maximizing the value function. Other than the actor-only methods, the critic-only methods do not try to search the optimal policy in policy space. They just estimate the critic for evaluating the performance of the actor; as a result the near-optimality of the resulting policy cannot be guaranteed. By combining the merits of the actor and the critic, AC algorithms were proposed where the value function is approximated to update the policy.

The value function and the policy are parameterized by and , where and are the parameters of the value function and the policy, respectively. At each time step , the parameter is updated aswhere denoting the TD-error of the value function. represents the feature of the value function. The parameter is the learning rate of the value function.

Eligibility is a trick to improve the convergence via assigning the credits to the previously visited states. At each time step , the eligibility can be represented aswhere denotes the trace-decay rate.

By introducing the eligibility, the update for in (4) can be transformed as

The policy parameter can be updated bywhere is the feature of the policy. is a random exploration term conforming to zero-mean normal distribution. is the learning rate of the policy.

S-AC (Standard AC algorithm) serves as a baseline to compare with our method, which is shown in Algorithm 1. The value function and the policy are approximated linearly in Algorithm 1, where TD is used as the learning algorithm.