Mathematical Problems in Engineering

Volume 2015, Article ID 760459, 14 pages

http://dx.doi.org/10.1155/2015/760459

## Two-Phase Iteration for Value Function Approximation and Hyperparameter Optimization in Gaussian-Kernel-Based Adaptive Critic Design

^{1}School of Automation, China University of Geosciences, Wuhan, Hubei 430074, China^{2}School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, China

Received 7 January 2015; Accepted 26 May 2015

Academic Editor: Simon X. Yang

Copyright © 2015 Xin Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernel-based adaptive critic design (ACD) was developed recently. There are two essential issues for a static kernel-based model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. They all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussian-kernel-based Adaptive Dynamic Programming (GK-ADP) approach. Simulations are used to verify its feasibility, particularly the necessity of two-phase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set.

#### 1. Introduction

Reinforcement learning (RL) is an interactive machine learning method for solving sequential decision problems. It is well known as an important learning method in unknown or dynamic environment. Different from supervised learning and unsupervised learning, RL interacts with the environment through trial mechanism and modifies its action policies to maximize the payoffs [1]. It is strongly connected from a theoretical point of view with direct and indirect adaptive optimal control methods [2].

Traditional RL research focused on discrete state/action systems; state/action only takes on a finite number of prescribed discrete values. The learning space grows exponentially as the number of states and the number of allowed actions increase. This leads to the so-called curse of dimensionality (CoD) [2]. In order to mitigate this CoD problem, function approximations and generalization methods [3] are introduced to store the optimal value and the optimal control as a function of the state vector. Generalization methods based on parametric model such as neural networks [4–6] have become one of popular means to solve RL problem in continuous environments.

Currently, research work on RL in continuous environment to construct learning systems for nonlinear optimal control has attracted attention of researchers and scholars in control domains for the reason that it can modify its policy only based on the value function without knowing the model structure or the parameters in advance. A family of new RL techniques known as Approximate or Adaptive Dynamic Programming (ADP) (also known as Neurodynamic Programming or Adaptive Critic Designs (ACDs)) has received more and more research interest [7, 8]. ADPs are based on the actor-critic structure, in which there is a critic assessing the value of the action or control policy applied by an actor and an actor modifying its action based on the assessment of values. In literatures, ADP approaches are categorized as the following main schemes: heuristic dynamic programming (HDP), dual heuristic programming (DHP), globalized dual heuristic programming (GDHP), and their action-dependent versions [9, 10].

ADP researches always adopt multilayer perceptron neural networks (MLPNNs) as the critic design. Vrabie and Lewis proposed an online approach to continuous-time direct adaptive optimal control which made use of neural networks to parametrically represent the control policy and the performance of the control system [11]. Liu et al. solved the constrained optimal control problem of unknown discrete-time nonlinear systems based on the iterative ADP algorithm via GDHP technique with three neural networks [12]. In fact, different kinds of neural networks (NNs) play the important roles in ADP algorithms, such as radial basis function NNs [3], wavelet basis function NNs [13], and echo state network [14].

Besides the benefits brought by NNs, ADP methods always suffer from some problems concerned in the design of NNs. On one hand, the learning control performance greatly depends on empirical design of critic networks, especially the manual setting of the hidden layer or the basis functions. On the other hand, due to the local minima in neural network training, how to improve the quality of the final policies is still an open problem [15].

As we can see, it is difficult to evaluate the effectiveness of the parametric model when the knowledge on the model’s order or nonlinear characteristics of the system is not enough. Compared with parametric modeling methods, nonparametric modeling methods, especially kernel methods [16, 17], do not need to set the model structure in advance. Hence, kernel machines have been popularly studied to realize nonlinear and nonparametric modeling. Engel et al. proposed the kernel recursive least-squares algorithm to construct minimum mean-squared-error solutions to nonlinear least-squares problems [18]. As popular kernel machines, support vector machines (SVMs) also have been applied to nonparametric modeling problems. Dietterich and Wang combined linear programming with SVMs to find value function approximations [19]. The similar research was published in [20], in which the least-squares SVMs were used. Nevertheless, they both focused on discrete state/action space and lacked theoretical results on the policies obtained more or less.

In addition to SVMs, Gaussian processes (GPs) have become an alternative generalization method. GP models are powerful nonparametric tools for approximate Bayesian inference and learning. In comparison with other popular nonlinear architectures, such as multilayer perceptrons, their behavior is conceptually simpler to understand, and model fitting can be achieved without resorting to nonconvex optimization routines [21, 22]. In [23], Engel et al. first applied GPs in temporal-difference (TD) learning for MDPs with stochastic rewards and deterministic transitions. They derived a new GPTD algorithm in [24] that overcame the limitation of deterministic transitions and extended GPTD to the estimation of state-action values. GPTD algorithm just addressed the value approximation, so it should be combined with actor-critic methods or other policy iteration methods to solve learning control problems.

An alternative approach employing GPs in RL is model-based value iteration or policy iteration method, in which GP model is used to model system dynamics and represent the value function [25]. In [26], an approximated value-function based RL algorithm named Gaussian process dynamic programming (GPDP) was presented, which built dynamic transition model, value function model, and action policy model, respectively, using GPs. In this way, the sample set will be adjusted to such a reasonable shape with high sample densities near the equilibriums or the places where value functions change dramatically. Thus it is good at controlling nonlinear systems with complex dynamics. A major shortcoming, even if the relatively high computation cost is endurable, is that the states in sample set need to be revisited again and again in order to update their value functions. Since this condition is unpractical in real implements, it diminishes the appeal of employing this method.

Kernel-based method is also introduced to ADP. In [15], a novel framework of ACDs with sparse kernel machines was presented by integrating kernel methods into critic network. A sparsification method based on the approximately linear dependence (ALD) analysis was used to sparsify the kernel machines. Obviously, this method can overcome the difficulty of presetting model structure in parametric models and realize actor-critic learning online [27]. However, the selection of samples based on the ALD analysis is an offline way without considering the distribution of the value function. Therefore, the data samples cannot be adjusted online, which makes the method more suitable for control systems with smooth dynamics, where value function changes gently.

We think GPDP and ACDs with sparse kernel machines are complementary. As indicated in [28], it is known the prediction of GPs is viewed as a linear combination of the covariance between the new points and the samples. Hence it seems reasonable to introduce kernel machine with GPs to build critic network in ACDs, if the values of the samples are known or at least can be assessed numerically. And then the sample set will be adjusted online during critic-actor learning.

The major problem here is how to realize the value function learning and GP models updating simultaneously, especially under the condition that the samples of state-action space can hardly be revisited infinitely in order to approximate their values. To tackle this problem, a two-phase iteration is developed in order to get optimal control policy for the system whose dynamics are unknown a priori.

#### 2. Description of the Problem

In general, ADP is an actor-critic method which approximates the value functions and policies to encourage the realization of generalization in MDPs with large or continuous spaces. The critic design plays the most important role, because it determines how the actor optimizes its action. Hence, we give a brief introduction on both kernel-based ACD and GPs, in order to derive the clear description of the theoretical problem.

##### 2.1. Kernel-Based ACDs

Kernel-based ACDs mainly consist of a critic network, a kernel-based feature learning module, a reward function, an actor network/controller, and a model of the plant. The critic constructed by kernel machine is used to approximate the value functions or their derivatives. Then the output of the critic is used in the training process of the actor so that policy gradients can be computed. As actor finally converges, the optimal action policy mapping states to actions is described by this actor.

Traditional neural network based on kernel machine and samples serves as the model of value functions, just as the following equation shows, and the recursive algorithm, such as KLSTD [29] serves as value function approximation:where and represent state-action pairs and , , , are the weights, and represents selected state-action pairs in sample set and is a kernel function.

The key of critic learning is the update of the weights vector . The value function modeling in continuous state/action space is a regression problem. From the view of the model, NN structure is a reproducing kernel space spanned by the samples, in which the value function is expressed as a linear regression function. Obviously as the basis of Hilbert space, how to select samples determines the VC dimension of identification, as well as the performance of value function approximation.

If only using ALD-based kernel sparsification, it is only independent of samples that are considered in sample selection. So it is hard to evaluate how good the sample set is, because the sample selection does not consider the distribution of value function, and the performance of ALD analysis is affected seriously by the hyperparameters of kernel function, which are predetermined empirically and fixed during learning.

If the hyperparameters can be optimized online and the value function w.r.t. samples can be evaluated by iteration algorithms, the critic network will be optimized not only by value approximation but also by hyperparameter optimization. Moreover with approximated sample values, there is a direct way to evaluate the validity of sample set, in order to regulate the set online. Thus in this paper we turn to Gaussian processes to construct the criterion for samples and hyperparameters.

##### 2.2. GP-Based Value Function Model

For an MDP, the data samples and the corresponding value can be collected by observing the MDP. Here is the state-action pairs , , , and is the value function defined aswhere .

Given a sample set collected from a continuous dynamic system, , where , , Gaussian regression with covariance function shown in the following equation is a well known model technology to infer the function: where .

Assuming additive independent identically distributed Gaussian noise with variance , the prior on the noisy observations becomes where

The parameters , , and are the hyperparameters of the and collected within the vector .

For an arbitrary input , the predictive distribution of the function value is Gaussian distributed with mean and variance given by

Comparing (6) to (1), we find that if we let , the neural network is also regarded as the Gaussian regression. Or the critic network can be constructed based on Gaussian kernel machine, if the following conditions are satisfied.

*Condition 1. *The hyperparameters of Gaussian kernel are known.

*Condition 2. *The values w.r.t. all samples states are known.

With Gaussian-kernel-based critic network, the sample state-action pairs and corresponding values are known. And then the criterion such as the comprehensive utility proposed in [26] can be set up, in order to refine sample set online. At the same time it is convenient to optimize hyperparameters, in order to approximate value function more accurately. Thus besides the advantages brought by kernel machine, the critic based on Gaussian-kernel will be better in approximating value functions.

Consider Condition 1. If the values w.r.t. are known (note that it is indeed Condition 2), the common way to get hyperparameters is by evidence maximization, where the log-evidence is given by It requires the calculation of the derivative of w.r.t. each , given by where denotes the trace, .

Consider Condition 2. For unknown system, if is known, that is, the hyperparameters are known (note that it is indeed Condition 1), the update of critic network will be transferred to the update of values w.r.t. samples by using value iteration.

According to the analysis, both conditions are interdependent. That means the update of critic depends on known hyperparameters, and the optimization of hyperparameters depends on accurate sample values.

Hence we need a comprehensive iteration method to realize value approximation and optimization of hyperparameters simultaneously. A direct way is to update them alternately. Unfortunately, this way is not reasonable because these two processes are tightly coupled. For example, temporal differential errors drive value approximation, but the change of weights will cause the change of hyperparameters simultaneously; then it is difficult to tell whether this temporal differential error is induced by observation or by Gaussian regression model changing.

To solve this problem, a kind of two-phase value iteration for critic network is presented in the next section, and the conditions of convergence are analyzed.

#### 3. Two-Phase Value Iteration for Critic Network Approximation

First a proposition is given to describe the relationship between hyperparameters and the sample value function.

Proposition 1. *The hyperparameters are optimized by evidence maximization according to the samples and their Q values, and the log-evidence is given by *

*where . It can be proved that, for arbitrary hyperparameters , if , (10) defines an implicit function or a continuously differentiable function as follows:*

*Then the two-phase value iteration for critic network is described as the following theorem.*

*Theorem 2. Given the following conditions(1)the system is boundary input and boundary output (BIBO) stable,(2)the immediate reward is bounded,(3)for , , ,the following iteration process is convergent:where and is a kind of pseudoinversion of ; that is, .*

*Proof. *From (12), it is clear that the two phases include the update of hyperparameters in phase 1, which is viewed as the update of generalization model, and the update of samples’ value in phase 2, which is viewed as the update of critic network.

The convergence of iterative algorithm is proved based on stochastic approximation Lyapunov method.

Define that Equation (12) is rewritten as Define approximation errors as Further define that Thus (16) is reexpressed as Let , and the two-phase iteration is in the shape of stochastic approximation; that is, Define where is the scale of the hyperparameters and will be defined later. Let represent for short.

Obviously (20) is a positive definite matrix, and , . Hence, is a Lyapunov functional. At moment , the conditional expectation of the positive definite matrix is It is easy to compute the first-order Taylor expansion of (21) as in which the first item on the right of (22) is Substituting (23) into (22) yields where , Consider the last two items of (24) firstly. If the immediate reward is bounded and infinite discounted reward is applied, the value function is bounded. Hence, .

If the system is BIBO stable, , , are the dimensions of the state space and action space, respectively, , , the policy space is bounded.

According to Proposition 1, when the policy space is bounded and , , there exists a constant , so that In addition, From (26) and (27), we know that where . Similarly According to Lemma 5.4.1 in [30], , and and .

Now we focus on the first two items on the right of (24).

The first item is computed as For the second item , when the state transition function is time invariant, it is true that , . Then we have This inequality holds because of positive , , . Define the norm , where is the derivative matrix norm of the vector** 1** and . Then Hence On the other hand, where is the value function error caused by the estimated hyperparameters.

Substituting (33) and (34) into yields Obviously, if the following inequality is satisfied, there exists a positive const , such that the first two items on the right of (24) satisfy According to Theorem 5.4.2 in [30], the iterative process is convergent; namely, .

*Remark 3. *Let us check the final convergence position. Obviously, , . This means the equilibrium of the critic network meets , where . And the equilibrium of hyperparameters is the solution of evidence maximization that , where .

*Remark 4. *It is clear that the selection of the samples is one of the key issues. Since now all samples have values according to two-phase iteration, according to the information-based criterion, it is convenient to evaluate samples and refine the set by arranging relative more samples near the equilibrium or with great gradient of the value function, so that the sample set is better to describe the distribution of value function.

*Remark 5. *Since the two-phase iteration belongs to value iteration, the initial policy of the algorithm does not need to be stable. To ensure BIBO in practice, we need a mechanism to clamp the output of system, even though the system will not be smooth any longer.

*Theorem 2 gives the iteration principle for critic network learning. Based on the critic network, with proper objective defined, such as minimizing the expected total discounted reward [15], HDP or DHP update is applied to optimize actor network. Since this paper focuses on ACD and more importantly the update process of actor network does not affect the convergence of critic network, though maybe it induces premature or local optimization, the gradient update of actor is not necessary. Hence a simple optimum seeking is applied to get optimal actions w.r.t. sample states, and then an actor network based on Gaussian kernel is generated based on these optimal state-action pairs.*

*Up to now we have built a critic-actor architecture which is named Gaussian-kernel-based Approximate Dynamic Programming (GK-ADP for short) and shown in Algorithm 1. A span is introduced, so that hyperparameters are updated every times of update. If , two phases are asynchronous. From the view of stochastic approximation, this asynchronous learning does not change the convergence condition (36) but benefit computational cost of hyperparameters learning.*