Abstract

Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernel-based adaptive critic design (ACD) was developed recently. There are two essential issues for a static kernel-based model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. They all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussian-kernel-based Adaptive Dynamic Programming (GK-ADP) approach. Simulations are used to verify its feasibility, particularly the necessity of two-phase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set.

1. Introduction

Reinforcement learning (RL) is an interactive machine learning method for solving sequential decision problems. It is well known as an important learning method in unknown or dynamic environment. Different from supervised learning and unsupervised learning, RL interacts with the environment through trial mechanism and modifies its action policies to maximize the payoffs [1]. It is strongly connected from a theoretical point of view with direct and indirect adaptive optimal control methods [2].

Traditional RL research focused on discrete state/action systems; state/action only takes on a finite number of prescribed discrete values. The learning space grows exponentially as the number of states and the number of allowed actions increase. This leads to the so-called curse of dimensionality (CoD) [2]. In order to mitigate this CoD problem, function approximations and generalization methods [3] are introduced to store the optimal value and the optimal control as a function of the state vector. Generalization methods based on parametric model such as neural networks [46] have become one of popular means to solve RL problem in continuous environments.

Currently, research work on RL in continuous environment to construct learning systems for nonlinear optimal control has attracted attention of researchers and scholars in control domains for the reason that it can modify its policy only based on the value function without knowing the model structure or the parameters in advance. A family of new RL techniques known as Approximate or Adaptive Dynamic Programming (ADP) (also known as Neurodynamic Programming or Adaptive Critic Designs (ACDs)) has received more and more research interest [7, 8]. ADPs are based on the actor-critic structure, in which there is a critic assessing the value of the action or control policy applied by an actor and an actor modifying its action based on the assessment of values. In literatures, ADP approaches are categorized as the following main schemes: heuristic dynamic programming (HDP), dual heuristic programming (DHP), globalized dual heuristic programming (GDHP), and their action-dependent versions [9, 10].

ADP researches always adopt multilayer perceptron neural networks (MLPNNs) as the critic design. Vrabie and Lewis proposed an online approach to continuous-time direct adaptive optimal control which made use of neural networks to parametrically represent the control policy and the performance of the control system [11]. Liu et al. solved the constrained optimal control problem of unknown discrete-time nonlinear systems based on the iterative ADP algorithm via GDHP technique with three neural networks [12]. In fact, different kinds of neural networks (NNs) play the important roles in ADP algorithms, such as radial basis function NNs [3], wavelet basis function NNs [13], and echo state network [14].

Besides the benefits brought by NNs, ADP methods always suffer from some problems concerned in the design of NNs. On one hand, the learning control performance greatly depends on empirical design of critic networks, especially the manual setting of the hidden layer or the basis functions. On the other hand, due to the local minima in neural network training, how to improve the quality of the final policies is still an open problem [15].

As we can see, it is difficult to evaluate the effectiveness of the parametric model when the knowledge on the model’s order or nonlinear characteristics of the system is not enough. Compared with parametric modeling methods, nonparametric modeling methods, especially kernel methods [16, 17], do not need to set the model structure in advance. Hence, kernel machines have been popularly studied to realize nonlinear and nonparametric modeling. Engel et al. proposed the kernel recursive least-squares algorithm to construct minimum mean-squared-error solutions to nonlinear least-squares problems [18]. As popular kernel machines, support vector machines (SVMs) also have been applied to nonparametric modeling problems. Dietterich and Wang combined linear programming with SVMs to find value function approximations [19]. The similar research was published in [20], in which the least-squares SVMs were used. Nevertheless, they both focused on discrete state/action space and lacked theoretical results on the policies obtained more or less.

In addition to SVMs, Gaussian processes (GPs) have become an alternative generalization method. GP models are powerful nonparametric tools for approximate Bayesian inference and learning. In comparison with other popular nonlinear architectures, such as multilayer perceptrons, their behavior is conceptually simpler to understand, and model fitting can be achieved without resorting to nonconvex optimization routines [21, 22]. In [23], Engel et al. first applied GPs in temporal-difference (TD) learning for MDPs with stochastic rewards and deterministic transitions. They derived a new GPTD algorithm in [24] that overcame the limitation of deterministic transitions and extended GPTD to the estimation of state-action values. GPTD algorithm just addressed the value approximation, so it should be combined with actor-critic methods or other policy iteration methods to solve learning control problems.

An alternative approach employing GPs in RL is model-based value iteration or policy iteration method, in which GP model is used to model system dynamics and represent the value function [25]. In [26], an approximated value-function based RL algorithm named Gaussian process dynamic programming (GPDP) was presented, which built dynamic transition model, value function model, and action policy model, respectively, using GPs. In this way, the sample set will be adjusted to such a reasonable shape with high sample densities near the equilibriums or the places where value functions change dramatically. Thus it is good at controlling nonlinear systems with complex dynamics. A major shortcoming, even if the relatively high computation cost is endurable, is that the states in sample set need to be revisited again and again in order to update their value functions. Since this condition is unpractical in real implements, it diminishes the appeal of employing this method.

Kernel-based method is also introduced to ADP. In [15], a novel framework of ACDs with sparse kernel machines was presented by integrating kernel methods into critic network. A sparsification method based on the approximately linear dependence (ALD) analysis was used to sparsify the kernel machines. Obviously, this method can overcome the difficulty of presetting model structure in parametric models and realize actor-critic learning online [27]. However, the selection of samples based on the ALD analysis is an offline way without considering the distribution of the value function. Therefore, the data samples cannot be adjusted online, which makes the method more suitable for control systems with smooth dynamics, where value function changes gently.

We think GPDP and ACDs with sparse kernel machines are complementary. As indicated in [28], it is known the prediction of GPs is viewed as a linear combination of the covariance between the new points and the samples. Hence it seems reasonable to introduce kernel machine with GPs to build critic network in ACDs, if the values of the samples are known or at least can be assessed numerically. And then the sample set will be adjusted online during critic-actor learning.

The major problem here is how to realize the value function learning and GP models updating simultaneously, especially under the condition that the samples of state-action space can hardly be revisited infinitely in order to approximate their values. To tackle this problem, a two-phase iteration is developed in order to get optimal control policy for the system whose dynamics are unknown a priori.

2. Description of the Problem

In general, ADP is an actor-critic method which approximates the value functions and policies to encourage the realization of generalization in MDPs with large or continuous spaces. The critic design plays the most important role, because it determines how the actor optimizes its action. Hence, we give a brief introduction on both kernel-based ACD and GPs, in order to derive the clear description of the theoretical problem.

2.1. Kernel-Based ACDs

Kernel-based ACDs mainly consist of a critic network, a kernel-based feature learning module, a reward function, an actor network/controller, and a model of the plant. The critic constructed by kernel machine is used to approximate the value functions or their derivatives. Then the output of the critic is used in the training process of the actor so that policy gradients can be computed. As actor finally converges, the optimal action policy mapping states to actions is described by this actor.

Traditional neural network based on kernel machine and samples serves as the model of value functions, just as the following equation shows, and the recursive algorithm, such as KLSTD [29] serves as value function approximation:where and represent state-action pairs and , , , are the weights, and represents selected state-action pairs in sample set and is a kernel function.

The key of critic learning is the update of the weights vector . The value function modeling in continuous state/action space is a regression problem. From the view of the model, NN structure is a reproducing kernel space spanned by the samples, in which the value function is expressed as a linear regression function. Obviously as the basis of Hilbert space, how to select samples determines the VC dimension of identification, as well as the performance of value function approximation.

If only using ALD-based kernel sparsification, it is only independent of samples that are considered in sample selection. So it is hard to evaluate how good the sample set is, because the sample selection does not consider the distribution of value function, and the performance of ALD analysis is affected seriously by the hyperparameters of kernel function, which are predetermined empirically and fixed during learning.

If the hyperparameters can be optimized online and the value function w.r.t. samples can be evaluated by iteration algorithms, the critic network will be optimized not only by value approximation but also by hyperparameter optimization. Moreover with approximated sample values, there is a direct way to evaluate the validity of sample set, in order to regulate the set online. Thus in this paper we turn to Gaussian processes to construct the criterion for samples and hyperparameters.

2.2. GP-Based Value Function Model

For an MDP, the data samples and the corresponding value can be collected by observing the MDP. Here is the state-action pairs , , , and is the value function defined aswhere .

Given a sample set collected from a continuous dynamic system, , where , , Gaussian regression with covariance function shown in the following equation is a well known model technology to infer the function: where .

Assuming additive independent identically distributed Gaussian noise with variance , the prior on the noisy observations becomes where

The parameters , , and are the hyperparameters of the and collected within the vector .

For an arbitrary input , the predictive distribution of the function value is Gaussian distributed with mean and variance given by

Comparing (6) to (1), we find that if we let , the neural network is also regarded as the Gaussian regression. Or the critic network can be constructed based on Gaussian kernel machine, if the following conditions are satisfied.

Condition 1. The hyperparameters of Gaussian kernel are known.

Condition 2. The values w.r.t. all samples states are known.

With Gaussian-kernel-based critic network, the sample state-action pairs and corresponding values are known. And then the criterion such as the comprehensive utility proposed in [26] can be set up, in order to refine sample set online. At the same time it is convenient to optimize hyperparameters, in order to approximate value function more accurately. Thus besides the advantages brought by kernel machine, the critic based on Gaussian-kernel will be better in approximating value functions.

Consider Condition 1. If the values w.r.t. are known (note that it is indeed Condition 2), the common way to get hyperparameters is by evidence maximization, where the log-evidence is given by It requires the calculation of the derivative of w.r.t. each , given by where denotes the trace, .

Consider Condition 2. For unknown system, if is known, that is, the hyperparameters are known (note that it is indeed Condition 1), the update of critic network will be transferred to the update of values w.r.t. samples by using value iteration.

According to the analysis, both conditions are interdependent. That means the update of critic depends on known hyperparameters, and the optimization of hyperparameters depends on accurate sample values.

Hence we need a comprehensive iteration method to realize value approximation and optimization of hyperparameters simultaneously. A direct way is to update them alternately. Unfortunately, this way is not reasonable because these two processes are tightly coupled. For example, temporal differential errors drive value approximation, but the change of weights will cause the change of hyperparameters simultaneously; then it is difficult to tell whether this temporal differential error is induced by observation or by Gaussian regression model changing.

To solve this problem, a kind of two-phase value iteration for critic network is presented in the next section, and the conditions of convergence are analyzed.

3. Two-Phase Value Iteration for Critic Network Approximation

First a proposition is given to describe the relationship between hyperparameters and the sample value function.

Proposition 1. The hyperparameters are optimized by evidence maximization according to the samples and their Q values, and the log-evidence is given by where . It can be proved that, for arbitrary hyperparameters , if , (10) defines an implicit function or a continuously differentiable function as follows:

Then the two-phase value iteration for critic network is described as the following theorem.

Theorem 2. Given the following conditions(1)the system is boundary input and boundary output (BIBO) stable,(2)the immediate reward is bounded,(3)for , , ,the following iteration process is convergent:where and is a kind of pseudoinversion of ; that is, .

Proof. From (12), it is clear that the two phases include the update of hyperparameters in phase 1, which is viewed as the update of generalization model, and the update of samples’ value in phase 2, which is viewed as the update of critic network.
The convergence of iterative algorithm is proved based on stochastic approximation Lyapunov method.
Define that Equation (12) is rewritten as Define approximation errors as Further define that Thus (16) is reexpressed as Let , and the two-phase iteration is in the shape of stochastic approximation; that is, Define where is the scale of the hyperparameters and will be defined later. Let represent for short.
Obviously (20) is a positive definite matrix, and , . Hence, is a Lyapunov functional. At moment , the conditional expectation of the positive definite matrix is It is easy to compute the first-order Taylor expansion of (21) as in which the first item on the right of (22) is Substituting (23) into (22) yields where , Consider the last two items of (24) firstly. If the immediate reward is bounded and infinite discounted reward is applied, the value function is bounded. Hence, .
If the system is BIBO stable, , , are the dimensions of the state space and action space, respectively, , , the policy space is bounded.
According to Proposition 1, when the policy space is bounded and , , there exists a constant , so that In addition, From (26) and (27), we know that where . Similarly According to Lemma  5.4.1 in [30], , and and .
Now we focus on the first two items on the right of (24).
The first item is computed as For the second item , when the state transition function is time invariant, it is true that , . Then we have This inequality holds because of positive , , . Define the norm , where is the derivative matrix norm of the vector 1 and . Then Hence On the other hand, where is the value function error caused by the estimated hyperparameters.
Substituting (33) and (34) into yields Obviously, if the following inequality is satisfied, there exists a positive const , such that the first two items on the right of (24) satisfy According to Theorem  5.4.2 in [30], the iterative process is convergent; namely, .

Remark 3. Let us check the final convergence position. Obviously, , . This means the equilibrium of the critic network meets , where . And the equilibrium of hyperparameters is the solution of evidence maximization that , where .

Remark 4. It is clear that the selection of the samples is one of the key issues. Since now all samples have values according to two-phase iteration, according to the information-based criterion, it is convenient to evaluate samples and refine the set by arranging relative more samples near the equilibrium or with great gradient of the value function, so that the sample set is better to describe the distribution of value function.

Remark 5. Since the two-phase iteration belongs to value iteration, the initial policy of the algorithm does not need to be stable. To ensure BIBO in practice, we need a mechanism to clamp the output of system, even though the system will not be smooth any longer.

Theorem 2 gives the iteration principle for critic network learning. Based on the critic network, with proper objective defined, such as minimizing the expected total discounted reward [15], HDP or DHP update is applied to optimize actor network. Since this paper focuses on ACD and more importantly the update process of actor network does not affect the convergence of critic network, though maybe it induces premature or local optimization, the gradient update of actor is not necessary. Hence a simple optimum seeking is applied to get optimal actions w.r.t. sample states, and then an actor network based on Gaussian kernel is generated based on these optimal state-action pairs.

Up to now we have built a critic-actor architecture which is named Gaussian-kernel-based Approximate Dynamic Programming (GK-ADP for short) and shown in Algorithm 1. A span is introduced, so that hyperparameters are updated every times of update. If , two phases are asynchronous. From the view of stochastic approximation, this asynchronous learning does not change the convergence condition (36) but benefit computational cost of hyperparameters learning.

Initialize:
    : hyperparameters of Gaussian kernel model
    : sample set
    : initial policy
    , : learning step size
Let = 0;
Loop:
      k = 1     
     = t + 1;
    
    Get the reward
    Observe next state
    Update according to (12)
    Update the policy according to optimum seeking
      
    Update according to (12)
Until the termination criterion is satisfied

4. Simulation and Discussion

In this section, we propose some numerical simulations about continuous control to illustrate the special properties and the feasibility of the algorithm, including the necessity of two phases learning, the specifical properties comparing with traditional kernel-based ACDs, and the performance enhancement resulting from online refinement of sample set.

Before further discussion, we firstly give common setup in all simulations:(i)The span .(ii)To make output bounded, once a state is out of the boundary, the system is reset randomly.(iii)The exploration-exploitation tradeoff is left out of account here. During learning process, all actions are selected randomly within limited action space. Thus the behavior of the system is totally unordered during learning.(iv)The same ALD-based sparsification in [15] is used to determine the initial sample set, in which , , and empirically.(v)The sampling time and control interval are set to 0.02 s.

4.1. The Necessity of Two-Phase Learning

The proof of Theorem 2 shows that properly determined learning rates of phases 1 and 2 guarantee condition (36). Here we propose a simulation to show how the learning rates in phases 1 and 2 affect the performance of the learning.

Consider a simple single homogeneous inverted pendulum system: where and represent the angle and its speed, kg, , m, respectively, and is a horizontal force acting on the pole.

We test the success rate under different learning rates. Since, during the learning the action is always selected randomly, we have to verify the optimal policy after learning; that is, an independent policy test is carried out to test the actor network. Thus the success of one time of learning means in the independent test the pole can be swung up and maintained within  rad for more than 200 iterations.

In the simulation, 60 state-action pairs are collected to serve as the sample set, and the learning rates are set to , , where is varying from 0 to 0.12.

The learning is repeated 50 times in order to get the average performance, where the initial states of each run are randomly selected within the bound and w.r.t. the dimensions of and . The action space is bounded within . And, in each run, the initial hyperparameters are set to .

Figure 1 shows the result of success rates. Clearly, with fixed, different ’s affect the performance significantly. In particular, without the learning of hyperparameters, that is, , there are only 6 successful runs over 50 runs. With the increasing of the success rate increases till 1 when . But as goes on increasing, the performance becomes worse.

Hence both phases are necessary if the hyperparameters are not initialized properly. It should be noted that the learning rates w.r.t. two phases need to be regulated carefully in order to guarantee condition (36), which leads the learning process to the equilibrium in Remark 3 but not to some kind of boundary where the actor always executed the maximal or minimal actions.

If all samples’ values on each iteration are summed up and all cumulative values w.r.t. iterations are depicted in series, then we have Figure 2. The evolution process w.r.t. the best parameter is marked by the darker diamond. It is clear that, even with different , the evolution processes of the samples’ value learning are similar. That means, due to BIBO property, the samples’ values must be finally bounded and convergent. However, as mentioned above, it does not mean that the learning process converges to the proper equilibrium.

4.2. Value Iteration versus Policy Iteration

As mentioned in Algorithm 1, the value iteration in critic network learning does not depend on the convergence of the actor network, and, compared with HDP or DHP, the direct optimum seeking for actor network seems a little aimless. To test its performance and discuss its special characters of learning, a comparison between GK-ADP and KHDP is carried out.

The control objective in the simulation is a one-dimensional inverted pendulum, where a single-stage inverted pendulum is mounted on a cart which is able to move linearly, just as Figure 3 shows.

The mathematic model is given aswhere to represent the state of angle, angle speed, linear displacement, and linear speed of the cart, respectively, represents the force acting on the cart, kg, m, , and other denotations are the same as (38).

Thus the state-action space is 5D space, much larger than that in simulation 1. The configurations of the both algorithms are listed as follows:(i)A small sample set with 50 samples is adopted to build critic network, which is determined by ALD-based sparsification.(ii)For both algorithms, the state-action pair is limited to , , , , w.r.t. to and .(iii)In GK-ADP, the learning rates are and .(iv)The KHDP algorithm in [15] is chosen as the comparison, where the critic network uses Gaussian kernels. To get the proper Gaussian parameters, a series of as 0.9, 1.2, 1.5, 1.8, 2.1, and 2.4 are tested in the simulation. The elements of weights are initialized randomly in , the forgetting factor in RLS-TD(0) is set to , and the learning rate in KHDP is set to 0.3.(v)All experiments run 100 times to get the average performance. And in each run there are 10000 iterations to learn critic network.

Before further discussion, it should be noted that it makes no sense to figure out ourselves with which algorithm is better in learning performance, because, besides the debate of policy iteration versus value iteration, there are too many parameters in the simulation configuration affecting learning performance. So the aim of this simulation is to illustrate the learning characters of GK-ADP.

However to make the comparison as fair as possible, we regulate the learning rates of both algorithms to get similar evolution processes. Figure 4 shows the evolution processes of GK-ADP and KHDPs under different , where the -axis in the left represents the cumulated weights, , of the critic network in kernel ACD, and the other -axis represents the cumulated values of the samples, , in GK-ADP. It implies although, with different units, the learning processes under both learning algorithms converge nearly at the same speed.

Then the success rates of all algorithms are depicted in Figure 5. The left six bars represent the success rates of KHDP with different . With superiority argument left aside, we find that, such fixed in KHDP is very similar to the fixed hyperparameters , which needs to be set properly in order to get higher success rate. But unfortunately there is no mechanism in KHDP to regulate online. On the contrary, the two-phase iteration introduces the update of hyperparameters into critic network, which is able to drive hyperparameters to better values, even with the not so good initial values. In fact, this two-phase update can be viewed as a kind of kernel ACD with dynamic , when Gaussian kernel is used.

To discuss the performance in deep, we plot the test trajectories resulting from the actors, which are optimized by GK-ADP and KHDP, in Figures 6(a) and 6(b), respectively, where the start state is set to .

Apparently the transition time of GK-ADP is much smaller than KHDP. We think, besides the possible well regulated parameters, an important reason is nongradient learning for actor network.

The critic learning only depends on exploration-exploitation balance but not the convergence of actor learning. If exploration-exploitation balance is designed without actor network output, the learning processes of actor and critic networks are relatively independent of each other, and then there are alternatives to gradient learning for actor network optimization, for example, the direct optimum seeking Gaussian regression actor network in GK-ADP.

Such direct optimum seeking may result in nearly nonsmooth actor network, just like the force output depicted in the second plot of Figure 6(a). To explain this phenomenon, we can find the clue from Figure 7, which illustrates the best actions w.r.t. all sample states according to the final actor network. It is reasonable that the force direction is always the same to the pole angle and contrary to the cart displacement. And clearly the transition interval from negative limit −3 to positive 3 is very short. Thus the output of actor network, resulting from Gaussian kernels, intends to control angle bias as quickly as possible, even though such quick control sometimes makes the output force a little nonsmooth.

It is obvious that GK-ADP is with high efficiency but also with potential risks, such as the impact to actuators. Hence how to design exploration-exploitation balance to satisfy Theorem 2 and how to design actor network learning to balance efficiency and engineering applicability are two important issues in future work.

Finally we check the samples values, which are depicted in Figure 8, where only dimensions of pole angle and linear displacement are depicted. Due to the definition of immediate reward, the closer the states to the equilibrium are, the smaller the value is.

If we execute 96 successful policies one by one and record all final cart displacements and linear speed, just as Figure 9 shows, the shortage of this sample set is uncovered that, even with good convergence of pole angle, the optimal policy can hardly drive cart back to zero point.

To solve this problem and improve performance, besides optimizing learning parameters, Remark 4 implies that another and maybe more efficient way is refining sample set based on the samples’ values. We will investigate it in the following subsection.

4.3. Online Adjustment of Sample Set

Due to learning sample’s value, it is possible to assess whether the samples are chosen reasonable. In this simulation we adopt the following expected utility [26]:where is a normalization function over sample set and , are coefficients.

Let represent the number of samples added to the sample set, the size limit of the sample set, and the state-action pair trajectory during GK-ADP learning, where denotes the end iteration of learning. The algorithm to refine sample set is presented as Algorithm 2.

  l = 1         
Calculate and using (6) and (7)
Calculate using (40)
    
Sort all , in ascending order
Add the first state-action pairs to sample set to get extended set
  L + >
Calculate hyperparameters based on
  l = 1     +
Calculate and var using (6) and (7)
Calculate using (40)
    
Sort all , in descending order
Delete the first + samples to get refined set  
  

Since in simulation 2 we have carried out 100 runs of experiment for GK-ADP, 100 sets of sample values w.r.t. the same sample states are obtained. Now let us apply Algorithm 2 to these resulted sample sets, where is picked up from the history trajectory of state-action pairs in each run, and , , in order to add 10 samples into sample set.

We repeat all 100 runs of GK-ADP again, with the same learning rates and . Based on 96 successful runs in simulation 2, 94 runs are successful in obtaining control policies swinging up pole. Figure 10(a) shows the average evolutional process of sample values over 100 runs, where the first 10000 iterations display the learning process in simulation 2 and the second 8000 iterations display the learning process after adding samples. Clearly the cumulative value behaves a sudden increase after sample was added and almost keeps steady afterwards. It implies, even with more samples added, there is not too much to learn for sample values.

However if we check the cart movement, we will find the enhancement brought by the change of sample set. For the th run, we have two actor networks resulting from GK-ADP before and after adding samples. Using these actors, We carry out the control test, respectively, and collect the absolute values of cart displacement, denoted by and , at the moment that the pole has been swung up for 100 instants. We define the enhancement as

Obviously the positive enhancement indicates that the actor network after adding samples behaves better. As all enhancements w.r.t. 100 runs are illustrated together, just as Figure 10(b) shows, almost all cart displacements as the pole is swung up are enhanced more or less except for 8 times of failure. Hence with proper principle based on the sample values, the sample set can be refined online in order to enhance performance.

Finally let us check which state-action pairs are added into sample set. We put all added samples over 100 runs together and depict their values in Figure 11(a), where the values are projected onto the dimensions of pole angle and cart displacement. If only concerning the relationship between values and pole angle, just as Figure 11(b) shows, we find that the refinement principle intends to select the samples a little away from the equilibrium and near the boundary , due to the ratio between and .

5. Conclusions and Future Work

ADP methods are among the most promising research works on RL in continuous environment to construct learning systems for nonlinear optimal control. This paper presents GK-ADP with two-phase value iteration which combines the advantages of kernel ACDs and GP-based value iteration.

The theoretical analysis reveals that, with proper learning rates, two-phase iteration is good at making Gaussian-kernel-based critic network converge to the structure with optimal hyperparameters and approximate all samples’ values.

A series of simulations are carried out to verify the necessity of two-phase learning and illustrate properties of GK-ADP. Finally the numerical tests support the viewpoint that the assessment of samples’ values provides the way to refine sample set online, in order to enhance the performance of critic-actor architecture during operation.

However there are some issues needed to be concerned in future. The first is how to guarantee condition (36) during learning, which is now determined by empirical ways. The second is the balance between exploration and exploitation, which is always an opening question, but seems more notable here because bad exploration-exploitation principle will lead two-phase iteration to failure not only to slow convergence.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grants 61473316 and 61202340 and the International Postdoctoral Exchange Fellowship Program under Grant no. 20140011.