Abstract
Adaptive Dynamic Programming (ADP) with criticactor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernelbased adaptive critic design (ACD) was developed recently. There are two essential issues for a static kernelbased model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. They all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a twophase simultaneous learning method for a Gaussiankernelbased critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussiankernelbased Adaptive Dynamic Programming (GKADP) approach. Simulations are used to verify its feasibility, particularly the necessity of twophase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set.
1. Introduction
Reinforcement learning (RL) is an interactive machine learning method for solving sequential decision problems. It is well known as an important learning method in unknown or dynamic environment. Different from supervised learning and unsupervised learning, RL interacts with the environment through trial mechanism and modifies its action policies to maximize the payoffs [1]. It is strongly connected from a theoretical point of view with direct and indirect adaptive optimal control methods [2].
Traditional RL research focused on discrete state/action systems; state/action only takes on a finite number of prescribed discrete values. The learning space grows exponentially as the number of states and the number of allowed actions increase. This leads to the socalled curse of dimensionality (CoD) [2]. In order to mitigate this CoD problem, function approximations and generalization methods [3] are introduced to store the optimal value and the optimal control as a function of the state vector. Generalization methods based on parametric model such as neural networks [4–6] have become one of popular means to solve RL problem in continuous environments.
Currently, research work on RL in continuous environment to construct learning systems for nonlinear optimal control has attracted attention of researchers and scholars in control domains for the reason that it can modify its policy only based on the value function without knowing the model structure or the parameters in advance. A family of new RL techniques known as Approximate or Adaptive Dynamic Programming (ADP) (also known as Neurodynamic Programming or Adaptive Critic Designs (ACDs)) has received more and more research interest [7, 8]. ADPs are based on the actorcritic structure, in which there is a critic assessing the value of the action or control policy applied by an actor and an actor modifying its action based on the assessment of values. In literatures, ADP approaches are categorized as the following main schemes: heuristic dynamic programming (HDP), dual heuristic programming (DHP), globalized dual heuristic programming (GDHP), and their actiondependent versions [9, 10].
ADP researches always adopt multilayer perceptron neural networks (MLPNNs) as the critic design. Vrabie and Lewis proposed an online approach to continuoustime direct adaptive optimal control which made use of neural networks to parametrically represent the control policy and the performance of the control system [11]. Liu et al. solved the constrained optimal control problem of unknown discretetime nonlinear systems based on the iterative ADP algorithm via GDHP technique with three neural networks [12]. In fact, different kinds of neural networks (NNs) play the important roles in ADP algorithms, such as radial basis function NNs [3], wavelet basis function NNs [13], and echo state network [14].
Besides the benefits brought by NNs, ADP methods always suffer from some problems concerned in the design of NNs. On one hand, the learning control performance greatly depends on empirical design of critic networks, especially the manual setting of the hidden layer or the basis functions. On the other hand, due to the local minima in neural network training, how to improve the quality of the final policies is still an open problem [15].
As we can see, it is difficult to evaluate the effectiveness of the parametric model when the knowledge on the model’s order or nonlinear characteristics of the system is not enough. Compared with parametric modeling methods, nonparametric modeling methods, especially kernel methods [16, 17], do not need to set the model structure in advance. Hence, kernel machines have been popularly studied to realize nonlinear and nonparametric modeling. Engel et al. proposed the kernel recursive leastsquares algorithm to construct minimum meansquarederror solutions to nonlinear leastsquares problems [18]. As popular kernel machines, support vector machines (SVMs) also have been applied to nonparametric modeling problems. Dietterich and Wang combined linear programming with SVMs to find value function approximations [19]. The similar research was published in [20], in which the leastsquares SVMs were used. Nevertheless, they both focused on discrete state/action space and lacked theoretical results on the policies obtained more or less.
In addition to SVMs, Gaussian processes (GPs) have become an alternative generalization method. GP models are powerful nonparametric tools for approximate Bayesian inference and learning. In comparison with other popular nonlinear architectures, such as multilayer perceptrons, their behavior is conceptually simpler to understand, and model fitting can be achieved without resorting to nonconvex optimization routines [21, 22]. In [23], Engel et al. first applied GPs in temporaldifference (TD) learning for MDPs with stochastic rewards and deterministic transitions. They derived a new GPTD algorithm in [24] that overcame the limitation of deterministic transitions and extended GPTD to the estimation of stateaction values. GPTD algorithm just addressed the value approximation, so it should be combined with actorcritic methods or other policy iteration methods to solve learning control problems.
An alternative approach employing GPs in RL is modelbased value iteration or policy iteration method, in which GP model is used to model system dynamics and represent the value function [25]. In [26], an approximated valuefunction based RL algorithm named Gaussian process dynamic programming (GPDP) was presented, which built dynamic transition model, value function model, and action policy model, respectively, using GPs. In this way, the sample set will be adjusted to such a reasonable shape with high sample densities near the equilibriums or the places where value functions change dramatically. Thus it is good at controlling nonlinear systems with complex dynamics. A major shortcoming, even if the relatively high computation cost is endurable, is that the states in sample set need to be revisited again and again in order to update their value functions. Since this condition is unpractical in real implements, it diminishes the appeal of employing this method.
Kernelbased method is also introduced to ADP. In [15], a novel framework of ACDs with sparse kernel machines was presented by integrating kernel methods into critic network. A sparsification method based on the approximately linear dependence (ALD) analysis was used to sparsify the kernel machines. Obviously, this method can overcome the difficulty of presetting model structure in parametric models and realize actorcritic learning online [27]. However, the selection of samples based on the ALD analysis is an offline way without considering the distribution of the value function. Therefore, the data samples cannot be adjusted online, which makes the method more suitable for control systems with smooth dynamics, where value function changes gently.
We think GPDP and ACDs with sparse kernel machines are complementary. As indicated in [28], it is known the prediction of GPs is viewed as a linear combination of the covariance between the new points and the samples. Hence it seems reasonable to introduce kernel machine with GPs to build critic network in ACDs, if the values of the samples are known or at least can be assessed numerically. And then the sample set will be adjusted online during criticactor learning.
The major problem here is how to realize the value function learning and GP models updating simultaneously, especially under the condition that the samples of stateaction space can hardly be revisited infinitely in order to approximate their values. To tackle this problem, a twophase iteration is developed in order to get optimal control policy for the system whose dynamics are unknown a priori.
2. Description of the Problem
In general, ADP is an actorcritic method which approximates the value functions and policies to encourage the realization of generalization in MDPs with large or continuous spaces. The critic design plays the most important role, because it determines how the actor optimizes its action. Hence, we give a brief introduction on both kernelbased ACD and GPs, in order to derive the clear description of the theoretical problem.
2.1. KernelBased ACDs
Kernelbased ACDs mainly consist of a critic network, a kernelbased feature learning module, a reward function, an actor network/controller, and a model of the plant. The critic constructed by kernel machine is used to approximate the value functions or their derivatives. Then the output of the critic is used in the training process of the actor so that policy gradients can be computed. As actor finally converges, the optimal action policy mapping states to actions is described by this actor.
Traditional neural network based on kernel machine and samples serves as the model of value functions, just as the following equation shows, and the recursive algorithm, such as KLSTD [29] serves as value function approximation:where and represent stateaction pairs and , , , are the weights, and represents selected stateaction pairs in sample set and is a kernel function.
The key of critic learning is the update of the weights vector . The value function modeling in continuous state/action space is a regression problem. From the view of the model, NN structure is a reproducing kernel space spanned by the samples, in which the value function is expressed as a linear regression function. Obviously as the basis of Hilbert space, how to select samples determines the VC dimension of identification, as well as the performance of value function approximation.
If only using ALDbased kernel sparsification, it is only independent of samples that are considered in sample selection. So it is hard to evaluate how good the sample set is, because the sample selection does not consider the distribution of value function, and the performance of ALD analysis is affected seriously by the hyperparameters of kernel function, which are predetermined empirically and fixed during learning.
If the hyperparameters can be optimized online and the value function w.r.t. samples can be evaluated by iteration algorithms, the critic network will be optimized not only by value approximation but also by hyperparameter optimization. Moreover with approximated sample values, there is a direct way to evaluate the validity of sample set, in order to regulate the set online. Thus in this paper we turn to Gaussian processes to construct the criterion for samples and hyperparameters.
2.2. GPBased Value Function Model
For an MDP, the data samples and the corresponding value can be collected by observing the MDP. Here is the stateaction pairs , , , and is the value function defined aswhere .
Given a sample set collected from a continuous dynamic system, , where , , Gaussian regression with covariance function shown in the following equation is a well known model technology to infer the function: where .
Assuming additive independent identically distributed Gaussian noise with variance , the prior on the noisy observations becomes where
The parameters , , and are the hyperparameters of the and collected within the vector .
For an arbitrary input , the predictive distribution of the function value is Gaussian distributed with mean and variance given by
Comparing (6) to (1), we find that if we let , the neural network is also regarded as the Gaussian regression. Or the critic network can be constructed based on Gaussian kernel machine, if the following conditions are satisfied.
Condition 1. The hyperparameters of Gaussian kernel are known.
Condition 2. The values w.r.t. all samples states are known.
With Gaussiankernelbased critic network, the sample stateaction pairs and corresponding values are known. And then the criterion such as the comprehensive utility proposed in [26] can be set up, in order to refine sample set online. At the same time it is convenient to optimize hyperparameters, in order to approximate value function more accurately. Thus besides the advantages brought by kernel machine, the critic based on Gaussiankernel will be better in approximating value functions.
Consider Condition 1. If the values w.r.t. are known (note that it is indeed Condition 2), the common way to get hyperparameters is by evidence maximization, where the logevidence is given by It requires the calculation of the derivative of w.r.t. each , given by where denotes the trace, .
Consider Condition 2. For unknown system, if is known, that is, the hyperparameters are known (note that it is indeed Condition 1), the update of critic network will be transferred to the update of values w.r.t. samples by using value iteration.
According to the analysis, both conditions are interdependent. That means the update of critic depends on known hyperparameters, and the optimization of hyperparameters depends on accurate sample values.
Hence we need a comprehensive iteration method to realize value approximation and optimization of hyperparameters simultaneously. A direct way is to update them alternately. Unfortunately, this way is not reasonable because these two processes are tightly coupled. For example, temporal differential errors drive value approximation, but the change of weights will cause the change of hyperparameters simultaneously; then it is difficult to tell whether this temporal differential error is induced by observation or by Gaussian regression model changing.
To solve this problem, a kind of twophase value iteration for critic network is presented in the next section, and the conditions of convergence are analyzed.
3. TwoPhase Value Iteration for Critic Network Approximation
First a proposition is given to describe the relationship between hyperparameters and the sample value function.
Proposition 1. The hyperparameters are optimized by evidence maximization according to the samples and their Q values, and the logevidence is given by where . It can be proved that, for arbitrary hyperparameters , if , (10) defines an implicit function or a continuously differentiable function as follows:
Then the twophase value iteration for critic network is described as the following theorem.
Theorem 2. Given the following conditions(1)the system is boundary input and boundary output (BIBO) stable,(2)the immediate reward is bounded,(3)for , , ,the following iteration process is convergent:where and is a kind of pseudoinversion of ; that is, .
Proof. From (12), it is clear that the two phases include the update of hyperparameters in phase 1, which is viewed as the update of generalization model, and the update of samples’ value in phase 2, which is viewed as the update of critic network.
The convergence of iterative algorithm is proved based on stochastic approximation Lyapunov method.
Define that Equation (12) is rewritten as Define approximation errors as Further define that Thus (16) is reexpressed as Let , and the twophase iteration is in the shape of stochastic approximation; that is, Define where is the scale of the hyperparameters and will be defined later. Let represent for short.
Obviously (20) is a positive definite matrix, and , . Hence, is a Lyapunov functional. At moment , the conditional expectation of the positive definite matrix is It is easy to compute the firstorder Taylor expansion of (21) as in which the first item on the right of (22) is Substituting (23) into (22) yields where , Consider the last two items of (24) firstly. If the immediate reward is bounded and infinite discounted reward is applied, the value function is bounded. Hence, .
If the system is BIBO stable, , , are the dimensions of the state space and action space, respectively, , , the policy space is bounded.
According to Proposition 1, when the policy space is bounded and , , there exists a constant , so that In addition, From (26) and (27), we know that where . Similarly According to Lemma 5.4.1 in [30], , and and .
Now we focus on the first two items on the right of (24).
The first item is computed as For the second item , when the state transition function is time invariant, it is true that , . Then we have This inequality holds because of positive , , . Define the norm , where is the derivative matrix norm of the vector 1 and . Then Hence On the other hand, where is the value function error caused by the estimated hyperparameters.
Substituting (33) and (34) into yields Obviously, if the following inequality is satisfied, there exists a positive const , such that the first two items on the right of (24) satisfy According to Theorem 5.4.2 in [30], the iterative process is convergent; namely, .
Remark 3. Let us check the final convergence position. Obviously, , . This means the equilibrium of the critic network meets , where . And the equilibrium of hyperparameters is the solution of evidence maximization that , where .
Remark 4. It is clear that the selection of the samples is one of the key issues. Since now all samples have values according to twophase iteration, according to the informationbased criterion, it is convenient to evaluate samples and refine the set by arranging relative more samples near the equilibrium or with great gradient of the value function, so that the sample set is better to describe the distribution of value function.
Remark 5. Since the twophase iteration belongs to value iteration, the initial policy of the algorithm does not need to be stable. To ensure BIBO in practice, we need a mechanism to clamp the output of system, even though the system will not be smooth any longer.
Theorem 2 gives the iteration principle for critic network learning. Based on the critic network, with proper objective defined, such as minimizing the expected total discounted reward [15], HDP or DHP update is applied to optimize actor network. Since this paper focuses on ACD and more importantly the update process of actor network does not affect the convergence of critic network, though maybe it induces premature or local optimization, the gradient update of actor is not necessary. Hence a simple optimum seeking is applied to get optimal actions w.r.t. sample states, and then an actor network based on Gaussian kernel is generated based on these optimal stateaction pairs.
Up to now we have built a criticactor architecture which is named Gaussiankernelbased Approximate Dynamic Programming (GKADP for short) and shown in Algorithm 1. A span is introduced, so that hyperparameters are updated every times of update. If , two phases are asynchronous. From the view of stochastic approximation, this asynchronous learning does not change the convergence condition (36) but benefit computational cost of hyperparameters learning.

4. Simulation and Discussion
In this section, we propose some numerical simulations about continuous control to illustrate the special properties and the feasibility of the algorithm, including the necessity of two phases learning, the specifical properties comparing with traditional kernelbased ACDs, and the performance enhancement resulting from online refinement of sample set.
Before further discussion, we firstly give common setup in all simulations:(i)The span .(ii)To make output bounded, once a state is out of the boundary, the system is reset randomly.(iii)The explorationexploitation tradeoff is left out of account here. During learning process, all actions are selected randomly within limited action space. Thus the behavior of the system is totally unordered during learning.(iv)The same ALDbased sparsification in [15] is used to determine the initial sample set, in which , , and empirically.(v)The sampling time and control interval are set to 0.02 s.
4.1. The Necessity of TwoPhase Learning
The proof of Theorem 2 shows that properly determined learning rates of phases 1 and 2 guarantee condition (36). Here we propose a simulation to show how the learning rates in phases 1 and 2 affect the performance of the learning.
Consider a simple single homogeneous inverted pendulum system: where and represent the angle and its speed, kg, , m, respectively, and is a horizontal force acting on the pole.
We test the success rate under different learning rates. Since, during the learning the action is always selected randomly, we have to verify the optimal policy after learning; that is, an independent policy test is carried out to test the actor network. Thus the success of one time of learning means in the independent test the pole can be swung up and maintained within rad for more than 200 iterations.
In the simulation, 60 stateaction pairs are collected to serve as the sample set, and the learning rates are set to , , where is varying from 0 to 0.12.
The learning is repeated 50 times in order to get the average performance, where the initial states of each run are randomly selected within the bound and w.r.t. the dimensions of and . The action space is bounded within . And, in each run, the initial hyperparameters are set to .
Figure 1 shows the result of success rates. Clearly, with fixed, different ’s affect the performance significantly. In particular, without the learning of hyperparameters, that is, , there are only 6 successful runs over 50 runs. With the increasing of the success rate increases till 1 when . But as goes on increasing, the performance becomes worse.
Hence both phases are necessary if the hyperparameters are not initialized properly. It should be noted that the learning rates w.r.t. two phases need to be regulated carefully in order to guarantee condition (36), which leads the learning process to the equilibrium in Remark 3 but not to some kind of boundary where the actor always executed the maximal or minimal actions.
If all samples’ values on each iteration are summed up and all cumulative values w.r.t. iterations are depicted in series, then we have Figure 2. The evolution process w.r.t. the best parameter is marked by the darker diamond. It is clear that, even with different , the evolution processes of the samples’ value learning are similar. That means, due to BIBO property, the samples’ values must be finally bounded and convergent. However, as mentioned above, it does not mean that the learning process converges to the proper equilibrium.
4.2. Value Iteration versus Policy Iteration
As mentioned in Algorithm 1, the value iteration in critic network learning does not depend on the convergence of the actor network, and, compared with HDP or DHP, the direct optimum seeking for actor network seems a little aimless. To test its performance and discuss its special characters of learning, a comparison between GKADP and KHDP is carried out.
The control objective in the simulation is a onedimensional inverted pendulum, where a singlestage inverted pendulum is mounted on a cart which is able to move linearly, just as Figure 3 shows.
The mathematic model is given aswhere to represent the state of angle, angle speed, linear displacement, and linear speed of the cart, respectively, represents the force acting on the cart, kg, m, , and other denotations are the same as (38).
Thus the stateaction space is 5D space, much larger than that in simulation 1. The configurations of the both algorithms are listed as follows:(i)A small sample set with 50 samples is adopted to build critic network, which is determined by ALDbased sparsification.(ii)For both algorithms, the stateaction pair is limited to , , , , w.r.t. to and .(iii)In GKADP, the learning rates are and .(iv)The KHDP algorithm in [15] is chosen as the comparison, where the critic network uses Gaussian kernels. To get the proper Gaussian parameters, a series of as 0.9, 1.2, 1.5, 1.8, 2.1, and 2.4 are tested in the simulation. The elements of weights are initialized randomly in , the forgetting factor in RLSTD(0) is set to , and the learning rate in KHDP is set to 0.3.(v)All experiments run 100 times to get the average performance. And in each run there are 10000 iterations to learn critic network.
Before further discussion, it should be noted that it makes no sense to figure out ourselves with which algorithm is better in learning performance, because, besides the debate of policy iteration versus value iteration, there are too many parameters in the simulation configuration affecting learning performance. So the aim of this simulation is to illustrate the learning characters of GKADP.
However to make the comparison as fair as possible, we regulate the learning rates of both algorithms to get similar evolution processes. Figure 4 shows the evolution processes of GKADP and KHDPs under different , where the axis in the left represents the cumulated weights, , of the critic network in kernel ACD, and the other axis represents the cumulated values of the samples, , in GKADP. It implies although, with different units, the learning processes under both learning algorithms converge nearly at the same speed.
Then the success rates of all algorithms are depicted in Figure 5. The left six bars represent the success rates of KHDP with different . With superiority argument left aside, we find that, such fixed in KHDP is very similar to the fixed hyperparameters , which needs to be set properly in order to get higher success rate. But unfortunately there is no mechanism in KHDP to regulate online. On the contrary, the twophase iteration introduces the update of hyperparameters into critic network, which is able to drive hyperparameters to better values, even with the not so good initial values. In fact, this twophase update can be viewed as a kind of kernel ACD with dynamic , when Gaussian kernel is used.
To discuss the performance in deep, we plot the test trajectories resulting from the actors, which are optimized by GKADP and KHDP, in Figures 6(a) and 6(b), respectively, where the start state is set to .
(a) The resulted control performance using GKADP
(b) The resulted control performance using KHDP
Apparently the transition time of GKADP is much smaller than KHDP. We think, besides the possible well regulated parameters, an important reason is nongradient learning for actor network.
The critic learning only depends on explorationexploitation balance but not the convergence of actor learning. If explorationexploitation balance is designed without actor network output, the learning processes of actor and critic networks are relatively independent of each other, and then there are alternatives to gradient learning for actor network optimization, for example, the direct optimum seeking Gaussian regression actor network in GKADP.
Such direct optimum seeking may result in nearly nonsmooth actor network, just like the force output depicted in the second plot of Figure 6(a). To explain this phenomenon, we can find the clue from Figure 7, which illustrates the best actions w.r.t. all sample states according to the final actor network. It is reasonable that the force direction is always the same to the pole angle and contrary to the cart displacement. And clearly the transition interval from negative limit −3 to positive 3 is very short. Thus the output of actor network, resulting from Gaussian kernels, intends to control angle bias as quickly as possible, even though such quick control sometimes makes the output force a little nonsmooth.
It is obvious that GKADP is with high efficiency but also with potential risks, such as the impact to actuators. Hence how to design explorationexploitation balance to satisfy Theorem 2 and how to design actor network learning to balance efficiency and engineering applicability are two important issues in future work.
Finally we check the samples values, which are depicted in Figure 8, where only dimensions of pole angle and linear displacement are depicted. Due to the definition of immediate reward, the closer the states to the equilibrium are, the smaller the value is.
If we execute 96 successful policies one by one and record all final cart displacements and linear speed, just as Figure 9 shows, the shortage of this sample set is uncovered that, even with good convergence of pole angle, the optimal policy can hardly drive cart back to zero point.
To solve this problem and improve performance, besides optimizing learning parameters, Remark 4 implies that another and maybe more efficient way is refining sample set based on the samples’ values. We will investigate it in the following subsection.
4.3. Online Adjustment of Sample Set
Due to learning sample’s value, it is possible to assess whether the samples are chosen reasonable. In this simulation we adopt the following expected utility [26]:where is a normalization function over sample set and , are coefficients.
Let represent the number of samples added to the sample set, the size limit of the sample set, and the stateaction pair trajectory during GKADP learning, where denotes the end iteration of learning. The algorithm to refine sample set is presented as Algorithm 2.

Since in simulation 2 we have carried out 100 runs of experiment for GKADP, 100 sets of sample values w.r.t. the same sample states are obtained. Now let us apply Algorithm 2 to these resulted sample sets, where is picked up from the history trajectory of stateaction pairs in each run, and , , in order to add 10 samples into sample set.
We repeat all 100 runs of GKADP again, with the same learning rates and . Based on 96 successful runs in simulation 2, 94 runs are successful in obtaining control policies swinging up pole. Figure 10(a) shows the average evolutional process of sample values over 100 runs, where the first 10000 iterations display the learning process in simulation 2 and the second 8000 iterations display the learning process after adding samples. Clearly the cumulative value behaves a sudden increase after sample was added and almost keeps steady afterwards. It implies, even with more samples added, there is not too much to learn for sample values.
(a) The average evolution process of values before and after adding samples
(b) The enhancement of the control performance about cart displacement after adding samples
However if we check the cart movement, we will find the enhancement brought by the change of sample set. For the th run, we have two actor networks resulting from GKADP before and after adding samples. Using these actors, We carry out the control test, respectively, and collect the absolute values of cart displacement, denoted by and , at the moment that the pole has been swung up for 100 instants. We define the enhancement as
Obviously the positive enhancement indicates that the actor network after adding samples behaves better. As all enhancements w.r.t. 100 runs are illustrated together, just as Figure 10(b) shows, almost all cart displacements as the pole is swung up are enhanced more or less except for 8 times of failure. Hence with proper principle based on the sample values, the sample set can be refined online in order to enhance performance.
Finally let us check which stateaction pairs are added into sample set. We put all added samples over 100 runs together and depict their values in Figure 11(a), where the values are projected onto the dimensions of pole angle and cart displacement. If only concerning the relationship between values and pole angle, just as Figure 11(b) shows, we find that the refinement principle intends to select the samples a little away from the equilibrium and near the boundary , due to the ratio between and .
(a) Projecting on the dimensions of pole angle and cart displacement
(b) Projecting on the dimension of pole angle
5. Conclusions and Future Work
ADP methods are among the most promising research works on RL in continuous environment to construct learning systems for nonlinear optimal control. This paper presents GKADP with twophase value iteration which combines the advantages of kernel ACDs and GPbased value iteration.
The theoretical analysis reveals that, with proper learning rates, twophase iteration is good at making Gaussiankernelbased critic network converge to the structure with optimal hyperparameters and approximate all samples’ values.
A series of simulations are carried out to verify the necessity of twophase learning and illustrate properties of GKADP. Finally the numerical tests support the viewpoint that the assessment of samples’ values provides the way to refine sample set online, in order to enhance the performance of criticactor architecture during operation.
However there are some issues needed to be concerned in future. The first is how to guarantee condition (36) during learning, which is now determined by empirical ways. The second is the balance between exploration and exploitation, which is always an opening question, but seems more notable here because bad explorationexploitation principle will lead twophase iteration to failure not only to slow convergence.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grants 61473316 and 61202340 and the International Postdoctoral Exchange Fellowship Program under Grant no. 20140011.