By combining with sparse kernel methods, least-squares temporal difference (LSTD) algorithms can construct the feature dictionary automatically and obtain a better generalization ability. However, the previous kernel-based LSTD algorithms do not consider regularization and their sparsification processes are batch or offline, which hinder their widespread applications in online learning problems. In this paper, we combine the following five techniques and propose two novel kernel recursive LSTD algorithms: (i) online sparsification, which can cope with unknown state regions and be used for online learning, (ii) and regularization, which can avoid overfitting and eliminate the influence of noise, (iii) recursive least squares, which can eliminate matrix-inversion operations and reduce computational complexity, (iv) a sliding-window approach, which can avoid caching all history samples and reduce the computational cost, and (v) the fixed-point subiteration and online pruning, which can make regularization easy to implement. Finally, simulation results on two 50-state chain problems demonstrate the effectiveness of our algorithms.

1. Introduction

The least-squares temporal difference (LSTD) learning may be the most popular approach for policy evaluation in reinforcement learning (RL) [1, 2]. Compared with the standard temporal difference (TD) learning, LSTD uses samples more efficiently and eliminates all step-size parameters. However, LSTD also has some drawbacks. First, LSTD requires a matrix-inversion operation at each time step. To reduce computational complexity, Bradtke and Barto proposed a recursive LSTD (RLSTD) algorithm [1], and Xu et al. proposed a RLSTD algorithm [3]. But these two algorithms still require many features especially for highly nonlinear RL problems, since the RLS approximator assumes a linear model [4]. Second, when the number of features is larger than the number of training samples, LSTD is prone to overfitting. To overcome this problem, Kolter and Ng proposed an -regularized LSTD algorithm called LARS-TD for feature selection [5], but it is only applicable for batch learning and its implementation is complicated. On this basis, Chen et al. proposed an -regularized RLSTD algorithm [6]. In contrast with LARS-TD, it has an analytical solution, but it cannot obtain a sparse solution. Third, LSTD requires users to design the feature vector manually, and poor design choices can result in estimates that diverge from the optimal value function [7].

In the last two decades, kernel methods have been intensively and extensively studied in supervised and unsupervised learning [8]. The basic idea behind kernel methods can be summarized as follows: By using a nonlinear transform, the origin input data can be mapped into a high-dimensional feature space, and an inner product in this space can be interpreted as a Mercer kernel function. Thus, as long as a linear algorithm can be formulated in terms of inner products, there is no need to perform computations in the high-dimensional feature space [9]. Recently, there have also been many research works on kernelizing least-squares algorithms [913]. Here, we only review some works related to our proposed algorithms. One typical work is the sparse kernel recursive least-squares (SKRLS) algorithm with the approximate linear dependency (ALD) criterion [11]. Compared with traditional RLS algorithms, it not only has a good nonlinear approximation ability but also can construct the feature dictionary automatically. Similarly, Chen et al. proposed an -regularized SKRLS algorithm with the online vector quantization [12]. Besides having the good properties of SKRLS-ALD, it can avoid overfitting. In addition, Chen et al. proposed an -regularized SKRLS algorithm with the fixed-point subiteration [13], which can yield a much sparser dictionary.

Intuitively, we can also bring the benefits of kernel machine learning to LSTD algorithms. In fact, kernel-based RL algorithms have become more and more popular in recent years [1422], and several works have been done for kernelizing LSTD algorithms. In an earlier paper, Xu proposed a sparse kernel-based LSTD (SKLSTD) algorithm with the ALD criterion [19]. Although this algorithm can avoid selecting features manually, it is only applicable for batch learning and its derivation is complicated. After that, Xu et al. proposed an incremental version of the SKLSTD algorithm for policy iteration [20], but this algorithm still requires a matrix-inversion operation at each time step. Moreover, the feature dictionary is required to be constructed offline, which makes this algorithm only approximate the value function correctly in the area of the state space that is covered by the training samples. Recently, Jakab and Csató proposed a sparse kernel RLSTD (SKRLSTD) algorithm by using a proximity graph sparsification method [21]. Unfortunately, its sparsification process is also offline. In addition, all of these algorithms do not consider regularization, whereas many real problems exhibit noise and the high expressiveness of the kernel matrix can result in overfitting [22].

In this paper, we propose two online SKRLSTD algorithms with and regularization, called OSKRLSTD- and OSKRLSTD-, respectively. Compared with the derivation of SKLSTD, our derivation uses Bellman operator along with projection operator and thus is more simple. To cope with unknown state-space regions and avoid overfitting, our algorithms use online sparsification and regularization techniques. Besides, to reduce computational complexity and avoid caching all history samples, our algorithms also use the recursive least-squares and the sliding-window technique. Moreover, different from LARS-TD, OSKRLSTD- uses the subiteration and online pruning to find the fixed point. These techniques make our algorithms more suitable for online RL problems with a large or continuous state space. The rest of this paper is organized as follows. In Section 2, we present preliminaries and review the LSTD algorithm. Section 3 contains the main contribution of this paper: we derive OSKRLSTD- and OSKRLSTD- algorithms in detail. In Section 4, we demonstrate the effectiveness of our algorithms for two 50-state chain problems. Finally, we conclude the paper in Section 5.

2. Background

In this section, we introduce the basic definitions and notations, which will be used throughout the paper without any further mention. We also review the LSTD algorithm, which is needed to establish our algorithms described in Section 3.

2.1. Preliminaries

In RL and dynamic programming (DP), an underlying sequential decision-making problem is often modeled as a Markov decision process (MDP). An MDP can be defined as a tuple [5], where is a set of states, is a set of actions, is a state transition probability function where denotes the probability of transitioning to state when taking action in state , is a reward function, is the discount factor, and is an initial state distribution. For simplicity of presentation, we assume that and are finite. Given an MDP and a policy , the sequence is a Markov reward process , where and .

RL and DP often use the state-value function to evaluate how good the policy is for the agent to be in state . For an MDP, can be defined as , which must obey the Bellman equation [23],or be expressed in vector form,If and are known, can be solved analytically; that is,where is the identity matrix.

However, different from the case in DP, and are unknown in RL. The agent has to estimate by exploring the environment. Furthermore, many real problems have a large or continuous state space, which makes hard to be expressed explicitly. To overcome this problem, we often resort to linear function approximation; that is,where is a parameter vector, is the feature vector of state , and is a feature matrix. Unfortunately, when approximating in this manner, there is usually no way to satisfy the Bellman equation exactly, because may lie outside the span of [5].

2.2. LSTD Algorithm

The LSTD algorithm presents an efficient way to find such that “approximately” satisfies the Bellman equation [5]. By solving the least-squares problem , we can find a closest approximation in the span of to replace . Then, from (2) and (4), we can use for approximating . That means we can find by solving the fixed-point equation:where is a nonnegative diagonal matrix indicating a distribution over states. Nevertheless, since and are unknown and since is too large to form anyway in a large or continuous state space, we cannot solve (5) exactly. Instead, given a trajectory following policy , LSTD uses , , and to replace , , and , respectively. Then, (5) can be approximately rewritten asLet ; we haveThus, the fixed point can be found by

3. Regularized OSKRLSTD Algorithms

To overcome the weaknesses of the previous kernel-based LSTD algorithms, we propose two regularized OSKRLSTD algorithms in this section.

3.1. OSKRLSTD- Algorithm

Now, we use regularization and online sparsification to derive the first OSKRLSTD algorithm, which is called OSKRLSTD-.

First, we use the kernel trick to kernelize (6). Suppose the feature dictionary , and let denote the corresponding feature matrix. By the Representer Theorem [24], and can be expressed as follows:where and are the coefficient vector of and , respectively. Then, from (6), we haveBy the Mercer Theorem [24], the inner product of two feature vectors can be calculated by . Thus, we can define , , and . On this basis, (10) can be rewritten as

Second, we try to derive the -regularized solution of (11). Add an -norm penalty into (11); that is,where is a regularization parameter. Let ; we haveSince , we easily have from (9). Then, the above equation can be rewritten aswhere is the identity matrix. Thus, can be analytically solved aswhere and denotewhere and .

Third, we derive the recursive formulas of and . Under online sparsification, there are two cases: () , , , , and ; () , , , , where , and is expanded aswhere is the dimensional zero vector.

For the first case, (16) can be rewritten as follows:Applying the matrix-inversion lemma [25] for , we getThus, plugging (19) and (20) into (15), we obtain

For the second case, (16) can be rewritten as follows:where and are the same as the updated and when the feature dictionary keeps unchanged, , , , and . However, computing , , , and requires caching all history samples, and the computational cost will become more and more expensive as increases. Inspired by the work of Van Vaerenbergh et al. [26], we introduce a sliding window to deal with these problems. Let , where is the window size. We only use the samples in to evaluate , , , and ; that is,Then, similar to those in the first case, and can be derived as follows:where and is the same as the updated when the dictionary keeps unchanged.

Finally, we summarize the whole algorithm in Algorithm 1.

)Input: to be evaluated, , , ,
()for    do
()if    then
()    Initialize ,
()    Take given by , and observe ,
()    Initialize
()    Initialize
()    Initialize
()  Take given by , and observe ,
()     Update ,
()     Update , by (21) and (20)
()      if   satisfies the sparsification condition then
()     , ,
()     Compute , , and by (23)
()     Update , by (25) and (24)
()      end if
()  end if
() end for

Remark 1. Here, we do not restrict the OSKRLSTD- algorithm to a specific online sparsification method. That means it can be combined with many popular sparsification methods such as the novelty criterion (NC) [27] and the ALD criterion.

Remark 2. Although the OSKRLSTD- algorithm is designed for infinite horizon tasks, it can be modified for episodic tasks. When is an absorbing state, it only requires setting temporarily and setting as the start state of next episode.

Remark 3. Our simulation results show that a big sliding window cannot help improve the convergence performance of the OSKRLSTD- algorithm. Thus, to save memory and reduce the computational cost, should be set to a small integer.

3.2. OSKRLSTD- Algorithm

In this subsection, we use regularization and online sparsification to derive the second OSKRLSTD algorithm, which is called OSKRLSTD-.

First, we try to derive the -regularized solution of (11). Add an -norm penalty into (11); that is,where is a regularization parameter. However, is not differentiable. Similar to Painter-Wakefield and Parr in [28], we resort to the subdifferential of ; that is,where is the set-valued function defined component-wise asLet , so thatSince , we also have from (9). Then, the above equation can be rewritten aswhere has the same meaning as . To avoid the singularity of and further reduce the complexity of the subsequent derivation, we introduce into both sides; that is,where is a regularization parameter. Obviously, the left hand side of (31) is the same as that of (14). Thus, from (16), the above equation can be rewritten asThen, we have the following fixed-point equation:where denotesUnfortunately, here, cannot be solved analytically.

Second, we investigate how to find the fixed point of (33). In -regularized LSTD algorithms [5, 29], researchers often used the LASSO method to tackle this problem. However, the LASSO method is inherently a batch method and is unsuitable for online learning. Instead, we resort to the fixed-point subiteration method introduced in [13]. We first use the sign function to replace in (33). Then, we can construct the following subiteration:where denotes the th subiteration and is initialized to since the fixed point will be close to if and are small. If the subiteration number reaches a preset value or is less than or equal to a preset threshold , the subiteration will stop. From (32) and (28), if , should be 0. Obviously, the replacement of makes lose the ability to select features. To remedy this situation, after the whole subiteration, we remove the weakly dependent elements from according to the magnitude of ; that is,where denotes the operation to remove the elements indexed by the set , which is determined bywhere is a preset threshold. Note that we do not remove the last element of , since is probably very small, especially when is just added to . Similarly, we perform and to remove the weakly dependent coefficients. From (16), also requires removing some rows and columns. Unfortunately, we cannot use the method in [30] to do this like Chen et al. in [13], since is not a symmetry matrix. Considering that will remove the corresponding elements if is pruned, we directly perform to remove the rows and columns indexed by . Although this method may bring some bias into , our simulation results show that it is feasible and effective. The whole fixed-point subiteration and online pruning algorithm are summarized in Algorithm 2.

)Input: , , , , , , ,
()for    to    do
()   Update by (35)
()    if    then
()   Break out of the loop
()    end if
()end for
() Determine the index set by (37)
() Perform , , and

Remark 4. Our simulation results show that Algorithm 2 will converge in few iterations. Thus, Algorithm 2 does not become the computational bottleneck of the OSKRLSTD- algorithm, and the maximum subiteration number can be set to a small positive integer.

Third, we derive the recursive formulas of and . Although the dictionary can be pruned by using Algorithm 2, it still has the risk of rapidly growing if new samples are allowed to be added continually. Thus, the conventional sparsification method is also required to be considered here. Similar to Section 3.1, there are two cases under online sparsification. Since and have the same definitions as and in the OSKRLSTD- algorithm, we can directly use (20) and (24) for updating and rewrite (21) and (25) for updating . If dissatisfies the sparsification condition, will be updated byOtherwise, will be updated bywhere , , , and are also calculated by (23). Since , , and will be pruned by Algorithm 2 after the update, it is important to note that and in (39) denote and updated by but unpruned by . Likewise, when (24) is used here, has the same meaning.

Finally, we summarize the whole algorithm in Algorithm 3. For episodic tasks, the modification is the same as Remark 2. In addition, similar to Remark 3, the sliding-window size should also be set to a small integer.

)Input: to be evaluated, , , , , , , ,
()for    do
()if    then
()   Initialize ,
()   Take given by , and observe ,
()   Initialize
()   Initialize
()   Initialize
()   Perform Algorithm 2
()  else
()    Take given by , and observe ,
()    Update ,
()    Update , by (38) and (20)
()    ,
()    Perform Algorithm 2
()    if   satisfies the sparsification condition  then
()      , ,
()      Compute , , and by (23)
()      Update , by (39) and (24)
()      Perform Algorithm 2
()    end if
()  end if
() end for

Remark 5. By pruning the weakly dependent features, the OSKRLSTD- algorithm can yield a much sparser solution than the OSKRLSTD- algorithm.

4. Simulations

In this section, we use a nonnoise chain and a noise chain [2, 20, 31] to demonstrate the effectiveness of our proposed algorithms. For comparison purposes, RLSTD [1] and SKRLSTD [21] algorithms are also tested in the simulations. To analyze the effect of regularization and online pruning on the performance of our algorithms, the OSKRLSTD- algorithm with and the OSKRLSTD- algorithm with (called OSKRLSTD-0 and OSKRLSTD-, resp.) are tested here, too. In addition, the effects of the sliding-window size on the performance of our algorithms and OSKRLSTD- are evaluated as well.

4.1. Simulation Settings

As shown in Figure 1, in both chain problems, each chain consists of 50 states, which are numbered from 1 to 50. For each state, there are two actions available, that is, “left” (L) and “right” (R). Each action succeeds with probability 0.9, changing the state in the intended direction, and fails with probability 0.1, changing the state in the opposite direction. The two boundaries of each chain are dead-ends, and the discount factor of each chain is set to 0.9. For the nonnoise chain, the reward is 1 only in states 10 and 41, whereas, for the noise chain, the reward is corrupted by an additive Gaussian noise . Due to the symmetry, the optimal policy for both chains is to go right in states 1–9 and 26–41 and left in states 10–25 and 42–50. Here, we use it as the policy to be evaluated. Note that the state transition probabilities are available only for solving the true state-value functions , and they are assumed to be unknown for all algorithms compared here.

In the implementations of all tested algorithms for both chain problems, the settings are summarized as follows: (i) For all OSKRLSTD algorithms, the Mercer kernel is defined as , the sparsification condition is defined as , and the sliding-window size is set to 5. Besides, for the OSKRLSTD- algorithm, the regularization parameters and are set to 0.8 and 0.3, the maximum subiteration number is set to , the precision threshold is set to 0.1, and the pruning threshold is set to 0.4; for the OSKRLSTD- algorithm, , , and are the same as those in the OSKRLSTD- algorithm; for the OSKRLSTD- algorithm, is set to 1. (ii) For the SKRLSTD algorithm, the Mercer kernel and the sparsification condition are the same as those in each OSKRLSTD algorithm. (iii) For the RLSTD algorithm, the feature vector consists of 19 Gaussian radius basis functions (GRBFs) plus a constant term 1, resulting in a total of 20 basis functions. The GRBF has the same definition as the Mercer kernel used in each OSKRLSTD algorithm, and the centers of GRBFs are uniformly distributed over . In addition, the variance matrix of RLSTD is initialized to 0.4, where is the 20 × 20 identity matrix. (iv) In the simulations, each algorithm performs 50 runs, each run includes 100 episodes, and each episode is truncated after 100 time steps. In particular, the SKRLSTD algorithm requires an extra run for offline sparsification before each regular run.

4.2. Simulation Results

We first report the comparison results of all tested algorithms with the simulation settings described in Section 4.1. Their learning curves are shown in Figure 2. At each episode, the root mean square error (RMSE) of each algorithm is calculated by , where is solved by (1) and is the approximate value of the th run. From Figure 2, we can observe that (i) OSKRLSTD- and OSKRLSTD- can obtain the similar performance as RLSTD and converge much faster than SKRLSTD. (ii) Without regularization, the performance of OSKRLSTD-0 becomes very poor, especially in the noise chain. In contrast, OSKRLSTD- and OSKRLSTD- still perform well. (iii) The performance of OSKRLSTD- is only slightly better than that of OSKRLSTD-, which indicates that online pruning has little effect on the performance. Figure 3 illustrates approximated by all tested algorithms at the final episode. Clearly, OSKRLSTD-0 has lost the ability to approximate of the noise chain. Figure 4 shows the dictionary growth curves of all tested algorithms. Compared with RLSTD and SKRLSTD, all OSKRLSTD algorithms can construct the dictionary automatically, and OSKRLSTD- yields a much sparser dictionary. Figure 5 shows the average subiterations per time step in OSKRLSTD- and OSKRLSTD-. As episodes increase, the subiterations decline gradually. In addition, online pruning can reduce the subiterations significantly. Even in the noise chain, the subiterations are small. Finally, the main simulation results of all tested algorithms at the final episode are summarized in Table 1.

Next, we evaluate the effect of the sliding-window size on our proposed algorithms and OSKRLSTD- with . The logarithmic RMSEs of each algorithm at the final episode are illustrated in Figure 6. Note that the parameter settings of these algorithms are the same as those described in Section 4.1 except for . From Figure 6, OSKRLSTD- and OSKRLSTD- obviously become worse rather than better as the window size increases, and OSKRLSTD- has a strong adaptability to different window sizes. The reason for this result is analyzed as follows: From the derivation of our algorithms, the influence of the window size is mainly manifest in . Since here is calculated by recursive update instead of matrix inversion and samples are used one by one, using too many history samples together may increase the calculation error. In OSKRLSTD-, a moderate regularization parameter can relieve the influence of this error. In contrast, in OSKRLSTD- and OSKRLSTD-, the subiteration may expand the influence. Especially for OSKRLSTD-, online pruning can introduce the new error, which further worsens the convergence performance. To verify the above analysis, we reset , , and for OSKRLSTD- and OSKRLSTD- and reevaluate the effect of the window size. The new results are illustrated in Figure 7. As expected, OSKRLSTD- and OSKRLSTD- can also adapt to . Nevertheless, there is still no proof that a big window size can help improve the convergence performance of OSKRLSTD- and OSKRLSTD-. Thus, as stated in Remark 3, is suggested to be set to a small integer in practice.

5. Conclusion

As an important approach for policy evaluation, LSTD algorithms can use samples more efficiently and eliminate all step-size parameters. But they require users to design the feature vector manually and often require many features to approximate state-value functions. Recently, there are some works on these issues by combining with sparse kernel methods. However, these works do not consider regularization and their sparsification processes are batch or offline. In this paper, we propose two online sparse kernel recursive least-squares TD algorithms with and regularization, that is, OSKRLSTD- and OSKRLSTD-. By using Bellman operator along with projection operator, our derivation is more simple. By combining online sparsification, and regularization, recursive least squares, a sliding window, and the fixed-point subiteration, our algorithms not only can construct the feature dictionary online but also can avoid overfitting and eliminate the influence of noise. These advantages make them more suitable for online RL problems with a large or continuous state space. In particular, compared with the OSKRLSTD- algorithm, the OSKRLSTD- algorithm can yield a much sparser dictionary. Finally, we illustrate the performance of our algorithms and compare them with RLSTD and SKRLSTD algorithms by several simulations.

There are also some interesting topics to be studied in future work: (i) How to select proper regularization parameter should be investigated. (ii) A more thorough simulation analysis is needed, including an extension of our algorithms to learning control problems. (iii) Eligibility traces would be combined for further improving the performance of our algorithms. (iv) The convergence and prediction error bounds of our algorithms will be analyzed theoretically.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.


This work is supported in part by the National Natural Science Foundation of China under Grant nos. 61300192 and 11261015, the Fundamental Research Funds for the Central Universities under Grant no. ZYGX2014J052, and the Natural Science Foundation of Hainan Province, China, under Grant no. 613153.