The 8th Joint Conference on Mathematics and Computer Science, July 14–17, 2010, Komárno, Slovakia
NON-PARAMETRIC VALUE FUNCTION APPROXIMATION
IN ROBOTICS
HUNOR JAKAB, BOTOND BÓCSI, AND LEHEL CSATÓ
Abstract. The use of non-parametric function approximators for value
function representation, in conjunction with robotic control and reinforcement learning, is a promising approach toward adaptive algorithms in complex applications. Due to their non-parametric and probabilistic nature,
the Gaussian process-based algorithms have a greater flexibility when contrasted to parametric function approximation methods. In this paper we
present several modalities of using Gaussian process regression for the approximation of value functions. We include the non-parametric approximation with both traditional value function based reinforcement learning and
with policy gradient algorithms. We test the performance of the presented
algorithms on two benchmark control tasks. The results show that the
proposed non-parametric function approximations can be used efficiently
to extend reinforcement learning methods to continuous domains: when
applied within policy gradient methods, they lead to an increase both in
convergence speed and in peak performance.
1. Introduction
Function approximation in reinforcement learning (RL) algorithms has
been studied widely in the literature: attempts to generalize RL methods to
continuous state spaces were described in a number of references [6, 2, 22, 4].
The majority of these methods used mean squared error minimization as a
learning rule for the function approximation, however, theoretical convergence
could be proven only for specific cases – e.g. non-bootstrapping methods,
linear function approximation with carefully chosen features [14]. The stateof-the-art algorithms, such as, Q-learning, state-action-reward-state-action
(SARSA), temporal-difference (TD) learning [20], and the use of eligibility
traces within the TD methods exhibit unexpected behaviour when applied
Received by the editors: 14.02.2011.
2010 Mathematics Subject Classification. 68T05, 68T40.
1998 CR Categories and Descriptors. 1.2.6 [Computing Methodologies]: Artificial Intelligence – Learning; 1.2.9 [Computing Methodologies]: Artificial Intelligence – Robotics.
Key words and phrases. Reinforcement learning, Gaussian processes,Value function approximation, Policy gradients, Robotic control.
230
Non-parametric value function approximation in robotics
231
to approximations of the action-value or state-value function. Recent research
has shown that temporal difference-based RL methods with linear and smooth
non-linear function approximation provably converge when in place of the conventional Bellman error – used in classical algorithms – a different objective
function, the projection of the Bellman error is used [21, 14].
The majority of methods exploiting the approximation to the value function use parametric functions [6, 22]. By applying non-parametric approximations, we extend the value function representation to more complex, high
dimensional and possibly continuous, state-action spaces. In this paper we
use Gaussian process regression (GPR) [17] as the non-parametric model
to approximate the action-value function. We prefer GPR over other nonparametric methods, e.g. support vector regression [18] or decision-tree based
methods, since GPR-s provide an appropriate uncertainty treatment as well.
A second favourable property is that one can choose a kernel that provides
an appropriately smooth approximation to the unknown value function. Furthermore, the analytical tractability of the Gaussian distribution leads to the
analytical tractability of the posterior distribution within a Bayesian inference
scheme with Gaussian additive noise. The main advantage of the Gaussian
processes is that no hard assumptions are made about the function we approximate, and have a larger expressive power than parametric approaches.
The non-parametric property becomes a drawback when the dataset is too
large, we address this issue by employing a sparse representation to the resulting process. In this paper we apply GPR value-function approximation
in conjunction with policy gradient algorithms and with value function based
RL algorithms. In the RL framework data are processed in an online manner – data originate from rewards observed along a trajectory –, therefore,
an online extension of GP learning is desired. We present online updates for
Gaussian processes based on [8]. We study the drawbacks and benefits of nonparametric approximation by testing the aforementioned methods on robot
control tasks.
The remainder of this paper is structured as follows: in Section 2, we
introduce online Gaussian process regression and sketch two possibilities of
how it can be used to approximate value functions. Details about GPR for
value-function approximation, in conjunction with both value-function based
RL and policy gradient methods are presented in Section 3. In Section 4, we
show a number of simulated control tasks which we use as testbed for our
algorithms and we also present the results of the experiments. Conclusions
are drawn in Section5.
232
Hunor Jakab, Botond Bócsi, and Lehel Csató
2. Online Gaussian process regression
Gaussian processes (GP) are non-parametric methods used in machine
learning for regression and classification [17]. They inherit and extend the
probabilistic nature of the Gaussian random variable, therefore provide not
only point estimates but also the uncertainty of the estimations. Gaussian
processes can be viewed from two perspectives. The first one, the functionspace view, defines a GP as a stochastic process denoted by f (x). Although
stochastic processes are defined over time, GP-s can be defined over a more
general input space – e.g. x ∈ Rl , with l the dimension of the space. In any
case a GP is completely specified by its mean function m(x) and covariance
its function k(x, x0 ), denoted by f (x) ∼ GP (m(x), k(x, x0 )):
m(x) = E [f (x)] ,
k(x, x0 ) = E (f (x) − m(x))(f (x0 ) − m(x0 )) .
The function-space view of the GPs does not provide an algorithmic tool for
inference. Hence we follow the second definition, where a GP is a collection
of random variables, any finite number of which have a joint Gaussian distribution [17]. With this definition we use GP-s for regression: let us consider a
n
training set D = {(xt , Q̂t )}t=1 having n elements, with xt and Q̂t the training
locations and the labels respectively.1 Given the training set D, we want to
find f (x∗ )=f
˙ n+1 corresponding to a test point x∗ knowing that:
m(x)
k(x, x) k(x, x∗ )
Q̂
∼ N
,
,
m(x∗ )
k(x∗ , x) k(x∗ , x∗ )
fn+1
where x = [x1 , . . . , xn ]T , Q̂ = [Q̂1 , . . . , Q̂n ]T , and k(·, ·) is a covariance function. The solution based on the property of the conditional Gaussian distribution, is [5]:2
fn+1 |D, x∗ ∼ N k(x∗ , x)k(x, x)−1 Q̂, k(x∗ , x∗ ) − k(x∗ , x)k(x, x)−1 k(x, x∗ ) .
Since the mean function m(x) is usually zero, the only unspecified element is
the covariance function. A wide range of covariance functions exist, the only
assumption is that it has to be a positive definite [18]. Covariance functions
can incorporate prior knowledge – e.g. smoothness, periodicity – about the
learning task. For example, when the inputs of the covariance function are
state-action pairs, as they are in RL, it makes sense to construct a joint kernel
1The unusual notation of Q̂ for labels is explained in Section 3.
2Note that if we want to calculate the labels of multiple point – x∗ is vector valued – the
same calculations apply.
Non-parametric value function approximation in robotics
233
from the composition of two kernel functions, each of which appropriately capture the covariance properties of states and actions respectively, as suggested
by [12]. For details on covariance functions see [19].
We define online Gaussian process regression [7] and present the inference
process. The key idea is to perform Bayesian inference where the prior is
the regression model defined by all the points processed in the past. We then
calculate the posterior probability of the new data point and since the posterior
probability is Gaussian, no approximation is needed.
We introduce the following notations: kq denotes a covariance function –
the q subscript tells that kq (·, ·) operates on state or state-action pairs; usually,
state-action pairs are denoted as q-values. Let K q be a kernel matrix given
by K q (i, j) = kq (xi , xj ). Assuming an online – or sequential – setting, we
suppose that |D| = n inputs – called training points – have already been
processed and we denote a new data point by xn+1 . We define the vector
kn+1 as the covariance between the new point and the training points from D
as follows
kn+1 = [kq (x1 , xn+1 ), . . . , kq (xn , xn+1 )].
Given a dataset D with n previous observations, a covariance function kq , a
new data point xn+1 , the joint probability distribution of the previous labels
Q̂ and the new predicted label fn+1 looks as follows
K q + Σn
kn+1
Q̂
∼ N 0,
,
kTn+1
kq (xn+1 , xn+1 )
fn+1
where Σn is the covariance matrix of the observation noise. Conditioned on
the previous points, fn+1 is a Gaussian distribution random variable with
predictive mean and predictive variance expressed by eq. (1):
2
fn+1 |D, xn+1 ∼ N µn+1 , σn+1
µn+1 = kn+1 αn
(1)
2
σn+1
= kq (xn+1 , xn+1 ) − kn+1 C n kTn+1 ,
where αn and C n are the parameters of the GP – for details see [17] – with
the following form
(2)
αn = [K nq + Σn ]−1 Q̂,
C n = [K nq + Σn ]−1 .
The expressions from eq (2) contain matrix inversions of order n. These
inversions has a high complexity of O(n3 ). To get rid of this costly operation,
an iterative update of αn and C n has been introduced [8]. At every time-step
the parameters αn+1 and C n+1 depend only on the previous values αn , C n ,
and the new data point xn+1 . After the update, we add xn+1 to the training
234
Hunor Jakab, Botond Bócsi, and Lehel Csató
set D. The update procedure is:
αn+1 =αn + q n+1 sn+1
C n+1 =C n + rn+1 sn+1 sTn+1
(3)
sn+1 =C n kTn+1 + en+1
en+1 =[0, 0, . . . , 1]TM1×(n+1) ,
To perform the update, we have to evaluate q n+1 and rn+1 from eq. (4)
for which we need the marginal predictive distribution for the new data-point.
2
This is a Gaussian with mean µn+1 and variance σn+1
given by eq. (1). We
assume additive Gaussian observation noise with zero mean and variance σ0 .
(4)
q n+1 =
Q̂n+1 − µn+1
2
σ02 + σn+1
and rn+1 = −
σ02
1
.
2
+ σn+1
Note that the hyper-parameters of the Gaussian process – parameters of
the kernel function – do not have to be set a-priori, rather can be optimized
[17]. We mention that the presented method can be extended by applying
online sparsification [9] in order to avoid the cubic increase in parameter size,
a useful extension for real-time applications.
To show that online GPR can be used in conjunction with RL, we approximated the value function on a simple RL leaning task, the pole balancing –
see Section 4. Figure 1 shows a comparison between the two value functions
estimated for a fixed policy fixed resolution dynamic programming and GPR,
respectively. Although the two value functions are not exactly the same, they
are similar. Moreover, since in value-function based RL methods the absolute
values are not as important as the relative values, the performance of the algorithms will not be influenced negatively as it is confirmed in Section 4. Next
we describe the use of GPR in RL.
3. Combining Gaussian process regression and reinforcement
learning
Recently, GPs gained a remarkable interest in the RL framework due to
their advantageous properties. They have been used with success in RL in different contexts. The temporal-difference learning has been extended when the
value function was modeled by a GP [11]. This extension offered a Bayesian
treatment of the RL problem. An online operable algorithm had been introduced, called GPTD and used to solve a two dimensional maze navigation
problem. Later, they presented a SARSA based extension of GPTD, named
GPSARSA. It allows the selection of actions and the gradual improvement of
policies without requiring a world-model [12]. [17] modeled the value function
Non-parametric value function approximation in robotics
(a)
235
(b)
Figure 1. Value functions for the pole balancing task evaluated on a fixed proportional derivative control policy. Value
function approximations using (a) tabular dynamic programming and (b) Gaussian process regression.
and the system dynamics using GPs, and proposed a policy iteration algorithm. Only a batch version algorithm has been developed and applied for the
mountain car problem. The gain of their work is that they showed that GPs
are feasible in the RL framework. [13] applied GPs in a policy gradient framework. They used a GP to approximate the gradient of the expected return
function. This extension allowed a full Bayesian view of the gradient estimation. The algorithm has been applied for the bandit problem. [10] modeled
both the value function and the action-value function with GPs, and proposed a dynamic programming [3] solution using these approximations. The
algorithm is called GPDP. They used different GPs for every possible action,
therefore, the method is feasible when the action space is small. GPDP has
been used only when binary actions were possible, i.e. to solve a swing-up
control problem.
Next we present two methods to approximate the state-action value function with GPR. It will be used with traditional value-function based learning
– in Section 3.1 – and with gradient-based RL algorithms – Section 3.2.
3.1. Temporal difference learning. Value-function based RL methods are
known to exhibit convergence issues when paired with function approximation. The main problem is the -greedy action-selection. This type of action
selection is based on the fact that after each policy iteration step, the resulting policy is better than the previous one, hence, the selected actions will
236
Hunor Jakab, Botond Bócsi, and Lehel Csató
improve in time. This condition cannot be guaranteed when we use function
approximation, therefore, small changes in the value function can cause drastic changes in the resulting policy. This could lead to the divergence of the
learning algorithm. Using off-policy methods, such as, Q-learning, SARSA,
TD(λ), with function approximations can also cause divergence. Off-policy
methods follow one policy when executing actions and another when evaluate
the current policy. The underlying stationary state distribution will not be
the same as the one which should be used to weight the individual backups –
the one that is the result of the -greedy policy. Details about the divergence
of the off-policy methods can be found in [20]. Despite the disadvantages
of function approximation, parametric function approximation has been used
successfully in practice in many value-function based RL methods. Now, we
present a GPR approximation of the state-action function and a corresponding
Q-learning based algorithm.
Let us consider an episode τ consisting of {(st , at )}t=1,H state-action pairs.
For each state action pair, we calculate the Monte Carlo estimation of the
action-value function – a.k.a. Q-value function – based on the sum of discounted rewards until the end of the episode:
(5)
m
Q (st , at ) =
m−1
X
i=0
γ i R(st+i , at+i ) + γ m max Qpred (st+m , at+m ),
a
where R(st , at ) denotes the reward observed at time t, γ ∈ (0, 1] is a discount
factor, and Qpred is the approximated value of the action-value function using
GP inference. One can see that if m is set to 1, we get the simple one-step
TD return on the right side of eq. (5). If m is chosen to be H, where H is
the length of the episode, then we get the full Monte Carlo estimation of the
corresponding Q-value. Note that by definition we assume that Qpred (st , at ) =
0 for every t > H. Setting m = 1 and defining α as a learning rate, this
formulation of the target value is equal to the actual Q-learning update rule
[20]:
(6) Q1 (st , at ) = (1 − α)Qpred (st , at ) + α R(st , at ) + γ max (Qpred (st+1 , a)) .
a
Further, we combine the Q-learning update rule – eq. (6) – and the online
Gaussian process update – eq. (3) and eq. (4) – to obtain the exact update
for the GPR when it is used in a RL framework. Notice that the only unclear
update is the expression of eq. (4). Inserting eq. (6) into eq. (4), we completely
Non-parametric value function approximation in robotics
237
specify the GPR update:
q n+1 =
=
Q̂1 (st , at ) − Qpred (st , at )
2
σ02 + σn+1
α (R(st , at ) + γ maxa Qpred (st+1 , a) − Qpred (st , at ))
.
2
σ02 + σn+1
Using the above expression to incorporate the TD error into future predictions
and into the expanded covariance matrix, corresponds to the stochastic averaging that takes place in tabular Q-learning. We mention that value function
approximation with GPR can be used in conjunction with many value-based
RL algorithms based on the same principles as well. We have chosen to use
with Q-learning.
3.2. Gradient-based policy improvement. A different approach for solving the RL problem is to model the policy directly and update it during the
learning process. Parametric function approximations – e.g. neural networks
– are used to model the policy – πθ – and update its parameters based on the
steepest gradient descent [1]:
θt+1 ← θt + α∇θ J(θt ).
Different policy gradient algorithms use different approaches to approximate
the gradient of the objective function ∇θ J(θt ). Here, J(θ) is the objective
function and expresses the expected cumulative discounted return induced by
the policy πθ .
Gradient based policy optimization algorithms, such as, REINFORCE [23],
finite difference methods [16], vanilla policy gradients [15] and natural actorcritic [15] have many advantages over traditional value-function based methods
when complex control policies of complex, continuous, and high dimensional
systems have to be learned. However, the majority of these methods suffer
from high gradient variance as aP
result of using Monte Carlo estimates of the
j
action-value function Qπ (s, a) = H
j=0 γ rj in the calculation of ∇θ J(θ). It has
been shown that by replacing this estimate with a function approximation of
the value-function on state-action space, the gradient variance can be reduced
significantly [22]. The gradient ∇θ J(θ) can be expressed as [16]
"H−1
#
X
(7)
∇θ J(θ) = Eτ
∇θ log π(at |st )R(τ ) ,
t=0
where the expectation Eτ can be approximated via averaging over a number
of controller output histories [23]. R(τ ) denotes the cumulated discounted
rewards along trajectory τ . It is a sample from the Monte Carlo estimation of
the action-value function Q(st , at ). Although, R(τ ) is an unbiased estimator
238
Hunor Jakab, Botond Bócsi, and Lehel Csató
of the true Q-value function, it has high variance. In order to reduce gradient
variance, we replace this term in the policy gradient formula – eq.(7) – using
our GP function approximator. We can choose to use solely the GP predicted
mean instead of the term R(τ ), or we can combine an arbitrary number of
immediate returns from the episode with the GP prediction. A similar idea is
explained in Section 3.1. Thus,
Qm (st , at ) =
m−1
X
γ i R(st+i , at+i ) + γ m E[f (st+m , at+m )],
i=0
where E[f (st+m , at+m )] is the predicted mean of the GPR approximation and
is given by eq. (1). Using the previous expression, leads to the following form
of the gradient
"H−1
#
X
m
∇θ J(θ) = E
∇θ log π(at |st )Q (st , at ) .
t=0
When m is grater than the number of time-steps in the respective episode,
then the full Monte Carlo estimate from (5) is used. By choosing m = 0 or
m = H, we get the pure GP prediction-based version or the policy gradient algorithms. Further in this paper, we apply the GPR extension to the
REINFORCE algorithm.
4. Experimental Results
In order to measure the performance of the presented methods, we used
two different simulated robot control tasks: the mountain car and the pole
balancing problems. As a base measure for out methods, we implemented
Q-learning for the first and episodic REINFORCE for the second task. Then,
we extended these methods with the GPR action-value approximation.
We experimented with the GP approximation of the action-value function
using the mountain car problem, a standard RL benchmark task, see Figure 2.(a).3 The task is as follows: a car is placed on the bottom of a valley
and it has to reach one of the tops of the surrounding hills – a yellow star
marks the desired location of the car from Figure 2.(a). The car does not
have the necessary power to overcome gravity, therefore it has to swing itself
to gain momentum and climb the hill. This is a nonlinear control task. The
mountain car problem has a two dimensional state space: the position of the
car and the velocity. We used a discretized action space where three actions
were possible: apply right force, apply left force, and do not apply any force.
3We used a MATLAB implementation of the mountain car robot by Jose Antonio Martin
– http://www.dia.fi.upm.es/~jamartin/ – extended with additional learning algorithms
apart from the SARSA algorithm present originally.
Non-parametric value function approximation in robotics
(a)
239
(b)
Figure 2. Illustration of the (a) mountain car and (b) pole
balancing problems.
For Q-learning,g we used a 20×12 lookup table using 20 possible values for the
position and 12 values for the velocity. GPs handle well continuous functions,
therefore, discretization was not needed in this case.
Another standard RL benchmark task is the pole balancing problem – see
Figure 2.(b) – where the goal is to balance a vertical pole. The state space has
two state variables: the angle θ and the angular velocity ω of the pole. The
dynamics of the system was implemented in Matlab based on the following
differential equation:
θ̈(t) =
−µθ̇(t) + mgl sin (θ(t)) + u(t)
,
ml2
where µ is the friction coefficient, m is the mass of the pole, l is the length
of the pole, and g is the gravitational coefficient. The state variables are
normalized so that θ is always positive with a range of [0 → 2π].
For the mountain car problem, we obtained the value functions as shown
in Figure 3.(a) and Figure 3.(b), based on 400 runs. Experiments show that
simple Q-learning converges faster if we do not use GP function approximation
– 61 episodes were needed by simple Q-learning whereas 146 episodes with
the GP extension. This is not a surprising result since by using function
approximations, we always loose accuracy. It is much more important to
note that better results – fewer steps until the goal state (marked by a star
on Figure 2.(a)) was reached – were obtained in the GP case. To find an
average solution, the Q-learning algorithm took 358 steps whereas only 171
steps were taken by the GP extension. Results show – see Figure 3.(c) – that
using GPs for the approximation of the action-value function in the mountain
240
Hunor Jakab, Botond Bócsi, and Lehel Csató
(a)
(c)
(b)
Figure 3. Value functions for the mountain car problem using
(a) tabular Q-learning and (b) Q-learning with GPR. (c) After
several iterations the GPR extension outperforms standard Qlearning.
car task might lead to better solutions, however, longer time is needed until
convergence.
We used the pole balancing task to test the performance of the episodic
REINFORCE algorithm and its GP augmented version. The episodic REINFORCE is known to suffer from high gradient variance and consequently it
converges slowly. To reduce the gradient variance, we used GPR as described
in Section 3.2. For each gradient update step we performed 4 episodes. The
states-action pairs visited during these episodes and the corresponding Monte
Carlo returns – as seen in Figure 4.(b) – provided the training points and labels for the GPR. In Figure 4.(b), we have plotted the evolution of the average
reward during the learning phase for both the simple and the GP augmented
versions of the algorithm. From this plot we can see that the use of GP approximated state-action value functions has a beneficial effect on the convergence
of the algorithm. Although the two versions of the algorithm asymptotically
converge, the learning curve in the initial phase is much steeper when the
Non-parametric value function approximation in robotics
(a)
241
(b)
Figure 4. (a) Value function of the pole balancing task using
GPR approximation together with the state-action pairs sampled from the experiment – red dots. (b) Comparison of the
standard episodic REINFORCE algorithm and the GPR extension – our method outperforms the episodic REINFORCE
algorithm.
GPR approximation has been used. Similar performance with the simple REINFORCE algorithm can only be reached if we increase the number of episodes
bewteen two gradient update steps – e.g. from 4 to 20. However, in case of
a real-life learning task where the number of trials may be strongly limited by
physical constraints this can be very costly.
5. Discussion
In this paper we investigated the possibility of using Gaussian processes
in reinforcement learning. Although the basic setup is not new, we consider
our approach a novel one: we approximated the action-value function within
the learning process with a Gaussian process. We then applied the approximation for Q-learning in order to extend it to continuous state spaces. We
also employed the GPR approximation in conjunction with the REINFORCE
algorithm to reduce the variance of the estimated gradient in a policy gradient
setup. We extended the original GP inference algorithm with an online version
of the algorithm since in practice any RL problem requires online treatment
– both for efficiency and for memory needs. Experiments were performed on
simulated robot control tasks. Results show that the GPR approximation of
the action-value function does not lead to worse performance. On the contrary,
242
Hunor Jakab, Botond Bócsi, and Lehel Csató
it provides better policies or faster convergence. The performance of the value
function approximation scheme presented here can be further improved by using geodesic distance based kernel functions and maintaining the continuity of
our value function approximation between gradient update steps. These are
the major steps of our future research.
Acknowledgements
The authors wish to thank for the financial support provided from program: Investing in people! PhD scholarship, project co-financed by the European Social Fund, sectoral operational program, human resources development
2007 - 2013. Contract POSDRU 6/1.5/S/3; POSDRU 88/1.5/S/60185 – ”Innovative doctoral studies in a knowledge based society” and PNCD II 11-039
of the Romanian Ministry of Education and Research.
References
[1] K. E. Atkinson. An Introduction to Numerical Analysis. Wiley, New York, 1978.
[2] L. Baird, A. Moore. Gradient descent for general reinforcement learning. In In Advances
in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1998.
[3] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Belmont,
MA, 1995.
[4] D.P. Bertsekas, J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[5] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[6] J. Boyan, A. Moore. Generalization in reinforcement learning: Safely approximating the
value function. In G. Tesauro, D.S. Touretzky, and T.K. Lee, editors, Neural Information
Processing Systems 7, pages 369–376, Cambridge, MA, 1995. The MIT Press.
[7] L. Csató. Gaussian Processes – Iterative Sparse Approximation. PhD thesis, Neural
Computing Research Group, March 2002.
[8] L. Csató, E. Fokoué, M. Opper, B. Schottky, O. Winther. Efficient approaches to Gaussian process classification. In NIPS, volume 12, pages 251–257. The MIT Press, 2000.
[9] L. Csató, M. Opper. Sparse representation for Gaussian process models. In Todd K.
Leen, Thomas G. Dietterich, and Volker Tresp, editors, NIPS, volume 13, pages 444–
450. The MIT Press, 2001.
[10] M.P. Deisenroth, C.E. Rasmussen, J. Peters. Gaussian process dynamic programming.
Neurocomputing, 72(7-9):1508–1524, 2009.
[11] Y. Engel, S. Mannor, R. Meir. Bayes meets Bellman: The Gaussian process approach to
temporal difference learning. In Proc. of the 20th International Conference on Machine
Learning, pages 154–161, 2003.
[12] Y. Engel, S. Mannor, R. Meir. Reinforcement learning with Gaussian processes. In ICML
’05: Proceedings of the 22nd international conference on Machine learning, pages 201–
208, New York, NY, USA, 2005. ACM.
[13] M. Ghavamzadeh Y. Engel. Bayesian policy gradient algorithms. In B. Schölkopf,
J. Platt, and T. Hoffman, editors, NIPS ’07: Advances in Neural Information Processing Systems 19, pages 457–464, Cambridge, MA, 2007. MIT Press.
Non-parametric value function approximation in robotics
243
[14] H. Maei, Cs. Szepesvari, S. Bhatnagar, D. Precup, D. Silver, R. Sutton. Convergent
temporal-difference learning with arbitrary smooth function approximation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances
in Neural Information Processing Systems 22, pages 1204–1212. 2009.
[15] J. Peters, S. Schaal. Policy gradient methods for robotics. In IROS, pages 2219–2225.
IEEE, 2006.
[16] J. Peters, S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural
Networks, 21(4):682–697, 2008.
[17] C.E. Rasmussen, C. Williams. Gaussian Processes for Machine Learning. MIT Press,
2006.
[18] B. Schoelkopf, A.J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA,
2002.
[19] J. Shawe-Taylor, N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
[20] R.S. Sutton, A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
[21] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, E.
Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear
function approximation. In ICML ’09: Proceedings of the 26th Annual International
Conference on Machine Learning, pages 993–1000, New York, NY, USA, 2009. ACM.
[22] R.S. Sutton, D.A. McAllester, S.P. Singh, Y. Mansour. Policy gradient methods for
reinforcement learning with function approximation. In Sara A. Solla, Todd K. Leen,
and Klaus-Robert Müller, editors, NIPS, pages 1057–1063, 1999.
[23] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
Faculty of Mathematics and Computer Science, Babes-Bolyai University,
Cluj-Napoca
E-mail address: {jakabh,bboti,csatol}@cs.ubbcluj.ro
© Copyright 2026 Paperzz