Reinforcement learning

Reinforcement learning
Based on [Kaelbling et al., 1996, Bertsekas, 2000]
Bert Kappen
Reinforcement learning
”Reinforcement learning is the problem faced by an agent that must learn behavior
through trial-and-error interactions with a dynamic environment.” (Kaelbling et al.
1996).
Approaches:
- search space of behaviors: genetic algorithms
- dynamic programming
a changes world, i is indication of current state s, r is reinforcement signal
Behaviour should be such as to increase the long-run sum of values of r.
Bert Kappen
Reinforcement learning 1
Formally:
- discrete set of environment states S
- discrete set of agent actions A
- set of scalar reinforcement signals, (0,1) or real
Find optimal policy π : S → A
Environment is
- non-deterministic: taking same action in same state may yield different next state
- stationary: p(s0 |s, a) independent of time.
Bert Kappen
Reinforcement learning 2
Models of optimallity
The finite horizon model:
+
*X
h
rt
t=0
Does not consider at t = 0 what happens after t = h.
Two uses:
- Fixed horizon: Take h-step optimal action, (h-1)-step optimal action,. . ., 1-step optimal action
- Receding horizon: Take always h-step optimal action
Bert Kappen
Reinforcement learning 3
Models of optimallity
The infinite horizon discounted model:
*X
∞
+
γt rt
0≤γ<1
t=0
γ is ”probability to live another step”, or mathematical trick to bound infinite sum.
Bert Kappen
Reinforcement learning 4
Models of optimallity
The average reward model:
* h +
1 X
lim
rt
h→∞ h
t=0
Is limit of discounted model for γ → 1.
Problem with this model is that there is no way to distinguish between two policies,
one of which gains a large amount of reward in the initial phases and the other of
which does not.
Bert Kappen
Reinforcement learning 5
Models of optimallity
Only single action with three choices from start state (upper left circle is t = 0).
Different criteria yield different optimal solutions:
Finite horizon h = 4 model prefers first choice: 4t=0 rt = (6, 0, 0).
Discounted reward γ = 0.9 model prefers second choice
P
∞
X
t=0

 ∞
∞
∞
X
X
X
 2


t
5
6
t
t
t
γ , 11
γ  = 2γ , 10γ , 11γ
γ rt = 2
γ , 10
t=2
t=5
Average reward prefers third choice:
Bert Kappen
t=6
1
h
Ph
t=0 rt
=
1
h
P
h
1
= (16.2, 59.0, 58.5)
1−γ
h−1
h−4
h−5
t=2 r, t=5 r, t=6 r = 2 h , 10 h , 11 h
Ph
Ph
Reinforcement learning 6
Where we used
t=0 γ
P∞
Proof: Define S =
t
=
t=0 γ
PT
(1 − γ)S
t
=
1
1−γ .
. Then
T
X
t=0
∞
X
t=0
Bert Kappen
γ
t
γt −
T
X
t=0
γt+1 =
T
X
t=0
γt −
T +1
X
γt = 1 − γT +1
t=1
1
1 − γT +1
= lim S = lim
=
T →∞
T →∞ 1 − γ
1−γ
Reinforcement learning 7
Models of optimallity
What happens when h = 1000 and γ = 0.2?
Bert Kappen
Reinforcement learning 8
Models of optimallity
What happens when h = 1000 and γ = 0.2?
Finite horizon h = 1000 model and average reward model in dependent of γ. Both prefer third
choice (long term reward):
1000
X
t=0
 1000

1000
1000
X
X 
 X
 = (1998, 19960, 21945)
, 10
rt = 2
, 11

t=2
t=5
t=6
Discounted reward γ = 0.2 model prefers first choice (short term reward)
∞
X
t=0
 ∞

∞
∞
X
X
X

 2

5
6
t
t
t
t

γ , 10
γ , 11
γ rt = 2
γ  = 2γ , 10γ , 11γ
t=2
t=5
t=6
1
= (0.1, 0.004, 0.0009)
1−γ
Optimal policy depends strongly on horizon time or discount factor.
Bert Kappen
Reinforcement learning 9
Discrete time control
Consider the control of a discrete time deterministic dynamical system:
xt+1 = xt + f (t, xt , ut ),
t = 0, 1, . . . , T − 1
xt describes the state and ut specifies the control or action at time t.
Given xt=0 = x0 and u0:T −1 = u0, u1, . . . , uT − 1, we can compute x1:T .
Define a cost for each sequence of controls:
C(x0, u0:T −1) = φ(xT ) +
T −1
X
R(t, xt , ut )
t=0
The problem of optimal control is to find the sequence u0:T −1 that minimizes
C(x0, u0:T −1).
Bert Kappen
Reinforcement learning 10
Dynamic programming
Find the minimal cost path from A to J.
C(J) = 0, C(H) = 3, C(I) = 4
C(F) = min(6 + C(H), 3 + C(I))
Bert Kappen
Reinforcement learning 11
Discrete time control
The optimal control problem can be solved by dynamic programming. Introduce
the optimal cost-to-go:


T −1
X


J(t, xt ) = min φ(xT ) +
R(s, x s, u s)
ut:T −1
s=t
which solves the optimal control problem from an intermediate time t until the fixed
end time T , for all intermediate states xt .
Then,
J(T, x) = φ(x)
J(0, x) =
Bert Kappen
min C(x, u0:T −1)
u0:T −1
Reinforcement learning 12
Discrete time control
One can recursively compute J(t, x) from J(t + 1, x) for all x in the following way:


T −1
X


J(t, xt ) = min φ(xT ) +
R(s, x s, u s)
ut:T −1
s=t



T
−1
X





= min R(t, xt , ut ) + min φ(xT ) +
R(s, x s, u s)
ut+1:T −1
ut
s=t+1
= min (R(t, xt , ut ) + J(t + 1, xt+1))
ut
= min (R(t, xt , ut ) + J(t + 1, xt + f (t, xt , ut )))
ut
This is called the Bellman Equation.
Computes u as a function of x, t for all intermediate t and all x.
Bert Kappen
Reinforcement learning 13
Discrete time control
∗
The algorithm to compute the optimal control u∗0:T −1, the optimal trajectory x1:T
and
the optimal cost is given by
1. Initialization: J(T, x) = φ(x)
2. Backwards: For t = T − 1, . . . , 0 and for all x compute
u∗t (x) = arg min{R(t, x, u) + J(t + 1, x + f (t, x, u))}
u
J(t, x) = R(t, x, u∗t ) + J(t + 1, x + f (t, x, u∗t ))
3. Forwards: For t = 0, . . . , T − 1 compute
∗
xt+1
= xt∗ + f (t, xt∗, u∗t (xt∗))
NB: the backward computation requires u∗t (x) for all x.
Bert Kappen
Reinforcement learning 14
Stochastic case
xt+1 =
xt + f (t, xt , ut , wt )
t = 0, . . . , T − 1
At time t, wt is a random value drawn from a probability distribution p(w).
For instance,
xt+1
=
xt + wt ,
wt
=
±1,
xt
=
t−1
X
x0 = 0
p(wt = 1) = p(wt = −1) = 1/2
ws
s=0
Thus, xt random variable.
Bert Kappen
Reinforcement learning 15
Stochastic case
*
C(x0) =
φ(xT ) +
T −1
X
+
R(t, xt , ut , ξt )
t=0
=
X X
w0:T −1 ξ0:T −1


T −1
X


p(w0:T −1)p(ξ0:T −1) φ(xT ) +
R(t, xt , ut , ξt )
t=0
with ξt , xt , wt random. Closed loop control: find functions ut (xt ) that minimizes the
remaining expected cost when in state x at time t. π = {u0(·), . . . , uT −1(·)} is called
a policy.
xt+1 =
xt + f (t, xt , ut (xt ), wt )
*
+
T −1
X
Cπ(x0) = φ(xT ) +
R(t, xt , ut (xt ), ξt )
t=0
π∗ = argminπCπ(x0) is optimal policy.
Bert Kappen
Reinforcement learning 16
Stochastic Bellman Equation
J(t, xt ) = min hR(t, xt , ut , ξt ) + J(t + 1, xt + f (t, xt , ut , wt ))i
ut
J(T, x) = φ(x)
ut is optimized for each xt separately. π = {u0, . . . , uT −1} is optimal a policy.
Bert Kappen
Reinforcement learning 17
Inventory problem
• xt = 0, 1, 2 stock available at the beginning of period t.
• ut stock ordered at the beginning of period t. Maximum storage is 2: ut ≤ 2 − xt .
• wt = 0, 1, 2 demand during period t with p(w = 0, 1, 2) = (0.1, 0.7, 0.2); excess
demand is lost.
• ut is the cost of purchasing ut units. (xt + ut − wt )2 is cost of stock at end of period
t.
xt+1 = max(0, xt + ut − wt )
*X
+
t=2
C(x0, u0:T −1) =
ut + (xt + ut − wt )2
t=0
Planning horizon T = 3.
Bert Kappen
Reinforcement learning 18
Inventory problem
Bert Kappen
Reinforcement learning 19
Apply Bellman Equation
Jt (xt ) = min hR(xt , ut , wt ) + Jt+1( f (xt , ut , wt ))i
ut
R(x, u, w) = u + (x + u − w)2
f (x, u, w) = max(0, x + u − w)
Start with J3(x3) = 0, ∀x3.
Bert Kappen
Reinforcement learning 20
Dynamic programming in action
Assume we are at stage t = 2 and the stock is x2. The cost-to-go is what we order
u2 and how much we have left at the end of period t = 2.
J2(x2) =
D
E
u2 + (x2 + u2 − w2)
0≤u2 ≤2−x2
=
min u2 + 0.1 ∗ (x2 + u2)2 + 0.7 ∗ (x2 + u2 − 1)2
0≤u2 ≤2−x2
2
+ 0.2 ∗ (x2 + u2 − 2)
2
2
2
J2(0) = min u2 + 0.1 ∗ u2 + 0.7 ∗ (u2 − 1) + 0.2 ∗ (u2 − 2)
min
2
0≤u2 ≤2
u2 = 0
:
rhs = 0 + 0.7 ∗ 1 + 0.2 ∗ 4 = 1.5
u2 = 1
:
rhs = 1 + 0.1 ∗ 1 + 0.2 ∗ 1 = 1.3
u2 = 2
:
rhs = 2 + 0.1 ∗ 4 + 0.7 ∗ 1 = 3.1
Thus, u2(x2 = 0) = 1 and J2(x2 = 0) = 1.3
Bert Kappen
Reinforcement learning 21
Inventory problem
The computation can be repeated for x2 = 1 and x2 = 2, completing stage 2 and
subsequently for stage 1 and stage 0.
Bert Kappen
Reinforcement learning 22
Exploitation versus Exploration: The Single-State Case
The k-armed bandit problem:
The agent is in a room with a collection of k gambling machines (each called a ”one-armed bandit”).
The agent is permitted a fixed number of pulls, h. Any arm may be pulled on each turn. The
machines do not require a deposit to play; the only cost is in wasting a pull playing a suboptimal
machine. When arm i is pulled, machine i pays off 1 or 0, with unknown probability pi . What should
the agent’s strategy be?
Trade-off between
exploration: try many new arms
exploitation: stick with a good arm
The bandit problem is a RL problem with a single state.
Bert Kappen
Reinforcement learning 23
Bayesian model
We assume prior distributions over the parameters pi.
Consider the Beta distribution over 0 ≤ x ≤ 1 parametrized by α, β > 0 integers:
P(x|α, β)
=
hxi
=
(α + β − 1)! α−1
x (1 − x)β−1
(α − 1)!(β − 1)!
α
α+β
P(pi|α = β = 1) can be used as a flat prior over pi to model ignorance of the value of pi.
When pulling arm i ni times giving wi times a payoff 1, we can compute the posterior
distribution over pi as
P(pi|ni, wi) ∝ ”likelihood to observe wi in ni trials given pi” × ”prior”
∝
hpii =
pwi i (1 − pi)ni−wi ∝ P(pi|α = wi + 1, β = ni − wi + 1)
wi + 1
ni + 2
NB if you pull once: ni = wi = 1, the expected return is 2/3.
Bert Kappen
Reinforcement learning 24
Dynamic programming solution
Although the agent has only one state, the knowledge (or belief) of the agent
changes while playing. This is the notion of belief state. If arm i is pulled ni times,
yielding a positive payoff in wi times, the belief state is
{n1, w1, . . . , nk , wk }
We write V ∗(n1, w1, . . . , nk , wk ) as the expected remaining payoff, given that a total
of h pulls are available, and we use the remaining pulls optimally.
Bert Kappen
Reinforcement learning 25
Dynamic programming solution
If
P
i ni
= h there are no remaining pulls and V ∗(n1, w1, . . . , nk , wk ) = 0.
If we know V ∗ for all states with t pulls remaining, we can compute V ∗ for any belief
state with t + 1 pulls remaining:
V ∗(n1, w1, . . . , nk , wk )
ρi
=
=
max agent takes action i and optimally for remaining pulls
i
max ρi(arm i returns 1) + (1 − ρi)(arm i returns 0)
=
max ρi + ρiV ∗(n1, w1, . . . , ni + 1, wi + 1, . . . , nk , wk )
+
(1 − ρi)V ∗(n1, w1, . . . , ni + 1, wi, . . . , nk , wk )
=
hpii =
i
i
wi + 1
ni + 2
NB: Error in Kaelbling formula
Linear in the number of belief states times actions and thus exponential in the
horizon.
Bert Kappen
Reinforcement learning 26
Example
h=4, two bandits. Notation: V(n1, w1, n2, w2) = (n1w1n2w2)
Use Bellman equation to compute backwards all values:
• If n1 + n2 = 4 V ∗(n1, w1, n2, w2) = 0
• Consider states with n1 + n2 = 3. For instance,
(0030)
=
(2211)
=
(2111)
=
...
Bert Kappen
!
1
1 1
max(ρ1, ρ2) = max , =
2 5
2
!
3 2
3
max(ρ1, ρ2) = max , = = (1122)
4 3
4
!
2 2
2
max(ρ1, ρ2) = max , = = (1121)
4 3
3
Reinforcement learning 27
• Consider states with n1 + n2 = 2. For instance,
(1111)
=
=
max ρ1 + ρ1(2211) + (1 − ρ1)(2111), ρ2 + ρ2(1122) + (1 − ρ2)(1121)
!
2
3
12
1+ +
= 1.39
3
4
33
Matlab results:
t= 3:
(0030)=0.50 (0031)=0.50 (0032)=0.60 (0033)=0.80 (1020)=0.33 (1021)=0.50 (1022)=0.75
(1120)=0.67 (1121)=0.67 (1122)=0.75 (2010)=0.33 (2011)=0.67 (2110)=0.50 (2111)=0.67
(2210)=0.75 (2211)=0.75 (3000)=0.50 (3100)=0.50 (3200)=0.60 (3300)=0.80
t= 2:
(0020)=1.00 (0021)=1.08 (0022)=1.50 (1010)=0.72 (1011)=1.33 (1110)=1.33 (1111)=1.39
(2000)=1.00 (2100)=1.08 (2200)=1.50
t= 1:
(0010)=1.53 (0011)=2.03 (1000)=1.53 (1100)=2.03
t= 0:
(0000)=2.28
Bert Kappen
Reinforcement learning 28
Example
V ∗(n1, w1, n2, w2)
=
max (ρ1 + ρ1V ∗(n1 + 1, w1 + 1, n2, w2) + (1 − ρ1)V ∗(n1 + 1, w1, n2, w2),
ρ2 + ρ2V ∗(n1, w1, n2 + 1, w2 + 1) + (1 − ρ2)V ∗(n1, w1, n2 + 1, w2))
ρi
=
wi + 1
ni + 2
Use values to compute forward optimal strategy:
• First step: Pull arm 1 and win. ρ1 = 2/3, ρ2 = 1/2.
• Second step: Optimal second pull from state (1100):
argmax (ρ1 + ρ1(2200) + (1 − ρ1)(2100), ρ2 + ρ2(1111) + (1 − ρ2)(1110))
=
argmax (2/3 + 2/3 ∗ 1.5 + 1/3 ∗ 1.08, 1/2 + 1/2 ∗ 1.39 + 1/2 ∗ 1.33)
=
argmax(2.03, 1.86)
• ...
Bert Kappen
Reinforcement learning 29
Ad-hoc strategies
Strategies that do not use the Bellman equation.
Optimism in the face of uncertainty:
- put strong optimistic prior belief P(pi |ni , wi ).
n
- For instance use wi = ni with ni a number of phantom plays. P(pi ) ∝ pi i
Randomized strategies:
exp(βρi)
P(a = i) = Pk
j=1 exp(βρ j )
ρi =
wi + 1
ni + 2
T = 1/β is temperature which is decreased over time to decrease exploration.
-Greedy when K actions are possible:
- choose best action with probably 1 − +
- choose any other action with probability
Bert Kappen
K
K
Reinforcement learning 30
Markov Decision Processes
A set of states S, set of actions A, reward function R : S × A → R.
A state transition function T : S × A → Π(S), with Π(S) is set of probability distributions over S. We denote T (s0|s, a).
The model is first order Markov because the distribution over next states s0 only
depend on current state and action s, a and no previous history.
We define π : S → A as a policy. We define the optimal value of a state as
V ∗(s) = max
π
*X
∞
t=0
+
γ t rt
s0 =s
For the infinite-horizon discounted model, there exists an optimal deterministic stationary policy ([Bellman, 1957]).
Bert Kappen
Reinforcement learning 31
Markov Decision Processes


X


∗
0
∗
0
V (s) = max R(s, a) + γ
T (s |s, a)V (s )
a
s0
The optimal policy is


X


∗
0
∗
0

π (s) = argmaxa R(s, a) + γ
T (s |s, a)V (s )
s0
Bert Kappen
Reinforcement learning 32
Value iteration
Value iteration converges to V ∗ ([Bellman, 1957])
Stopping criterion (Williams & Baird 1993):
if max s |Vt (s) − Vt−1(s)| = then max s |πt (s) − π∗(s)| ≤ 2γ/(1 − γ)
Computational complexity is O(|S|2|A|), or O(|S||A|) when constant number of next
states per state (sparse T ).
# iterations polynomial in 1/(1 − γ).
Bert Kappen
Reinforcement learning 33
Policy iteration
Manipulates the policy directly, rather than indirectly through the value function:
Vπ is the value of policy π. The policy update is greedy with respect to Vπ.
Bert Kappen
Reinforcement learning 34
Learning an Optimal Policy: Model-free Methods
Reinforcement learning is primarily concerned with how to obtain the optimal policy
when T (s0|s, a) and R(s, a) are not known in advance.
Two approaches:
- Model free: Learn a controller without learning a model
- Model based: learn a model, and use it to derive a controller
Which is better?
- Model free learns a single task, no generalization, faster, simple tasks
- Model based can be task independent, more complex tasks, slower
Bert Kappen
Reinforcement learning 35
Monte Carlo sampling
Simplest method is for given policy π run N sample trajectories of length h always
starting in state s:
Vπ(s) =
*X
∞
t=1
+
N
h
XX
1
γ t rt ≈
γt rti
N i=1 t=0
Repeat for each state s.
Inefficient:
- states reappear in multiple sample trajectories
- statistics starting from those states are lost
Bert Kappen
Reinforcement learning 36
Adaptive Heuristic Critic and TD(λ)
AHC is adaptive version of policy iteration ([Barto et al., 1983])
- Critic: compute estimate of Vπ for policy π used by actor/RL component
- Actor: optimise π0 based on (current estimate of) Vπ .
NB: Only version with full convergence of ’inner loop’ critic for fixed policy can be
guaranteed to converge to optimal policy.
Bert Kappen
Reinforcement learning 37
Adaptive Heuristic Critic and TD(λ)
Critic is learned by:
- Consider experience tuple (s, a, r, s0 ) under policy π.
V(s) := V(s) + αt (r + γV(s0) − V(s))
- This rule is called TD(0) and converges to the solution of policy evaluation Vπ .
Bert Kappen
Reinforcement learning 38
Multiplying by
P
s0
T (s0|s, π(s)) yields evaluation of Vπ:


X


0
0
0 = α R(s, π(s)) + γ
T (s |s, π(s)V(s ) − V(s)
s0
Method is known as stochastic approximation originally due to Robbins and Monro
1951:
- Solve M(x) = a with M(x) = hN(x, ξ)i.
- Iterate xt+1 = xt + αt (a − N(x, ξ))
- Convergence requires
X
αt = ∞
t
X
α2t < ∞
t
For instance αt = 1/t. 1
1 Correspondence:
x ↔ V(s), ξ ↔ γV(s0 ), a ↔ R(s, π(s)). Then
N(x, ξ) = x − ξ = V(s) − γV(s0 )
M(x) = V(s) − γ
X
T (s0 |s, π(s))V(s0 )
s0
Bert Kappen
Reinforcement learning 39
TD(λ)
TD(0) converges but makes poor use of the data: only the immediate previous
state is updated.
TD(λ) updates every state according to discount 0 ≤ λ ≤ 1:
dt
= rt + γV(st+1) − V(st )
When s1 → s2:
V(s1) := V(s1) + αms1 d1
When s2 → s3:
V(s1) := V(s1) + αms1 λd2
V(s2) := V(s2) + αms2 d2
m s is the number of times state s has been visited.
Bert Kappen
Reinforcement learning 40
TD(λ)
In general at iteration t:
dt
=
(s)
=
rt + γV(st+1) − V(st )
t
X
λt−k δ s,sk
k=1
V(s) := V(s) + αms dt (s)
States are updated proportional to their eligibility (s) that decays over time.
t
1
2
3
4
Bert Kappen
state
1
2
3
1
(s)
(λ0, 0, 0)
(λ1, λ0, 0)
(λ2, λ1, λ0)
(λ3 + λ0, λ2, λ1)
Reinforcement learning 41
When:
- αm is a decreasing series satisfying Robbins-Monro criteria (cf. 1/m)
- all states are visited infinitely often
TD(λ) converges to the optimal solution with probability 1 [Bertsekas and Tsitsiklis, 1996
NB Error in Kaelbling formula
Bert Kappen
Reinforcement learning 42
Q learning
The two components of AHC can be unified in the Q learning algorithm (Watkins
1989).
Denote Q(s, a) the optimal expected value of state s when taking action a and then
proceeding optimally. That is
Q(s, a) = R(s, a) + γ
X
0 0
T (s0|s, a) max
Q(s
,a )
0
a
s0
and V ∗(s) = maxa Q(s, a).
Using stochastic approximation, we obtain
- Generate s0 from environment T (s0 |s, a)
- Update
Q(s, a) = Q(s, a) + α(r + γ max Q(s0, a0) − Q(s, a))
a0
- Generate a0 either random or argmax Q(s0 , a0 ).
Convergence under similar criteria as TD.
Bert Kappen
Reinforcement learning 43
Model based approaches
Model free methods are data-inefficient. Simplest model based approach:
- make arbitrary division between learning and action phase
- gather data ’some way’.
Problems:
- Random exploration may be dangerous and/or inefficient.
- blind to changes in the environment
Bert Kappen
Reinforcement learning 44
Dyna
Idea: combine model based and model free.
Sutton 1990
Bert Kappen
Reinforcement learning 45
Dyna example
Maze:
- In each of the 46 states there are 4 actions (N,E,S,W) which take the agent to the corresponding
state. When movement is blocked by obstacle, no movement results.
- reward is zero for all states and transitions except into the goal state G.
- after reaching the goal state the episode ends. agent returns to start state S .
- γ = 0.95, α = 0.1, = 0.1.
Bert Kappen
Reinforcement learning 46
Dyna example
Left) Average over 30 runs of number of steps per episode. First episode requires 1700 steps.
Right) Policies found by planning (N=50) and non-planning (N=0) Dyna halfway through the second
episode. N = 0 Dyna (normal Q-learning) has only updated policy for next-to-goal state. N = 50
Dyna has learned environment model from first episode which is used to learn policies for all states.
Bert Kappen
Reinforcement learning 47
Dyna larger example
3277 states shortest path problem formulated as discounted RL problem. Goal state has reward
1, all other states have reward 0. Dyna (and prioritised sweeping) used N = 200 backups per
transition.
Bert Kappen
Reinforcement learning 48
Generalizations
A shortcoming of the Dyna method is that the planning steps are done at random.
- improvement can be made by prioritized sweeping (Moore & Atkenson 1993) by updating the
states with highest priority.
Combining Dyna with Monte-Carlo tree search yields state-of-the-art performance
on 9 × 9 computer Go [Silver et al., 2012]
Bert Kappen
Reinforcement learning 49
Linear function approximation
In the standard treatment of RL, the Bellman equations such as value iteration, policy iteration and Q learning, discussed in the paper by Kaelbling the basic quantity
is the value of a state V(s) and the RL rules update V(s).
In the book of Dayan and Abbott, instead, the update rules for temporal difference
learning (Eq. 9.6-9.10, 9.25-9.25) use a representation of the state in terms of
P
features: V(s) = k wk φk (s) and the RL rules update wk .
The relation can be understood by considering V(s) = k wk φk (s) as an approximation to the true value function V(s) which is the solution of the Bellman equation.
Consider for example the online version of value iteration. Based on the experience
tuple (s, a, r, s0) the value of V(s) is updated to
P
V +(s) = V(s) + α(r + γV(s0) − V(s)) = V(s) + αδ
δ = r + γV(s0) − V(s)
The question is now how the vector wk is updated to realize the change from
V(s) → V +(s).
The answer is to adapt wk so as to reduce the quadratic difference between V +(s)
Bert Kappen
Reinforcement learning 50
and V(s):
C
=
∂C
∂wk
=
wk

2
X


V +(s) −

w
φ
(s)
k
k


k


X


+

−2φk (s) V (s) −
φl(s)wl = −2αφk (s)δ
l
← wk − β
∂C
= wk + γφk (s)δ
∂wk
with γ = 2βα, which is equivalent to the rules (Eq. 9.6-9.10, 9.25-9.25).
See [Geramifard et al., 2013] for further details.
Bert Kappen
Reinforcement learning 51
References
[Barto et al., 1983] Barto, A., Sutton, R. S., and Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man,
and Cybernetics, 13(5):835–846.
[Bellman, 1957] Bellman, R. (1957). Dynamic programming. Princeton University Press.
[Bertsekas, 2000] Bertsekas, D. (2000). Dynamic Programming and optimal control. Athena Scientific, Belmont, Massachusetts. Second edition.
[Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-dynamic programming. Athena Scientific, Belmont, Massachusetts.
[Geramifard et al., 2013] Geramifard, A., Walsh, T., Tellex, S., Chowdhary, G., Roy, N., and How, J.
(2013). A tutorial on linear function approximators for dynamic programming and reinforcement
learning. Foundations and Trends in Machine Learning, 6:375–454.
[Kaelbling et al., 1996] Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement learning:
a survey. Journal of Artificial Intelligence research, 4:237–285.
[Silver et al., 2012] Silver, D., Sutton, R. S., and Müller, M. (2012). Temporal-difference search in
computer go. Machine learning, 87(2):183–219.
Bert Kappen
Reinforcement learning 52