Reinforcement Learning Eligibility Trace

Reinforcement Learning
Eligibility Traces
活力痕跡
主講人:虞台文
Content









n-step TD prediction
Forward View of TD()
Backward View of TD()
Equivalence of the Forward and Backward Views
Sarsa()
Q()
Eligibility Traces for Actor-Critic Methods
Replacing Traces
Implementation Issues
Reinforcement Learning
Eligibility Traces
n-Step
TD Prediction
Elementary Methods
Monte Carlo
Methods
Dynamic
Programming
TD(0)
Monte Carlo vs. TD(0)

Monte Carlo
–
observe reward for all steps in an episode
Rt  rt 1   rt  2   rt 3 
2

TD(0)
–
observed one step only
Rt(1)  rt 1   V ( st 1 )

T t 1
rT
n-Step TD Prediction
TD (1-step)
2-step
3-step
n-step
Monte Carlo
Rt(1)
(2)
t
R
Rt(3)
(n)
t
R
Rt
Rt  rt 1   rt  2   rt 3 
2

T t 1
rT
n-Step TD Prediction
(1)
t
R
 rt 1   Vt ( st 1 )
Rt(2)  rt 1   rt  2   2Vt ( st  2 )
Rt(3)  rt 1   rt  2   2 rt 3   3Vt ( st 3 )
Rt( n )  rt 1   rt  2   2 rt 3 
  n 1rt  n   nVt ( st  n )
corrected n-step truncated return
 [ Rt( n )  Vt ( s)] s  st
Vt ( s)  
s  st
0
Backups
Monte Carlo Vt 1 (st )  Vt (st )    Rt  Vt (st )
TD(0) Vt 1 (st )  Vt (st )    rt 1   Vt (st 1 )  Vt (st )
 Vt ( st )    Rt(1)  Vt ( st ) 
(n)

V
(
s
)

V
(
s
)


R
n-step TD t 1 t
t
t
 t  Vt ( st ) 
Vt ( st )
 [ Rt( n )  Vt ( s)] s  st
Vt ( s)  
s  st
0
n-Step TD Backup
online
offline
Vt 1 (s)  Vt (s)  Vt ( s)
T 1
V ( s)  V ( s)   Vt ( s)
t 0
When offline, the new V(s) will be
for the next episode.
Error Reduction Property
online
offline
(n)
t
max E {R
s
Vt 1 (s)  Vt (s)  Vt ( s)
T 1
V ( s)  V ( s)   Vt ( s)
t 0


| st  s}  V ( s)   max V ( s )  V ( s)
n-step return
Maximum error using n-step return
n
s
Maximum error using V
(current value)
Example (Random Walk)
start
0
V(s)
A
1/6
0
B
2/6
0
C
3/6
0
D
4/6
0
E
5/6
Consider 2-step TD, 3-step TD, …
n=? is optimal?
1
Example (19-state Random Walk)
start
1
0
0
online
Average
RMSE
Over First
10 Trials
0
0
1
offline
Exercise (Random Walk)
+1
1
Standard
moves
1. Evaluate value function for random policy
2. Approximate value function using n-step TD (try
different n’s and ’s), and compare their performance.
3. Find optimal policy.
Exercise (Random Walk)
+1
1
Standard
moves
Reinforcement Learning
Eligibility Traces
The Forward View of
TD()
Averaging n-step Returns


We are not limited to simply
using n-step TD returns
For example, we could take
average n-step TD returns like:
avg
t
R
1 (2) 1 (4)
 Rt  Rt
2
2
Sum to 1
One backup
TD()  -Return

TD() is a method for
averaging all n-step
backups
–
–
weight by n1 (time since
visitation)
Called -return

Rt  (1   ) 

n 1
R
n 1

(n)
t
Backup using -return:
Vt ( st )    Rt  Vt ( st ) 
w1
w2
w3
w
n
1
wTt 1
TD()  -Return

TD() is a method for
averaging all n-step
backups
–
–
weight by n1 (time since
visitation)
Called -return

Rt  (1   ) 

n 1
R
n 1

(n)
t
Backup using -return:
Vt ( st )    Rt  Vt ( st ) 
w1
(1   )
w2
(1   )
w3
(1   ) 3
w
n
1
 T t 1
wTt 1
TD()  -Return

TD() is a method for
averaging all n-step
backups
–
–
weight by n1 (time since
visitation)
Called -return

Rt  (1   ) 

n 1
R
n 1

(n)
t
Backup using -return:
Vt ( st )    Rt  Vt ( st ) 
w1
(1   )
w2
(1   )
w3
(1   ) 3
w
n
1
 T t 1
wTt 1
How about if 0?
How about if 1?
TD()  -Return

TD() is a method for
averaging all n-step
backups
–
–
weight by n1 (time since
visitation)
Called -return

Rt  (1   ) 

n 1
R
n 1

(n)
t
Backup using -return:
Vt ( st )    Rt  Vt ( st ) 
w1
(1   )
w2
(1   )
w3
(1   ) 3
w
n
1
 T t 1
wTt 1
How about if 0?
How about if 1?
TD()  -Return
TD() is a method for
0:
TD(0)
averaging
all n-step
backups
1:
Monte Carlo

–
–
weight by n1 (time since
visitation)
Called -return

Rt  (1   ) 

n 1
R
n 1

(n)
t
Backup using -return:
Vt ( st )    Rt  Vt ( st ) 
w1
(1   )
w2
(1   )
w3
(1   ) 3
w
n
1
 T t 1
wTt 1
Forward View of TD()
A theoretical view
TD() on the Random Walk
Reinforcement Learning
Eligibility Traces
The Backward View of
TD()
Why Backward View?

Forward view is acausal
–

Not implementable
Backward view is causal
–
–
Implementable
In the offline case, achieving the
same result as the forward view
Eligibility Traces

Each state is associated with an additional
memory variable  eligibility trace, defined
by:
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
Eligibility Traces

Each state is associated with an additional
memory variable  eligibility trace, defined
by:
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
Eligibility Traces

Each state is associated with an additional
memory variable  eligibility trace, defined
by:
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
Eligibility  Recency of Visiting



At any time, the traces record which states have
recently been visited, where “recently" is defined
in terms of .
The traces indicate the degree to which each
state is eligible for undergoing learning changes
should a reinforcing event occur.
Reinforcing event
The moment-by-moment 1-step TD errors
 t  rt 1   Vt (st 1 )  Vt (st )
Reinforcing Event
The moment-by-moment 1-step TD errors
 t  rt 1   Vt (st 1 )  Vt (st )
Vt (s)   t et (s)
TD()
Eligibility Traces
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
Reinforcing Events
 t  rt 1   Vt (st 1 )  Vt (st )
Value updates
Vt (s)   t et (s)
Online TD()
Initialize V ( s ) arbitrarily and e( s )  0, for all s  S
Repeat (for each episode):
Initialize s
Repeat (for each step of episode):
a  action given by  for s
Take action a, observe reward, r , and next state s
  r   V ( s)  V ( s )
e( s )  e( s )  1
For all s:
V (s )  V ( s )   e( s )
e( s )   e( s )
s  s
Until s is terminal
Backward View of TD()
Backwards View vs. MC & TD(0)


Set  to 0, we get to TD(0)
Set  to 1, we get MC but in a better way
–
–
Can apply TD(1) to continuing tasks
Works incrementally and on-line (instead of
waiting to the end of the episode)
How about 0 <  < 1?
Reinforcement Learning
Eligibility Traces
Equivalence of the
Forward and
Backward Views
Offline TD()’s
Offline Forward TD()  -Return

Rt  (1   )  n 1 Rt( n )
n 1



[
R
s  st
f
t  Vt ( s )]
Vt ( s)  
s  st
0
Offline Backward TD()
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
 t  rt 1   Vt (st 1 )  Vt (st )
Vt b ( s )   t et ( s )
I sst
1 s  st

0 s  st
Forward View = Backward View
T 1
 V
t 0
t
T 1
b
( s )   Vt ( st )I sst
f
t 0
Backward updates
Forward updates
See the proof
I sst
1 s  st

0 s  st
Forward View = Backward View
T 1
 V
t 0
t
T 1
b
( s )   Vt ( st )I sst
Backward updates
f
t 0
Forward updates
TD() on the Random Walk
Offline -return
(forward)
Average
RMSE
Over First
10 Trials
Online TD()
(backward)
Reinforcement Learning
Eligibility Traces
Sarsa()
Sarsa()

TD() 
–

Use eligibility traces for policy evaluation
How can eligibility traces be used for
control?
–
Learn Qt(s, a) rather than Vt(s).
Sarsa()
Eligibility
Traces
(st , at )  ( s, a)
et 1 (s, a)
et ( s, a)  
et 1 (s, a)  1 ( st , at )  ( s, a)
Reinforcing
Events
 t  rt 1   Qt (st 1 , at 1 )  Qt (st , at )
Updates
Qt 1 (s, a)  Qt ( s, a)   t et (s, a)
Sarsa()
Initialize Q ( s, a ) arbitrarily and e( s, a )  0, for all s, a
Repeat (for each episode):
Initialize s, a
Repeat (for each step of episode):
Take action a, observe r , s
Choose a from s using policy derived from Q (e.g.  -greedy)
  r   Q( s, a)  Q( s, a )
e(s,a)  e(s,a)  1
For all s,a:
Q ( s, a )  Q ( s, a )   e( s, a )
e( s, a)   e( s, a )
s  s; a  a
Until s is terminal
Sarsa() 
Traces in Grid World

With one trial, the agent has much more information
about how to get to the goal
–

not necessarily the best way
Considerably accelerate learning
Reinforcement Learning
Eligibility Traces
Q()
Q-Learning

An off-policy method
–
–


breaks from time to time to take exploratory actions
a simple time trace cannot be easily implemented
How to combine eligibility traces and
Q-learning?
Three methods:
–
–
–
Watkins's Q()
Peng's Q ()
Naïve Q ()
Watkins's Q()
Estimation policy
(e.g., greedy)
Behavior policy
(e.g., -greedy)
Non-Greedy
Path
First
non-greedy
action
Greedy
Path
How to define the eligibility traces?
Backups  Watkins's Q()
Two cases:
1. Both behavior and
estimation policies
take the greedy path.
2. Behavior path has
taken a non-greedy
action before the
episode ends.
Case 1
Case 2
Watkins's Q()

1   et 1 ( s, a )


et ( s, a )  0



 et 1 ( s, a )
( s, a )  ( st , at ) and
Qt 1 ( st , at )  max a Qt 1 ( st , a )
( s, a )  ( st , at ) and
Qt 1 ( st , at )  max a Qt 1 ( st , a)
otherwise
 t  rt 1   max a Qt (st 1 , a)  Qt (st , at )
Qt 1 (s, a)  Qt ( s, a)   t et (s, a)
Initialize Q ( s, a ) arbitrarily and e( s, a )  0, for all s, a
Watkins's Q()
Repeat (for each episode):
Initialize s, a
Repeat (for each step of episode):
Take action a, observe r , s
Choose a from s using policy derived from Q (e.g.  -greedy)
a*  arg max b Q ( s, b) (if a ties for the max, then a*  a)
  r   Q( s, a)  Q( s, a*)
e ( s , a )  e( s , a )  1
For all s,a:
Q ( s, a )  Q ( s, a )   e( s, a )
If a  a*, then e( s, a )   e( s, a )
els e e( s, a )  0
s  s; a  a
Until s is terminal
Peng's Q()



Cutting off traces loses much of the
advantage of using eligibility traces.
If exploratory actions are frequent, as
they often are early in learning, then only
rarely will backups of more than one or
two steps be done, and learning may be
little faster than 1-step Q-learning.
Peng's Q() is an alternate version of Q()
meant to remedy this.
Backups  Peng's Q()
Peng, J. and Williams, R. J. (1996).
Incremental Multi-Step Q-Learning.
Machine Learning, 22(1/2/3).




Never cut traces
Backup max action except at end
The book says it outperforms Watkins Q(λ)
and almost as well as Sarsa(λ)
Disadvantage: difficult for implementation
Peng, J. and Williams, R. J. (1996).
See Incremental Multi-Step Q-Learning. for notations.
Machine Learning, 22(1/2/3).
Peng's Q()
Naïve Q()

Idea: Is it really a problem to backup
exploratory actions?
–
–


Never zero traces
Always backup max at current action (unlike
Peng or Watkins’s)
Is this truly naïve?
Works well is preliminary empirical studies
Naïve Q()

1   et 1 ( s, a )


et ( s, a )  0



 et 1 ( s, a )
( s, a )  ( st , at ) and
Qt 1 ( st , at )  max a Qt 1 ( st , a )
( s, a )  ( st , at ) and
Qt 1 ( st , at )  max a Qt 1 ( st , a)
otherwise
 t  rt 1   max a Qt (st 1 , a)  Qt (st , at )
Qt 1 (s, a)  Qt ( s, a)   t et (s, a)
Comparisons


McGovern, Amy and Sutton, Richard S. (1997)
Towards a better Q(). Presented at the Fall
1997 Reinforcement Learning Workshop.
Deterministic gridworld with obstacles
–
–
–
–
–
10x10 gridworld
25 randomly generated obstacles
30 runs
 = 0.05,  = 0.9,  = 0.9,  = 0.05,
accumulating traces
Comparisons
Convergence of the Q()’s

None of the methods are proven to
converge.
–



Much extra credit if you can prove any of
them.
Watkins’s is thought to converge to Q*
Peng’s is thought to converge to a mixture
of Q and Q*
Naïve - Q*?
Reinforcement Learning
Eligibility Traces
Eligibility Traces for
Actor-Critic Methods
Actor-Critic Methods

Actor
Policy

Critic
Value
Function
state
TD
error
reward
Environment
Critic: On-policy learning of V.
Use TD() as described before.
Actor: Needs eligibility traces for
each state-action pair.
Action
Policy Parameters Update
Method 1:
Actor
Policy
Critic
Value
Function
state
 pt ( s, a)   t
pt 1 (s, a)  
 pt ( s, a)
TD
error
reward
Environment
if a  at and s  st
otherwise
Action
pt 1 ( s, a)  pt ( s, a)   t et (s, a)
Policy Parameters Update
Method 2:
Actor
 p (s, a)   t 1   ( s, a) if a  at and s  st
pt 1 ( s, a)   t
otherwise
 pt ( s, a)
Policy
Critic
Value
Function
state
TD
error
reward
Environment
Action
pt 1 ( s, a)  pt ( s, a)   t et (s, a)
e ( s, a)  1   t ( st , at ) if s  st and a  at
et (s, a)   t 1
otherwise
et 1 ( s, a)
Reinforcement Learning
Eligibility Traces
Replacing
Traces
Accumulating/Replacing Traces
Accumulating Traces:
st  s
et 1 (s)
et (s)  
et 1 (s)  1 st  s
Replacing Traces:
et 1 (s) st  s
et (s)  
st  s
1
Why Replacing Traces?

Using accumulating traces, frequently visited
states can have eligibilities greater than 1
–



This can be a problem for convergence
Replacing traces can significantly speed learning
They can make the system perform well for a
broader set of parameters
Accumulating traces can do poorly on certain
types of tasks
Example (19-State Random Walk)
Extension to action-values


When you revisit a state, what should you
do with the traces for the other actions?
Singh and Sutton (1996)  to set traces of
all other actions from the revisited state
to 0.
1
if s  st and a  at


et ( s, a)  
0
if s  st and a  at
 e ( s, a)
if s  st
 t 1
Reinforcement Learning
Eligibility Traces
Implementation
Issues
Implementation Issues




For practical use we cannot compute every trace
down to the last.
Dropping very small values is recommended and
encouraged.
If you implement it in Matlab, backup is only one
line of code and is very fast (Matlab is optimized
for matrices).
Use with neural networks and backpropagation
generally only causes a doubling of needed
computational power.
Variable 

Can generalize to variable 
if s  st
 t et 1 ( s)
et ( s)  
t et 1 ( s)  1 if s  st

Here  is a function of time
– E.g.,
t   (st ) or t  
t

Reinforcement Learning
Eligibility Traces
Proof
T 1
 V
Proof
t 0
An accumulating eligibility
trace can be written explicitly
(non-recursively) as
T 1
T 1
t
t 0
t 0
k 0
t
T 1
k 0
t k
T 1
T 1
t 0
k t
b
( s )   Vt ( st )I sst
f
t 0
t
et ( s)   ( )
b
t k

V
(
s
)


(

)
I ssk
 t
 t
T 1
T 1
   I ssk  ( )t  k  t
   I sst  ( ) k t  k
k 0
t k
I ssk
0  k  t  T 1
0  k  t  T 1
k t
T 1
 V
Proof
t
t 0
T 1
T 1
t
t 0
t 0
k 0
T 1
b
( s )   Vt ( st )I sst
f
t 0
b
t k

V
(
s
)


(

)
I ssk
 t
 t
1


Vt ( st )  Rt  Vt ( st )  Vt ( st )  (1   )  n 1 Rt( n )
f

n 1
 Vt ( st )  (1   ) [rt 1   Vt ( st 1 )]
0
(1   ) 1[rt 1   rt  2   2Vt ( st  2 )]
(1   ) 2 [rt 1   rt  2   2 rt 3   3Vt ( st 3 )]
T 1
 V
Proof
t 0
T 1
T 1
t
t 0
t 0
k 0
t
T 1
b
( s )   Vt ( st )I sst
f
t 0
b
t k

V
(
s
)


(

)
I ssk
 t
 t
1

0


V
(
s
)

(

)
[rt 1   Vt ( st 1 )  Vt ( st 1 )]
Vt ( st )
t
t
f
( )1[rt  2   Vt ( st  2 )  Vt ( st  2 )]
( ) 2 [rt 3   Vt ( st 3 )  Vt ( st 3 )]

( ) [rt 1   Vt ( st 1 )  Vt ( st )]
( )0  t
( )1[rt  2   Vt ( st  2 )  Vt ( st 1 )]
( )1 t 1
( ) 2 [rt 3   Vt ( st 3 )  Vt ( st  2 )]
( ) 2  t  2
0
T 1
 V
Proof
t 0
T 1
T 1
t
t 0
t 0
k 0
t
T 1
b
( s )   Vt ( st )I sst
f
t 0
b
t k

V
(
s
)


(

)
I ssk
 t
 t
1


T 1
Vt ( st )   ( )  k   ( ) k t  k
k t
f
k t
k t
T 1
T 1
T 1
k t


(

)
 k I sst

V
(
s
)
I
 t t sst  
f
t 0
t 0
k t
T 1
k
k 0
t 0
T 1
T 1
t
t 0
t 0
k 0
   k  ( ) k t I sst
f
t k

V
(
s
)
I


(

)
I ssk
 t t sst  t 
0  t  k  T 1
0  t  k  T 1
k t