Computational Modeling Lab COMO

Computational Modeling Lab
Reinforcement Learning
an introduction
part 4
Ann Nowé
[email protected]
http://como.vub.ac.be
Wednesday 18 June 2003
By Sutton and Barto
Computational Modeling Lab
Backup diagrams in DP
State-value function for policy 
V(s1)
V(s1’)
V(s2)
V(s2’)
V(s3)
V(s3’)
1 a
1
2
 3 a2
V(s)
a3
V  (s)    (s,a) PsasRsas   V  (s)
s
a
Action-values function
for policy 

s1
r1
Q(s,a)
r2
s2
V ( s )  max Q( s, a)
a
Q(s1,a1)
Q(s1,a2)
Q(s2,a1)
Q(s2,a2)
Q (s,a)   PsasRsas   V  (s)
s
Computational Modeling Lab
Dynamic Programming, model based
V (s)  E Rt st  s

 E rt 1   V st 1  st  s
st
rt 1
st 1
T
TT
TT
T
T
T
T
T
T
T
Computational Modeling Lab
Recall Value Iteration in DP
Q(s,a)
Computational Modeling Lab
RL, model free
st
rt 1
st 1
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
Computational Modeling Lab
Q-Learning, a value iteration approach
One - step Q - learning :


Qst ,at   Qst ,at    rt 1   max Qst 1,a  Qst ,at 
a

Q-learning is off-policy
Computational Modeling Lab
example
0.2
a
1
d
R=4
6
R=1
R=1
0.8
2
b
c
3
0.3
R=10
4
R=2
0.7
5
1 R=5
1
Epoch
Epoch
Epoch
Epoch
Epoch
1:
2:
3:
4:
6:
1,2,4
1,6
1,3
1,2,5
2,5
Computational Modeling Lab
Some convergence issues
Q-learning in guaranteed to converge in a
Markovian setting
Tsitsiklis J.N. Asynchronous Stochastic Approximation and Qlearning. Machine Learning, Vol. 16:185-202, 1994.
Computational Modeling Lab
Proof by Tsitsiklis, cont.
On the convergence of Q-learning
One - step Q - learning :


Qst ,at   Qst ,at    rt 1   max Qst 1,a  Qst ,at 

a
Can be written as :


,a E r

Qst ,at   Qst ,at   E rt 1   max Qst 1,a  Qst ,at 

a


 rt 1   max Qst 1

a






max
Q
s
,a
 t 1  
t 1
a
Computational Modeling Lab
Proof by Tsitsiklis
q  q1,..., qn 
On the convergence of Q-learning
Stochastic approximation :

Q(s,a)

qi t  1  qi t    i t  Fi q t  qi t   w i t 
i
Noise term

“Learning factor”
q vector, but with possibly
outdated components
  i t   
Contraction mapping
0t  
  i2 t   


qi t   q11i t ,...,qn  ni t  ,
0t  
with 0   ij t   t
F q(t )   q (t )

Computational Modeling Lab
Proof by Tsitsiklis, cont.
Stochastic approximation, as a vector
qi
Fi
Fi + noise
t
qj
Computational Modeling Lab
Proof by Tsitsiklis, cont.
Relating Q-learning to stochastic approximation


,a E r

Qst ,at   Qst ,at   E rt 1   max Qst 1,a  Qst ,at 

a


ith component
Can vary in time







max
Q
s
,a
 t 1  
t 1
a

 rt 1   max Qst 1

a
Contraction mapping
Bellman operator
Noise term

Qi t  1  Qi t    i t  Fi Qi t  Qi t   w i t 
Computational Modeling Lab
Sarsa: On-Policy TD Control
Qst ,at   Qst ,at    rt 1   st 1,at 1  Qst ,at 

When is Sarsa = Q-learning?
Computational Modeling Lab
Q-Learning versus SARSA
One - step Q - learning :


Qst ,at   Qst ,at    rt 1   max Qst 1,a  Qst ,at 
a
Q-learning is off-policy

Sarsa
Qst ,at   Qst ,at    rt 1  Qst 1,at 1  Qst ,at 

Q-learning is on-policy
Computational Modeling Lab
Cliff Walking example
Actions: up, down, left, right
Reward: cliff -100, goal 0, default -1.
Action selection -greedy, with  = 0.1
Sarsa takes exploration into account
Computational Modeling Lab
Q-learning for CAC
Acceptance Criterion:
Class-1
Maximize Network Revenue
S1 = (2,4)
[
Q(s1,A1)
Q(s1,R1)
Class-2
S2=(3,4)
S3 = (3,3)
[
Q(s3,A2)
Q(s3,R2)
Computational Modeling Lab
Continuous Time Q-learning for CAC


Q(k 1) (x,a)  Q(k ) (x,a)   k IR(x,a)  e  . max Q(k ) (y,a')  Q(k ) (x,a)
a'
[Bratke]
Call
Arrival

Call
Arrival
t1
System state: x
t2
tn
System state: y

t0 = 0
Call
Call
Departure Departure

IR( x, a)   e
0
  .s
t1
 ( x, a)ds   e
0
  .s
t2
1 ( x, a)ds   e
t1
Call
Departure
  .s

 2 ( x, a)ds  ...  e  .sn 1 (x,a)ds
tn