Computational Modeling Lab Reinforcement Learning an introduction part 4 Ann Nowé [email protected] http://como.vub.ac.be Wednesday 18 June 2003 By Sutton and Barto Computational Modeling Lab Backup diagrams in DP State-value function for policy V(s1) V(s1’) V(s2) V(s2’) V(s3) V(s3’) 1 a 1 2 3 a2 V(s) a3 V (s) (s,a) PsasRsas V (s) s a Action-values function for policy s1 r1 Q(s,a) r2 s2 V ( s ) max Q( s, a) a Q(s1,a1) Q(s1,a2) Q(s2,a1) Q(s2,a2) Q (s,a) PsasRsas V (s) s Computational Modeling Lab Dynamic Programming, model based V (s) E Rt st s E rt 1 V st 1 st s st rt 1 st 1 T TT TT T T T T T T T Computational Modeling Lab Recall Value Iteration in DP Q(s,a) Computational Modeling Lab RL, model free st rt 1 st 1 T T T T T T T T T T T T T T T T T T T T Computational Modeling Lab Q-Learning, a value iteration approach One - step Q - learning : Qst ,at Qst ,at rt 1 max Qst 1,a Qst ,at a Q-learning is off-policy Computational Modeling Lab example 0.2 a 1 d R=4 6 R=1 R=1 0.8 2 b c 3 0.3 R=10 4 R=2 0.7 5 1 R=5 1 Epoch Epoch Epoch Epoch Epoch 1: 2: 3: 4: 6: 1,2,4 1,6 1,3 1,2,5 2,5 Computational Modeling Lab Some convergence issues Q-learning in guaranteed to converge in a Markovian setting Tsitsiklis J.N. Asynchronous Stochastic Approximation and Qlearning. Machine Learning, Vol. 16:185-202, 1994. Computational Modeling Lab Proof by Tsitsiklis, cont. On the convergence of Q-learning One - step Q - learning : Qst ,at Qst ,at rt 1 max Qst 1,a Qst ,at a Can be written as : ,a E r Qst ,at Qst ,at E rt 1 max Qst 1,a Qst ,at a rt 1 max Qst 1 a max Q s ,a t 1 t 1 a Computational Modeling Lab Proof by Tsitsiklis q q1,..., qn On the convergence of Q-learning Stochastic approximation : Q(s,a) qi t 1 qi t i t Fi q t qi t w i t i Noise term “Learning factor” q vector, but with possibly outdated components i t Contraction mapping 0t i2 t qi t q11i t ,...,qn ni t , 0t with 0 ij t t F q(t ) q (t ) Computational Modeling Lab Proof by Tsitsiklis, cont. Stochastic approximation, as a vector qi Fi Fi + noise t qj Computational Modeling Lab Proof by Tsitsiklis, cont. Relating Q-learning to stochastic approximation ,a E r Qst ,at Qst ,at E rt 1 max Qst 1,a Qst ,at a ith component Can vary in time max Q s ,a t 1 t 1 a rt 1 max Qst 1 a Contraction mapping Bellman operator Noise term Qi t 1 Qi t i t Fi Qi t Qi t w i t Computational Modeling Lab Sarsa: On-Policy TD Control Qst ,at Qst ,at rt 1 st 1,at 1 Qst ,at When is Sarsa = Q-learning? Computational Modeling Lab Q-Learning versus SARSA One - step Q - learning : Qst ,at Qst ,at rt 1 max Qst 1,a Qst ,at a Q-learning is off-policy Sarsa Qst ,at Qst ,at rt 1 Qst 1,at 1 Qst ,at Q-learning is on-policy Computational Modeling Lab Cliff Walking example Actions: up, down, left, right Reward: cliff -100, goal 0, default -1. Action selection -greedy, with = 0.1 Sarsa takes exploration into account Computational Modeling Lab Q-learning for CAC Acceptance Criterion: Class-1 Maximize Network Revenue S1 = (2,4) [ Q(s1,A1) Q(s1,R1) Class-2 S2=(3,4) S3 = (3,3) [ Q(s3,A2) Q(s3,R2) Computational Modeling Lab Continuous Time Q-learning for CAC Q(k 1) (x,a) Q(k ) (x,a) k IR(x,a) e . max Q(k ) (y,a') Q(k ) (x,a) a' [Bratke] Call Arrival Call Arrival t1 System state: x t2 tn System state: y t0 = 0 Call Call Departure Departure IR( x, a) e 0 .s t1 ( x, a)ds e 0 .s t2 1 ( x, a)ds e t1 Call Departure .s 2 ( x, a)ds ... e .sn 1 (x,a)ds tn
© Copyright 2024 Paperzz