Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 1 / 21 Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 2 / 21 Introduction Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 3 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 5 / 21 Introduction What is reinforcement learning? Toy Problem sf s0 Noga Zaslavsky s Q-Learning (Watkins & Dayan, 1992) November 1, 2015 6 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S ∈ S - the state of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S ∈ S - the state of the system A ∈ A - the agent’s action Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S ∈ S - the state of the system A ∈ A - the agent’s action p (s0 |s, a)- the dynamics of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S ∈ S - the state of the system A ∈ A - the agent’s action p (s0 |s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS , A, p, ri Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S ∈ S - the state of the system A ∈ A - the agent’s action p (s0 |s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS , A, p, ri π (a|s) - the agent’s policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S ∈ S - the state of the system A ∈ A - the agent’s action p (s0 |s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS , A, p, ri π (a|s) - the agent’s policy St−1 St St+1 p Rt−1 π Noga Zaslavsky p Rt π At−1 St+2 p Rt+1 π At Q-Learning (Watkins & Dayan, 1992) At+1 November 1, 2015 7 / 21 Introduction Modeling the problem The Value Function The un-discounted value function for any given policy π " # T π V ( s ) = E ∑ r ( st , at ) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Modeling the problem The Value Function The un-discounted value function for any given policy π " # T π V ( s ) = E ∑ r ( st , at ) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π " # ∞ V π ( s ) = E ∑ γ t r ( st , at ) s0 = s t=0 0 < γ < 1 is the discounting factor Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Modeling the problem The Value Function The un-discounted value function for any given policy π " # T π V ( s ) = E ∑ r ( st , at ) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π " # ∞ V π ( s ) = E ∑ γ t r ( st , at ) s0 = s t=0 0 < γ < 1 is the discounting factor The goal: find a policy π ∗ that maximizes the value ∀s ∈ S ∗ V ∗ (s) = V π (s) = max V π (s) π Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Bellman optimality The Bellman Equation The value function can be written recursively " # ∞ V π (s) = E ∑ γt r (st , at )s0 = s = E t=0 Noga Zaslavsky a∼π (·|s) s0 ∼p(·|s,a) Q-Learning (Watkins & Dayan, 1992) r (s, a) + γV π s0 November 1, 2015 9 / 21 Introduction Bellman optimality The Bellman Equation The value function can be written recursively " # ∞ V π (s) = E ∑ γt r (st , at )s0 = s = E a∼π (·|s) s0 ∼p(·|s,a) t=0 r (s, a) + γV π s0 The optimal value satisfies the Bellman equation V ∗ (s) = max E r (s, a) + γV ∗ s0 π Noga Zaslavsky a∼π (·|s) s0 ∼p(·|s,a) Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 Introduction Bellman optimality The Bellman Equation The value function can be written recursively " # ∞ V π (s) = E ∑ γt r (st , at )s0 = s = E a∼π (·|s) s0 ∼p(·|s,a) t=0 r (s, a) + γV π s0 The optimal value satisfies the Bellman equation V ∗ (s) = max E r (s, a) + γV ∗ s0 π a∼π (·|s) s0 ∼p(·|s,a) π ∗ is not necessarily unique, but V ∗ is unique Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + γ Noga Zaslavsky E s0 ∼p(·|s,a) Q-Learning (Watkins & Dayan, 1992) V s0 November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s0 ∼p(·|s,a) V s0 If we know V ∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s0 ∼p(·|s,a) V s0 If we know V ∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V ∗ (s) = max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s0 ∼p(·|s,a) V s0 If we know V ∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V ∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Q-Learning Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 11 / 21 Q-Learning Learning Q Algorithm ∗ If the agent knows the dynamics p and the reward function r, it can find Q∗ by dynamic programing (e.g. value iteration) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 12 / 21 Q-Learning Learning Q Algorithm ∗ If the agent knows the dynamics p and the reward function r, it can find Q∗ by dynamic programing (e.g. value iteration) Otherwise, it needs to estimate Q∗ from its experience The experience of an agent is a sequence s0 , a0 , r0 , s1 , a1 , r1 , s2 , . . . The n-th episode is (sn , an , rn , sn+1 ) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 12 / 21 Q-Learning Learning Q Algorithm ∗ Q-Learning Initialize Q0 (s, a) for all a ∈ A and s ∈ S For each episode n observe the current state sn select and execute an action an observe the next state sn+1 and reward rn Qn (sn , an ) ← (1 − αn ) Qn−1 (sn , an ) + αn rn + γmax Qn−1 (sn+1 , a) a αn ∈ (0, 1) - the learning rate Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 13 / 21 Q-Learning Algorithm Exploration vs. Exploitation How should the agent gain experience in order to learn? Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 Q-Learning Algorithm Exploration vs. Exploitation How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 Q-Learning Algorithm Exploration vs. Exploitation How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 Q-Learning Algorithm Exploration vs. Exploitation How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting) One common approach is to use an e-greedy policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 Q-Learning Convergence analysis Convergence of the Q function Theorem Given bounded rewards |rn | ≤ R, and learning rates 0 ≤ αn ≤ 1 s.t. ∞ ∑ αn = ∞ ∞ and n=1 ∑ α2n < ∞ n=1 then with probability 1 Qn (s, a) −→ Q∗ (s, a) n→ ∞ The exploration policy should be such that each state-action pair will be encountered infinitely many times Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 15 / 21 Q-Learning Convergence analysis Convergence of the Q function Poof - main idea Define the operator 0 0 B [Q]s,a = r (s, a) + γ ∑ p s0 |s, a max Q s , a 0 s0 a B is a contraction under the k · k∞ norm kB [Q1 ] − B [Q2 ] k∞ ≤ γkQ1 − Q2 k∞ It is easy to see that B [Q∗ ] = Q∗ We’re not done because the updates come from a stochastic process Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 16 / 21 Assumptions and Limitations Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 17 / 21 Assumptions and Limitations What assumptions did we make? Q-learning is model free The state and reward are assumed to be fully observable POMDP is a much harder problem The Q function can be represented by a look-up table otherwise we need to do something else such as function approximation, and this is where deep learning comes in! Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 18 / 21 Summary Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 19 / 21 Summary Summary The basic reinforcement learning problem can be modeled as an MDP If the model is known we can solve precisely by dynamic programing Q-learning allows learning an optimal policy when the model is not known, and without trying to learn the model It converges asymptotically to the optimal Q if the learning rate is not too fast and not too slow, and if the exploration policy is satisfactory Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 20 / 21 Summary Further Reading Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction. MIT Press, 1998 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 21 / 21
© Copyright 2024 Paperzz