Reinforcement Learning Monte Carlo Methods I Subramanian Ramamoorthy School of Informatics 12 October, 2009 The RL Problem Main Elements: • States, s • Actions, a • State transition dynamics often, stochastic & unknown • Reward (r) process - possibly stochastic Objective: Policy t(s,a) – probability distribution over actions given current state 12/10/2009 Reinforcement Learning Assumption: Environment is a finite-state MDP 2 Three Aspects of the RL Problem • Planning The MDP is known (states, actions, transitions, rewards). Find an optimal policy, *! • Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy *! • Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning. 3 12/10/2009 Reinforcement Learning Monte Carlo Methods • Learn value functions • Discover optimal policies • Do not assume knowledge of model as in DP – • Learn from experience: Sample sequences of states, actions and rewards (s, a, r) – In simulated or real (e.g., physical robotic) worlds – Clearly, simulator is a model but not a full one as in a prob. distribution • Eventually attain optimal behaviour 12/10/2009 Reinforcement Learning 4 Returns How did we achieve this in the multi-arm bandit problem? 12/10/2009 Reinforcement Learning 5 How do MC Methods Work? Preview • Divide experience into episodes – all episodes must terminate (e.g. noughts-and-crosses) • Keep estimates of value functions, policies • Change estimates/policies at end of each episode must keep track of s1, a1, r1, s2, a2, r2, … sT−1, aT−1, rT−1, sT (terminal state) • Incremental episode-by-episode NOT step-by-step cf. Dynamic Programming • Average complete returns – NOT partial returns 12/10/2009 Reinforcement Learning 6 Monte Carlo Policy Evaluation • Goal: • Given: Some number of episodes under which contain s • Conceptual idea: maintain average returns after visits to s • Every visit MC: update based on every time s is visited in an episode • First visit MC: update based on only the first time s is visited in an episode • In both cases, approximations converge asymptotically 12/10/2009 Reinforcement Learning 7 First-visit Monte Carlo Policy Evaluation 12/10/2009 Reinforcement Learning 8 Example: Blackjack • Goal: Achieve a card sum greater than dealer without exceeding 21 • Player’s options: Hit (take another card) or Stick (pass) – If player crosses 21 - loss • Dealer follows simple rule: Stick if ≥ 17, else Hit • Result: Closest to 21 wins Equally close is a draw 12/10/2009 Reinforcement Learning 9 Example: Blackjack • Goal: Achieve a card sum greater than dealer without exceeding 21 • State space: (200 states) – Current sum (12 – 21) – Dealer’s showing card (ace - 10) – Do I have a usable ace (can be used as 11 without overshoot)? • Reward: +1 for win, 0 for loss, -1 for a loss • Action space: stick (no more cards), hit (receive another card) • Policy: stick if sum is 20 or 21, else hit Note: This is the (arbitrary) policy with which algorithm works 12/10/2009 Reinforcement Learning 10 Solution ( ) : Blackjack Why is this more choppy? Does it matter if you use every-visit or first-visit MC? 12/10/2009 Reinforcement Learning 11 Remarks on Blackjack Example • Why does the value function jump up for the last two rows in the rear? – When sums correspond to 20 or 21, policy is to stick; this is a good choice in this region of state space (could have had divergence between and *…) • Why does it drop off for the whole last row on the left? – Dealer is showing an ace, which gives him extra flexibility (two chances to get close to 21) • Why are the foremost values higher on upper plots than lower plots? – Player has usable ace (more flexibility) 12/10/2009 Reinforcement Learning 12 Remarks: MC vs. DP • MC does not assume the model as in DP, but convergence analysis of both methods follows similar principles • DP policy evaluation requires distributions that may be hard to define in general (MC requirement is more relaxed) – Blackjack example: When player’s sum is 14, and he has chosen to stick, what is expected reward? – A priori computation of such information can be demanding – Contrast with MC, which requires many plays/realizations 12/10/2009 Reinforcement Learning 13 Backup in MC • Does the concept of backup diagram make sense for MC methods? • As in figure, MC does not sample all transitions – Root node to be updated as before – Transitions are dictated by policy – Acquire samples along a sample path – Clear path from eventual reward to states along the way (credit assignment easier) • Estimates are different states are independent – Computational complexity not a function of state dimensionality 12/10/2009 Reinforcement Learning 14 On the Monte Carlo Method • Old mathematics problem [Dirichlet]: Given function f that has values everywhere on the boundary of a region in Rn, is there a unique continuous function u twice continuously differentiable in the interior and continuous on the boundary, such that u is harmonic in the interior & u = f on boundary? • In layman’s terms: What is the shape of a soap bubble in a warped ring? 12/10/2009 Reinforcement Learning 15 Two Approaches for Soap Bubble Example Relaxation (e.g., finite-difference) Monte Carlo Method Approximate from sample paths (from interior/boundary) 12/10/2009 Reinforcement Learning 16 Monte Carlo Estimation of Action Values • Model is not available, so we do not know how states and actions interact – We want Q* • We can try to approximate Q (s,a) using Monte Carlo method – Asymptotic convergence if every state-action pair is visited • Explore various start states: Equal chance of starting from any given state – What did this mean for the soap bubble analogy? 12/10/2009 Reinforcement Learning 17 Monte Carlo Control • Policy Evaluation: Monte Carlo method • Policy Improvement: Greedify with respect to state-value of action-value function 12/10/2009 Reinforcement Learning 18 Convergence of MC Control • Policy improvement still works if evaluation is done with MC: • k+1 ≥ k by the policy improvement theorem • Assumption: exploring starts and infinite number of episodes for MC policy evaluation (i.e., value function has stabilized) • Things to do (as in DP): – update only to given tolerance – interleave evaluation/improvement 12/10/2009 Reinforcement Learning 19 Monte Carlo Exploring Starts 12/10/2009 Reinforcement Learning 20 Blackjack Example, again 12/10/2009 Reinforcement Learning 21 Summary • MC Methods learn V and Q from examples – sample episodes • Do not assume prior knowledge of dynamics, etc. • Can learn from simulated experience – Could focus on parts of the state space • As they do not bootstrap, could be more tolerant to violations of Markov property • Need for sufficient exploration Next time: How to avoid need for infinite visits? How to learn about one policy while following another? Incremental versions? 12/10/2009 Reinforcement Learning 22
© Copyright 2026 Paperzz