Structure and Synthesis of Robot Motion Introduction

Reinforcement Learning
Monte Carlo Methods I
Subramanian Ramamoorthy
School of Informatics
12 October, 2009
The RL Problem
Main Elements:
• States, s
• Actions, a
• State transition dynamics often, stochastic & unknown
• Reward (r) process - possibly
stochastic
Objective: Policy t(s,a)
– probability distribution over
actions given current state
12/10/2009
Reinforcement Learning
Assumption:
Environment is a
finite-state MDP
2
Three Aspects of the RL Problem
• Planning
The MDP is known (states, actions, transitions, rewards).
Find an optimal policy, *!
• Learning
The MDP is unknown. You are allowed to interact with it.
Find an optimal policy *!
• Optimal learning
While interacting with the MDP, minimize the loss due to not
using an optimal policy from the beginning.
3
12/10/2009
Reinforcement Learning
Monte Carlo Methods
• Learn value functions
• Discover optimal policies
• Do not assume knowledge of model as in DP –
• Learn from experience: Sample sequences of states, actions
and rewards (s, a, r)
– In simulated or real (e.g., physical robotic) worlds
– Clearly, simulator is a model but not a full one as in a prob. distribution
• Eventually attain optimal behaviour
12/10/2009
Reinforcement Learning
4
Returns
How did we achieve this in the multi-arm bandit problem?
12/10/2009
Reinforcement Learning
5
How do MC Methods Work? Preview
• Divide experience into episodes
– all episodes must terminate (e.g. noughts-and-crosses)
• Keep estimates of value functions, policies
• Change estimates/policies at end of each episode
must keep track of
s1, a1, r1, s2, a2, r2, … sT−1, aT−1, rT−1, sT (terminal state)
• Incremental episode-by-episode
NOT step-by-step
cf. Dynamic Programming
• Average complete returns – NOT partial returns
12/10/2009
Reinforcement Learning
6
Monte Carlo Policy Evaluation
• Goal:
• Given: Some number of episodes under which contain s
• Conceptual idea: maintain average returns after visits to s
• Every visit MC: update based on every time s is visited in an
episode
• First visit MC: update based on only the first time s is visited in
an episode
• In both cases, approximations converge asymptotically
12/10/2009
Reinforcement Learning
7
First-visit Monte Carlo Policy Evaluation
12/10/2009
Reinforcement Learning
8
Example: Blackjack
• Goal: Achieve a card sum
greater than dealer without
exceeding 21
• Player’s options: Hit (take
another card) or Stick (pass)
– If player crosses 21 - loss
• Dealer follows simple rule:
Stick if ≥ 17, else Hit
• Result:
Closest to 21 wins
Equally close is a draw
12/10/2009
Reinforcement Learning
9
Example: Blackjack
• Goal: Achieve a card sum greater than dealer without
exceeding 21
• State space: (200 states)
– Current sum (12 – 21)
– Dealer’s showing card (ace - 10)
– Do I have a usable ace (can be used as 11 without overshoot)?
• Reward: +1 for win, 0 for loss, -1 for a loss
• Action space: stick (no more cards), hit (receive another card)
• Policy: stick if sum is 20 or 21, else hit
Note: This is the (arbitrary) policy with which algorithm works
12/10/2009
Reinforcement Learning
10
Solution (
) : Blackjack
Why is this
more choppy?
Does it matter if you use every-visit or first-visit MC?
12/10/2009
Reinforcement Learning
11
Remarks on Blackjack Example
• Why does the value function jump up for the last two rows in
the rear?
– When sums correspond to 20 or 21, policy is to stick; this is a
good choice in this region of state space (could have had
divergence between and *…)
• Why does it drop off for the whole last row on the left?
– Dealer is showing an ace, which gives him extra flexibility (two
chances to get close to 21)
• Why are the foremost values higher on upper plots than lower
plots?
– Player has usable ace (more flexibility)
12/10/2009
Reinforcement Learning
12
Remarks: MC vs. DP
• MC does not assume the model as in DP, but convergence
analysis of both methods follows similar principles
• DP policy evaluation requires distributions that may be hard
to define in general (MC requirement is more relaxed)
– Blackjack example: When player’s sum is 14, and he has chosen
to stick, what is expected reward?
– A priori computation of such information can be demanding
– Contrast with MC, which requires many plays/realizations
12/10/2009
Reinforcement Learning
13
Backup in MC
• Does the concept of backup diagram make sense for MC
methods?
• As in figure, MC does not sample all transitions
– Root node to be updated as before
– Transitions are dictated by policy
– Acquire samples along a sample path
– Clear path from eventual reward to states
along the way (credit assignment easier)
• Estimates are different states are independent
– Computational complexity not a function of state dimensionality
12/10/2009
Reinforcement Learning
14
On the Monte Carlo Method
• Old mathematics problem [Dirichlet]: Given function f that
has values everywhere on the boundary of a region in Rn, is
there a unique continuous function u twice continuously
differentiable in the interior and continuous on the boundary,
such that u is harmonic in the interior & u = f on boundary?
• In layman’s terms:
What is the shape of a soap
bubble in a warped ring?
12/10/2009
Reinforcement Learning
15
Two Approaches for Soap Bubble Example
Relaxation (e.g., finite-difference)
Monte Carlo Method
Approximate from sample
paths (from interior/boundary)
12/10/2009
Reinforcement Learning
16
Monte Carlo Estimation of Action Values
• Model is not available, so we do not know how states and
actions interact
– We want Q*
• We can try to approximate Q (s,a) using Monte Carlo method
– Asymptotic convergence if every state-action pair is visited
• Explore various start states: Equal chance of starting from any
given state
– What did this mean for the soap bubble analogy?
12/10/2009
Reinforcement Learning
17
Monte Carlo Control
• Policy Evaluation:
Monte Carlo method
• Policy Improvement:
Greedify with respect to
state-value of action-value
function
12/10/2009
Reinforcement Learning
18
Convergence of MC Control
• Policy improvement still works if evaluation is done with MC:
• k+1 ≥ k by the policy improvement theorem
• Assumption: exploring starts and infinite number of episodes
for MC policy evaluation (i.e., value function has stabilized)
• Things to do (as in DP):
– update only to given tolerance
– interleave evaluation/improvement
12/10/2009
Reinforcement Learning
19
Monte Carlo Exploring Starts
12/10/2009
Reinforcement Learning
20
Blackjack Example, again
12/10/2009
Reinforcement Learning
21
Summary
• MC Methods learn V and Q from examples – sample episodes
• Do not assume prior knowledge of dynamics, etc.
• Can learn from simulated experience
– Could focus on parts of the state space
• As they do not bootstrap, could be more tolerant to violations
of Markov property
• Need for sufficient exploration
Next time:
How to avoid need for infinite visits? How to learn about one
policy while following another? Incremental versions?
12/10/2009
Reinforcement Learning
22