6PP

Announcements
CS 188: Artificial Intelligence
ƒ Midterm prep page is up
Fall 2006
ƒ Project 2.1 will be up in a day or so
ƒ Due after midterm
ƒ But start it now, because MDPs are on the midterm!
Lecture 10: MDPs II
9/28/2006
ƒ Project 1.4 (Learning Pacman) and the Pacman
contest will be after the midterm
Dan Klein – UC Berkeley
ƒ Review session TBA, check the web page
Recap: MDPs
Example: Autonomous Helicopter
ƒ Markov decision processes:
ƒ
ƒ
ƒ
ƒ
ƒ
States S
Actions A
Transitions P(s’|s,a) (or T(s,a,s’))
Rewards R(s,a,s’)
Start state s0
ƒ Examples:
ƒ Gridworld, High-Low, Pacman, N-Armed Bandit
ƒ Any process where the result of your action is stochastic
ƒ Goal: find the “best” policy π
ƒ Policies are maps from states to actions
ƒ What do we mean by “best”?
ƒ This is like search – it’s planning using a model, not actually
interacting with the environment
Example: High-Low
MDP Search Trees
ƒ Can view an MDP as a branching search tree
3
Low
High
s
3
3 , High
a
, Low
(s, a) is a
q-state
T = 0.5,
R=2
2
High
T = 0.25,
R=3
T = 0,
R=4
3
Low
High
T = 0.25,
R=0
High
s, a
(s,a,s’) called a transition
T(s,a,s’) = P(s’|s,a)
s,a,s’
R(s,a,s’)
4
Low
s is a state
s’
Low
1
Discounting
Recap: MDP Quantities
ƒ Return = Sum of future
discounted rewards
in one episode
(stochastic)
ƒ Typically discount
rewards by γ < 1
each time step
ƒ Sooner rewards
have higher utility
than later rewards
ƒ Also helps the
algorithms
converge
s’
ƒ Q: Expected return from a q
- state under a policy
MDP Search Trees
ƒ Problems:
ƒ We want to find the optimal policy π
ƒ This tree is usually infinite (why?)
ƒ The same states appear over and
over (why?)
ƒ Option 1: modified expectimax search:
s
a
s, a
s,a,s’
s’
ƒ Calculate estimates Vk*(s)
s, a
s,a,s’
ƒ V: Expected return from a state under a policy
Solving MDPs
Value Estimates
s
a
ƒ Solution:
ƒ Compute to a finite depth (like
expectimax)
ƒ Consider returns from sequences
of increasing length
ƒ Cache values so we don’t repeat
work
Memoized Recursion
ƒ Recurrences:
ƒ Not the optimal value of s!
ƒ The optimal value considering
only next k time steps (k
rewards)
ƒ As k → ∞, it approaches the
optimal value
ƒ Why:
ƒ If discounting, distant rewards
become negligible
ƒ If terminal states reachable from
everywhere, fraction of episodes
not ending becomes negligible
ƒ Otherwise, can get infinite
expected utility and then this
approach actually won’t work
ƒ Cache all function call results so you never repeat work
ƒ What happened to the evaluation function?
2
Value Iteration
The Bellman Equations
ƒ Problems with the recursive computation:
ƒ Have to keep all the Vk*(s) around all the time
ƒ Don’t know which depth πk(s) to ask for when
planning
ƒ Definition of utility leads to a simple relationship amongst
optimal utility values:
Optimal rewards = maximize over first action and then follow
optimal policy
ƒ Formally:
That’s my
equation!
ƒ Solution: value iteration
ƒ Calculate values for all states, bottom
- up
ƒ Keep increasing k until convergence
Value Iteration
Example: Bellman Updates
ƒ Idea:
ƒ Start with V0*(s) = 0, which we know is right (why?)
ƒ Given Vi*, calculate the values for all states for depth i+1:
ƒ This is called a value update or Bellman update
ƒ Repeat until convergence
ƒ Theorem: will converge to unique optimal values
ƒ Basic idea: approximations get refined towards optimal values
ƒ Policy may converge long before values do
Example: Value Iteration
V2
Convergence*
ƒ Define the max-norm:
V3
ƒ Theorem: For any two approximations U and V
ƒ I.e. any distinct approximations must get closer to each other, so,
in particular, any approximation must get closer to the true U and
value iteration converges to a unique, stable, optimal solution
ƒ Theorem:
ƒ Information propagates outward from terminal
states and eventually all states have correct
value estimates
ƒ I.e. one the change in our approximation is small, it must also be
close to correct
[DEMO]
3
Policy Iteration
ƒ Alternate approach:
ƒ Policy evaluation: calculate utilities for a fixed policy
until convergence (remember the beginning of
lecture)
ƒ Policy improvement: update policy based on resulting
converged utilities
ƒ Repeat until policy converges
Policy Iteration
ƒ If we have a fixed policy π, use simplified
Bellman equation to calculate utilities:
ƒ For fixed utilities, easy to find the best action
according to one
- step look- ahead
ƒ This is policy iteration
ƒ Can converge faster under some conditions
Comparison
ƒ In value iteration:
ƒ Every pass (or “backup”) updates both utilities (explicitly, based
on current utilities) and policy (possibly implicitly, based on
current policy)
ƒ In policy iteration:
ƒ Several passes to update utilities with frozen policy
ƒ Occasional passes to update policies
ƒ Hybrid approaches (asynchronous policy iteration):
ƒ Any sequences of partial updates to either policy entries or
utilities will converge if every state is visited infinitely often
Example: Animal Learning
ƒ RL studied experimentally for more than 60
years in psychology
ƒ Rewards: food, pain, hunger, drugs, etc.
ƒ Mechanisms and sophistication debated
ƒ Example: foraging
ƒ Bees learn near-optimal foraging plan in field of
artificial flowers with controlled nectar supplies
ƒ Bees have a direct neural connection from nectar
intake measurement to motor planning area
Reinforcement Learning
ƒ Reinforcement learning:
ƒ Still have an MDP:
ƒ A set of states s ∈ S
ƒ A model T(s,a,s’)
ƒ A reward function R(s)
ƒ Still looking for a policy π(s)
ƒ New twist: don’t know T or R
ƒ I.e. don’t know which states are good or what the actions do
ƒ Must actually try actions and states out to learn
Example: Backgammon
ƒ Reward only for win / loss in
terminal states, zero
otherwise
ƒ TD-Gammon learns a
function approximation to
U(s) using a neural network
ƒ Combined with depth 3
search, one of the top 3
players in the world
ƒ You could imagine training
Pacman this way…
ƒ … but it’s tricky!
4
Passive Learning
Example: Direct Estimation
y
ƒ Simplified task
ƒ
ƒ
ƒ
ƒ
You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
You are given a policy π(s)
Goal: learn the state values (and maybe the model)
ƒ In this case:
ƒ No choice about what actions to take
ƒ Just execute the policy and learn from experience
ƒ We’ll get to the general case soon
ƒ Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) right -1
(3,2) up -1
(4,2) right -100
(3,3) right +100
(done)
-100
x
γ = 1, R = -1
U(1,1) ~ (93 + -105) / 2 = -6
(done)
U(3,3) ~ (100 + 98 + -101) / 3 = 32.3
Model-Based Learning
Example: Model-Based Learning
y
ƒ Idea:
ƒ Learn the model empirically (rather than values)
ƒ Solve the MDP as if the learned model were correct
ƒ Empirical model learning
ƒ Simplest case:
ƒ Count outcomes for each s,a
ƒ Normalize to give estimate of T(s,a,s’)
ƒ Discover R(s,a,s’) the first time we experience (s,a,s’)
ƒ More complex learners are possible (e.g. if we know
that all squares have related action outcomes, e.g.
“stationary noise”)
ƒ Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) right -1
(3,2) up -1
(4,2) right -100
(3,3) right +100
(done)
(done)
-100
x
γ=1
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
5