6PP

Today
CS 188: Artificial Intelligence
Fall 2006
Lecture 13: Advanced
Reinforcement Learning
10/12/2006
Dan Klein – UC Berkeley
Recap: Q-Learning
ƒ Learn Q*(s,a) values from samples
ƒ Receive a sample (s,a,s’,r)
ƒ On one hand: old estimate of return:
ƒ But now we have a new estimate for this sample:
ƒ How advanced reinforcement learning
works for large problems
ƒ Some previews of fundamental ideas we’ll
see throughout the rest of the term
ƒ Next class we’ll start on probabilistic
reasoning and reasoning about beliefs
Q-Learning
ƒ Q-learning produces tables of q-values:
ƒ Nudge the old estimate towards the new sample:
ƒ Equivalently, average samples over time:
Q-Learning
ƒ In realistic situations, we cannot possibly learn
about every single state!
ƒ Too many states to visit them all in training
ƒ Too many states to even hold the q-tables in memory
ƒ Instead, we want to generalize:
ƒ Learn about some small number of training states
from experience
ƒ Generalize that experience to new, similar states
ƒ This is a fundamental idea in machine learning, and
we’ll see it over and over again
Example: Pacman
ƒ Let’s say we discover
through experience
that this state is bad:
ƒ In naïve -q learning,
we know nothing
about this state or its
- states:
q
ƒ Or even this one!
1
Feature-Based Representations
ƒ Solution: describe a state using
a vector of features
ƒ Features are functions from states
to real numbers (often 0/1) that
capture important properties of the
state
ƒ Example features:
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc.
ƒ Can also describe a q-state (s, a)
with features (e.g. action moves
closer to food)
Function Approximation
Linear Feature Functions
ƒ Using a feature representation, we can write a
- function (or value function) for any state
q
using a few weights:
ƒ Advantage: our experience is summed up in a
few powerful numbers
ƒ Disadvantage: states may share features but
be very different in value!
Example: Q-Pacman
ƒ Q-learning with linear q-functions:
ƒ Intuitive interpretation:
ƒ Adjust weights of active features
ƒ E.g. if something unexpectedly bad happens, disprefer all states
with that state’s features
ƒ Formal justification: online least squares (much later)
Hierarchical Learning
Hierarchical RL
ƒ Stratagus: Example of a large RL task, from Bhaskara Marthi’s
thesis (w/ Stuart Russell)
ƒ Stratagus is hard for reinforcement learning algorithms
ƒ > 10100 states
ƒ > 1030 actions at each point
ƒ Time horizon ≈ 104 steps
ƒ Stratagus is hard for human programmers
ƒ Typically takes several person-months for game companies to write
computer opponent
ƒ Still, no match for experienced human players
ƒ Programming involves much trial and error
ƒ Hierarchical RL
ƒ Humans supply high-level prior knowledge using partial program
ƒ Learning algorithm fills in the details
2
Partial “Alisp” Program
(defun top ()
(loop
(choose
(gather-wood)
(gather-gold))))
(defun gather-wood ()
(with-choice
(dest *forest-list*)
(nav dest)
(action ‘get-wood)
(nav *base-loc*)
(action ‘dropoff)))
(defun gather-gold ()
(with-choice
(dest *goldmine-list*)
(nav dest))
(action ‘get-gold)
(nav *base-loc*))
(action ‘dropoff)))
(defun nav (dest)
(until (= (pos (get-state))
dest)
(with-choice
(move ‘(N S E W NOOP))
(action move))))
Hierarchical RL
ƒ They then define a hierarchical Q
- function which
learns a linear feature
- based mini- Q
- function at
each choice point
ƒ Very good at balancing resources and directing
rewards to the right region
ƒ Still not very good at the strategic elements of
these kinds of games (i.e. the Markov game
aspect)
[DEMO]
Policy Search
Policy Search
ƒ Problem: often the feature-based policies that work well
aren’t the ones that approximate V / Q best
ƒ E.g. your value functions from 1.3 were probably horrible
estimates of future rewards, but they still produce good decisions
ƒ We’ll see this distinction between modeling and prediction again
later in the course
ƒ Solution: learn the policy that maximizes rewards rather
than the value that predicts rewards
ƒ This is the idea behind policy search, such as what
controlled the upside-down helicopter
Policy Search
ƒ Simplest policy search:
ƒ Start with an initial linear value function or q-function
ƒ Nudge each feature weight up and down and see if
your policy is better than before
ƒ Problems:
ƒ How do we tell the policy got better?
ƒ Need to run many sample episodes!
ƒ If there are a lot of features, this can be impractical
Policy Search*
ƒ Advanced policy search:
ƒ Write a stochastic (soft) policy:
ƒ Turns out you can efficiently approximate the
derivative of the returns with respect to the
parameters w (details in the book, but you don’t have
to know them)
ƒ Take uphill steps, recalculate derivatives, etc.
3
Take a Deep Breath…
ƒ We’re done with search and planning!
ƒ Next, we’ll look at how to reason with
probabilities
ƒ
ƒ
ƒ
ƒ
ƒ
Diagnosis
Tracking objects
Speech recognition
Robot mapping
… lots more!
ƒ Last part of course: machine learning
4