09-Knox

Design Principles for Creating
Human-Shapable Agents
W. Bradley Knox, Ian Fasel,
and Peter Stone
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
The University of Texas at Austin
Department of Computer Sciences
Transferring human knowledge
through natural forms of
communication
Potential benefits over purely autonomous
learners:
• Decrease sample complexity
• Learn in the absence of a reward function
• Allow lay users to teach agents the policies
that they prefer (no programming!)
• Learn in more complex domains
Shaping
LOOK magazine, 1952
Def. - creating a desired behavior by reinforcing
successive approximations of the behavior
The Shaping Scenario
(in this context)
A human trainer observes an agent and
manually delivers reinforcement (a scalar
value), signaling approval or disapproval.
E.g., training a dog with treats as in the previous
picture
The Shaping Problem
(for computational agents)
Within a sequential decision making task,
how can an agent harness state
descriptions and occasional scalar human
reinforcement signals to learn a good task
policy?
Previous work on human-shapable
agents
• Clicker training for entertainment agents
(Blumberg et al., 2002; Kaplan et al., 2002)
• Sophie’s World (Thomaz & Breazeal, 2006)
– RL with reward = environmental (MDP) reward +
human reinforcement
• Social software agent Cobot in LambdaMoo
(Isbell et al., 2006)
– RL with reward = human reinforcement
MDP reward
vs.
Human reinforcement
• MDP reward (within reinforcement learning):
– Key problem: credit assignment from sparse
rewards
• Reinforcement from a human trainer:
– Trainer has long-term impact in mind
– Reinforcement is within a small temporal window
of the targeted behavior
– Credit assignment problem is largely removed
Teaching an Agent Manually via
Evaluative Reinforcement (TAMER)
• TAMER approach:
– Learn a model of human reinforcement
– Directly exploit the model to determine
policy
• If greedy:
Teaching an Agent Manually via
Evaluative Reinforcement (TAMER)
Learning from
targeted human reinforcement
is a supervised learning problem,
not a reinforcement learning problem.
Teaching an Agent Manually via
Evaluative Reinforcement (TAMER)
The Shaped Agent’s Perspective
• Each time step, agent:
– receives state description
– might receive a scalar human
reinforcement signal
– chooses an action
– does not receive an environmental reward
signal (if learning purely from shaping)
Tetris
• Drop blocks to make solid
horizontal lines, which then
disappear
• |state space| > 2250
• Challenging but slow
• 21 features extracted from (s, a)
• TAMER model:
– Linear model over features
– Gradient descent updates
• Greedy action selection
TAMER in action: Tetris
Training:
After
training:
Before
training:
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
TAMER Results: Tetris
(9 subjects)
TAMER Results: Tetris
(9 subjects)
TAMER Results: Mountain Car
(19 subjects)
Conjectures on how to create an
agent that can be interactively
shaped by a human trainer
1.
2.
3.
4.
5.
6.
7.
8.
For many tasks, greedily exploiting the human trainer’s
reinforcement function yields a good policy.
Modeling a human trainer’s reinforcement is a supervised
learning problem (not RL).
Exploration can be driven by negative reinforcement alone.
Credit assignment to a dense state-action history should …
A human trainer’s reinforcement function is not static.
Human reinforcement is a function of states and actions.
In an MDP, human reinforcement should be treated differently
from environmental reward.
Human trainers reinforce predicted action as well as recent
action.
the end.
Mountain Car
• Drive back and forth,
gaining enough
momentum to get to the
goal on top of the hill
• Continuous state space
– Velocity and position
• Simple but rapid actions
• Feature extraction:
–2D Gaussian RBFs over velocity and position of car
–One “grid” of RBFs per action
• TAMER model:
–Linear model over RBF features
–Gradient descent updates
TAMER in action: Mountain Car
Before training:
After training:
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
Training:
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
TAMER Results: Mountain Car
(19 subjects)
TAMER Results: Mountain Car
(19 subjects)
HOW TO: Convert a basic TDLearning agent into a TAMER agent
(w/o temporal credit assignment)
1. the underlying fcn approximator must be a
Q-function (for state-action values)
2. set discount factor (gamma) to 0
3. make action selection fully greedy
4. human reinf. replaces environmental reward
5. if no human input is received, no update
6. remove any eligibility traces (can just
change parameter lambda to 0)
7. maybe lower alpha to .01 or less
HOW TO: Convert a TD-Learning
agent into a TAMER agent (cont.)
With credit assignment (more frequent time steps)
1. Save (features, human reinf.) for each time
step in a window from 0.2 seconds before to
about 0.8 seconds
2. define a probability distribution fcn over the
window (a uniform distribution is probably fine)
3. credit for each state-action pair is the integral
of the pdf from the time of the next most
recent timestep to the timestep for that pair
• - for the update, both reward prediction (in
place of state-action-value prediction) used to