Presentation PDF - Digital Wisdom Group

Reinforcement Learning and the Reward
Engineering Principle
Daniel Dewey
[email protected]; AAAI Spring Symposium Series 2014
A modest aim:
What role goals in AI research?
…through the lens of reinforcement learning.
[email protected]; AAAI Spring Symposium Series 2014
Reinforcement learning and AI
Definitions:
“control”
“dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
RL and AI
“…one can define AI as the problem of designing
systems that do the right thing.
Now we just need a definition for ‘right.’”
Stuart Russell, “Rationality and Intelligence”
Reinforcement learning provides a definition:
maximize total rewards.
[email protected]; AAAI Spring Symposium Series 2014
RL and AI
action
Agent
Environment
reward
state
[email protected]; AAAI Spring Symposium Series 2014
RL and AI
Understand and Exploit
Inference, Planning,
Learning, Metareasoning,
Concept formation, etc…
[email protected]; AAAI Spring Symposium Series 2014
RL and AI
Advantages:
• Simple and cheap
“worse is better”
• Flexible and abstract
• Measurable
…and used in natural neural nets (brains!)
[email protected]; AAAI Spring Symposium Series 2014
RL and AI
Outside the frame:
Some behaviours cannot be elicited
(by any rewards!)
Key concepts:
Control and dominance
As RL AI becomes more general and autonomous,
it becomes harder to get good results with RL.
[email protected]; AAAI Spring Symposium Series 2014
Reinforcement learning and AI
Definitions:
“control”
“dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “control”
A user has control when the agent’s received
rewards equal the user’s chosen reward.
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “control”
action
Agent
Environment
reward
state
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “control”
action
Environment 1
state action
User
reward
Environment 2
reward
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “control”
Environment 1
Agent
User
user chooses
reward
Environment 2
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “control”
Environment 1
Agent
User
env. “chooses”
reward
Environment 2
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
Why does control matter?
Loss of control can create situations where no
possible sequence of rewards can elicit the
desired behaviour.
These behaviours are dominated by other
behaviours.
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
A “behaviour” (sequence of actions) is a policy.
a1
P1
a3
a2
1
?
a5
a4
0
?
a6
?
a8
a7
?
0
?
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
User-chosen rewards
P1
1
?
0
?
?
?
0
?
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
Env.-chosen rewards (loss of control)
P1
1
?
0
?
?
?
0
?
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
P1
1
?
0
?
?
?
0
?
P2
1
0
?
1
?
?
1
1
Can rewards make either better?
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
P1
1
1
0
1
1
1
0
1
Choose all rewards 1: Max. reward = 6
P2
1
0
0
1
0
0
1
1
Choose all rewards 0: Min. reward = 4
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
P1
1
0
0
0
0
0
0
0
Choose all rewards 0: Min. reward = 1
P2
1
0
1
1
1
1
1
1
Choose all rewards 1: Max. reward = 7
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
P1
1
?
0
?
?
?
0
?
P3
1
1
1
1
1
?
1
1
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
P1
1
1
0
1
1
1
0
1
Max. reward = 6
P3
1
1
1
1
1
0
1
1
Min. reward = 7
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
P1
1
?
0
?
?
?
0
?
Dominated by P3
P3
1
1
1
1
1
?
1
1
Dominates P1
[email protected]; AAAI Spring Symposium Series 2014
Definitions: “dominance”
A dominates B if no possible assignment of
rewards causes R(A) > R(B).
No series of rewards can prompt a dominated
policy; they are unelicitable.
(A less obvious result:
every unelicitable policy is
dominated.)
[email protected]; AAAI Spring Symposium Series 2014
Recap
Control is sometimes lost;
Loss of control enables dominance;
Dominance makes some policies unelicitable.
All of this is outside the “RL AI frame”
…but is clearly part of the AI problem
(do the right thing!)
[email protected]; AAAI Spring Symposium Series 2014
Additional factors
Generality: the range of policies an agent has
reasonably efficient access to.
= better chance of finding dominant policies
Autonomy: ability to function in environments
with little interaction from users.
= more frequent loss of control
[email protected]; AAAI Spring Symposium Series 2014
Reinforcement learning and AI
Definitions:
“control”
“dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
Reward Engineering Principle
As RL AI becomes more general and autonomous,
it becomes both more difficult and more
important to constrain the environment to avoid
loss of control.
…because general / autonomous RL AI has
• better chance of dominant policies;
• more unelicitable policies;
• more significant effects
[email protected]; AAAI Spring Symposium Series 2014
Reinforcement learning and AI
Definitions:
“control”
“dominance”
The reward engineering principle
Conclusions
[email protected]; AAAI Spring Symposium Series 2014
RL AI users:
Heed the Reward Engineering Principle.
• Consider existence of dominant policies
• Be as rigorous as possible in excluding them
• Remember what’s outside the frame!
[email protected]; AAAI Spring Symposium Series 2014
AI Researchers:
Expand the frame! Make goal design a first-class
citizen.
Consider alternatives: manually coded utility
functions, preference learning, …?
Watch out for dominance relations (e.g. in “dual”
motivation systems, between intrinsic and
extrinsic)
[email protected]; AAAI Spring Symposium Series 2014
Thank you!
Toby Ord, Seán Ó hÉigeartaigh, and two anonymous
judges, for comments.
Work supported by the
Alexander Tamas Research Fellowship
[email protected]; AAAI Spring Symposium Series 2014