Reinforcement Learning and the Reward Engineering Principle Daniel Dewey [email protected]; AAAI Spring Symposium Series 2014 A modest aim: What role goals in AI research? …through the lens of reinforcement learning. [email protected]; AAAI Spring Symposium Series 2014 Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions [email protected]; AAAI Spring Symposium Series 2014 RL and AI “…one can define AI as the problem of designing systems that do the right thing. Now we just need a definition for ‘right.’” Stuart Russell, “Rationality and Intelligence” Reinforcement learning provides a definition: maximize total rewards. [email protected]; AAAI Spring Symposium Series 2014 RL and AI action Agent Environment reward state [email protected]; AAAI Spring Symposium Series 2014 RL and AI Understand and Exploit Inference, Planning, Learning, Metareasoning, Concept formation, etc… [email protected]; AAAI Spring Symposium Series 2014 RL and AI Advantages: • Simple and cheap “worse is better” • Flexible and abstract • Measurable …and used in natural neural nets (brains!) [email protected]; AAAI Spring Symposium Series 2014 RL and AI Outside the frame: Some behaviours cannot be elicited (by any rewards!) Key concepts: Control and dominance As RL AI becomes more general and autonomous, it becomes harder to get good results with RL. [email protected]; AAAI Spring Symposium Series 2014 Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions [email protected]; AAAI Spring Symposium Series 2014 Definitions: “control” A user has control when the agent’s received rewards equal the user’s chosen reward. [email protected]; AAAI Spring Symposium Series 2014 Definitions: “control” action Agent Environment reward state [email protected]; AAAI Spring Symposium Series 2014 Definitions: “control” action Environment 1 state action User reward Environment 2 reward [email protected]; AAAI Spring Symposium Series 2014 Definitions: “control” Environment 1 Agent User user chooses reward Environment 2 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “control” Environment 1 Agent User env. “chooses” reward Environment 2 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” Why does control matter? Loss of control can create situations where no possible sequence of rewards can elicit the desired behaviour. These behaviours are dominated by other behaviours. [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” A “behaviour” (sequence of actions) is a policy. a1 P1 a3 a2 1 ? a5 a4 0 ? a6 ? a8 a7 ? 0 ? [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” User-chosen rewards P1 1 ? 0 ? ? ? 0 ? [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” Env.-chosen rewards (loss of control) P1 1 ? 0 ? ? ? 0 ? [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” P1 1 ? 0 ? ? ? 0 ? P2 1 0 ? 1 ? ? 1 1 Can rewards make either better? [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” P1 1 1 0 1 1 1 0 1 Choose all rewards 1: Max. reward = 6 P2 1 0 0 1 0 0 1 1 Choose all rewards 0: Min. reward = 4 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” P1 1 0 0 0 0 0 0 0 Choose all rewards 0: Min. reward = 1 P2 1 0 1 1 1 1 1 1 Choose all rewards 1: Max. reward = 7 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” P1 1 ? 0 ? ? ? 0 ? P3 1 1 1 1 1 ? 1 1 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” P1 1 1 0 1 1 1 0 1 Max. reward = 6 P3 1 1 1 1 1 0 1 1 Min. reward = 7 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” P1 1 ? 0 ? ? ? 0 ? Dominated by P3 P3 1 1 1 1 1 ? 1 1 Dominates P1 [email protected]; AAAI Spring Symposium Series 2014 Definitions: “dominance” A dominates B if no possible assignment of rewards causes R(A) > R(B). No series of rewards can prompt a dominated policy; they are unelicitable. (A less obvious result: every unelicitable policy is dominated.) [email protected]; AAAI Spring Symposium Series 2014 Recap Control is sometimes lost; Loss of control enables dominance; Dominance makes some policies unelicitable. All of this is outside the “RL AI frame” …but is clearly part of the AI problem (do the right thing!) [email protected]; AAAI Spring Symposium Series 2014 Additional factors Generality: the range of policies an agent has reasonably efficient access to. = better chance of finding dominant policies Autonomy: ability to function in environments with little interaction from users. = more frequent loss of control [email protected]; AAAI Spring Symposium Series 2014 Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions [email protected]; AAAI Spring Symposium Series 2014 Reward Engineering Principle As RL AI becomes more general and autonomous, it becomes both more difficult and more important to constrain the environment to avoid loss of control. …because general / autonomous RL AI has • better chance of dominant policies; • more unelicitable policies; • more significant effects [email protected]; AAAI Spring Symposium Series 2014 Reinforcement learning and AI Definitions: “control” “dominance” The reward engineering principle Conclusions [email protected]; AAAI Spring Symposium Series 2014 RL AI users: Heed the Reward Engineering Principle. • Consider existence of dominant policies • Be as rigorous as possible in excluding them • Remember what’s outside the frame! [email protected]; AAAI Spring Symposium Series 2014 AI Researchers: Expand the frame! Make goal design a first-class citizen. Consider alternatives: manually coded utility functions, preference learning, …? Watch out for dominance relations (e.g. in “dual” motivation systems, between intrinsic and extrinsic) [email protected]; AAAI Spring Symposium Series 2014 Thank you! Toby Ord, Seán Ó hÉigeartaigh, and two anonymous judges, for comments. Work supported by the Alexander Tamas Research Fellowship [email protected]; AAAI Spring Symposium Series 2014
© Copyright 2026 Paperzz