Provably Beneficial AI Stuart Russell University of California, Berkeley [joint work with Dylan Hadfield-Menell, Smitha Milli, Andrew Critch, Anca Dragan, Pieter Abbeel, Tom Griffiths] Premise Eventually, AI systems will make better* decisions than humans Taking into account more information, looking further into the future Upside Everything we have is the product of intelligence Access to significantly greater intelligence would be a step change in civilization Downside Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled. ... Alan Turing, 1951 We had better be quite sure that the purpose put into the machine is the purpose which we really desire Norbert Wiener, 1960 King Midas, c540 BCE You can’t fetch the coffee if you’re dead I’m sorry, Dave, I’m afraid I can’t do that Center for Human-Compatible AI “…reorient the general thrust of AI research towards provably beneficial systems.” Also FHI, CSER/LCFI, MIRI, FLI, OpenAI AAAI, IEEE, NSF, DARPA, PonAI Three simple ideas 1. The robot’s only objective is to maximize the realization of human values* 2. The robot is initially uncertain about what those values are 3. Human behavior provides information about human values *Implicit preferences over complete lives Uncertainty in objectives Irrelevant in standard decision problems (policy depends only on expectation) …Unless the environment provides further information about objectives E.g., observable human actions General theory will include both “human” and “machine” agents give humans (or, “principals”) special status Provably beneficial AI Define a formal problem F that we assume the robot solves arbitrarily well The robot is an F-solver, not just “AGI” Program design may include subsystems of arbitrary “intelligence” They just have to be connected, trained, and motivated the right way Desired theorem: The human is provably better off with the robot The off-switch problem I must fetch the coffee I can’t fetch the coffee if I’m dead Therefore I must disable my off-switch And Taser all other Starbucks customers Image courtesy of Clearpath Robotics … with uncertain objectives The human might switch me off But only if I’m doing something wrong I don’t know what “wrong” is but I know I don’t want to do it Therefore I should let the human switch me off Image courtesy of Clearpath Robotics … with uncertain objectives Qh uman =meit Switc mi of Pi mput = wnlh if eim + doigg Sumqigg rogg Pi idwnt nw wat rogg iz mput ai dwnt want tu du it SP Qhrfwr I let qh + uman switc mh of Theorem: Such a robot is provably beneficial Image courtesy of Clearpath Robotics Value alignment issues Humans are nasty, irrational, inconsistent, weak-willed, computationally limited, incredibly complex, heterogeneous, and may not have an objective in any meaningful sense Computationally limited, Incredibly complex Need to invert behavior through human cognitive architecture Biggest issue: behavior is probably organized into a deeply nested “subroutine hierarchy” with constrained local choices What is it? Can we recover it? Nasty Robot will not act like those it observes It is purely altruistic, cares about everyone It is learning to predict what people want, not learning to want it And if someone wants others to suffer? Check sign of altruism term Your wife called to remind you about dinner tonight Wait! What? What dinner? For your 20th anniversary, at 7pm I did warn you, but you overrode my recommendation… Don’t worry, I arranged for his plane to be delayed – some kind of computer malfunction. I can’t, I’m meeting the Secretary General at 7.30! How did this happen?? OK, but what am I going to do now? I can’t just tell him I’m too busy!! Really? You can do that?!? He sends his profound apologies and is happy to meet you for lunch tomorrow Welcome home! Long day? Yes, terrible, not even time for lunch. So you must be quite hungry! Starving! What’s for dinner? There’s something I need to tell you There are humans in Somalia in more urgent need of help. I am leaving now. Please make your own dinner. Summary Value misalignment: a potential risk Certain design templates may support provably beneficial systems Not yet ready for standards or regulations! Economic incentives may work in our favor Questions Can we change the way AI defines itself? A civil engineer says “I design bridges”, not “I design bridges that don’t fall down” Will solutions to near-term control problems scale to the long-term control problem? What about Bondian villains? Long-term enfeeblement? cf E. M. Forster, “The Machine Stops” What does Values ‘R Us sell, exactly?
© Copyright 2026 Paperzz