The Future of AI: What if We Succeed?

Provably Beneficial AI
Stuart Russell
University of California, Berkeley
[joint work with Dylan Hadfield-Menell, Smitha Milli,
Andrew Critch, Anca Dragan, Pieter Abbeel, Tom Griffiths]
Premise

Eventually, AI systems will make
better* decisions than humans

Taking into account more information,
looking further into the future
Upside
Everything we have is the product of
intelligence
 Access to significantly greater intelligence
would be a step change in civilization

Downside
Even if we could keep the machines in a
subservient position, for instance by
turning off the power at strategic
moments, we should, as a species, feel
greatly humbled. ...
Alan Turing, 1951
We had better be quite sure that the
purpose put into the machine is the
purpose which we really desire
Norbert Wiener, 1960
King Midas, c540 BCE
You can’t fetch the coffee if you’re dead
I’m sorry, Dave, I’m afraid I
can’t do that
Center for Human-Compatible AI
“…reorient the general thrust of AI research
towards provably beneficial systems.”
Also FHI, CSER/LCFI, MIRI, FLI, OpenAI
AAAI, IEEE, NSF, DARPA, PonAI
Three simple ideas
1. The robot’s only objective is to maximize
the realization of human values*
2. The robot is initially uncertain about what
those values are
3. Human behavior provides information
about human values
*Implicit preferences over complete lives
Uncertainty in objectives
Irrelevant in standard decision problems
(policy depends only on expectation)
 …Unless the environment provides further
information about objectives



E.g., observable human actions
General theory will
include both “human” and “machine” agents
 give humans (or, “principals”) special status

Provably beneficial AI

Define a formal problem F that we assume
the robot solves arbitrarily well
The robot is an F-solver, not just “AGI”
 Program design may include subsystems of
arbitrary “intelligence”
 They just have to be connected, trained, and
motivated the right way


Desired theorem: The human is provably
better off with the robot
The off-switch problem
I must fetch the
coffee
I can’t fetch the coffee if I’m
dead
Therefore I must disable
my off-switch
And Taser all other
Starbucks
customers
Image courtesy of Clearpath Robotics
… with uncertain objectives
The human might switch
me off
But only if I’m doing
something wrong
I don’t know what “wrong” is
but I know I don’t want to do
it
Therefore I should let
the human switch me
off
Image courtesy of Clearpath Robotics
… with uncertain objectives
Qh uman =meit Switc mi
of
Pi mput = wnlh if eim
+ doigg Sumqigg rogg
Pi idwnt nw wat rogg iz
mput ai dwnt want tu du it
SP Qhrfwr I let qh
+ uman switc mh of
Theorem: Such a robot is provably beneficial
Image courtesy of Clearpath Robotics
Value alignment issues

Humans are nasty, irrational, inconsistent,
weak-willed, computationally limited,
incredibly complex, heterogeneous, and may
not have an objective in any meaningful sense
Computationally limited,
Incredibly complex
Need to invert behavior through
human cognitive architecture
 Biggest issue: behavior is probably
organized into a deeply nested
“subroutine hierarchy” with
constrained local choices


What is it? Can we recover it?
Nasty

Robot will not act like those it observes
It is purely altruistic, cares about everyone
 It is learning to predict what people want, not
learning to want it


And if someone wants others to suffer?

Check sign of altruism term
Your wife called to remind you about
dinner tonight
Wait! What? What dinner?
For your 20th anniversary, at 7pm
I did warn you, but you overrode my
recommendation…
Don’t worry, I arranged for his plane
to be delayed – some kind of
computer malfunction.
I can’t, I’m meeting the Secretary
General at 7.30! How did this
happen??
OK, but what am I going to do now? I
can’t just tell him I’m too busy!!
Really? You can do that?!?
He sends his profound apologies
and is happy to meet you for lunch
tomorrow
Welcome home! Long day?
Yes, terrible, not even time for lunch.
So you must be quite hungry!
Starving! What’s for dinner?
There’s something I need to tell you
There are humans in Somalia in
more urgent need of help.
I am leaving now. Please make your
own dinner.
Summary
Value misalignment: a potential risk
 Certain design templates may support
provably beneficial systems

Not yet ready for standards or regulations!
 Economic incentives may work in our favor

Questions

Can we change the way AI defines itself?

A civil engineer says “I design bridges”, not
“I design bridges that don’t fall down”
Will solutions to near-term control problems
scale to the long-term control problem?
 What about Bondian villains?
 Long-term enfeeblement?



cf E. M. Forster, “The Machine Stops”
What does Values ‘R Us sell, exactly?