General Reinforcement Learning

General Reinforcement Learning
Jan Leike
Future of Humanity Institute
University of Oxford
9 June 2016
Reinforcement Learning Today1
1
Volodymyr Mnih et al. “Human-Level Control through Deep Reinforcement Learning”.
518.7540 (2015), pp. 529–533.
Jan Leike
General Reinforcement Learning
In: Nature
2 / 32
If we upscale DQN, do we get strong AI?
Jan Leike
General Reinforcement Learning
3 / 32
Narrow Reinforcement Learning
Atari 2600
fully observable
ergodic
very large state space
ε-exploration works
Jan Leike
General Reinforcement Learning
4 / 32
Narrow Reinforcement Learning
Jan Leike
Atari 2600
The Real WorldTM
fully observable
ergodic
very large state space
ε-exploration works
partially observable
not ergodic
infinite state space
ε-exploration fails
General Reinforcement Learning
4 / 32
Narrow Reinforcement Learning
Jan Leike
Atari 2600
The Real WorldTM
fully observable
ergodic
very large state space
ε-exploration works
partially observable
not ergodic
infinite state space
ε-exploration fails
ergodic MDPs
general environments
General Reinforcement Learning
4 / 32
Outline
AIXI
Optimality
Game Theory
AI Safety
Jan Leike
General Reinforcement Learning
5 / 32
Outline
AIXI
Optimality
Game Theory
AI Safety
Jan Leike
General Reinforcement Learning
6 / 32
General Reinforcement Learning
action at ∈ A
agent
Jan Leike
percept et = (ot , rt ) ∈ E
General Reinforcement Learning
environment
7 / 32
General Reinforcement Learning
action at ∈ A
agent
percept et = (ot , rt ) ∈ E
policy
environment
history
Jan Leike
environment
π : (A × E)∗ → ∆A
ν : (A × E)∗ × A → ∆E
æ <t := a1 e1 a2 e2 . . . at−1 et−1
General Reinforcement Learning
7 / 32
General Reinforcement Learning
action at ∈ A
agent
percept et = (ot , rt ) ∈ E
policy
environment
history
environment
π : (A × E)∗ → ∆A
ν : (A × E)∗ × A → ∆E
æ <t := a1 e1 a2 e2 . . . at−1 et−1
P
Goal: maximize ∞
t=1 γt rt
P
where γ : N → [0, 1] is a discount function with ∞
t=1 γt < ∞
Jan Leike
General Reinforcement Learning
7 / 32
General Reinforcement Learning
action at ∈ A
agent
percept et = (ot , rt ) ∈ E
policy
environment
history
environment
π : (A × E)∗ → ∆A
ν : (A × E)∗ × A → ∆E
æ <t := a1 e1 a2 e2 . . . at−1 et−1
P
Goal: maximize ∞
t=1 γt rt
P
where γ : N → [0, 1] is a discount function with ∞
t=1 γt < ∞
Assumptions
Jan Leike
I
0 ≤ rt ≤ 1
I
A and E are finite
General Reinforcement Learning
7 / 32
Value Functions
Jan Leike
General Reinforcement Learning
8 / 32
Value Functions
Value of policy π in environment ν:
1
Vνπ (æ <t ) := P∞
k=t γk
Jan Leike
Eπν
"∞
X
k=t
#
γk rk æ <t
General Reinforcement Learning
8 / 32
Value Functions
Value of policy π in environment ν:
1
Vνπ (æ <t ) := P∞
k=t γk
Jan Leike
Eπν
"∞
X
k=t
I
optimal value: Vν∗ := supπ Vνπ
I
ν-optimal policy: πν∗ := arg maxπ Vνπ
#
γk rk æ <t
General Reinforcement Learning
8 / 32
Value Functions
Value of policy π in environment ν:
1
Vνπ (æ <t ) := P∞
k=t γk
Eπν
"∞
X
k=t
I
optimal value: Vν∗ := supπ Vνπ
I
ν-optimal policy: πν∗ := arg maxπ Vνπ
I
Effective horizon:
#
γk rk æ <t
P∞
i=t+k γi
Ht (ε) := min k P∞
≤ε
i=t γi
Jan Leike
General Reinforcement Learning
8 / 32
AIXI3
2
Ray Solomonoff. “A Formal Theory of Inductive Inference. Parts 1 and 2”.
Control 7.1 (1964), pages.
In: Information and
3
Marcus Hutter. Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic
Probability. Springer, 2005.
Jan Leike
General Reinforcement Learning
9 / 32
AIXI3
I
countable set of environments M = {ν1 , ν2 , . . .}
2
Ray Solomonoff. “A Formal Theory of Inductive Inference. Parts 1 and 2”.
Control 7.1 (1964), pages.
In: Information and
3
Marcus Hutter. Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic
Probability. Springer, 2005.
Jan Leike
General Reinforcement Learning
9 / 32
AIXI3
I
countable set of environments M = {ν1 , ν2 , . . .}
I
prior w : M → [0, 1]
2
Ray Solomonoff. “A Formal Theory of Inductive Inference. Parts 1 and 2”.
Control 7.1 (1964), pages.
In: Information and
3
Marcus Hutter. Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic
Probability. Springer, 2005.
Jan Leike
General Reinforcement Learning
9 / 32
AIXI3
I
countable set of environments M = {ν1 , ν2 , . . .}
I
prior w : M → [0, 1]
Solomonoff prior2 w(ν) ∝ 2−K(ν)
2
Ray Solomonoff. “A Formal Theory of Inductive Inference. Parts 1 and 2”.
Control 7.1 (1964), pages.
In: Information and
3
Marcus Hutter. Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic
Probability. Springer, 2005.
Jan Leike
General Reinforcement Learning
9 / 32
AIXI3
I
countable set of environments M = {ν1 , ν2 , . . .}
I
prior w : M → [0, 1]
Solomonoff prior2 w(ν) ∝ 2−K(ν)
I
Bayesian mixture
ξ :=
X
w(ν)ν
ν∈M
2
Ray Solomonoff. “A Formal Theory of Inductive Inference. Parts 1 and 2”.
Control 7.1 (1964), pages.
In: Information and
3
Marcus Hutter. Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic
Probability. Springer, 2005.
Jan Leike
General Reinforcement Learning
9 / 32
AIXI3
I
countable set of environments M = {ν1 , ν2 , . . .}
I
prior w : M → [0, 1]
Solomonoff prior2 w(ν) ∝ 2−K(ν)
I
Bayesian mixture
ξ :=
X
w(ν)ν
ν∈M
AIXI is the Bayes-optimal agent with a Solomonoff prior
πξ∗ := arg max Vξπ
π
2
Ray Solomonoff. “A Formal Theory of Inductive Inference. Parts 1 and 2”.
Control 7.1 (1964), pages.
In: Information and
3
Marcus Hutter. Universal Artificial Intelligence. Sequential Decisions Based on Algorithmic
Probability. Springer, 2005.
Jan Leike
General Reinforcement Learning
9 / 32
On-Policy Value Convergence
Vξπ (æ <t ) − Vµπ (æ <t ) → 0 as t → ∞ almost surely
Jan Leike
General Reinforcement Learning
10 / 32
Outline
AIXI
Optimality
Game Theory
AI Safety
Jan Leike
General Reinforcement Learning
11 / 32
Notions of Optimality in Reinforcement Learning
Jan Leike
I
Bayes optimality
I
Asymptotic optimality
I
Sample complexity bounds
I
Regret bounds
I
...
General Reinforcement Learning
12 / 32
Asymptotic Optimality
π is asymptotically optimal iff
Vµ∗ (æ <t ) − Vµπ (æ <t ) → 0 as t → ∞
4
Laurent Orseau. “Asymptotic Non-Learnability of Universal Agents with Computable Horizon
Functions”. In: Theoretical Computer Science 473 (2013), pp. 149–156.
Jan Leike
General Reinforcement Learning
13 / 32
Asymptotic Optimality
π is asymptotically optimal iff
Vµ∗ (æ <t ) − Vµπ (æ <t ) → 0 as t → ∞
For asymptotic optimality the agent needs to explore
infinitely often for an entire effective horizon.
4
Laurent Orseau. “Asymptotic Non-Learnability of Universal Agents with Computable Horizon
Functions”. In: Theoretical Computer Science 473 (2013), pp. 149–156.
Jan Leike
General Reinforcement Learning
13 / 32
Asymptotic Optimality
π is asymptotically optimal iff
Vµ∗ (æ <t ) − Vµπ (æ <t ) → 0 as t → ∞
For asymptotic optimality the agent needs to explore
infinitely often for an entire effective horizon.
Theorem
AIXI is not asymptotically optimal.4
4
Laurent Orseau. “Asymptotic Non-Learnability of Universal Agents with Computable Horizon
Functions”. In: Theoretical Computer Science 473 (2013), pp. 149–156.
Jan Leike
General Reinforcement Learning
13 / 32
Hell
Jan Leike
General Reinforcement Learning
14 / 32
Hell
hell
Jan Leike
reward = 0
General Reinforcement Learning
14 / 32
The Dogmatic Prior5
5
Jan Leike and Marcus Hutter. “Bad Universal Priors and Notions of Optimality”.
Learning Theory. 2015, pp. 1244–1259.
Jan Leike
General Reinforcement Learning
In: Conference on
15 / 32
The Dogmatic Prior5
Policy πLazy :
while (true) { do_nothing(); }
5
Jan Leike and Marcus Hutter. “Bad Universal Priors and Notions of Optimality”.
Learning Theory. 2015, pp. 1244–1259.
Jan Leike
General Reinforcement Learning
In: Conference on
15 / 32
The Dogmatic Prior5
Policy πLazy :
while (true) { do_nothing(); }
Dogmatic prior ξ 0 :
if not acting according to πLazy ,
go to hell with high probability
5
Jan Leike and Marcus Hutter. “Bad Universal Priors and Notions of Optimality”.
Learning Theory. 2015, pp. 1244–1259.
Jan Leike
General Reinforcement Learning
In: Conference on
15 / 32
The Dogmatic Prior5
Policy πLazy :
while (true) { do_nothing(); }
Dogmatic prior ξ 0 :
if not acting according to πLazy ,
go to hell with high probability
Theorem
∀ε > 0 ∃ξ 0 s.t. AIξ 0 acts according to πLazy as long as
π
Vξ Lazy (æ<t ) > ε > 0.
5
Jan Leike and Marcus Hutter. “Bad Universal Priors and Notions of Optimality”.
Learning Theory. 2015, pp. 1244–1259.
Jan Leike
General Reinforcement Learning
In: Conference on
15 / 32
Thompson Sampling
Thompson sampling policy πT :
Sample ρ ∼ w( · | æ<t ).
Follow πρ∗ for Ht (εt ) steps.
Repeat.
with εt → 0.
6
Jan Leike et al. “Thompson Sampling is Asymptotically Optimal in General Environments”.
Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
In:
16 / 32
Thompson Sampling
Thompson sampling policy πT :
Sample ρ ∼ w( · | æ<t ).
Follow πρ∗ for Ht (εt ) steps.
Repeat.
with εt → 0.
Theorem
Thompson sampling is asymptotically optimal in mean:6
EπµT Vµ∗ (æ<t ) − VµπT (æ<t ) → 0 as t → ∞.
6
Jan Leike et al. “Thompson Sampling is Asymptotically Optimal in General Environments”.
Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
In:
16 / 32
Recoverable Environments
An environment ν is recoverable iff
∗
sup Eπν ν [Vν∗ (æ <t )] − Eπν [Vν∗ (æ <t )] → 0 as t → ∞.
π
Jan Leike
General Reinforcement Learning
17 / 32
Recoverable Environments
An environment ν is recoverable iff
∗
sup Eπν ν [Vν∗ (æ <t )] − Eπν [Vν∗ (æ <t )] → 0 as t → ∞.
π
For non-recoverable environments:
Either the agent gets caught in a trap
or it is not asymptotically optimal.
Jan Leike
General Reinforcement Learning
17 / 32
Regret
Rm (π, µ) :=
Jan Leike
π0
E
max
µ
π0
"m #
"m #
X
X
π
rt − Eµ
rt
t=1
General Reinforcement Learning
t=1
18 / 32
Regret
Rm (π, µ) :=
π0
E
max
µ
π0
"m #
"m #
X
X
π
rt − Eµ
rt
t=1
t=1
A problem class is learnable iff ∃π ∀µ Rm (π, µ) ∈ o(m).
Jan Leike
General Reinforcement Learning
18 / 32
Regret
Rm (π, µ) :=
π0
E
max
µ
π0
"m #
"m #
X
X
π
rt − Eµ
rt
t=1
t=1
A problem class is learnable iff ∃π ∀µ Rm (π, µ) ∈ o(m).
Fact: The general RL problem is not learnable.
Jan Leike
General Reinforcement Learning
18 / 32
Regret in Non-Recoverable Environments
reward = 1
reward = 0
heaven
hell
right
left
µ1
Jan Leike
General Reinforcement Learning
19 / 32
Regret in Non-Recoverable Environments
reward = 1
heaven
reward = 0
reward = 0
hell
hell
right
left
heaven
left
µ1
Jan Leike
reward = 1
right
µ2
General Reinforcement Learning
19 / 32
Regret in Non-Recoverable Environments
reward = 1
heaven
reward = 0
reward = 0
hell
hell
right
left
heaven
left
µ1
right
µ2
Rm (left, µ1 ) = 0
Rm (right, µ1 ) = m
Jan Leike
reward = 1
Rm (left, µ2 ) = m
Rm (right, µ2 ) = 0
General Reinforcement Learning
19 / 32
Sublinear Regret
Theorem
If
I
µ ∈ M is recoverable,
I
π is asymptotically optimal in mean, and
I
γ satisfies some weak assumptions,
then regret is sublinear.7
7
Jan Leike et al. “Thompson Sampling is Asymptotically Optimal in General Environments”.
Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
In:
20 / 32
Optimality Summary
Sublinear regret
Sample complexity
Pareto optimality
Bayes optimality
Asymptotic optimality
Jan Leike
AIXI
TS
All policies
×
×
X
X
×
recoverable
?
X
×
X
×
General Reinforcement Learning
X
21 / 32
Outline
AIXI
Optimality
Game Theory
AI Safety
Jan Leike
General Reinforcement Learning
22 / 32
Multi-Agent Environments
a1t
agent π1
e1t
a2t
agent π2
e2t
..
.
multi-agent
environment
σ
ant
agent πn
ent
Jan Leike
General Reinforcement Learning
23 / 32
Multi-Agent Environments
a1t
agent π1
e1t
a2t
agent π2
e2t
..
.
multi-agent
environment
σ
ant
agent πn
ent
I
Jan Leike
πi is an ε-best response iff Vσ∗i − Vσπii < ε
General Reinforcement Learning
23 / 32
Multi-Agent Environments
a1t
agent π1
e1t
a2t
agent π2
e2t
..
.
multi-agent
environment
σ
ant
agent πn
ent
I
I
Jan Leike
πi is an ε-best response iff Vσ∗i − Vσπii < ε
π1 , . . . , πn play an ε-Nash equilibrium iff each πi is an ε-best
response
General Reinforcement Learning
23 / 32
The Bayesian Approach
I
Jan Leike
countable set of policies Π
General Reinforcement Learning
24 / 32
The Bayesian Approach
Jan Leike
I
countable set of policies Π
I
prior w ∈ ∆Π
General Reinforcement Learning
24 / 32
The Bayesian Approach
Jan Leike
I
countable set of policies Π
I
prior w ∈ ∆Π
I
act Bayes-optimal with respect to w
General Reinforcement Learning
24 / 32
The Bayesian Approach
I
countable set of policies Π
I
prior w ∈ ∆Π
I
act Bayes-optimal with respect to w
Grain of Truth: the Bayes-optimal policy needs to be in Π
Jan Leike
General Reinforcement Learning
24 / 32
Results for Bayesian Agents
Theorem
If each player is Bayesian, knows the infinite repeated game and has
a grain of truth, then the players converge to an ε-Nash
equilibrium.8
8
Ehud Kalai and Ehud Lehrer. “Rational Learning Leads to Nash Equilibrium”.
(1993), pp. 1019–1045.
In: Econometrica
9
Jan Leike, Jessica Taylor, and Benya Fallenstein. “A Formal Solution to the Grain of Truth
Problem”. In: Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
25 / 32
Results for Bayesian Agents
Theorem
If each player is Bayesian, knows the infinite repeated game and has
a grain of truth, then the players converge to an ε-Nash
equilibrium.8
Theorem
Two Bayesian players playing infinite repeated matching pennies
may fail to converge to an ε-Nash equilibrium, even if they have a
grain of truth.9
8
Ehud Kalai and Ehud Lehrer. “Rational Learning Leads to Nash Equilibrium”.
(1993), pp. 1019–1045.
In: Econometrica
9
Jan Leike, Jessica Taylor, and Benya Fallenstein. “A Formal Solution to the Grain of Truth
Problem”. In: Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
25 / 32
Solving the Grain of Truth Problem10
Theorem
There is a class of environments Mrefl that contains a grain of
truth with respect to any computable priors’ Bayes-optimal policies
in any computable multi-agent environment.
10
Jan Leike, Jessica Taylor, and Benya Fallenstein. “A Formal Solution to the Grain of Truth
Problem”. In: Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
26 / 32
Solving the Grain of Truth Problem10
Theorem
There is a class of environments Mrefl that contains a grain of
truth with respect to any computable priors’ Bayes-optimal policies
in any computable multi-agent environment.
Theorem
Each ν ∈ Mrefl is limit computable.
10
Jan Leike, Jessica Taylor, and Benya Fallenstein. “A Formal Solution to the Grain of Truth
Problem”. In: Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
26 / 32
Solving the Grain of Truth Problem10
Theorem
There is a class of environments Mrefl that contains a grain of
truth with respect to any computable priors’ Bayes-optimal policies
in any computable multi-agent environment.
Theorem
Each ν ∈ Mrefl is limit computable.
Theorem
There are limit computable policies π1 , . . . , πn such that for any
computable multi-agent environment σ and for all ε > 0 and all
i ∈ {1, . . . , n} the probability that the policy πi is an ε-best
response converges to 1 as t → ∞.
10
Jan Leike, Jessica Taylor, and Benya Fallenstein. “A Formal Solution to the Grain of Truth
Problem”. In: Uncertainty in Artificial Intelligence. 2016.
Jan Leike
General Reinforcement Learning
26 / 32
Outline
AIXI
Optimality
Game Theory
AI Safety
Jan Leike
General Reinforcement Learning
27 / 32
AI Safety Approaches
bottom-up
top-down
practical algorithms
toy models
demos
Jan Leike
General Reinforcement Learning
28 / 32
AI Safety Approaches
Jan Leike
bottom-up
top-down
practical algorithms
toy models
demos
theoretical models
abstract problems
theorems
General Reinforcement Learning
28 / 32
“Applications” of GRL to AI Safety
Jan Leike
I
self-modification: Orseau and Ring (2011), Orseau and Ring
(2012), Everitt et al. (2016)
I
self-reflection: Fallenstein, Soares, and Taylor (2015), Leike,
Taylor, and Fallenstein (2016)
I
memory manipulation: Orseau and Ring (2012)
I
interruptibility: Orseau and Armstrong (2016)
I
decision theory: Everitt, Leike, and Hutter (2015)
I
wireheading: Ring and Orseau (2011), Everitt and Hutter
(2016)
I
value learning: Dewey (2011)
I
questions of identity: Orseau (2014)
General Reinforcement Learning
29 / 32
Limits of the Current Model
Jan Leike
I
model-based
I
dualistic
I
not self-improving
I
assumes infinite computation
General Reinforcement Learning
30 / 32
Conclusion
Mathematical and mental tools to think about strong AI
Jan Leike
I
exploration vs. exploitation
I
effective horizon
I
on-policy vs. off-policy
I
model-based vs. model-free
I
recoverability
I
asymptotic optimality
I
reflective oracles
General Reinforcement Learning
31 / 32
Jan Leike. “Nonparametric General Reinforcement Learning”.
PhD thesis. Australian National University, 2016
http://jan.leike.name/
Jan Leike
General Reinforcement Learning
32 / 32