The Contextual Research Program

Contextual Decision Processes
Machine Learning the Future,
March 20
https://arxiv.org/abs/1610.09512
How to Learn?
Practice
Powerful modeling,
simple exploration
e.g.: Atari Deep
Reinforcement
Learning
Theory
Sophisticated
exploration in smallstate MDPs
3
e.g. 𝐸 , R-MAX
algorithms
Contextual Decision Processes
Repeatedly:
For β„Ž = 1 to 𝐻
1. See features π‘₯
2. Choose actions π‘Ž in 𝐴
3. See reward π‘Ÿ for action π‘Ž in context π‘₯ and history β„Ž
Goal: If 𝐹 = 𝑓: 𝑋 × π΄ β†’ (βˆ’βˆž, ∞) , compete with
Π𝐹 = {βˆ€π‘“ ∈ 𝐹: πœ‹π‘“ π‘₯ = argmax 𝑓 π‘₯, π‘Ž }
π‘Ž
Bad News:
𝐻
|𝐴|
πœ–
Ξ©
min{ 𝐴 𝐻 ,Π}
πœ–2
Lesson: Some notion of few-states required.
OLIVE: Optimism Led Iterative Value Elimination
Given: Set of value functions 𝐹 = 𝑓: 𝑋 × π΄ β†’ (βˆ’βˆž, ∞)
Repeatedly:
Pick most optimistic 𝑓 at β„Ž = 1
Rollout with 𝑓 repeatedly
If (predicted value ≃ real value) then return 𝑓
Else find horizon β„Ž of large disagreement
Rollout with 𝑓 except acting randomly at β„Ž
Eliminate all 𝑓 with a large bellman error at β„Ž
What is Bellman Error?
πœ€ 𝑓, πœ‹, β„Ž = E 𝑓 π‘₯β„Ž , π‘Žβ„Ž βˆ’ π‘Ÿβ„Ž βˆ’ 𝑓 π‘₯β„Ž+1 , π‘Žβ„Ž+1 ,
π‘Ž1 , … π‘Žβ„Žβˆ’1 ~ πœ‹
⋆
πœ€ 𝑓 , πœ‹, β„Ž = 0
π‘Žβ„Ž , π‘Žβ„Ž+1
πœ‹
πœ‹π‘“
𝑓
π‘“β‰ˆπ‘“
⋆
⋆
Theorem: βˆ€ CDPs, βˆ€ 𝐹 containing with
Bellman rank 𝐡 with probability 1 βˆ’ 𝛿, OLIVE
requires only:
βˆ—
𝑓
2
3
𝐡 𝐻
𝐴
|𝐹|
log
𝛿
𝑂
πœ–2
trajectories to find an πœ– optimal 𝑓.
What is Bellman Rank?
Define the F × |F| matrix:
πœ€ F, β„Ž 𝑓,𝑔 = πœ€(𝑓, πœ‹π‘” , β„Ž)
Bellman Rank = rank of that matrix.
Bellman Rank bounded by β€œstates” in simple models
Model
PAC Guarantees
Small-state MDPs
Known
Structured large-state
MDPs
Reactive POMDPs
New
Reactive PSRs
New
Extended
LQR (continuous actions) Known
A far from complete theory of RL
1.
2.
3.
4.
5.