The Contextual Research Program

Contextual Decision Processes
Machine Learning the Future,
March 20
https://arxiv.org/abs/1610.09512
How to Learn?
Practice
Powerful modeling,
simple exploration
e.g.: Atari Deep
Reinforcement
Learning
Theory
Sophisticated
exploration in smallstate MDPs
3
e.g. 𝐸 , R-MAX
algorithms
Contextual Decision Processes
Repeatedly:
For ℎ = 1 to 𝐻
1. See features 𝑥
2. Choose actions 𝑎 in 𝐴
3. See reward 𝑟 for action 𝑎 in context 𝑥 and history ℎ
Goal: If 𝐹 = 𝑓: 𝑋 × 𝐴 → (−∞, ∞) , compete with
Π𝐹 = {∀𝑓 ∈ 𝐹: 𝜋𝑓 𝑥 = argmax 𝑓 𝑥, 𝑎 }
𝑎
Bad News:
𝐻
|𝐴|
𝜖
Ω
min{ 𝐴 𝐻 ,Π}
𝜖2
Lesson: Some notion of few-states required.
OLIVE: Optimism Led Iterative Value Elimination
Given: Set of value functions 𝐹 = 𝑓: 𝑋 × 𝐴 → (−∞, ∞)
Repeatedly:
Pick most optimistic 𝑓 at ℎ = 1
Rollout with 𝑓 repeatedly
If (predicted value ≃ real value) then return 𝑓
Else find horizon ℎ of large disagreement
Rollout with 𝑓 except acting randomly at ℎ
Eliminate all 𝑓 with a large bellman error at ℎ
What is Bellman Error?
𝜀 𝑓, 𝜋, ℎ = E 𝑓 𝑥ℎ , 𝑎ℎ − 𝑟ℎ − 𝑓 𝑥ℎ+1 , 𝑎ℎ+1 ,
𝑎1 , … 𝑎ℎ−1 ~ 𝜋
⋆
𝜀 𝑓 , 𝜋, ℎ = 0
𝑎ℎ , 𝑎ℎ+1
𝜋
𝜋𝑓
𝑓
𝑓≈𝑓
⋆
⋆
Theorem: ∀ CDPs, ∀ 𝐹 containing with
Bellman rank 𝐵 with probability 1 − 𝛿, OLIVE
requires only:
∗
𝑓
2
3
𝐵 𝐻
𝐴
|𝐹|
log
𝛿
𝑂
𝜖2
trajectories to find an 𝜖 optimal 𝑓.
What is Bellman Rank?
Define the F × |F| matrix:
𝜀 F, ℎ 𝑓,𝑔 = 𝜀(𝑓, 𝜋𝑔 , ℎ)
Bellman Rank = rank of that matrix.
Bellman Rank bounded by “states” in simple models
Model
PAC Guarantees
Small-state MDPs
Known
Structured large-state
MDPs
Reactive POMDPs
New
Reactive PSRs
New
Extended
LQR (continuous actions) Known
A far from complete theory of RL
1.
2.
3.
4.
5.

Download Report

The Contextual Research Program

Paperzz.com

Your Paperzz