Contextual Decision Processes
Machine Learning the Future,
March 20
https://arxiv.org/abs/1610.09512
How to Learn?
Practice
Powerful modeling,
simple exploration
e.g.: Atari Deep
Reinforcement
Learning
Theory
Sophisticated
exploration in smallstate MDPs
3
e.g. πΈ , R-MAX
algorithms
Contextual Decision Processes
Repeatedly:
For β = 1 to π»
1. See features π₯
2. Choose actions π in π΄
3. See reward π for action π in context π₯ and history β
Goal: If πΉ = π: π × π΄ β (ββ, β) , compete with
Ξ πΉ = {βπ β πΉ: ππ π₯ = argmax π π₯, π }
π
Bad News:
π»
|π΄|
π
Ξ©
min{ π΄ π» ,Ξ }
π2
Lesson: Some notion of few-states required.
OLIVE: Optimism Led Iterative Value Elimination
Given: Set of value functions πΉ = π: π × π΄ β (ββ, β)
Repeatedly:
Pick most optimistic π at β = 1
Rollout with π repeatedly
If (predicted value β real value) then return π
Else find horizon β of large disagreement
Rollout with π except acting randomly at β
Eliminate all π with a large bellman error at β
What is Bellman Error?
π π, π, β = E π π₯β , πβ β πβ β π π₯β+1 , πβ+1 ,
π1 , β¦ πββ1 ~ π
β
π π , π, β = 0
πβ , πβ+1
π
ππ
π
πβπ
β
β
Theorem: β CDPs, β πΉ containing with
Bellman rank π΅ with probability 1 β πΏ, OLIVE
requires only:
β
π
2
3
π΅ π»
π΄
|πΉ|
log
πΏ
π
π2
trajectories to find an π optimal π.
What is Bellman Rank?
Define the F × |F| matrix:
π F, β π,π = π(π, ππ , β)
Bellman Rank = rank of that matrix.
Bellman Rank bounded by βstatesβ in simple models
Model
PAC Guarantees
Small-state MDPs
Known
Structured large-state
MDPs
Reactive POMDPs
New
Reactive PSRs
New
Extended
LQR (continuous actions) Known
A far from complete theory of RL
1.
2.
3.
4.
5.
© Copyright 2026 Paperzz