Bayesian Sparse Sampling for On

Bayesian Sparse Sampling
for On-line Reward Optimization
Presented in the Value of
Information Seminar at NIPS 2005
Based on previous paper from ICML
2005, written by Dale Schuurmans et all
Background Perspective

Be Bayesian about reinforcement learning
Ideal representation of uncertainty
for
action selection

Computational barriers

Exploration vs. Exploitation

Bayes decision theory


Value of information measured by ultimate return
in reward
Choose actions to max expected value

Exploration/exploitation tradeoff implicitly handled
as side effect
Bayesian Approach
conceptually clean
but
computationally disasterous
versus
conceptually disasterous
but
computationally clean
Bayesian Approach
conceptually clean
but
computationally disasterous
versus
conceptually disasterous
but
computationally clean
Overview

Efficient lookahead search for Bayesian RL



Sparser sparse sampling
Controllable computational cost
Higher quality action selection than current methods
Greedy
Epsilon - greedy
Boltzmann
Thompson Sampling
Bayes optimal
Interval estimation
Myopic value of perfect info.
Standard sparse sampling
Péret & Garcia
(Luce 1959)
(Thompson 1933)
(Hee 1978)
(Lai 1987, Kaelbling 1994)
(Dearden, Friedman, Andre 1999)
(Kearns, Mansour, Ng 2001)
(Péret & Garcia 2004)
Sequential Decision Making
“Planning”
Requires model P(r,s’|s,a)
V(s)
s
How to make an optimal decision?
a
Q(s,a)
s,a
r
r
V(s’)
s’
a’
Q(s’,a’)
s’,a’
r’
r’
s’’
s’’
r’
s’’
a’
Q(s’,a’)
s’,a’
r’
r’
s’’
s’’
r
V(s’)
s’
a’
Q(s’,a’)
s’,a’
Er , s'|s,a
r
V(s’)
s’
s’’
V(s’)
s’
a’
Q(s’,a’)
s’,a’
r’
a
a
Q(s,a)
s,a
a’
MAX
a’
Q(s’,a’)
s’,a’
r’
r’
r’
s’’
s’’
s’’
a’
Q(s’,a’)
s’,a’
r’
s’’
Q(s’,a’)
s’,a’
r’
r’
r’
s’’
s’’
s’’
MAX
a’
Q(s’,a’)
s’,a’
r’
s’’
r’
s’’
r’
a'
Er ', s''|s',a'
s’’
This is: finite horizon, finite action, finite reward case
General case: Fixed point equations: V ( s )  sup Q( s, a )
a
Q ( s, a )  E
[r  V ( s' )]
r , s ' | s, a
Reinforcement Learning
Do not have model P(r,s’|s,a)
V(s)
s
a
a
Q(s,a)
s,a
Q(s,a)
s,a
r
r
V(s’)
s’
V(s’)
s’
a’
a’
Q(s’,a’)
s’,a’
r’
r’
s’’
s’’
r’
s’’
V(s’)
s’
a’
Q(s’,a’)
s’,a’
a’
Q(s’,a’)
s’,a’
r’
r’
s’’
s’’
r
r
a’
Q(s’,a’)
s’,a’
r’
s’’
V(s’)
s’
a’
Q(s’,a’)
s’,a’
r’
r’
r’
s’’
s’’
s’’
a’
Q(s’,a’)
s’,a’
r’
s’’
a’
Q(s’,a’)
s’,a’
r’
r’
r’
s’’
s’’
s’’
Q(s’,a’)
s’,a’
r’
s’’
r’
s’’
r’
s’’
Reinforcement Learning
Cannot Compute
Er,s'|s,a
Do not have model P(r,s’|s,a)
V(s)
s
a
a
Q(s,a)
s,a
Q(s,a)
s,a
r
r
V(s’)
s’
V(s’)
s’
a’
a’
Q(s’,a’)
s’,a’
r’
r’
s’’
s’’
r’
s’’
V(s’)
s’
a’
Q(s’,a’)
s’,a’
a’
Q(s’,a’)
s’,a’
r’
r’
s’’
s’’
r
r
a’
Q(s’,a’)
s’,a’
r’
s’’
V(s’)
s’
a’
Q(s’,a’)
s’,a’
r’
r’
r’
s’’
s’’
s’’
a’
Q(s’,a’)
s’,a’
r’
s’’
a’
Q(s’,a’)
s’,a’
r’
r’
r’
s’’
s’’
s’’
Q(s’,a’)
s’,a’
r’
s’’
r’
s’’
r’
s’’
Bayesian Reinforcement Learning
Prior P(θ) on model P(rs’ |sa, θ)
meta-level state
Belief state b=P(θ)
sb
a
Choose action
to maximize
long term reward
actions
s b +a
rewards
r
s’ b’
actions
a’
s’ b’ +a’
rewards
r’
s’’ b’’
Meta-level MDP
decision
a
outcome
r, s’, b’
Meta-level Model
P(r,s’b’|s b,a)
decision
a’
outcome
r’, s’’, b’’
Meta-level Model
P(r’,s’’b’’|s’ b’,a’)
Have a model for meta-level transitions!
- based on posterior update and expectations over base-level MDPs
Bayesian RL Decision Making
How to make an optimal decision?
Solve planning problem
in meta-level MDP:
a
V(s b)
a
Q(s b, a)
- Optimal Q,V values
r
V(s’ b’)
r
V(s’ b’)
Bayes optimal
action selection
Q(s b, a)
r
V(s’ b’)
r
V(s’ b’)
Problem: meta-level MDP much larger than base-level MDP
Impractical
Bayesian RL Decision Making
Current approximation strategies:
sb
s
Consider current belief state b
a
a
a
a
Draw a base-level MDP
s b, a
r
s’ b’
s, a
s b, a
r
r
s’ b’
s’ b’
Greedy
approach:
Thompson
approach:
r
r
s’ b’
s’
s, a
r
r
s’
s’
r
s’
☺ Exploration is based
current b → mean base-level MDP model
current b → sample a base-level MDP model
→ point estimate for Q, V
→ point estimate for Q, V
→ choose greedy action
(Choose action proportional to probability it is max Q)
on
Butuncertainty
doesn’t consider
uncertainty
But still myopic
Their Approach

Try to better approximate Bayes optimal
action selection by performing lookahead

Adapt “sparse sampling” (Kearns, Mansour ,Ng)

Make some practical improvements
Sparse Sampling
(Kearns, Mansour, Ng 2001)
Bayesian Sparse Sampling
Observation 1

Action value estimates are not
equally important
 Need better Q value
estimates for some actions
but not all
 Preferentially expand tree
under actions that might be
optimal
Biased tree growth
Use Thompson sampling to
select actions to expand
Bayesian Sparse Sampling
Observation 2
Correct leaf value estimates to same depth
t=1
Qˆ ( N  t)
Use mean MDP
Q-value multiplied
by remaining depth
t=2
t=3
effective horizon N=3
Bayesian Sparse Sampling
Tree growing procedure
1. Sample prior for a model
2. Solve action values
3. Select the optimal action
Execute action
Observe reward
s,b


s,b,a
s’b’
Descend sparse tree
from root



Thompson sample actions
Sample outcome
Until new node added
Repeat until tree size
limit reached
Control computation by controlling tree size
Simple experiments






5 Bernoulli bandits a1 ,..., a5
Beta priors
Sampled model from prior
Run action selection strategies
Repeat 3000 times
Average accumulated reward per step
Five Bernoulli Bandits
0.75
Average Reward per Step
0.7
0.65
0.6
eps-Greedy
Boltzmann
Interval Est.
Thompson
MVPI
Sparse Samp.
Bayes Samp.
0.55
0.5
0.45
5
10
Horizon
15
20
Five Gaussian Bandits
0.9
0.8
Average Reward per Step
0.7
0.6
0.5
0.4
eps-Greedy
Boltzmann
Interval Est.
Thompson
MVPI
Sparse Samp.
Bayes Samp.
0.3
0.2
0.1
0
5
10
Horizon
15
20