Bayesian Sparse Sampling for On-line Reward Optimization Presented in the Value of Information Seminar at NIPS 2005 Based on previous paper from ICML 2005, written by Dale Schuurmans et all Background Perspective Be Bayesian about reinforcement learning Ideal representation of uncertainty for action selection Computational barriers Exploration vs. Exploitation Bayes decision theory Value of information measured by ultimate return in reward Choose actions to max expected value Exploration/exploitation tradeoff implicitly handled as side effect Bayesian Approach conceptually clean but computationally disasterous versus conceptually disasterous but computationally clean Bayesian Approach conceptually clean but computationally disasterous versus conceptually disasterous but computationally clean Overview Efficient lookahead search for Bayesian RL Sparser sparse sampling Controllable computational cost Higher quality action selection than current methods Greedy Epsilon - greedy Boltzmann Thompson Sampling Bayes optimal Interval estimation Myopic value of perfect info. Standard sparse sampling Péret & Garcia (Luce 1959) (Thompson 1933) (Hee 1978) (Lai 1987, Kaelbling 1994) (Dearden, Friedman, Andre 1999) (Kearns, Mansour, Ng 2001) (Péret & Garcia 2004) Sequential Decision Making “Planning” Requires model P(r,s’|s,a) V(s) s How to make an optimal decision? a Q(s,a) s,a r r V(s’) s’ a’ Q(s’,a’) s’,a’ r’ r’ s’’ s’’ r’ s’’ a’ Q(s’,a’) s’,a’ r’ r’ s’’ s’’ r V(s’) s’ a’ Q(s’,a’) s’,a’ Er , s'|s,a r V(s’) s’ s’’ V(s’) s’ a’ Q(s’,a’) s’,a’ r’ a a Q(s,a) s,a a’ MAX a’ Q(s’,a’) s’,a’ r’ r’ r’ s’’ s’’ s’’ a’ Q(s’,a’) s’,a’ r’ s’’ Q(s’,a’) s’,a’ r’ r’ r’ s’’ s’’ s’’ MAX a’ Q(s’,a’) s’,a’ r’ s’’ r’ s’’ r’ a' Er ', s''|s',a' s’’ This is: finite horizon, finite action, finite reward case General case: Fixed point equations: V ( s ) sup Q( s, a ) a Q ( s, a ) E [r V ( s' )] r , s ' | s, a Reinforcement Learning Do not have model P(r,s’|s,a) V(s) s a a Q(s,a) s,a Q(s,a) s,a r r V(s’) s’ V(s’) s’ a’ a’ Q(s’,a’) s’,a’ r’ r’ s’’ s’’ r’ s’’ V(s’) s’ a’ Q(s’,a’) s’,a’ a’ Q(s’,a’) s’,a’ r’ r’ s’’ s’’ r r a’ Q(s’,a’) s’,a’ r’ s’’ V(s’) s’ a’ Q(s’,a’) s’,a’ r’ r’ r’ s’’ s’’ s’’ a’ Q(s’,a’) s’,a’ r’ s’’ a’ Q(s’,a’) s’,a’ r’ r’ r’ s’’ s’’ s’’ Q(s’,a’) s’,a’ r’ s’’ r’ s’’ r’ s’’ Reinforcement Learning Cannot Compute Er,s'|s,a Do not have model P(r,s’|s,a) V(s) s a a Q(s,a) s,a Q(s,a) s,a r r V(s’) s’ V(s’) s’ a’ a’ Q(s’,a’) s’,a’ r’ r’ s’’ s’’ r’ s’’ V(s’) s’ a’ Q(s’,a’) s’,a’ a’ Q(s’,a’) s’,a’ r’ r’ s’’ s’’ r r a’ Q(s’,a’) s’,a’ r’ s’’ V(s’) s’ a’ Q(s’,a’) s’,a’ r’ r’ r’ s’’ s’’ s’’ a’ Q(s’,a’) s’,a’ r’ s’’ a’ Q(s’,a’) s’,a’ r’ r’ r’ s’’ s’’ s’’ Q(s’,a’) s’,a’ r’ s’’ r’ s’’ r’ s’’ Bayesian Reinforcement Learning Prior P(θ) on model P(rs’ |sa, θ) meta-level state Belief state b=P(θ) sb a Choose action to maximize long term reward actions s b +a rewards r s’ b’ actions a’ s’ b’ +a’ rewards r’ s’’ b’’ Meta-level MDP decision a outcome r, s’, b’ Meta-level Model P(r,s’b’|s b,a) decision a’ outcome r’, s’’, b’’ Meta-level Model P(r’,s’’b’’|s’ b’,a’) Have a model for meta-level transitions! - based on posterior update and expectations over base-level MDPs Bayesian RL Decision Making How to make an optimal decision? Solve planning problem in meta-level MDP: a V(s b) a Q(s b, a) - Optimal Q,V values r V(s’ b’) r V(s’ b’) Bayes optimal action selection Q(s b, a) r V(s’ b’) r V(s’ b’) Problem: meta-level MDP much larger than base-level MDP Impractical Bayesian RL Decision Making Current approximation strategies: sb s Consider current belief state b a a a a Draw a base-level MDP s b, a r s’ b’ s, a s b, a r r s’ b’ s’ b’ Greedy approach: Thompson approach: r r s’ b’ s’ s, a r r s’ s’ r s’ ☺ Exploration is based current b → mean base-level MDP model current b → sample a base-level MDP model → point estimate for Q, V → point estimate for Q, V → choose greedy action (Choose action proportional to probability it is max Q) on Butuncertainty doesn’t consider uncertainty But still myopic Their Approach Try to better approximate Bayes optimal action selection by performing lookahead Adapt “sparse sampling” (Kearns, Mansour ,Ng) Make some practical improvements Sparse Sampling (Kearns, Mansour, Ng 2001) Bayesian Sparse Sampling Observation 1 Action value estimates are not equally important Need better Q value estimates for some actions but not all Preferentially expand tree under actions that might be optimal Biased tree growth Use Thompson sampling to select actions to expand Bayesian Sparse Sampling Observation 2 Correct leaf value estimates to same depth t=1 Qˆ ( N t) Use mean MDP Q-value multiplied by remaining depth t=2 t=3 effective horizon N=3 Bayesian Sparse Sampling Tree growing procedure 1. Sample prior for a model 2. Solve action values 3. Select the optimal action Execute action Observe reward s,b s,b,a s’b’ Descend sparse tree from root Thompson sample actions Sample outcome Until new node added Repeat until tree size limit reached Control computation by controlling tree size Simple experiments 5 Bernoulli bandits a1 ,..., a5 Beta priors Sampled model from prior Run action selection strategies Repeat 3000 times Average accumulated reward per step Five Bernoulli Bandits 0.75 Average Reward per Step 0.7 0.65 0.6 eps-Greedy Boltzmann Interval Est. Thompson MVPI Sparse Samp. Bayes Samp. 0.55 0.5 0.45 5 10 Horizon 15 20 Five Gaussian Bandits 0.9 0.8 Average Reward per Step 0.7 0.6 0.5 0.4 eps-Greedy Boltzmann Interval Est. Thompson MVPI Sparse Samp. Bayes Samp. 0.3 0.2 0.1 0 5 10 Horizon 15 20
© Copyright 2026 Paperzz