Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning Bachelor presentation Marcel Neidinger <[email protected]> Department of Mathematics and Computer Science, University of Basel 13. February 2017 What is Probabilistic Planning? Solve planning tasks with probabilistic transitions Models a Markov Decision Problem given by M = ⟨V, s0 , A, T, R⟩ A set of binary variables V inducing States S = 2V An initial state s0 ∈ S A set of applicable actions A A transition model T : S × A × S → [0; 1] A Reward R(s, a) Monte Carlo Tree Search algorithms solve MDPs Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 2 / 33 Monte Carlo Tree Search Algorithms Algorithmic framework to solve MDPs Used especially in computer Go Go Board1 Lee Sedol2 1 Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defyingmillennia-of-basic-human-instinct/ 2 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 3 / 33 Four phases - Two components Selection Expansion Simulation Simulation e Backpropagation Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 4 / 33 Monte Carlo Tree node MCTS tree for a MDP M Important information in a tree node A state s ∈ S A counter N (i) for the number of visits A counter N (i) (s, a) ∀a ∈ A for the number of times a was selected in s A reward estimate Q(i) (s, a) for action a in state s Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 5 / 33 Online Knowledge AlphaGo used Neural Networks for the two policis → Domain-specific knowledge We want domain independent enhancements Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 6 / 33 Overview Tree-Policy Enhancements All Moves as First α-AMAF Cutoff-AMAF Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 7 / 33 What is a Tree Policy? Iterate through the known part of the tree and select an action given a node Use a Q value for a state-action pair to estimate an actions reward Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 8 / 33 UCT MCTS implementation first proposed in 2006 s1 m′ m s2 m′′ s5 s3 m′ s4 Reward: 10 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 9 / 33 UCT Reward approximation, parent node vl , child node vj √ 2 ln N (i) (sl ) U CT (vl , vj ) = Q(i) (sl , aj ) + 2 Cp N (i+1) (sj ) (1) From parent vl select child node v ∗ that maximises v ∗ = max{U CT (nl , nj )} vj Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning (2) 10 / 33 All Moves as First - Idea UCT score needs several trials to become reliable Idea: Generalize informations extracted from trials Implementation: Use additional (node-independant) score that updates unselected actions as well State s1 s1 s2 s5 Reward … m′ m m′′ Action m s3 m′ s4 Reward: 10 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 11 / 33 All Moves as First - α-AMAF Idea: Combine UCT and AMAF score SCR = αAM AF + (1 − α)U CT (3) Choose action with highest SCR Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 12 / 33 All Moves as First - α-AMAF - Results AMAF(α = 0) AMAF(α = 0.2) AMAF(α = 0.4) AMAF(α = 0.6) AMAF(α = 0.8) AMAF(α = 1.0) IPPC score 1 0.8 0.6 0.4 0.2 0 l ta on to ati g vi na ill sk ing s os cr c ffi tra e m ga n co in re dm sa sy risk a m s ta tor a ev ic el em ad ac le g an tri re fi ild w Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 13 / 33 All Moves as First - α-AMAF - Problems With more trials UCT becomes more reliable AMAF score has higher variance We want to discontinue using AMAF score after some time Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33 All Moves as First - α-AMAF - Problems With more trials UCT becomes more reliable AMAF score has higher variance We want to discontinue using AMAF score after some time Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33 All Moves as First - Cutoff-AMAF Introduce cutoff parameter K { αAM AF + (1 − α)U CT SCR = U CT , for i ≤ k , else (4) Use AMAF score only in the first k trials Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 15 / 33 All Moves as First - Cutoff-AMAF - Results init: IDS, backup: MC Raw UCT Plain α-AMAF 0.75 Total IPPC score 0.7 0.65 0.6 0.55 0.5 0 10 20 30 40 50 K value Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 16 / 33 All Moves as First - Cutoff-AMAF - Problems How to choose the parameter K? When is the UCT score reliable enough? Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 17 / 33 Rapid Actio Value Estimation - Idea First introduced in 2007 for computer go Use soft cutoff { V − v(n) α = max 0, V } (5) Use UCT for often visited nodes and AMAF score for less-visited Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 18 / 33 Rapid Action Value Estimation - Results UCT RAVE(5) RAVE(15) RAVE(25) RAVE(50) 1 IPPC score 0.8 0.6 0.4 0.2 0 l ta on to ati g vi na ill sk ing s os cr c ffi tra e m ga n co in re dm sa sy risk a m s ta tor a ev ic el em ad ac le g an tri re fi ild w Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 19 / 33 All Moves as First - Conclusion UCT RAVE(25) AMAF(α = 0.2) 1 IPPC score 0.8 0.6 0.4 0.2 0 l ta on to ati g vi na ill sk ing s os cr c ffi tra e m ga n co in re dm sa sy isk ar m s ta tor a ev ic el em ad ac le g an tri re fi ild w Domain Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 20 / 33 Rapid Action Value Estimation - Problems PROST uses problem description with conditional effects Also no preconditions given PROST description is more general Player Goal field Movepath In PROST: Action: move_up In e.g. computer chess Action: move_a2_to_a3 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 21 / 33 Predicate Rapid Action Value Estimation A state has predicates that give some context Idea Use predicates to find similar states and use their score QP RAV E (s, a) = 1 ∑ QRAV E (p, a) N (6) p∈P and weight with { α= V − v(n) 0, V } Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning (7) 22 / 33 All Moves as First - Conclusion - Revisited UCT PRAVE RAVE(25) AMAF(α = 0.2) 1 IPPC score 0.8 0.6 0.4 0.2 0 l ta on to ati g vi na ill sk ing s os cr c ffi tra e m ga n co in re dm sa sy isk ar m s ta tor a ev ic el em ad ac le g an tri re fi ild w Domain Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 23 / 33 Overview Tree-Policy Enhancements All Moves as First α-AMAF Cutoff-AMAF Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 24 / 33 What is a Default Policy? Simulation e Simulate the outcome of a trial Basic default policy: random walk Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 25 / 33 X-Average Sampling Technique Use tree knowledge to bias default policy towards moves that are more goal-oriented Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 26 / 33 Move-Average Sampling Technique - Idea Sample Game Introduce Q(a) Player Goal field Movepath Use moves that are good on average Choose action according to: e P (a) = ∑ Q(a) τ e Q(b) τ (8) b∈A Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 27 / 33 Move-Average Sampling Technique - Idea Example Actions: r,r,u,u,u Q(r) = 1; N (r) = 2 Q(u) = 6; N (u) = 3 Actions: r,r,u,l,l Q(r) = 2; N (r) = 4 Q(u) = 7; N (u) = 4 Q(l) = 3; N (l) = 2 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 28 / 33 Move-Average Sampling Technique - Idea Example (2) Actions: l,u,u,r,r Q(r) = 7; N (r) = 6 Q(u) = 8; N (u) = 6 Q(l) = 2; N (l) = 3 Actions: r,r,r,u,u Q(r) = 7; N (r) = 9 Q(u) = 9; N (u) = 8 Q(l) = 2; N (l) = 3 Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 29 / 33 Move-Average Sampling Technique - Results UCT(RandomWalk) UCT(MAST) 0.5 IPPC score 0.4 0.3 0.2 0.1 0 l ta n to tio ga vi na ill sk ng si os cr fic af tr e m ga n co n i re dm sa sy isk ar m ta ors at ev ic el e em ad ac e gl n ia fir ild tr w Domain Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 30 / 33 Overview Tree-Policy Enhancements All Moves as First α-AMAF Cutoff-AMAF Rapid Action Value Estimation Default-Policy Enhancements Move-Average Sampling Technique Conclusion Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 31 / 33 Conclusion Tree-policy enhancements α-AMAF and RAVE performe worse than standard UCT PRAVE performs slightly better but still worse than standard UCT Default-policy enhancements MAST outperforms RandomWalk Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 32 / 33 Questions? [email protected]
© Copyright 2025 Paperzz