Online Knowledge Enhancements for Monte Carlo Tree Search in

Online Knowledge Enhancements for Monte Carlo
Tree Search in Probabilistic Planning
Bachelor presentation
Marcel Neidinger
<[email protected]>
Department of Mathematics and Computer Science,
University of Basel
13. February 2017
What is Probabilistic Planning?
Solve planning tasks with probabilistic transitions
Models a Markov Decision Problem given by
M = ⟨V, s0 , A, T, R⟩
A set of binary variables V inducing States S = 2V
An initial state s0 ∈ S
A set of applicable actions A
A transition model T : S × A × S → [0; 1]
A Reward R(s, a)
Monte Carlo Tree Search algorithms solve MDPs
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
2 / 33
Monte Carlo Tree Search Algorithms
Algorithmic framework to solve MDPs
Used especially in computer Go
Go Board1
Lee Sedol2
1
Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg
Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defyingmillennia-of-basic-human-instinct/
2
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
3 / 33
Four phases - Two components
Selection
Expansion
Simulation
Simulation
e
Backpropagation
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
4 / 33
Monte Carlo Tree node
MCTS tree for a MDP M
Important information in a tree node
A state s ∈ S
A counter N (i) for the number of visits
A counter N (i) (s, a) ∀a ∈ A for the number of times a was selected
in s
A reward estimate Q(i) (s, a) for action a in state s
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
5 / 33
Online Knowledge
AlphaGo used Neural Networks for the two policis →
Domain-specific knowledge
We want domain independent enhancements
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
6 / 33
Overview
Tree-Policy Enhancements
All Moves as First
α-AMAF
Cutoff-AMAF
Rapid Action Value Estimation
Default-Policy Enhancements
Move-Average Sampling Technique
Conclusion
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
7 / 33
What is a Tree Policy?
Iterate through the known part of the tree and select an action
given a node
Use a Q value for a state-action pair to estimate an actions
reward
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
8 / 33
UCT
MCTS implementation first proposed in 2006
s1
m′
m
s2
m′′
s5
s3
m′
s4
Reward: 10
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
9 / 33
UCT
Reward approximation, parent node vl , child node vj
√
2 ln N (i) (sl )
U CT (vl , vj ) = Q(i) (sl , aj ) + 2 Cp
N (i+1) (sj )
(1)
From parent vl select child node v ∗ that maximises
v ∗ = max{U CT (nl , nj )}
vj
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
(2)
10 / 33
All Moves as First - Idea
UCT score needs several trials to become reliable
Idea: Generalize informations extracted from trials
Implementation: Use additional (node-independant) score
that updates unselected actions as well
State
s1
s1
s2
s5
Reward
…
m′
m
m′′
Action
m
s3
m′
s4
Reward: 10
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
11 / 33
All Moves as First - α-AMAF
Idea: Combine UCT and AMAF score
SCR = αAM AF + (1 − α)U CT
(3)
Choose action with highest SCR
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
12 / 33
All Moves as First - α-AMAF - Results
AMAF(α = 0)
AMAF(α = 0.2)
AMAF(α = 0.4)
AMAF(α = 0.6)
AMAF(α = 0.8)
AMAF(α = 1.0)
IPPC score
1
0.8
0.6
0.4
0.2
0
l
ta on
to ati
g
vi
na
ill
sk ing
s
os
cr
c
ffi
tra
e
m
ga
n
co in
re dm
sa
sy risk
a
m s
ta tor
a
ev ic
el em
ad
ac le
g
an
tri re
fi
ild
w
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
13 / 33
All Moves as First - α-AMAF - Problems
With more trials UCT becomes more reliable
AMAF score has higher variance
We want to discontinue using AMAF score after
some time
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
14 / 33
All Moves as First - α-AMAF - Problems
With more trials UCT becomes more reliable
AMAF score has higher variance
We want to discontinue using AMAF score after
some time
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
14 / 33
All Moves as First - Cutoff-AMAF
Introduce cutoff parameter K
{
αAM AF + (1 − α)U CT
SCR =
U CT
, for i ≤ k
, else
(4)
Use AMAF score only in the first k trials
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
15 / 33
All Moves as First - Cutoff-AMAF - Results
init: IDS, backup: MC
Raw UCT
Plain α-AMAF
0.75
Total IPPC score
0.7
0.65
0.6
0.55
0.5
0
10
20
30
40
50
K value
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
16 / 33
All Moves as First - Cutoff-AMAF - Problems
How to choose the parameter K?
When is the UCT score reliable enough?
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
17 / 33
Rapid Actio Value Estimation - Idea
First introduced in 2007 for computer go
Use soft cutoff
{
V − v(n)
α = max 0,
V
}
(5)
Use UCT for often visited nodes and AMAF score for
less-visited
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
18 / 33
Rapid Action Value Estimation - Results
UCT
RAVE(5)
RAVE(15)
RAVE(25)
RAVE(50)
1
IPPC score
0.8
0.6
0.4
0.2
0
l
ta on
to ati
g
vi
na
ill
sk ing
s
os
cr
c
ffi
tra
e
m
ga
n
co in
re dm
sa
sy risk
a
m s
ta tor
a
ev ic
el em
ad
ac le
g
an
tri re
fi
ild
w
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
19 / 33
All Moves as First - Conclusion
UCT
RAVE(25)
AMAF(α = 0.2)
1
IPPC score
0.8
0.6
0.4
0.2
0
l
ta on
to ati
g
vi
na
ill
sk ing
s
os
cr
c
ffi
tra
e
m
ga
n
co in
re dm
sa
sy isk
ar
m s
ta tor
a
ev ic
el em
ad
ac le
g
an
tri re
fi
ild
w
Domain
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
20 / 33
Rapid Action Value Estimation - Problems
PROST uses problem description with conditional effects
Also no preconditions given
PROST description is more general
Player
Goal field
Movepath
In PROST:
Action: move_up
In e.g. computer chess
Action: move_a2_to_a3
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
21 / 33
Predicate Rapid Action Value Estimation
A state has predicates that give some context
Idea Use predicates to find similar states and use their score
QP RAV E (s, a) =
1 ∑
QRAV E (p, a)
N
(6)
p∈P
and weight with
{
α=
V − v(n)
0,
V
}
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
(7)
22 / 33
All Moves as First - Conclusion - Revisited
UCT
PRAVE
RAVE(25)
AMAF(α = 0.2)
1
IPPC score
0.8
0.6
0.4
0.2
0
l
ta on
to ati
g
vi
na
ill
sk ing
s
os
cr
c
ffi
tra
e
m
ga
n
co in
re dm
sa
sy isk
ar
m s
ta tor
a
ev ic
el em
ad
ac le
g
an
tri re
fi
ild
w
Domain
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
23 / 33
Overview
Tree-Policy Enhancements
All Moves as First
α-AMAF
Cutoff-AMAF
Rapid Action Value Estimation
Default-Policy Enhancements
Move-Average Sampling Technique
Conclusion
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
24 / 33
What is a Default Policy?
Simulation
e
Simulate the outcome of a trial
Basic default policy: random walk
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
25 / 33
X-Average Sampling Technique
Use tree knowledge to bias default policy towards moves that
are more goal-oriented
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
26 / 33
Move-Average Sampling Technique - Idea Sample Game
Introduce Q(a)
Player
Goal field
Movepath
Use moves that are
good on average
Choose action
according to:
e
P (a) = ∑
Q(a)
τ
e
Q(b)
τ
(8)
b∈A
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
27 / 33
Move-Average Sampling Technique - Idea Example
Actions:
r,r,u,u,u
Q(r) = 1; N (r) = 2
Q(u) = 6; N (u) = 3
Actions:
r,r,u,l,l
Q(r) = 2; N (r) = 4
Q(u) = 7; N (u) = 4
Q(l) = 3; N (l) = 2
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
28 / 33
Move-Average Sampling Technique - Idea Example (2)
Actions:
l,u,u,r,r
Q(r) = 7; N (r) = 6
Q(u) = 8; N (u) = 6
Q(l) = 2; N (l) = 3
Actions:
r,r,r,u,u
Q(r) = 7; N (r) = 9
Q(u) = 9; N (u) = 8
Q(l) = 2; N (l) = 3
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
29 / 33
Move-Average Sampling Technique - Results
UCT(RandomWalk)
UCT(MAST)
0.5
IPPC score
0.4
0.3
0.2
0.1
0
l
ta n
to tio
ga
vi
na
ill
sk
ng
si
os
cr
fic
af
tr
e
m
ga
n
co n
i
re
dm
sa
sy isk
ar
m
ta ors
at
ev ic
el
e
em
ad
ac
e
gl
n
ia
fir
ild
tr
w
Domain
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
30 / 33
Overview
Tree-Policy Enhancements
All Moves as First
α-AMAF
Cutoff-AMAF
Rapid Action Value Estimation
Default-Policy Enhancements
Move-Average Sampling Technique
Conclusion
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
31 / 33
Conclusion
Tree-policy enhancements
α-AMAF and RAVE performe worse than standard UCT
PRAVE performs slightly better but still worse than standard UCT
Default-policy enhancements
MAST outperforms RandomWalk
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning
32 / 33
Questions?
[email protected]