슬라이드 1

Probabilistic Planning via
Determinization in Hindsight
FF-Hindsight
Sungwook Yoon
Joint work with
Alan Fern, Bob Givan and Rao Kambhampati
Sungwook Yoon – Probabilistic Planning via Determinization
Probabilistic Planning Competition
Client : Participants, send action
Server: Competition Host, simulates actions
Sungwook Yoon – Probabilistic Planning via Determinization
2
The Winner was ……
• FF-Replan
– A replanner. Use FF
– Probabilistic domain is determinized
• Interesting Contrast
– Many probabilistic planning techniques
• Work in theory but does not work in practice
– FF-Replan
• No theory
• Work in practice
Sungwook Yoon – Probabilistic Planning via Determinization
3
The Paper’s Objective
Better determinization approach
(Determinization in Hindsight)
Theoretical consideration of the new
determinization (in Hindsight)
New view on FF-Replan
Experimental studies with determinization in
Hindsight (FF-Hindsight)
Sungwook Yoon – Probabilistic Planning via Determinization
4
Probabilistic Planning
(goal-oriented)
Action
Maximize Goal Achievement
I
Left Outcomes
are more likely
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Dead End
Goal State
Action
State
Sungwook Yoon – Probabilistic Planning via Determinization
5
All Outcome Replanning (FFRA)
ICAPS-07
Probability1
Effect 1
Action1
Effect 1
Effect 2
Action2
Effect 2
Action
Probability2
Sungwook Yoon – Probabilistic Planning via Determinization
6
Probabilistic Planning
All Outcome Determinization
Action
Find Goal
I
A1
Probabilistic
Outcome
A2
Time 1
A1-1
A1
A1-2
A2
A1
A2-1
A2
A1
A2
A2-2
A1
A2
Time 2
A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2
Dead End
Goal State
Action
State
Sungwook Yoon – Probabilistic Planning via Determinization
7
Probabilistic Planning
All Outcome Determinization
Action
Find Goal
I
A1
Probabilistic
Outcome
A2
Time 1
A1-1
A1
A1-2
A2
A1
A2-1
A2
A1
A2
A2-2
A1
A2
Time 2
A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2
Dead End
Goal State
Action
State
Sungwook Yoon – Probabilistic Planning via Determinization
8
Problem of FF-Replan and
better alternative sampling
FF-Replan’s Static Determinizations don’t
respect probabilities.
We need “Probabilistic and Dynamic
Determinization”
Sample Future Outcomes and
Determinization in Hindsight
Each Future Sample Becomes a
Known-Future Deterministic Problem
Sungwook Yoon – Probabilistic Planning via Determinization
9
Probabilistic Planning
(goal-oriented)
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Dead End
Goal State
Action
State
Sungwook Yoon – Probabilistic Planning via Determinization
10
Start Sampling
Note. Sampling will reveal which is better
A1? Or A2 at state I
Sungwook Yoon – Probabilistic Planning via Determinization
11
Hindsight Sample 1
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 1
A2: 0
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
12
Hindsight Sample 2
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 2
A2: 1
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
13
Hindsight Sample 3
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 2
A2: 1
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
14
Hindsight Sample 4
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 3
A2: 1
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
15
Summary of the Idea:
The Decision Process
(Estimating Q-Value, Q(s,a))
S: Current State, A(S) → S’
1. For Each Action A, Draw Future Samples
Each Sample is a Deterministic Planning Problem
2. Solve The Deterministic Problems
The solution length is used for goal-oriented problems, Q(s,A)
3. Aggregate the solutions for each action
Max A Q(s,A)
4. Select the action with best aggregation
Sungwook Yoon – Probabilistic Planning via Determinization
16
Mathematical Summary
of the Algorithm
• H-horizon future FH for M = [S,A,T,R]
– Mapping of state, action and time (h<H) to a state
Each Future is a
– S×A×h→S
Deterministic
Problem
• Value of a policy π for FH
– R(s,FH, π)
• VHS(s,H) = EFH [maxπ R(s,FH,π)]
Done by FF
• Compare this and the real value
• V*(s,H) = maxπ EFH [ R(s,FH,π) ]
• VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H)
• Q(s,a,H) = (R(a) + EFH-1 [maxπ R(a(s),FH-1,π)] )
– In our proposal, computation of maxπ R(s,FH-1,π) is
approximately done by FF [Hoffmann and Nebel ’01]
Sungwook Yoon – Probabilistic Planning via Determinization
17
Key Technical Results
The Importance of Independent Sampling of States, Actions, Time
The necessity of Random Time Breaking in Decision making
We identify the characteristic of FF-Replan in terms of Hindsight
Decision Making, VFFRa(s) = maxF V(s,F)
Theorem 1
When there is a policy that can achieve the goal with probability
1 within horizon, hindsight decision making algorithm will find
the goal with probability 1.
Theorem 2
Polynomial number of samples are needed with regard to,
Horizon, Action, The minimum Q-value advantage
Sungwook Yoon – Probabilistic Planning via Determinization
18
Empirical Results
IPPC-04 Problems
Problem
Numbers are solved Trials
FFRa
FF-Hindsight
Blocksworld
270
158
Boxworld
150
100
Fileworld
29
14
R-Tireworld
30
30
ZenoTravel
30
0
Exploding BW
5
28
G-Tireworld
7
18
Tower of Hanois
11
17
For ZenoTravel, when we used Importance sampling, the solved trials
have been improved to 26
Sungwook Yoon – Probabilistic Planning via Determinization
19
Empirical Results
Planners
Climber
River
Bus-Fare
Tire1
Tire2
Tire3
Tire4
Tire5
Tire6
60%
65%
1%
50%
0%
0%
0%
0%
0%
Paragraph
100%
65%
100%
100%
100%
100%
3%
1%
0%
FPG
100%
65%
22%
100%
92%
60%
35%
19%
13%
FF-HS
100%
65%
100%
100%
100%
100%
100%
100%
100%
FFRa
These Domains are Developed just to Beat FF-Replan
Obviously, FF-Replan did not do well.
But, FF-Hindsight did very well, showing
Probabilistic Reasoning Ability while achieving Scalability
Sungwook Yoon – Probabilistic Planning via Determinization
20
Conclusion
Deterministic Planning
Probabilistic Planning
scalability
scalability
Classic Planning
Markov Decision
Processes
Machine Learning for
MDP
Machine Learning for
Planning
Net Benefit
Optimization
Temporal Planning
Temporal MDP
scalability
Determinization
Sungwook Yoon – Probabilistic Planning via Determinization
21
Conclusion
• Devised an algorithm that can take advantage of
the significant advances in deterministic planning
in the context of probabilistic planning
• Made many of the deterministic planning
techniques available to probabilistic planning
– Most of the learning to planning techniques are
developed solely for deterministic planning
• Now, these techniques are relevant to probabilistic planning
too
– Advanced net-benefit style of planners can be used for
the reward maximization style of probabilistic
planning problems
Sungwook Yoon – Probabilistic Planning via Determinization
22
Discussion
• Mercier and Van Hentenryck provided the
analysis of the difference between
– V*(s,H) = maxπ EFH [ R(s,FH,π) ]
– VHS(s,H) = EFH [maxπ R(s,FH,π)]
• Ng and Jordan provided the analysis of the
difference between
– V*(s,H) = maxπ EFH [ R(s,FH,π) ]
– V^(s,H) = maxπ ∑ [ R(s,FH,π) ] / m, where m is the
sample number
Sungwook Yoon – Probabilistic Planning via Determinization
23
IPPC-2004 Results
Human
Learned
Control Knowledge Knowledge
2nd
Winner of IPPC-04
FFRs
Numbers :
Successful Runs
Place Winners
NMRC
J1
Classy
NMR
mGPT
C
FFRS
FFRA
BW
252
270
255
30
120
30
210
270
Box
134
150
100
0
30
0
150
150
File
-
-
-
3
30
3
14
29
Zeno
-
-
-
30
30
30
0
30
Tire-r
-
-
-
30
30
30
30
30
Tire-g
-
-
-
9
16
30
7
7
TOH
-
-
-
15
0
0
0
11
Exploding
-
-
-
0
0
0
3
5
NMR
Non-Markovian Reward Decision Process Planner
Classy
Approximate Policy Iteration with a Policy Language Bias
mGPT
Heuristic Search Probabilistic Planning
C
Sungwook
Symbolic Heuristic
SearchYoon – Probabilistic Planning via Determinization
24
IPPC-2006 Results
Unofficial Winner
of IPPC-06 FFRa
Numbers :
Percentage of
Successful Runs
Paragraph
FFRS
FFRA
FPG
FOALP
sfDP
BW
86
63
100
29
0
77
Zenotravel
100
27
0
7
7
7
Random
100
65
0
0
5
73
Elevator
93
76
100
0
0
93
Exploding
52
43
24
31
31
52
Drive
71
56
0
0
9
0
Schedule
51
54
0
0
1
0
PitchCatch
54
23
0
0
0
0
Tire
82
75
82
0
91
69
FPG
Factored Policy Gradient Planner
FOALP
First Order Approximate Linear Programming
sfDP
Symbolic Stochastic Focused Dynamic Programming with Decision Diagrams
Paragraph A Graphplan Based Probabilistic Planner
Sungwook Yoon – Probabilistic Planning via Determinization
25
Sungwook Yoon – Probabilistic Planning via Determinization
26
Sampling Problem
Time dependency issue
S1
S2
A
Start
Goal
B
D (with probability 1-p)
S3
C (with probability p)
C (with probability 1-p)
D (with probability p)
Dead
End
Sungwook Yoon – Probabilistic Planning via Determinization
27
Sampling Problem
Time dependency issue
S1
S2
A
Start
Goal
B
S3
Dead
End
S3 is worse state then S1 but looks like there is always a path to Goal
Need to sample independently across actions
Sungwook Yoon – Probabilistic Planning via Determinization
28
Action Selection Problem
Random Tie breaking is essential
B: with probability 1-p
A: Always stays in Start
Start
S1
B: with probability p
Goal
C: with probability 1-p
C: with probability p
In Start state, C action is definitely better, but A can be used to
wait until C to the Goal effect is realized
Sungwook Yoon – Probabilistic Planning via Determinization
29
Sampling Problem
Importance Sampling (IS)
B: with very high probability
Start
S1
B: with extremely
low probability
Goal
- Sampling uniformly would find the problem unsolvable.
- Use importance sampling.
- Identifying the region that needs importance sampling is for
further study.
-In the benchmark, Zenotravel needs the IS idea.
Sungwook Yoon – Probabilistic Planning via Determinization
30
Theoretical Results
• Theorem 1
– For goal-achieving probabilistic planning problems, if there is a
policy that can solve the probabilistic planning problem with
probability 1 with bounded horizon, then hindsight planning
would solve the problem with probability 1. If there is no such
policy, hindsight planning would return less 1 success ratio.
– If there is a future where no plan can achieve the goal, the
future can be sampled
• Theorem 2
– The number of future samples needed to correctly identify the
best action
– w > 4Δ-2T ln (|A|H| / δ)
– Δ : the minimum Q-advantage of the best action over the other
actions, δ: confidence parameter
– From Chernoff Bound
Sungwook Yoon – Probabilistic Planning via Determinization
31
Probabilistic Planning
Expecti-max solution
Action
Maximize Goal Achievement
Probabilistic
Outcome
Max
Time 1
Exp
Exp
Max
Max
Time 2
E
E
E
Max
Max
E
E
E
E
Action
State
E
Goal State
Sungwook Yoon – Probabilistic Planning via Determinization
32
Hindsight Sample 1
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 1
A2: 0
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
33
Hindsight Sample 2
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 2
A2: 1
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
34
Hindsight Sample 3
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 2
A2: 1
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
35
Hindsight Sample 4
Left Outcomes
are more likely
Action
Maximize Goal Achievement
I
A1
Probabilistic
Outcome
A2
Time 1
A1
A2
A1
A2
A1
A2
A1
A2
Time 2
Action
State
A1: 3
A2: 1
Sungwook Yoon – Probabilistic Planning via Determinization
Dead End
Goal State
36