Stochastic Safest and Shortest Path Problems

Stochastic Safest and Shortest Path Problems
Florent Teichteil-Königsbuch
AAAI-12, Toronto, Canada
July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Path optimization under probabilistic uncertainties
I Problems coming to searching for a shortest path in a
probabilistic AND/OR cyclic graph
. OR nodes : branch choice (action)
. AND nodes : probabilistic outcomes of chosen branches
(actions’ effects)
I Problem statement : to compute a policy to go to the goal
with maximum probability or minimum expected cost-to-go
I Examples :
. Shortest path planning in probabilistic grid worlds
(racetrack)
. Minimum number of moves of blocks to build towers with
stochastic operators (exploding-blocksworld)
. Controller synthesis for critical systems, with maximum
terminal disponibility and minimum energy consumption
(embedded systems, transportation systems, servers, etc.)
3/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Mathematical framework : goal-oriented MDP
Goal-oriented Markov Decision Process
I S : finite set of states
I G ⊂ S : finite set of goals
I A : finite set of actions
S × A × S → [0; 1]
I T:
:
(s, a, s 0 )
7→ Pr(st+1 = s 0 |st = s, at = a)
transition function
I c : S × A × S → R : cost function associated with the
transition function
I Absorbing goal states : ∀ g ∈ G, ∀ a ∈ A, T (g, a, g) = 1
I No costs paid from goal states :
∀ g ∈ G, ∀ a ∈ A, c(g, a, g) = 0
I app : S → 2A : applicable actions in a given state
4/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Stochastic Shortest Path (SSP)
[Bertsekas & Tsitsiklis (1996)]
Optimization criterion : total (undiscounted) cost, or cost-to-go
I Find a Markovian policy π ∗ : S → A that minimizes the
expected total cost from any possible initial state
∗
π
∀ s ∈ S, π (s) = argmin V (s) = E
"+∞
X
π∈AS
t=0
#
ct s0 = s
I Value of π ∗ solution of Bellman equation
∗
∀ s ∈ S, V π (s) = min
a∈app(s)
5/25
X
∗
T (s, a, s 0 ) V π (s 0 ) + c(s, a, s 0 )
s0 ∈S
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
SSP : required theoretical and practical assumptions
Assumption 1
There exists at least one proper policy, i.e. : a policy that reaches
the goal with probability 1 regardless of the initial state.
Assumption 2
For every improper policy, the corresponding cost-to-go is infinite, i.e. : all cycles not leading to the goal are composed of strictly
positive costs.
Implications if both assumptions 1 and 2 hold
I There exists a policy π such that V π is finite ;
I An optimal Markovian (stationary) policy can be obtained
using dynamic programming (Bellman equation).
6/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Drawbacks of the SSP criterion
SSP assumptions not easy to check in practice
I Deciding whether assumptions 1 and 2 hold not obvious in
general
I Same complexity class as optimizing the cost-to-go criterion
Limited practical scope
I Limited to optimizing policies reaching the goal with probability
1, without nonpositive-cost cycles not leading to the goal
I Especially annoying in presence of dead-ends or
nonpositive-cost cycles
I In the absence of proper policies, no known method to
optimize both the probability of reaching the goal and
the corresponding total cost of those paths to the goal
(dual optimization)
7/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Alternatives to the standard SSP criterion
Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)]
I Constraint optimization : Find the minimal-cost policy among the
ones reaching the goal with probability 1
I Includes many other problems, e.g. : Find the policy reaching the
goal with maximum probability
Drawback : Assumes proper policies, cannot find minimal-cost policies among the ones reaching the goal with maximum probability
Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)]
Relaxed heuristics for the γ-discounted criterion :
X
∗
∗
∀ s ∈ S, V π (s) = min
T (s, a, s 0 ) γV π (s 0 ) + c(s, a, s 0 )
a∈app(s)
s0 ∈S
Drawback : Accumulates costs along paths not reaching the goal !
⇒ the optimized policy may potentially avoid the goal...
8/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Alternatives to the standard SSP criterion
Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)]
I Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1
I Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability
Drawback : Cannot find minimal-cost policies among the ones reaching the goal with maximum probability
Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)]
Relaxed heuristics for the γ-discounted criterion :
∀ s ∈ S, V
π∗
(s) =
min
X
0
T(s, a, s ) γV
π∗
0
0
(s ) + c(s, a, s )
a∈app(s)
s0 ∈S
Drawback : Accumulates costs along paths not reaching the goal !
fSSPUDE/iSSPUDE [Kolobov, Mausam & Weld (2012)]
I fSSPUDE : goal MDPs with finite-cost unavoidable dead-ends
→ no dual optimization of goal-probability and goal-cost
I iSSPUDE : goal MDPs with infinite-cost unavoidable dead-ends
→ (required) dual optimization, limited to positive costs
8/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
S3 P : new dual optimization criterion for goal MDPs
Goal-probability function
PnG,π (s) : probability of reaching the goal in at most n steps-to-go
by executing π from s
Goal-cost function
CnG,π (s) : expected total cost in at most n steps-to-go by executing
π from s, averaged only over paths to the goal
S3 P optimization criterion : SSP ⊂ GSSP ⊂ S3 P
G,π among
Find an optimal (Markovian) π ∗ policy that minimizes C∞
G,π
all policies maximizing P∞
π ∗ (s) ∈
argmin
G,π 0 0
π:∀s0 ∈S,π(s0 )∈argmaxπ0 ∈AS P∞
(s )
9/25
G,π
C∞
(s)
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example I : goal-oriented MDP with proper policy
action aI
p=1
c=0
s
I
p=1
c = −1
action a2
action a1
p=1
c = −2
action aG
G
p=1
c=0
2 policies : π1 = (aI , a1 , aG , aG , aG , · · · ), π2 = (aI , a2 , aI , a2 , aI , a2 , · · · )
10/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example I : goal-oriented MDP with proper policy
action aI
p=1
c=0
s
I
p=1
c = −1
action a2
action a1
p=1
c = −2
action aG
G
p=1
c=0
2 policies : π 1 = (aI , a1 , aG , aG , aG , · · · ), π2 = (aI , a2 , aI , a2 , aI , a2 , · · · )
10/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example I : goal-oriented MDP with proper policy
action aI
p=1
c=0
s
I
p=1
c = −1
action a2
action a1
p=1
c = −2
action aG
G
p=1
c=0
2 policies : π1 = (aI , a1 , aG , aG , aG , · · · ), π 2 = (aI , a2 , aI , a2 , aI , a2 , · · · )
10/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example I : goal-oriented MDP with proper policy
action aI
p=1
c=0
s
I
action a1
p=1
c = −2
action aG
G
p=1
c=0
p=1
c = −1
action a2
2 policies : π1 = (aI , a1 , aG , aG , aG , · · · ), π2 = (aI , a2 , aI , a2 , aI , a2 , · · · )
V π (I ) [SSP]
10/25
π1
−2
π2
−∞
Assumption 2
not satisfied
⇒
SSP criterion
not well-defined !
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example I : goal-oriented MDP with proper policy
action aI
p=1
c=0
s
I
action a1
p=1
c = −2
action aG
G
p=1
c=0
p=1
c = −1
action a2
2 policies : π1 = (aI , a1 , aG , aG , aG , · · · ), π2 = (aI , a2 , aI , a2 , aI , a2 , · · · )
V π (I ) [SSP]
G,π [S3 P]
P∞
G,π [S3 P]
C∞
π1
−2
1
−2
π2
−∞
0
0
Assumption 2
not satisfied
⇒
SSP criterion
not well-defined !
Optimal S3 P policy
10/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π 1
a3 → π 3
11/25
a2 → π2
aI → π 4
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π 1
a3 → π 3
11/25
a2 → π2
aI → π 4
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π1
a3 → π3
11/25
a2 → π 2
aI → π 4
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π 1
a3 → π 3
11/25
a2 → π2
aI → π 4
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π1
a3 → π3
11/25
a2 → π2
aI → π 4
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
SSP well-defined but
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π 1
a3 → π 3
11/25
Optimal SSP
policy : π3
unsatisfied
assumptions !
From state I :
V
P C
π1
1.1
π2
2.1
π3 −1.9
π4 +∞
a2 → π2
aI → π 4
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example II : goal-oriented MDP without proper policy
action a2
c=2
action a1
c=1
p=
0.9
action aG
G
p = 0.1
action aI
s
I
p=1
c=1
p = 0.1
action a3
c = −2
p=
p=
SSP well-defined but
p=1
c=0
0.5
action as
c=1
p=
0.5
action ad
d
0.9
p=1
c=0
4 Markovian policies depending on action in I
a1 → π 1
a3 → π 3
11/25
a2 → π2
aI → π 4
Optimal SSP
policy : π3
unsatisfied
assumptions !
From state I :
V
P
C
π1
1.1 0.95 1.05
π2
2.1 0.95 1.85
π3 −1.9 0.05 −1
π4 +∞
0
0
Optimal S3 P
policy : π1
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
a300
p=1
c = 100
p = 0.9
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
action aG
G
p=1
c=0
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
a300
p=1
c = 100
p = 0.9
4 Markovian policies according to action in I :
a1 → π 1
a2 → π2
a3 → π 3
aI → π 4
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
action aG
G
p=1
c=0
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
a300
p=1
c = 100
p = 0.9
4 Markovian policies according to action in I :
a1 → π1
a2 → π 2
a3 → π 3
aI → π 4
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
action aG
G
p=1
c=0
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
a300
p=1
c = 100
p = 0.9
4 Markovian policies according to action in I :
a1 → π1
a2 → π2
a3 → π 3
aI → π 4
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
action aG
G
p=1
c=0
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
a300
p=1
c = 100
p = 0.9
4 Markovian policies according to action in I :
a1 → π1
a2 → π2
a3 → π 3
aI → π 4
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
action aG
G
p=1
c=0
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
action aG
G
p=1
c=0
a300
p=1
c = 100
p = 0.9
I Assumptions 1 and 2 both unsatisfied : ∀ i ∈ {1, 2, 3, 4}, V πi (I ) = +∞
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
action a1
c=1
p=
s1
p = 0.5
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
0.1
p = 0.9
p = 0.9
action aI
I
action a2
c=10 p = 0.1
p=1
c = 1000
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
action aG
G
p=1
c=0
a300
p=1
c = 100
p = 0.9
I Assumptions 1 and 2 both unsatisfied : ∀ i ∈ {1, 2, 3, 4}, V πi (I ) = +∞
I γ-discounted criterion :
∀ 0 < γ < 1, Vγπ4 (I ) > Vγπ3 (I ) > Vγπ2 (I ) > Vγπ1 (I )
⇒ π1 is the only optimal DSSP policy for all 0 < γ < 1
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Example III : goal-oriented MDP without proper policy
G,π
P∞ 1 (I ) = 0.55
action a1
c=1
action aI
I
p=
p = 0.9
p = 0.9
action a2
c=10 p = 0.1
p=1
c = 1000
G,π
P∞ 2 (I ) = 0.95
p = 0.5
0.1
s1
action a10
c=1
p = 0.5
d1
a100
p=1
c=1
s2
p = 0.5
action a20
c=10
p = 0.5
d2
a200
p=1
c = 10
p = 0.5
action a3
c=100
p = 0.5
s3
action
c=100
a30
p = 0.1
d3
a300
p=1
c = 100
p = 0.9
G,π
P∞ 3 (I ) = 0.95
G,π
P∞ 4 (I ) = 0
action aG
G
p=1
c=0
G,π
C∞ 1 (I ) = 1.818
G,π
C∞ 2 (I ) = 10.53
G,π
C∞ 3 (I ) = 147.4
G,π
C∞ 4 (I ) = 0
I Assumptions 1 and 2 both unsatisfied : ∀ i ∈ {1, 2, 3, 4}, V πi (I ) = +∞
I γ-discounted criterion :
∀ 0 < γ < 1, Vγπ4 (I ) > Vγπ3 (I ) > Vγπ2 (I ) > Vγπ1 (I )
⇒ π1 is the only optimal DSSP policy for all 0 < γ < 1
I S3 P criterion : π2 is the only optimal S3 P policy !
12/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Lessons learnt from these examples
13/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Conclusion
Motivations
S3 P
Experiments
Conclusion
Lessons learnt from these examples
I If you care about :
First reach the goal, then minimize costs along paths to
the goal
like in deterministic/classical planning !
13/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Lessons learnt from these examples
I If you care about :
First reach the goal, then minimize costs along paths to
the goal
like in deterministic/classical planning !
I Then be aware that :
Previous existing criteria do not guarantee to produce
such policies
13/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Lessons learnt from these examples
I If you care about :
First reach the goal, then minimize costs along paths to
the goal
like in deterministic/classical planning !
I Then be aware that :
Previous existing criteria do not guarantee to produce
such policies
I Without mentioning that :
Standard criteria do need to be well-defined for many
interesting goal-oriented MDPs contrary to S3 Ps
13/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Lessons learnt from these examples
I If you care about :
First reach the goal, then minimize costs along paths to
the goal
like in deterministic/classical planning !
I Then be aware that :
Previous existing criteria do not guarantee to produce
such policies
I Without mentioning that :
Standard criteria do need to be well-defined for many
interesting goal-oriented MDPs contrary to S3 Ps
So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S3 Ps.
13/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Policy evaluation for the S3 P dual criterion
Theorem 1 (finite horizon H ∈ N, general costs)
For all steps-to-go 1 6 n < H , all history-dependent policies π =
(π0 , · · · , πH −1 ) and all states s ∈ S :
PnG,π (s) =
X
G,π 0
T (s, πH −n (s), s 0 )Pn−1
(s ), with :
s0 ∈S
G,π
P0 (s)
= 0, ∀s ∈ S \ G, and P0G,π (g) = 1, ∀g ∈ G (1)
If PnG,π (s) > 0, CnG,π (s) is well-defined, and satisfies :
CnG,π (s) =
1
X
PnG,π (s) s0 ∈S
h
G,π 0
T (s, πH −n (s), s 0 )Pn−1
(s )×
i
G,π 0
c(s, πH −n (s), s 0 ) + Cn−1
(s ) , with :
CoG,π (s) = 0, ∀s ∈ S
14/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
(2)
S3 P
Motivations
Experiments
Conclusion
Policy evaluation for the S3 P dual criterion
Theorem 1 (finite horizon H ∈ N, general costs)
For all steps-to-go 1 6 n < H , all history-dependent policies π =
(π0 , · · · , πH −1 ) and all states s ∈ S :
PnG,π (s) =
X
G,π 0
T (s, πH −n (s), s 0 )Pn−1
(s ), with :
s0 ∈S
G,π
P0 (s)
= 0, ∀s ∈ S \ G, and P0G,π (g) = 1, ∀g ∈ G (1)
If PnG,π (s) > 0, CnG,π (s) is well-defined, and satisfies :
1
G,π 0
T (s, πH −n (s), s 0 )Pn−1
(s )×
PnG,π (s) s0 ∈S
h
i
G,π 0
normalization factor
c(s, πH −n (s), s 0 ) + Cn−1
(s ) , with :
for averaging over
CoG,π (s) = 0, ∀s ∈ S
only paths to the goal
CnG,π (s) =
14/25
X
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
(2)
Motivations
S3 P
Experiments
Conclusion
Convergence of the S3 P dual criterion in infinite horizon
A fundamental difference with SSPs : demonstrate and use the
contraction property of the transition function over the
states reaching the goal with some positive probability
(for any Markovian policy, in infinite horizon, and general costs)
Lemma 1 (infinite horizon, general costs)
Generalizes/extends Bertsekas & Tsitsiklis’ SSP theoretical results to S3 Ps
Let M be a general goal-oriented MDP, π a stationary policy, T π the transition
matrix for policy π, and for all n ∈ N, Xnπ = {s ∈ S \ G : PnG,π (s) > 0}. Then :
(i) for all s ∈ S, PnG,π (s) converges to a finite value as n tends to +∞ ;
(ii) there exists X π ⊂ S such that Xnπ ⊆ X π for all n ∈ N and T π is a contraction
over X π .
This new contraction property guarantees the wellfoundedness of the S3 P criterion and the existence of
optimal Markovian policies, without any assumption at all !
15/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Evaluating and optimizing S3 P policies in infinite horizon
I The S3 P dual criterion is well-defined for any
goal-oriented MDP and any Markovian policy.
Theorem 2 (infinite horizon, general costs)
Let M be any general goal-oriented MDP, and π any stationary policy for M.
G,π
G,π
Evaluation equations of Theorem 1 converge to finite values P∞
(s) and C∞
(s)
for any s ∈ S (by convention, CnG,π (s) = 0 if PnG,π (s) = 0, n ∈ N).
16/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Evaluating and optimizing S3 P policies in infinite horizon
I The S3 P dual criterion is well-defined for any
goal-oriented MDP and any Markovian policy.
Theorem 2 (infinite horizon, general costs)
Let M be any general goal-oriented MDP, and π any stationary policy for M.
G,π
G,π
Evaluation equations of Theorem 1 converge to finite values P∞
(s) and C∞
(s)
for any s ∈ S (by convention, CnG,π (s) = 0 if PnG,π (s) = 0, n ∈ N).
I There exists a Markovian policy solution to the S3 P
problem for any goal-oriented MDP.
Proposition (infinite horizon, general costs)
Let M be any general goal-oriented MDP.
1
There exists an optimal stationary policy π ∗ that minimizes the
infinite-horizon goal-cost function among all policies that maximize the
infinite-horizon goal-probability function, ie π ∗ is S3 P-optimal.
∗
2
16/25
∗
G,π
G,π
Goal-probability P∞
and goal-cost C∞
functions have finite values.
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
But wait... Can we easily optimize S3 Ps ?
Greedy Markovian policies w.r.t. goal-probability or goal-cost
metrics do not need to be optimal !
action aI
p=1
c=0
s
I
action a1
p=1
c = −2
action aG
G
p=1
c=0
p=1
c = −1
action a2
Pn∗ (s) = maxa∈app(s)
P
s0 ∈S
∗ (s 0 ), avec :
T (s, a, s 0 )Pn−1
P0∗ (s) = 0, ∀s ∈ S \ G; P0∗ (g) = 1, ∀g ∈ G
After 3 iterations : a2 ∈ argmaxa∈app(s)
T (s, a, s 0 )P2∗ (s 0 )
0
| s ∈S
{z
}
P
=1
G,π=(a2 ,a2 ,a2 ,··· )
Whereas : P∞
17/25
∗ (s) !
(s) = 0 < 1 = P∞
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
But wait... Can we easily optimize S3 Ps ?
Greedy Markovian policies w.r.t. goal-probability or goal-cost
metrics do not need to be optimal !
action aI
p=1
c=0
s
I
Solution
p=1
c = −1
action a2
action a1
p=1
c = −2
action aG
G
p=1
c=0
Implicitly eliminate non-optimal greedy policies w.r.t.
goal-probability criterion, by searching for greedy policies w.r.t.
goal-cost criterion...
17/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
But wait... Can we easily optimize S3 Ps ?
Greedy Markovian policies w.r.t. goal-probability or goal-cost
metrics do not need to be optimal !
action aI
p=1
c=3
s
I
Solution
action a1
action aG
p=1
c=1
G
p=1
c=0
p=1
c=2
action a2
Implicitly eliminate non-optimal greedy policies w.r.t.
goal-probability criterion, by searching for greedy policies w.r.t.
goal-cost criterion...
...provided all costs are positive (except from the goal) so that
G,π3a2
∗
“choosing a2 ” has an infinite cost (since P∞
(s) < T (s, a2 , I )P∞
(I )).
|
17/25
{z
=0
}
|
{z
=1
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
}
S3 P
Motivations
Experiments
Conclusion
Optimisation of S3 P Markovian policies in infinite horizon
Under positive costs : greedy policies w.r.t. goal-probability
and goal-cost functions are S3 P optimal
Theorem 3 (infinite horizon, positive costs only)
Let M be a goal-oriented MDP such that all transitions from non-goal states
have strictly positive costs.
Pn∗ (s) = maxa∈app(s)
P
s0 ∈S
∗
T (s, a, s 0 )Pn−1
(s 0 ), with :
P0∗ (s) = 0, ∀s ∈ S \ G; P0∗ (g) = 1, ∀g ∈ G
Pn∗
(3)
∗
P∞
.
Functions
converge to a finite-values function
∗
∗
Cn∗ (s) = 0 if P∞
(s) = 0, otherwise if P∞
(s) > 0 :
1
∗
Cn (s) =
min
×
P
∗
∗ (s0 )=P ∗ (s) P∞ (s)
a∈app(s):
T(s,a,s0 )P∞
∞
s0 ∈S
P
s0 ∈S
∗
∗
T (s, a, s 0 )P∞
(s 0 ) [c(s, a, s 0 ) + Cn−1
(s 0 )] ,
with : Co∗ (s) = 0, ∀s ∈ S
(4)
∗
and any Markovian
Functions Cn∗ converge to a finite-values function C∞
policy π ∗ obtained from the previous equations at convergence, is S3 P optimal.
18/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Optimisation of S3 P Markovian policies in infinite horizon
Under positive costs : greedy policies w.r.t. goal-probability
and goal-cost functions are S3 P optimal
Theorem 3 (infinite horizon, positive costs only)
Let M be a goal-oriented MDP such that all transitions from non-goal states
have strictly positive costs.
Pn∗ (s) = maxa∈app(s)
P
s0 ∈S
∗
T (s, a, s 0 )Pn−1
(s 0 ), with :
P0∗ (s) = 0, ∀s ∈ S \ G; P0∗ (g) = 1, ∀g ∈ G
Pn∗
(3)
∗
P∞
.
Functions
converge to a finite-values function
∗
∗
Cn∗ (s) = 0 if P∞
(s) = 0, otherwise if P∞
(s) > 0 :
1
∗
Cn (s) =
min
×
P
∗
∗ (s0 )=P ∗ (s) P∞ (s)
a∈app(s):
T(s,a,s0 )P∞
∞
s0 ∈S
requires only
positive costs
P
s0 ∈S
∗
∗
T (s, a, s 0 )P∞
(s 0 ) [c(s, a, s 0 ) + Cn−1
(s 0 )] ,
with : Co∗ (s) = 0, ∀s ∈ S
(4)
∗
Functions Cn∗ converge to a finite-values function C∞
and any Markovian policy
π ∗ obtained from the previous equations at convergence, is S3 P optimal.
18/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Summarizing the new S3 P dual optimization criterion
I SSP ⊂ GSSP ⊂ S3 P
I Dual optimization : total cost minimization, averaged only
over paths to the goal, among all Markovian policies
maximizing the probability of reaching the goal
I Well-defined in finite or infinite horizon for any
goal-oriented MDPs (contrary to SSPs or GSSPs)
I But (at the moment) : optimization equations in the form
of dynamic programming only if all costs from non-goal
states are positive
. GPCI algorithm (Goal-Probability and -Cost Iteration)
. iSSPUDE sub-model (=
ˆ S3 P with positive costs) : efficient
heuristic algorithms by Kolobov, Mausam & Weld (UAI 2012)
19/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Experimental setup
I Tested problems
. without dead-ends (assumptions 1 and 2 of SSPs satisfied) :
blocksworld, rectangle-tireworld
⇒ Optimization with the standard SSP criterion, then
comparison with the S3 P dual criterion
. with dead-ends (assumptions 1 or 2 of SSPs unsatisfied) :
exploding-blocksworld, triangle-tireworld, grid
(gridworld variation on example III)
⇒ Optimization with the DSSP criterion for many values of γ
G,π
G,π
until maximizing (resp. minimizing) at best P∞
(resp. C∞
);
γopt unknown in advance !
I Tested algorithms :
VI, LRTDP
|
{z
}
optimal for (D)SSPs
, RFF ,
GPCI
| {z }
optimal for S3 Ps
I Once optimized, all policies evaluated using S3 P criterion
I Systematic comparison with an optimal policy for S3 Ps
20/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
G,π
Analysis of the goal-probability function P∞
1
GPCI
VI
LRTDP
RFF
Goal probability
0.9
0.8
0.7
γ > 0.99
0.6
0.5
GPCI is
G -optimal
P∞
0.4
0.3
0.2
BW
RTW
TTW
EBW
G1
G2
Domain
DSSP-optimal policies (VI, LRTDP) do not maximize the
probability of reaching the goal, whatever the value of γ !
21/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
G,π
Analysis of the goal-cost function C∞
35
GPCI
VI
30 LRTDP
RFF
γ > 0.99
Goal cost
25
GPCI is
G
C∞
G,∗ |P∞
optimal, i.e.
S3 P-optimal
20
15
Smaller
goal-costs
for VI and
LRTDP but
also actually
smaller goalprobabilities !
10
5
BW
RTW
TTW
EBW
G1
G2
γ > 0.99
Domain
DSSP-optimal policies (VI, LRTDP) do not minimize the total cost averaged only over paths to the goal, whatever γ !
22/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Comparison of computation times
Computation time (in seconds)
100
GPCI as
efficient
as VI for
problems
not really
probabilistically
interesting
G,∗ ' 1)
(P∞
GPCI
VI
LRTDP
RFF
10
1
0.1
0.01
BW
RTW TTW EBW
G1
G2
Domain
GPCI faster than VI, LRTDP and even RFF for problems with dead-ends and complex cost structure
23/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
S3 P
Motivations
Experiments
Conclusion
Conclusion and perspectives
I An original well-founded dual criterion for goal-oriented MDPs
. SSP ⊂ GSSP ⊂ S3 P
. S3 P dual criterion : minimum goal-cost policy among the
ones with maximum goal-probability
. S3 P dual criterion well-defined in infinite horizon for any
goal-oriented MDP (no assumptions required, contrary to
SSPs or GSSPs)
. If costs are positive : GPCI algorithm or heuristic algorithm
for the iSSPUDE sub-model [Kolobov, Mausam & Weld, 2012]
I Future work
. Uniformizing our general-cost model and the positive-cost
model of Kolobov, Mausam & Weld
. Algorithms for solving S3 Ps with general costs
. Domain-independent heuristics for estimating goal-probability
and goal-cost functions
24/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012
Motivations
S3 P
Experiments
Conclusion
Thank you for your attention !
I If you care about :
First reach the goal, then minimize costs along paths to
the goal
like in deterministic/classical planning !
I Then be aware that :
Previous existing criteria do not guarantee to produce
such policies
I Without mentioning that :
Standard criteria do need to be well-defined for many
interesting goal-oriented MDPs contrary to S3 Ps
So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S3 Ps.
25/25
Stochastic Safest and Shortest Path Problems – F. Teichteil-K. – AAAI-12 – July 24-26, 2012