Improved Expected Running Time for MDP

Improved Expected Running Time for MDP Planning
Shivaram Kalyanakrishnan1
Neeldhara Misra2
Aditya Gopalan3
1. Department of Computer Science and Engineering, Indian Institute of Technology Bombay
2. Department of Computer Science and Automation, Indian Institute of Science
3. Department of Electrical Communication Engineering, Indian Institute of Science
June 2015
1/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
1 / 16
Overview
1. MDP Planning
2. Solution strategies
Linear programming
Value iteration
Policy iteration
Our contribution: Planning by Guessing and Policy Improvement (PGPI)
3. PGPI algorithm
A total order on the set of policies
Guessing game
Algorithm
4. Discussion
2/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
2 / 16
Overview
1. MDP Planning
2. Solution strategies
Linear programming
Value iteration
Policy iteration
Our contribution: Planning by Guessing and Policy Improvement (PGPI)
3. PGPI algorithm
A total order on the set of policies
Guessing game
Algorithm
4. Discussion
2/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
2 / 16
MDP Planning
Markov Decision Problem: general abstraction of sequential decision making.
An MDP comprises a tuple (S, A, R, T ), where
S is a set of states (with |S| = n),
A is a set of actions (with |A| = k ),
R(s, a) is a bounded real number, ∀s ∈ S, ∀a ∈ A, and
T (s, a) is a probability distribution over S, ∀s ∈ S, ∀a ∈ A.
A policy π : S → A specifies an action from each state. The value of a policy π
from state s is:
"∞
#
X t
π
V (s) = E
γ rt | s0 = s, at = π(st ), t = 0, 1, 2, . . . , where
t=0
γ ∈ [0, 1) is a discount factor.
3/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
3 / 16
MDP Planning
Markov Decision Problem: general abstraction of sequential decision making.
An MDP comprises a tuple (S, A, R, T ), where
S is a set of states (with |S| = n),
A is a set of actions (with |A| = k ),
R(s, a) is a bounded real number, ∀s ∈ S, ∀a ∈ A, and
T (s, a) is a probability distribution over S, ∀s ∈ S, ∀a ∈ A.
A policy π : S → A specifies an action from each state. The value of a policy π
from state s is:
"∞
#
X t
π
V (s) = E
γ rt | s0 = s, at = π(st ), t = 0, 1, 2, . . . , where
t=0
γ ∈ [0, 1) is a discount factor.
Planning problem: Given S, A, R, T , and γ, find a policy π ⋆ from the set of all
policies Π such that
⋆
V π (s) ≥ V π (s), ∀s ∈ S, ∀π ∈ Π.
3/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
3 / 16
MDP Planning
Markov Decision Problem: general abstraction of sequential decision making.
An MDP comprises a tuple (S, A, R, T ), where
S is a set of states (with |S| = n),
A is a set of actions (with |A| = k ),
R(s, a) is a bounded real number, ∀s ∈ S, ∀a ∈ A, and
T (s, a) is a probability distribution over S, ∀s ∈ S, ∀a ∈ A.
A policy π : S → A specifies an action from each state. The value of a policy π
from state s is:
"∞
#
X t
π
V (s) = E
γ rt | s0 = s, at = π(st ), t = 0, 1, 2, . . . , where
t=0
γ ∈ [0, 1) is a discount factor.
Planning problem: Given S, A, R, T , and γ, find a policy π ⋆ from the set of all
policies Π such that
⋆
V π (s) ≥ V π (s), ∀s ∈ S, ∀π ∈ Π.
3/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
3 / 16
Overview
1. MDP Planning
2. Solution strategies
Linear programming
Value iteration
Policy iteration
Our contribution: Planning by Guessing and Policy Improvement (PGPI)
3. PGPI algorithm
A total order on the set of policies
Guessing game
Algorithm
4. Discussion
4/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
4 / 16
Linear Programming
⋆ def
The optimal value function V π = V ⋆ is unique solution of
Bellman’s Optimality Equations: ∀s ∈ S:
⋆
V (s) = max R(s, a) + γ
a∈A
X
′
⋆
′
!
T (s, a, s )V (s ) .
s′ ∈S
5/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
5 / 16
Linear Programming
⋆ def
The optimal value function V π = V ⋆ is unique solution of
Bellman’s Optimality Equations: ∀s ∈ S:
⋆
V (s) = max R(s, a) + γ
a∈A
X
′
⋆
′
!
T (s, a, s )V (s ) .
s′ ∈S
V ⋆ can be obtained by solving an equivalent linear program:
maximise
X
V (s)
s∈S
subject to
V (s) ≥
R(s, a) + γ
X
s′
′
′
!
T (s, a, s )V (s ) , ∀s ∈ S, ∀a ∈ A.
5/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
5 / 16
Linear Programming
⋆ def
The optimal value function V π = V ⋆ is unique solution of
Bellman’s Optimality Equations: ∀s ∈ S:
⋆
V (s) = max R(s, a) + γ
a∈A
X
′
⋆
′
!
T (s, a, s )V (s ) .
s′ ∈S
V ⋆ can be obtained by solving an equivalent linear program:
maximise
X
V (s)
s∈S
subject to
V (s) ≥
R(s, a) + γ
X
s′
′
′
!
T (s, a, s )V (s ) , ∀s ∈ S, ∀a ∈ A.
n variables, nk constraints (or dual with nk variables, n constraints).
Solution time: poly(n, k , B),
where B is the number of bits used to represent the MDP.
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
5/16
5 / 16
Value Iteration
Classical dynamic programming approach.
V0 ← Arbitrary, element-wise bounded, n-length vector.
t ← 0.
Repeat:
For s ∈ S:
P
Vt+1 (s) ← maxa∈A R(s, a) + γ s′ ∈S T (s, a, s′ )Vt (s′ ) .
t ← t + 1.
Until Vt = Vt−1 = V ⋆ (up to machine precision).
6/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
6 / 16
Value Iteration
Classical dynamic programming approach.
V0 ← Arbitrary, element-wise bounded, n-length vector.
t ← 0.
Repeat:
For s ∈ S:
P
Vt+1 (s) ← maxa∈A R(s, a) + γ s′ ∈S T (s, a, s′ )Vt (s′ ) .
t ← t + 1.
Until Vt = Vt−1 = V ⋆ (up to machine precision).
Convergence to V ⋆ guaranteed using a max-norm contraction argument.
6/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
6 / 16
Value Iteration
Classical dynamic programming approach.
V0 ← Arbitrary, element-wise bounded, n-length vector.
t ← 0.
Repeat:
For s ∈ S:
P
Vt+1 (s) ← maxa∈A R(s, a) + γ s′ ∈S T (s, a, s′ )Vt (s′ ) .
t ← t + 1.
Until Vt = Vt−1 = V ⋆ (up to machine precision).
Convergence to V ⋆ guaranteed using a max-norm contraction argument.
Number of iterations: poly(n, k , B,
1
).
1−γ
6/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
6 / 16
Bellman’s Equations and Policy Evaluation
Recall that
V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )].
Bellman’s Equations (∀s ∈ S):
P
V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )].
V π is called the value function of π.
7/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
7 / 16
Bellman’s Equations and Policy Evaluation
Recall that
V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )].
Bellman’s Equations (∀s ∈ S):
P
V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )].
V π is called the value function of π.
Define (∀s ∈ S, ∀a ∈ A):
Q π (s, a) =
π
P
s′ ∈S
T (s, a, s′ ) [R(s, a, s′ ) + γV π (s′ )].
Q is called the action value function of π.
V π (s) = Q π (s, π(s)).
7/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
7 / 16
Bellman’s Equations and Policy Evaluation
Recall that
V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )].
Bellman’s Equations (∀s ∈ S):
P
V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )].
V π is called the value function of π.
Define (∀s ∈ S, ∀a ∈ A):
Q π (s, a) =
π
P
s′ ∈S
T (s, a, s′ ) [R(s, a, s′ ) + γV π (s′ )].
Q is called the action value function of π.
V π (s) = Q π (s, π(s)).
The variables in Bellman’s equation are the V π (s). |S| linear equations in |S|
unknowns.
7/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
7 / 16
Bellman’s Equations and Policy Evaluation
Recall that
V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )].
Bellman’s Equations (∀s ∈ S):
P
V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )].
V π is called the value function of π.
Define (∀s ∈ S, ∀a ∈ A):
Q π (s, a) =
π
P
s′ ∈S
T (s, a, s′ ) [R(s, a, s′ ) + γV π (s′ )].
Q is called the action value function of π.
V π (s) = Q π (s, π(s)).
The variables in Bellman’s equation are the V π (s). |S| linear equations in |S|
unknowns.
Thus, given S, A, T , R, γ, and a fixed policy π, we can solve Bellman’s equations efficiently to obtain, ∀s ∈ S, ∀a ∈ A, V π (s) and Q π (s, a).
7/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
7 / 16
Policy Iteration
For a given policy π:
def
I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) .
a∈A
8/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
8 / 16
Policy Iteration
For a given policy π:
def
I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) .
a∈A
Q π and I π are easily derived from V π (policy evaluation).
8/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
8 / 16
Policy Iteration
For a given policy π:
def
I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) .
a∈A
Q π and I π are easily derived from V π (policy evaluation).
I π = ∅ iff π is an optimal policy.
8/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
8 / 16
Policy Iteration
For a given policy π:
def
I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) .
a∈A
Q π and I π are easily derived from V π (policy evaluation).
I π = ∅ iff π is an optimal policy.
Assume I π 6= ∅. Let C(π) be an arbitrarily “chosen” non-empty subset of I(π).
Define a policy π ′ as follows.
′
def
π (s) =
(
arg maxa∈A Q π (s, a) if s ∈ C(π),
π(s)
otherwise.
8/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
8 / 16
Policy Iteration
For a given policy π:
def
I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) .
a∈A
Q π and I π are easily derived from V π (policy evaluation).
I π = ∅ iff π is an optimal policy.
Assume I π 6= ∅. Let C(π) be an arbitrarily “chosen” non-empty subset of I(π).
Define a policy π ′ as follows.
′
def
π (s) =
(
arg maxa∈A Q π (s, a) if s ∈ C(π),
π(s)
otherwise.
It can be shown that
′
(1) ∀s ∈ S : V π (s) ≥ V π (s), and
π′
(2) ∃s ∈ S : V (s) > V π (s).
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
8/16
8 / 16
Policy Iteration
For a given policy π:
def
I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) .
a∈A
Q π and I π are easily derived from V π (policy evaluation).
I π = ∅ iff π is an optimal policy.
Assume I π 6= ∅. Let C(π) be an arbitrarily “chosen” non-empty subset of I(π).
Define a policy π ′ as follows.
′
def
π (s) =
(
arg maxa∈A Q π (s, a) if s ∈ C(π),
π(s)
otherwise.
It can be shown that
′
(1) ∀s ∈ S : V π (s) ≥ V π (s), and
π′
(2) ∃s ∈ S : V (s) > V π (s). [Policy improvement]
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
8/16
8 / 16
Policy Iteration
π0 ← Arbitrary policy.
t ← 0.
Repeat:
Evaluate π t ; derive I(π t ).
If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 .
t ← t + 1.
Until I(π t−1 ) = ∅.
9/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
9 / 16
Policy Iteration
π0 ← Arbitrary policy.
t ← 0.
Repeat:
Evaluate π t ; derive I(π t ).
If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 .
t ← t + 1.
Until I(π t−1 ) = ∅.
Howard’s Policy Iteration: C(π) = I(π).
Number of iterations: O
kn
n
.
9/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
9 / 16
Policy Iteration
π0 ← Arbitrary policy.
t ← 0.
Repeat:
Evaluate π t ; derive I(π t ).
If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 .
t ← t + 1.
Until I(π t−1 ) = ∅.
Howard’s Policy Iteration: C(π) = I(π).
Number of iterations: O
kn
n
.
Mansour and Singh’s Randomised Policy Iteration: C(π) chosen uniformly at
random from among the non-empty subsets of I(π).
Expected number of iterations: n
k
2
for general k .
O 20.78n for k = 2; O
1 + log(k
)
2
9/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
9 / 16
Policy Iteration
π0 ← Arbitrary policy.
t ← 0.
Repeat:
Evaluate π t ; derive I(π t ).
If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 .
t ← t + 1.
Until I(π t−1 ) = ∅.
Howard’s Policy Iteration: C(π) = I(π).
Number of iterations: O
kn
n
.
Mansour and Singh’s Randomised Policy Iteration: C(π) chosen uniformly at
random from among the non-empty subsets of I(π).
Expected number of iterations: n
k
2
for general k .
O 20.78n for k = 2; O
1 + log(k
)
2
References: Howard(1960), Mansour and Singh (1999), Hollanders et al. (2014).
9/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
9 / 16
Policy Iteration
π0 ← Arbitrary policy.
t ← 0.
Repeat:
Evaluate π t ; derive I(π t ).
If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 .
t ← t + 1.
Until I(π t−1 ) = ∅.
Howard’s Policy Iteration: C(π) = I(π).
Number of iterations: O
kn
n
.
Mansour and Singh’s Randomised Policy Iteration: C(π) chosen uniformly at
random from among the non-empty subsets of I(π).
Expected number of iterations: n
k
2
for general k .
O 20.78n for k = 2; O
1 + log(k
)
2
References: Howard(1960), Mansour and Singh (1999), Hollanders et al. (2014).
Note that bounds do not depend on B and γ!
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
9/16
9 / 16
Our Contribution
Algorithm
Computational complexity
Linear Programming
poly(n, k , B)
Value Iteration
poly n, k , B,
Policy Iteration
poly(n) · O
1
1−γ
1+
2
log(k )
n k
2
(expected)
10/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
10 / 16
Our Contribution
Algorithm
Computational complexity
Linear Programming
poly(n, k , B)
Value Iteration
poly n, k , B,
Policy Iteration
poly(n) · O
PGPI
n
poly(n) · O k 2 (expected)
1
1−γ
1+
2
log(k )
n k
2
(expected)
10/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
10 / 16
Our Contribution
Algorithm
Computational complexity
Linear Programming
poly(n, k , B)
Value Iteration
poly n, k , B,
Policy Iteration
poly(n) · O
PGPI
n
poly(n) · O k 2 (expected)
1
1−γ
1+
2
log(k )
n k
2
(expected)
PGPI = Planning by Guessing and Policy Improvement.
Randomised algorithm.
Key ingredient: a total order on the set of policies.
Analysis involves basic probability and counting.
10/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
10 / 16
Our Contribution
Algorithm
Computational complexity
Linear Programming
poly(n, k , B)
Value Iteration
poly n, k , B,
Policy Iteration
poly(n) · O
PGPI
n
poly(n) · O k 2 (expected)
1
1−γ
1+
2
log(k )
n k
2
(expected)
PGPI = Planning by Guessing and Policy Improvement.
Randomised algorithm.
Key ingredient: a total order on the set of policies.
Analysis involves basic probability and counting.
Even tighter bounds by combining PGPI with Randomised Policy Iteration!
10/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
10 / 16
Our Contribution
Algorithm
Computational complexity
Linear Programming
poly(n, k , B) [“subexponential” bounds (n, k ) exist!]
Value Iteration
poly n, k , B,
Policy Iteration
poly(n) · O
PGPI
n
poly(n) · O k 2 (expected)
1
1−γ
1+
2
log(k )
n k
2
(expected)
PGPI = Planning by Guessing and Policy Improvement.
Randomised algorithm.
Key ingredient: a total order on the set of policies.
Analysis involves basic probability and counting.
Even tighter bounds by combining PGPI with Randomised Policy Iteration!
10/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
10 / 16
Overview
1. MDP Planning
2. Solution strategies
Linear programming
Value iteration
Policy iteration
Our contribution: Planning by Guessing and Policy Improvement (PGPI)
3. PGPI algorithm
A total order on the set of policies
Guessing game
Algorithm
4. Discussion
11/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
11 / 16
A Total Order on the Set of Policies
For π ∈ Π, define
V (π) =
X
V π (s).
s∈S
12/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
12 / 16
A Total Order on the Set of Policies
For π ∈ Π, define
V (π) =
X
V π (s).
s∈S
Let L be an arbitrary total ordering on policies for tie-breaking, for example the
lexicographic ordering.
12/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
12 / 16
A Total Order on the Set of Policies
For π ∈ Π, define
V (π) =
X
V π (s).
s∈S
Let L be an arbitrary total ordering on policies for tie-breaking, for example the
lexicographic ordering.
Total order ≻:
For π1 , π2 ∈ Π, define π1 ≻ π2 iff
V (π1 ) > V (π2 ) or
V (π1 ) = V (π2 ) and π1 Lπ2 .
12/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
12 / 16
A Total Order on the Set of Policies
For π ∈ Π, define
V (π) =
X
V π (s).
s∈S
Let L be an arbitrary total ordering on policies for tie-breaking, for example the
lexicographic ordering.
Total order ≻:
For π1 , π2 ∈ Π, define π1 ≻ π2 iff
V (π1 ) > V (π2 ) or
V (π1 ) = V (π2 ) and π1 Lπ2 .
Observe that if policy improvement to π yields π ′ , then π ′ ≻ π.
′
′
∀s ∈ S : V π (s) ≥ V π (s) and ∃s ∈ S : V π (s) > V π (s)
=⇒ V (π ′ ) > V (π)
=⇒ π1 ≻ π2 .
12/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
12 / 16
Guessing Game
N
3
2
1
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
i+1
INCREMENT
i
3
2
1
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
GUESS
i+1
INCREMENT
i
3
2
1
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
GUESS
i+1
INCREMENT
i
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
3
2
1
How many operations needed to reach N?
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
N − O( N )
3
2
1
How many operations needed to reach N?
√
Pick the best of N guesses, and then increment up to N.
13/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13 / 16
Guessing Game
N
N − O( N )
3
2
1
How many operations needed to reach N?
√
Pick the best of N guesses, and then increment up to N.
√
Expected number of operations: O( N).
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
13/16
13 / 16
PGPI Algorithm
π ← Arbitrary policy.
Repeat k αn times:
Draw π ′ from Π uniformly at random.
If π ′ ≻ π, π ← π ′ .
While π is not optimal:
π ← PolicyImprovement(π).
14/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
14 / 16
PGPI Algorithm
π ← Arbitrary policy.
Repeat k αn times:
Draw π ′ from Π uniformly at random.
If π ′ ≻ π, π ← π ′ .
While π is not optimal:
π ← PolicyImprovement(π).
n
α = 0.5; Expected number of iterations: O k 2 .
14/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
14 / 16
PGPI Algorithm
π ← Arbitrary policy.
Repeat k αn times:
Draw π ′ from Π uniformly at random.
If π ′ ≻ π, π ← π ′ .
While π is not optimal:
π ← PolicyImprovement(π).
n
α = 0.5; Expected number of iterations: O k 2 .
k = 2, α = 0.46, randomised policy improvement;
Expected number of iterations: O 20.46n .
14/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
14 / 16
Overview
1. MDP Planning
2. Solution strategies
Linear programming
Value iteration
Policy iteration
Our contribution: Planning by Guessing and Policy Improvement (PGPI)
3. PGPI algorithm
A total order on the set of policies
Guessing game
Algorithm
4. Discussion
15/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
15 / 16
Discussion
PGPI: improved complexity bound solely in terms of n and k .
16/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
16 / 16
Discussion
PGPI: improved complexity bound solely in terms of n and k .
Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections?
16/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
16 / 16
Discussion
PGPI: improved complexity bound solely in terms of n and k .
Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections?
PGPI: no favourable experimental results yet!
Policy Iteration (Howard, Mansour and Singh) very quick on “typical” MDPs.
Yet to find MDPs on which PGPI dominates Policy Iteration.
Currently known “lower bound” MDPs (Fearnley, 2010) not suitable.
16/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
16 / 16
Discussion
PGPI: improved complexity bound solely in terms of n and k .
Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections?
PGPI: no favourable experimental results yet!
Policy Iteration (Howard, Mansour and Singh) very quick on “typical” MDPs.
Yet to find MDPs on which PGPI dominates Policy Iteration.
Currently known “lower bound” MDPs (Fearnley, 2010) not suitable.
References.
R. A. Howard, 1960. Dynamic Programming and Markov Processes. MIT Press, 1960.
Yishay Mansour and Satinder Singh, 1999. On the Complexity of Policy Iteration. In Proc. UAI 1999, pp. 401–408,
AUAI, 1999.
John Fearnley, 2010. Exponential Lower Bounds for Policy Iteration. In Proc. ICALP 2010, pp. 551–562, Springer,
2010.
Romain Hollanders, Balázs Gerencsér, Jean-Charles Delvenne, and Raphaël M. Jungers, 2014. About upper
bounds on the complexity of Policy Iteration,
http://perso.uclouvain.be/romain.hollanders/docs/NewUpperBoundForPI.pdf.
16/16
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
16 / 16
Discussion
PGPI: improved complexity bound solely in terms of n and k .
Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections?
PGPI: no favourable experimental results yet!
Policy Iteration (Howard, Mansour and Singh) very quick on “typical” MDPs.
Yet to find MDPs on which PGPI dominates Policy Iteration.
Currently known “lower bound” MDPs (Fearnley, 2010) not suitable.
References.
R. A. Howard, 1960. Dynamic Programming and Markov Processes. MIT Press, 1960.
Yishay Mansour and Satinder Singh, 1999. On the Complexity of Policy Iteration. In Proc. UAI 1999, pp. 401–408,
AUAI, 1999.
John Fearnley, 2010. Exponential Lower Bounds for Policy Iteration. In Proc. ICALP 2010, pp. 551–562, Springer,
2010.
Romain Hollanders, Balázs Gerencsér, Jean-Charles Delvenne, and Raphaël M. Jungers, 2014. About upper
bounds on the complexity of Policy Iteration,
http://perso.uclouvain.be/romain.hollanders/docs/NewUpperBoundForPI.pdf.
Thank you!
Kalyanakrishnan, Misra, and Gopalan (2015)
Improved Running Time for MDP Planning
16/16
16 / 16