Strategy iteration is strongly polynomial for 2-player turn

Strategy iteration is strongly polynomial
for 2-player turn-based stochastic games
with a constant discount factor
Thomas Dueholm Hansen1,2
Peter Bro Miltersen1 Uri Zwick2
1
Center for the Theory of Interactive Computation,
Department of Computer Science,
Aarhus University, Denmark
2
School of Computer Science,
Tel Aviv University, Israel
January 29, 2013
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 1/35
Thomas Dueholm Hansen
Uri Zwick
Peter Bro Miltersen
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 2/35
The focus of this talk
Worst-case analysis of a classical algorithm:
Markov decision processes (MDPs) introduced by Bellman
(1957); a special case of stochastic games (SGs) introduced
by Shapley (1953).
Howard’s policy iteration algorithm (1960) for solving
MDPs.
2-player turn-based stochastic games (2TBSGs) is a
generalization of MDPs and a special case of SGs.
Strategy iteration, an extension of policy iteration, solves
2TBSGs.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 3/35
Applications
Solving Markov decision processes
is an important problem in
operations research and machine
learning; it is, for instance, used to
solve the dairy cow replacement
problem.
Solving parity games, a special case
of 2TBSGs, is equivalent to the
problem of µ-calculus model
checking; an important problem in
logic and formal methods.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 4/35
The major open problems
MDPs can be solved in polynomial time using linear
programming, but it remains open how to solve MDPs in
strongly polynomial time.
Solving 2TBSGs, or even parity games, in polynomial time
remains open.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 5/35
The major open problems
MDPs can be solved in polynomial time using linear
programming, but it remains open how to solve MDPs in
strongly polynomial time.
Solving 2TBSGs, or even parity games, in polynomial time
remains open.
The corresponding decision problem is in NP ∩ coNP.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 5/35
The major open problems
MDPs can be solved in polynomial time using linear
programming, but it remains open how to solve MDPs in
strongly polynomial time.
Solving 2TBSGs, or even parity games, in polynomial time
remains open.
The corresponding decision problem is in NP ∩ coNP.
Gärtner and Rüst (2005), and Jurdziński and Ravani (2008):
2TBSGs can be solved via P-matrix linear complementarity
problems.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 5/35
Markov chains

1
2
1
2
1
1
2
1
2
1
3
1
2
4
0 1 0
1

P =  21
2
0
1
2
0 0
0 0 0
0

0

1

2
1
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 6/35
Markov chains

1
2
1
2
1
1
1
2
0
1
2
1
3
0
1
2
4
1

P =  21
2
0
1
2
0 0
0 0 0
0
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
0 1 0
k
0
0

0

1

2
1
bT P k
1 0 0 0
Let b ∈ Rn define a probability
distribution for picking an initial state.
b T P k is a vector defining the probabilities
of being in each state after k transitions.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 6/35
Markov chains

1
2
1
2
1
0
1
2
1
1
2
1
3
0
1
2
4
1

P =  21
2
0
1
2
0 0
0 0 0
0
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
0 1 0
k
0
1
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
Let b ∈ Rn define a probability
distribution for picking an initial state.
b T P k is a vector defining the probabilities
of being in each state after k transitions.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 6/35
Markov chains

1
2
1
2
1
1
2
1
2
0
1
2
1
3
1
2
1
2
4
1

P =  21
2
0
1
2
0 0
0 0 0
0
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
0 1 0
k
0
1
2
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
1
1
2 0 2 0
Let b ∈ Rn define a probability
distribution for picking an initial state.
b T P k is a vector defining the probabilities
of being in each state after k transitions.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 6/35
Markov chains

1
2
1
2
1
1
4
1
2
1
2
1
2
1
3
1
4
1
2
4
1

P =  21
2
Let b ∈ Rn define a probability
distribution for picking an initial state.
0
1
2
0 0
0 0 0
0
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
0 1 0
k
0
1
2
3
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
1
1
2 0 2 0
1
1
1
4
2 0 4
b T P k is a vector defining the probabilities
of being in each state after k transitions.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 6/35
Markov chains

1
2
1
2
1
1
4
1
2
1
4
1
2
1
3
1
4
1
2
4
1

P =  21
2
Let b ∈ Rn define a probability
distribution for picking an initial state.
0
1
2
0 0
0 0 0
1
4
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
0 1 0
k
0
1
2
3
4
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
1
1
2 0 2 0
1
1
1
4
2 0 4
1
4
1
4
1
4
1
4
b T P k is a vector defining the probabilities
of being in each state after k transitions.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 6/35
Markov chains

1
2
1
2
1
1
4
1
2
1
4
1
2
1
3
1
8
1
2
4
Let b ∈ Rn define a probability
distribution for picking an initial state.
b T P k is a vector defining the probabilities
of being in each state after k transitions.
Hansen, Miltersen, and Zwick
1

P =  21
2
1
2
0
0 0
0 0 0
3
8
An n-state Markov chain is defined by an
n × n stochastic matrix P, with Pi,j being
the probability of making a transition
P
from state i to state j. I.e., j Pi,j = 1.
0 1 0
k
0
1
2
3
4
5
..
.
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
1
1
2 0 2 0
1
1
1
4
2 0 4
1
4
1
4
1
4
1
4
Strategy Iteration for discounted 2TBSGs
1
4
1
8
1
4
3
8
..
.
Page 6/35
Markov chains with costs
We refer to the act of leaving a state i as
an action, and we associate each action
with a cost ci , described by a vector
c ∈ Rn .
Let γ < 1 be a discount factor. We are
interested in the expected total
discounted cost for some initial
distribution b:
∞
X
b T (γP)k c
k=0
Hansen, Miltersen, and Zwick

0 1 0
1

P =  21
2
1
2
0
0 0
0 0 0
k
0
1
2
3
4
5
..
.
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
1
1
2 0 2 0
1
1
1
4
2 0 4
1
4
1
4
1
4
1
4
Strategy Iteration for discounted 2TBSGs
1
4
1
8
1
4
3
8
..
.
Page 7/35
Markov chains with costs
We refer to the act of leaving a state i as
an action, and we associate each action
with a cost ci , described by a vector
c ∈ Rn .
Let γ < 1 be a discount factor. We are
interested in the expected total
discounted cost for some initial
distribution b:
∞
X
b T (γP)k c
k=0
The value of state i is the expected total
discounted cost when starting in state i
with probability 1. I.e., b = ei .
Hansen, Miltersen, and Zwick

0 1 0
1

P =  21
2
1
2
0
0 0
0 0 0
k
0
1
2
3
4
5
..
.
0

0

1

2
1
bT P k
1 0 0 0
0 1 0 0
1
1
2 0 2 0
1
1
1
4
2 0 4
1
4
1
4
1
4
1
4
Strategy Iteration for discounted 2TBSGs
1
4
1
8
1
4
3
8
..
.
Page 7/35
Markov decision processes
7
1
2
1
2
-4
1
a3
a1
1
3
a5
2
a2
1
5
1
3
a4
1
1 4
2
2
1
4
a6
2
1 3
3
-10
A Markov decision process consists of a set of n states S,
each state i ∈ S being associated with a non-empty set of
actions Ai .
Each action a is associated with a cost ca and a probability
distribution Pa ∈ R1×n such that Pa,j is the probability of
moving to state j when using action a.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 8/35
Markov decision processes
7
1
2
1
2
-4
1
a3
a1
1
3
a5
2
a2
1
5
1
3
a4
1
1 4
2
2
1
4
a6
2
1 3
3
-10
A policy π is a choice of an action from each state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 8/35
Markov decision processes
7
1
2
5
1
2
1
a1
a5
1
2
3
a4
1
1 4
2
2
1
4
A policy π is a choice of an action from each state.
A policy π P
is a Markov chain with costs, and
T
k
valπ (i) = ∞
k=0 ei (γPπ ) cπ refers to the corresponding
value of state i.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 8/35
Markov decision processes
7
1
2
5
1
2
1
a1
a5
1
2
3
a4
1
1 4
2
2
1
4
A policy π is a choice of an action from each state.
A policy π P
is a Markov chain with costs, and
T
k
valπ (i) = ∞
k=0 ei (γPπ ) cπ refers to the corresponding
value of state i.
A policy π ∗ is optimal iff it minimizes the values of all states:
∀π, i :
valπ∗ (i) ≤ valπ (i)
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 8/35
Markov decision processes
7
1
2
5
1
2
1
a1
a5
1
2
3
a4
1
1 4
2
2
1
4
Shapley (1953), Bellman (1957): There always exists an
optimal policy.
Solving an MDP means finding an optimal policy.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 8/35
Values
The values can be found as the unique solution to the system
of linear equations:
X
∀i : valπ (i) = cπ(i) + γ
Pπ(i),j · valπ (j)
j
2
1
3
1
valπ (1) = 7 + γ
7
1
3 valπ (2)
Hansen, Miltersen, and Zwick
+
2
3
2
3 valπ (3)
3
Strategy Iteration for discounted 2TBSGs
Page 9/35
Summing values
Since an optimal policy simultaneously
minimizes the values of all states it
minimizes the sum of the values:
X
valπ (i) =
i∈S
where
eT
∞
X
e T (γPπ )k cπ
k=0
= (1, 1, . . . , 1).
Hansen, Miltersen, and Zwick
k
0
1
2
3
4
..
.
eT Pk
1 1 1 1
1 1 12 32
3
1
7
4 1 2
4
3
3
1
2
4
4
2
5
8
3
4
3
8
9
4
..
.
Strategy Iteration for discounted 2TBSGs
Page 10/35
Summing values
Since an optimal policy simultaneously
minimizes the values of all states it
minimizes the sum of the values:
X
valπ (i) =
i∈S
where
eT
∞
X
e T (γPπ )k cπ
k=0
= (1, 1, . . . , 1).
k
0
1
2
3
4
..
.
eT Pk
1 1 1 1
1 1 12 32
3
1
7
4 1 2
4
3
3
1
2
4
4
2
5
8
3
4
3
8
9
4
..
.
Let xa be the number of times action a is used, i.e., the sum
of a column in the table. Then:
X
XX
valπ (i) =
ca xa
i∈S
Hansen, Miltersen, and Zwick
i∈S a∈Ai
Strategy Iteration for discounted 2TBSGs
Page 10/35
Linear program for solving MDPs
The following linear program
simultaneously maximizes the values of
all states:
XX
minimize
ca xa
i∈S a∈A
Xi
s.t. ∀i ∈ S :
xa = 1 + γ
a∈Ai
∀i ∈ S, ∀a ∈ Ai : xa ≥ 0
The constraints ensure “flow
conservation”.
Hansen, Miltersen, and Zwick
XX
Flow conservation:
x1 = 1 x2 = 6γ
Pa,i xa
j∈S a∈Aj
x3 = 4 x4 = 2
x1 + x2 = 1 + γ(x3 + x4 )
Strategy Iteration for discounted 2TBSGs
Page 11/35
Basic feasible solutions and policies
minimize
XX
i∈S a∈A
Xi
s.t. ∀i ∈ S :
Flow conservation:
ca xa
xa = 1 + γ
a∈Ai
XX
Pa,i xa
x1 = 0 x2 = 1 + 6γ
j∈S a∈Aj
∀i ∈ S, ∀a ∈ Ai : xa ≥ 0
In a basic feasible solution exactly one
variable (action) is non-zero for each
state.
Every basic feasible solution corresponds
to a policy π.
Hansen, Miltersen, and Zwick
x3 = 4 x4 = 2
x1 + x2 = 1 + γ(x3 + x4 )
Strategy Iteration for discounted 2TBSGs
Page 12/35
Improving switches and reduced costs
a
i
An action a originating at i is an improving switch w.r.t. a
policy π if making the switch to a improves the value of i:
valπ[a] (i) < valπ (i)
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 13/35
Improving switches and reduced costs
a
i
An action a originating at i is an improving switch w.r.t. a
policy π if making the switch to a improves the value of i:
valπ[a] (i) < valπ (i)
For every policy π, state i ∈ S, and action a ∈ Ai , we define
the reduced cost of a w.r.t. π as:
X
c̄aπ = ca + γ
Pa,j · valπ (j) − valπ (i) .
j
a is an improving switch w.r.t. π iff c̄aπ < 0.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 13/35
Multiple joint improving switches and optimality
Lemma (Howard (1960))
Let π 0 be obtained from π by jointly performing any non-empty set
of improving switches. Then for all states i, valπ0 (i) ≤ valπ (i)
with strict inequality for at least one state.
Lemma (Howard (1960))
A policy π is optimal iff there are no improving switches.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 14/35
Policy iteration
Function PolicyIteration(π)
while ∃ improving switch w.r.t. π do
Update π by performing improving switches
return π
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 15/35
Policy iteration
Function PolicyIteration(π)
while ∃ improving switch w.r.t. π do
Update π by performing improving switches
return π
Howard’s algorithm (1960): For each state i pick the
improving switch with the smallest reduced cost:
π(i) ← argmin c̄aπ
a∈Ai
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 15/35
Policy iteration
Function PolicyIteration(π)
while ∃ improving switch w.r.t. π do
Update π by performing improving switches
return π
Howard’s algorithm (1960): For each state i pick the
improving switch with the smallest reduced cost:
π(i) ← argmin c̄aπ
a∈Ai
Dantzig’s rule (1947): Perform the single improving switch
with the most negative reduced cost.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 15/35
Non-discounted MDPs
Liggett and Lippman (1969) showed that for any MDP G
there exists a discount factor γG < 1, such that the same
policies are optimal for all discount factors γ 0 ∈ [γG , 1).
Andersson and Miltersen (2009) showed that γG can be
described with a number of bits that is polynomial in the bit
complexity of G .
A 2TBSG G is called non-discounted if it is implicitly using
discount factor γG .
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 16/35
Upper bounds for policy iteration
The number of iterations performed by Howard’s policy
iteration algorithm for an MDP with n states, m actions,
discount factor γ < 1, and bit complexity B is at most:
Meister and Holzbaur (1986):
Mansour and Singh (1999)1 :
Ye (2011):
We prove:
1
1
B
log 1−γ
)
O( 1−γ
n
2
O( n )
mn
n
O( 1−γ
log 1−γ
)
n
m
O( 1−γ log 1−γ )
Assuming there are two actions per state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 17/35
Upper bounds for policy iteration
The number of iterations performed by Howard’s policy
iteration algorithm for an MDP with n states, m actions,
discount factor γ < 1, and bit complexity B is at most:
Meister and Holzbaur (1986):
Mansour and Singh (1999)1 :
Ye (2011):
We prove:
1
B
log 1−γ
)
O( 1−γ
n
2
O( n )
mn
n
O( 1−γ
log 1−γ
)
n
m
O( 1−γ log 1−γ )
Ye (2011): Policy iteration with Dantzig’s rule uses at most
mn
n
O( 1−γ
log 1−γ
) iterations.
1
Assuming there are two actions per state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 17/35
Upper bounds for policy iteration
The number of iterations performed by Howard’s policy
iteration algorithm for an MDP with n states, m actions,
discount factor γ < 1, and bit complexity B is at most:
Meister and Holzbaur (1986):
Mansour and Singh (1999)1 :
Ye (2011):
We prove:
1
B
log 1−γ
)
O( 1−γ
n
2
O( n )
mn
n
O( 1−γ
log 1−γ
)
n
m
O( 1−γ log 1−γ )
Ye (2011): Policy iteration with Dantzig’s rule uses at most
mn
n
O( 1−γ
log 1−γ
) iterations.
Post and Ye (2012): Policy iteration with Dantzig’s rule uses
at most O(n3 m2 log2 n) iterations to solve deterministic
MDPs.
1
Assuming there are two actions per state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 17/35
Lower bounds for policy iteration
Friedmann (2009), Fearnley (2010): There exists a family of
non-discounted MDPs for which the number of iterations
performed by Howard’s policy iteration algorithm is 2Ω(n) .
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 18/35
Lower bounds for policy iteration
Friedmann (2009), Fearnley (2010): There exists a family of
non-discounted MDPs for which the number of iterations
performed by Howard’s policy iteration algorithm is 2Ω(n) .
Hansen and Zwick (2010): For every discount factor γ < 1,
there exists a family of deterministic MDPs for which the
number of iterations performed by Howard’s policy iteration
algorithm is Ω(n2 ).
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 18/35
Extension to games
2-player turn-based stochastic games (2TBSGs) is a
natural generalization of MDPs.
The strategy iteration algorithm is a generalization of policy
iteration that solves 2TBSGs.
2
Assuming there are two actions per state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 19/35
Extension to games
2-player turn-based stochastic games (2TBSGs) is a
natural generalization of MDPs.
The strategy iteration algorithm is a generalization of policy
iteration that solves 2TBSGs.
The number of iterations performed by the strategy iteration
algorithm using Howard’s rule is at most:
B
1
O( 1−γ
log 1−γ
)
n
2
O( n )
m
n
O( 1−γ
log 1−γ
)
Littman (1996):
Mansour and Singh (1999)2 :
We prove:
2
Assuming there are two actions per state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 19/35
Extension to games
2-player turn-based stochastic games (2TBSGs) is a
natural generalization of MDPs.
The strategy iteration algorithm is a generalization of policy
iteration that solves 2TBSGs.
The number of iterations performed by the strategy iteration
algorithm using Howard’s rule is at most:
B
1
O( 1−γ
log 1−γ
)
n
2
O( n )
m
n
O( 1−γ
log 1−γ
)
Littman (1996):
Mansour and Singh (1999)2 :
We prove:
We provide the first known strongly polynomial time bound for
an algorithm for solving 2TBSGs with a fixed discount factor.
2
Assuming there are two actions per state.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 19/35
2-player turn-based stochastic games (2TBSGs)
7
1
2
1
2
-4
1
a3
a1
a2
1
3
5
1
a5
a4
1
1 4
2
2
1
4
a6
2
1 3
3
-10
A 2TBSG is an MDP where the set of states is partitioned
into two sets: S1 ∪ S2 = S.
S1 is controlled by player 1, the minimizer.
S2 is controlled by player 2, the maximizer.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 20/35
2-player turn-based stochastic games (2TBSGs)
7
1
2
1
2
-4
1
a3
a1
a2
1
3
5
1
a5
a4
1
1 4
2
2
1
4
a6
2
1 3
3
-10
A 2TBSG is an MDP where the set of states is partitioned
into two sets: S1 ∪ S2 = S.
S1 is controlled by player 1, the minimizer.
S2 is controlled by player 2, the maximizer.
A strategy π1 (or π2 ) for player 1 (or player 2) is a choice of
an action for each state i ∈ S1 (or i ∈ S2 ).
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 20/35
2-player turn-based stochastic games (2TBSGs)
7
1
2
5
1
2
1
a1
a5
a4
1
1 4
2
2
1
4
A strategy profile π = (π1 , π2 ) is a pair of strategies,
defining a Markov chain with costs.
We again use valπ1 ,π2 (i) to denote the corresponding value of
state i.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 20/35
2TBSGs can model backgammon
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 21/35
Optimality for 2TBSGs
Let π1 be a strategy for player 1. A strategy π2 for player 2 is
an optimal counter strategy against π1 iff:
∀π20 , i :
valπ1 ,π2 (i) ≥ valπ1 ,π20 (i)
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 22/35
Optimality for 2TBSGs
Let π1 be a strategy for player 1. A strategy π2 for player 2 is
an optimal counter strategy against π1 iff:
∀π20 , i :
valπ1 ,π2 (i) ≥ valπ1 ,π20 (i)
I.e., π2 is optimal, when maximizing, for the MDP resulting
from fixing the choices of player 1 to π1 .
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 22/35
Optimality for 2TBSGs
Let π1 be a strategy for player 1. A strategy π2 for player 2 is
an optimal counter strategy against π1 iff:
∀π20 , i :
valπ1 ,π2 (i) ≥ valπ1 ,π20 (i)
I.e., π2 is optimal, when maximizing, for the MDP resulting
from fixing the choices of player 1 to π1 .
π1∗ is optimal for player 1 iff:
∀π1 , i :
max valπ1∗ ,π2 (i) ≤ max valπ1 ,π2 (i)
π2
π2
Shapley (1953): Optimal strategies always exist.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 22/35
Improving switches and reduced costs
a
i
An action a originating at i ∈ S1 is an improving switch for
player 1 w.r.t. (π1 , π2 ) if making the switch to a improves the
value of state i for player 1:
valπ1 [a],π2 (i) < valπ1 ,π2 (i)
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 23/35
Improving switches and reduced costs
a
i
An action a originating at i ∈ S1 is an improving switch for
player 1 w.r.t. (π1 , π2 ) if making the switch to a improves the
value of state i for player 1:
valπ1 [a],π2 (i) < valπ1 ,π2 (i)
For every strategy profile (π1 , π2 ), state i ∈ S, and action
a ∈ Ai , we define the reduced cost of a w.r.t. (π1 , π2 ) as:
X
c̄aπ1 ,π2 = ca + γ
Pa,j · valπ1 ,π2 (j) − valπ1 ,π2 (i) .
j
a is an improving switch for player 1 w.r.t. (π1 , π2 ) iff
c̄aπ1 ,π2 < 0.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 23/35
Optimality condition
Optimal counter strategies, optimal strategies, and improving
switches are defined analogously for player 2 with reversed
inequalities.
π2 is an optimal counter strategy against π1 iff player 2 has
no improving switches in S2 .
π1 is an optimal counter strategy against π2 iff player 1 has
no improving switches in S1 .
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 24/35
Optimality condition
Optimal counter strategies, optimal strategies, and improving
switches are defined analogously for player 2 with reversed
inequalities.
π2 is an optimal counter strategy against π1 iff player 2 has
no improving switches in S2 .
π1 is an optimal counter strategy against π2 iff player 1 has
no improving switches in S1 .
(π1∗ , π2∗ ) is optimal iff neither player has an improving switch.
I.e., (π1∗ , π2∗ ) is a Nash equilibrium.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 24/35
Optimality condition
Optimal counter strategies, optimal strategies, and improving
switches are defined analogously for player 2 with reversed
inequalities.
π2 is an optimal counter strategy against π1 iff player 2 has
no improving switches in S2 .
π1 is an optimal counter strategy against π2 iff player 1 has
no improving switches in S1 .
(π1∗ , π2∗ ) is optimal iff neither player has an improving switch.
I.e., (π1∗ , π2∗ ) is a Nash equilibrium.
Note that the decision problem corresponding to solving
2TBSGs is in NP ∩ coNP, since an optimal strategy profile is
a witness for both yes and no instances.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 24/35
StrategyIteration (Rao et al. (1973))
We let π2 (π1 ) be an optimal counter strategy against π1 .
Function StrategyIteration(π1 )
while ∃ improving switch for player 1 w.r.t. (π1 , π2 (π1 )) do
Update π1 by performing improving switches
return (π1 , π2 (π1 ))
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 25/35
StrategyIteration (Rao et al. (1973))
We let π2 (π1 ) be an optimal counter strategy against π1 .
Function StrategyIteration(π1 )
while ∃ improving switch for player 1 w.r.t. (π1 , π2 (π1 )) do
Update π1 by performing improving switches
return (π1 , π2 (π1 ))
Howard’s algorithm can be naturally extended to 2TBSGs by
choosing for all i ∈ S1 :
π ,π2 (π1 )
π1 (i) ← argmin c̄a 1
a∈Ai
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 25/35
Main observation
Theorem (Hansen, Miltersen, and Zwick (2011))
The strategy iteration algorithm with Howard’s update rule
n
m
log 1−γ
) iterations.
terminates after O( 1−γ
Main observation:
1
n
log 1−γ
) iterations a new action will never be
Every O( 1−γ
used again.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 26/35
Main observation
Theorem (Hansen, Miltersen, and Zwick (2011))
The strategy iteration algorithm with Howard’s update rule
n
m
log 1−γ
) iterations.
terminates after O( 1−γ
Main observation:
1
n
log 1−γ
) iterations a new action will never be
Every O( 1−γ
used again.
Let π ∗ = (π1∗ , π2∗ ) be optimal. The actions a with largest
∗
reduced cost c̄aπ w.r.t. π ∗ are eliminated first.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 26/35
Convergence towards the optimal solution
Lemma (Meister and Holzbaur (1986))
Let π t be the strategy profile generated after t iterations when
running the StrategyIteration algorithm starting from π 0 ,
then:
kvalπt − valπ∗ k∞ ≤ γ t kvalπ0 − valπ∗ k∞ .
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 27/35
Convergence towards the optimal solution
Lemma (Meister and Holzbaur (1986))
Let π t be the strategy profile generated after t iterations when
running the StrategyIteration algorithm starting from π 0 ,
then:
kvalπt − valπ∗ k∞ ≤ γ t kvalπ0 − valπ∗ k∞ .
Proofsketch:
The values generated by the StrategyIteration algorithm
decrease at least as fast as the corresponding values generated
the ValueIteration algorithm.
ValueIteration is a contraction with Lipschitz constant γ.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 27/35
When is an action eliminated?
Lemma
Let π t = (π1t , π2 (π1t )) be the strategy profile generated after t
iterations when running the StrategyIteration algorithm
∗
starting from π 0 = (π10 , π2 (π10 )). Let a = argmaxa∈π0 c̄aπ , and
1
assume that a ∈ π1t . Then:
kvalπt − valπ∗ k1 ≥
1−γ
kvalπ0 − valπ∗ k1 .
n
I.e., the action a does not appear after t iterations if:
kvalπt − valπ∗ k1 <
Hansen, Miltersen, and Zwick
1−γ
kvalπ0 − valπ∗ k1 .
n
Strategy Iteration for discounted 2TBSGs
Page 28/35
When is an action eliminated?
Combining the two lemmas gives:
1−γ
kvalπ0 − valπ∗ k1 ≤ kvalπt − valπ∗ k1
n
≤ nkvalπt − valπ∗ k∞
≤ nγ t kvalπ0 − valπ∗ k∞
≤ nγ t kvalπ0 − valπ∗ k1
Hence, a can only appear after t iterations if:
t ≤ log1/γ
n2
1
n
= O( 1−γ
log 1−γ
)
1−γ
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 29/35
Flux vectors
Since an optimal policy for an MDP
simultaneously minimizes the values of
all states it minimizes the sum of the
values:
X
i∈S
valπ (i) =
∞
X
e T Pπk cπ
k=0
where e T = (1, 1, . . . , 1).
Hansen, Miltersen, and Zwick
k
0
1
2
3
4
..
.
eT Pk
1 1 1 1
1 1 12 32
3
1
7
4 1 2
4
3
3
1
2
4
4
2
5
8
3
4
3
8
9
4
..
.
Strategy Iteration for discounted 2TBSGs
Page 30/35
Flux vectors
Since an optimal policy for an MDP
simultaneously minimizes the values of
all states it minimizes the sum of the
values:
X
i∈S
valπ (i) =
∞
X
e T Pπk cπ
k=0
where e T = (1, 1, . . . , 1).
k
0
1
2
3
4
..
.
eT Pk
1 1 1 1
1 1 12 32
3
1
7
4 1 2
4
3
3
1
2
4
4
2
5
8
3
4
3
8
9
4
..
.
We define the flux vector x π ∈ Rn such that xiπ is the
expected (discounted) number of times state i is visited, i.e.,
the discounted sum of a column in the table:
π T
(x )
=
∞
X
e T (γPπ )k
k=0
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 30/35
Flux vectors
The flux vector provides an alternative way of summing values:
X
i∈S
valπ (i) =
∞
X
e T (γPπ )k cπ = (x π )T cπ
k=0
where e T = (1, 1, . . . , 1).
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 31/35
Flux vectors
The flux vector provides an alternative way of summing values:
X
i∈S
valπ (i) =
∞
X
e T (γPπ )k cπ = (x π )T cπ
k=0
where e T = (1, 1, . . . , 1).
Note that:
eT x π =
Hansen, Miltersen, and Zwick
n
1−γ
Strategy Iteration for discounted 2TBSGs
Page 31/35
Flux vectors
The flux vector provides an alternative way of summing values:
X
i∈S
valπ (i) =
∞
X
e T (γPπ )k cπ = (x π )T cπ
k=0
where e T = (1, 1, . . . , 1).
Note that:
eT x π =
n
1−γ
Lemma
Let π and π ∗ be any two strategy profiles. Then:
X
∗
valπ (i) − valπ∗ (i) = ((c̄ π )π )T x π
i∈S
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 31/35
A subexponential algorithm for 2TBSGs
Kalai (1992) and Matoušek, Sharir and Welzl (1992)
introduced the RandomFacet algorithm for solving linear
programs, and more generally LP-type problems, in expected
subexponential time.
Halman (2007) showed that 2TBSGs are LP-type problems.
Hence the RandomFacet algorithm
can be used to solve
√
2TBSGs in expected time 2O( n log m) .
This is the best known bound expressed only in terms of n and
m.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 32/35
Overview of problems
LP-type problems
Abstract
Concrete
Turn-based stochastic games
21/2 players
Mean payoff games
2 players
Parity games
2 players
Linear programming
Markov decision processes
11/2 players
Deterministic MDPs
1 player
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 33/35
Concluding remarks
We have shown that StrategyIteration with Howard’s
m
n
update rule performs O( 1−γ
log 1−γ
) iterations.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 34/35
Concluding remarks
We have shown that StrategyIteration with Howard’s
m
n
update rule performs O( 1−γ
log 1−γ
) iterations.
n
The best bound expressed only in terms of n and m is O( 2n )
by Mansour and Singh (1999) for m = 2n. This is true even
for deterministic MDPs.
The exponential lower bounds by Friedmann (2009) and
Fearnley (2010) do not hold for deterministic MDPs. Here the
best lower bound is Ω(n2 ) by Hansen and Zwick (2010).
Open problem: Close the gap. Hansen and Zwick (2010)
conjecture that the number of iterations is at most m for
deterministic MDPs.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 34/35
Concluding remarks
We have shown that StrategyIteration with Howard’s
m
n
update rule performs O( 1−γ
log 1−γ
) iterations.
n
The best bound expressed only in terms of n and m is O( 2n )
by Mansour and Singh (1999) for m = 2n. This is true even
for deterministic MDPs.
The exponential lower bounds by Friedmann (2009) and
Fearnley (2010) do not hold for deterministic MDPs. Here the
best lower bound is Ω(n2 ) by Hansen and Zwick (2010).
Open problem: Close the gap. Hansen and Zwick (2010)
conjecture that the number of iterations is at most m for
deterministic MDPs.
Hansen and Zwick conjecture that Fn+2 is an upper bound for
the number of iterations performed by StrategyIteration
with Howard’s update rule.
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 34/35
The end
Thank you for listening!
Hansen, Miltersen, and Zwick
Strategy Iteration for discounted 2TBSGs
Page 35/35

Download Report

Strategy iteration is strongly polynomial for 2-player turn

Paperzz.com

Your Paperzz