Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor Thomas Dueholm Hansen1,2 Peter Bro Miltersen1 Uri Zwick2 1 Center for the Theory of Interactive Computation, Department of Computer Science, Aarhus University, Denmark 2 School of Computer Science, Tel Aviv University, Israel January 29, 2013 Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 1/35 Thomas Dueholm Hansen Uri Zwick Peter Bro Miltersen Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 2/35 The focus of this talk Worst-case analysis of a classical algorithm: Markov decision processes (MDPs) introduced by Bellman (1957); a special case of stochastic games (SGs) introduced by Shapley (1953). Howard’s policy iteration algorithm (1960) for solving MDPs. 2-player turn-based stochastic games (2TBSGs) is a generalization of MDPs and a special case of SGs. Strategy iteration, an extension of policy iteration, solves 2TBSGs. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 3/35 Applications Solving Markov decision processes is an important problem in operations research and machine learning; it is, for instance, used to solve the dairy cow replacement problem. Solving parity games, a special case of 2TBSGs, is equivalent to the problem of µ-calculus model checking; an important problem in logic and formal methods. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 4/35 The major open problems MDPs can be solved in polynomial time using linear programming, but it remains open how to solve MDPs in strongly polynomial time. Solving 2TBSGs, or even parity games, in polynomial time remains open. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 5/35 The major open problems MDPs can be solved in polynomial time using linear programming, but it remains open how to solve MDPs in strongly polynomial time. Solving 2TBSGs, or even parity games, in polynomial time remains open. The corresponding decision problem is in NP ∩ coNP. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 5/35 The major open problems MDPs can be solved in polynomial time using linear programming, but it remains open how to solve MDPs in strongly polynomial time. Solving 2TBSGs, or even parity games, in polynomial time remains open. The corresponding decision problem is in NP ∩ coNP. Gärtner and Rüst (2005), and Jurdziński and Ravani (2008): 2TBSGs can be solved via P-matrix linear complementarity problems. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 5/35 Markov chains 1 2 1 2 1 1 2 1 2 1 3 1 2 4 0 1 0 1 P = 21 2 0 1 2 0 0 0 0 0 0 0 1 2 1 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 6/35 Markov chains 1 2 1 2 1 1 1 2 0 1 2 1 3 0 1 2 4 1 P = 21 2 0 1 2 0 0 0 0 0 0 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. 0 1 0 k 0 0 0 1 2 1 bT P k 1 0 0 0 Let b ∈ Rn define a probability distribution for picking an initial state. b T P k is a vector defining the probabilities of being in each state after k transitions. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 6/35 Markov chains 1 2 1 2 1 0 1 2 1 1 2 1 3 0 1 2 4 1 P = 21 2 0 1 2 0 0 0 0 0 0 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. 0 1 0 k 0 1 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 Let b ∈ Rn define a probability distribution for picking an initial state. b T P k is a vector defining the probabilities of being in each state after k transitions. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 6/35 Markov chains 1 2 1 2 1 1 2 1 2 0 1 2 1 3 1 2 1 2 4 1 P = 21 2 0 1 2 0 0 0 0 0 0 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. 0 1 0 k 0 1 2 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 1 1 2 0 2 0 Let b ∈ Rn define a probability distribution for picking an initial state. b T P k is a vector defining the probabilities of being in each state after k transitions. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 6/35 Markov chains 1 2 1 2 1 1 4 1 2 1 2 1 2 1 3 1 4 1 2 4 1 P = 21 2 Let b ∈ Rn define a probability distribution for picking an initial state. 0 1 2 0 0 0 0 0 0 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. 0 1 0 k 0 1 2 3 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 1 1 2 0 2 0 1 1 1 4 2 0 4 b T P k is a vector defining the probabilities of being in each state after k transitions. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 6/35 Markov chains 1 2 1 2 1 1 4 1 2 1 4 1 2 1 3 1 4 1 2 4 1 P = 21 2 Let b ∈ Rn define a probability distribution for picking an initial state. 0 1 2 0 0 0 0 0 1 4 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. 0 1 0 k 0 1 2 3 4 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 1 1 2 0 2 0 1 1 1 4 2 0 4 1 4 1 4 1 4 1 4 b T P k is a vector defining the probabilities of being in each state after k transitions. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 6/35 Markov chains 1 2 1 2 1 1 4 1 2 1 4 1 2 1 3 1 8 1 2 4 Let b ∈ Rn define a probability distribution for picking an initial state. b T P k is a vector defining the probabilities of being in each state after k transitions. Hansen, Miltersen, and Zwick 1 P = 21 2 1 2 0 0 0 0 0 0 3 8 An n-state Markov chain is defined by an n × n stochastic matrix P, with Pi,j being the probability of making a transition P from state i to state j. I.e., j Pi,j = 1. 0 1 0 k 0 1 2 3 4 5 .. . 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 1 1 2 0 2 0 1 1 1 4 2 0 4 1 4 1 4 1 4 1 4 Strategy Iteration for discounted 2TBSGs 1 4 1 8 1 4 3 8 .. . Page 6/35 Markov chains with costs We refer to the act of leaving a state i as an action, and we associate each action with a cost ci , described by a vector c ∈ Rn . Let γ < 1 be a discount factor. We are interested in the expected total discounted cost for some initial distribution b: ∞ X b T (γP)k c k=0 Hansen, Miltersen, and Zwick 0 1 0 1 P = 21 2 1 2 0 0 0 0 0 0 k 0 1 2 3 4 5 .. . 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 1 1 2 0 2 0 1 1 1 4 2 0 4 1 4 1 4 1 4 1 4 Strategy Iteration for discounted 2TBSGs 1 4 1 8 1 4 3 8 .. . Page 7/35 Markov chains with costs We refer to the act of leaving a state i as an action, and we associate each action with a cost ci , described by a vector c ∈ Rn . Let γ < 1 be a discount factor. We are interested in the expected total discounted cost for some initial distribution b: ∞ X b T (γP)k c k=0 The value of state i is the expected total discounted cost when starting in state i with probability 1. I.e., b = ei . Hansen, Miltersen, and Zwick 0 1 0 1 P = 21 2 1 2 0 0 0 0 0 0 k 0 1 2 3 4 5 .. . 0 0 1 2 1 bT P k 1 0 0 0 0 1 0 0 1 1 2 0 2 0 1 1 1 4 2 0 4 1 4 1 4 1 4 1 4 Strategy Iteration for discounted 2TBSGs 1 4 1 8 1 4 3 8 .. . Page 7/35 Markov decision processes 7 1 2 1 2 -4 1 a3 a1 1 3 a5 2 a2 1 5 1 3 a4 1 1 4 2 2 1 4 a6 2 1 3 3 -10 A Markov decision process consists of a set of n states S, each state i ∈ S being associated with a non-empty set of actions Ai . Each action a is associated with a cost ca and a probability distribution Pa ∈ R1×n such that Pa,j is the probability of moving to state j when using action a. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 8/35 Markov decision processes 7 1 2 1 2 -4 1 a3 a1 1 3 a5 2 a2 1 5 1 3 a4 1 1 4 2 2 1 4 a6 2 1 3 3 -10 A policy π is a choice of an action from each state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 8/35 Markov decision processes 7 1 2 5 1 2 1 a1 a5 1 2 3 a4 1 1 4 2 2 1 4 A policy π is a choice of an action from each state. A policy π P is a Markov chain with costs, and T k valπ (i) = ∞ k=0 ei (γPπ ) cπ refers to the corresponding value of state i. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 8/35 Markov decision processes 7 1 2 5 1 2 1 a1 a5 1 2 3 a4 1 1 4 2 2 1 4 A policy π is a choice of an action from each state. A policy π P is a Markov chain with costs, and T k valπ (i) = ∞ k=0 ei (γPπ ) cπ refers to the corresponding value of state i. A policy π ∗ is optimal iff it minimizes the values of all states: ∀π, i : valπ∗ (i) ≤ valπ (i) Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 8/35 Markov decision processes 7 1 2 5 1 2 1 a1 a5 1 2 3 a4 1 1 4 2 2 1 4 Shapley (1953), Bellman (1957): There always exists an optimal policy. Solving an MDP means finding an optimal policy. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 8/35 Values The values can be found as the unique solution to the system of linear equations: X ∀i : valπ (i) = cπ(i) + γ Pπ(i),j · valπ (j) j 2 1 3 1 valπ (1) = 7 + γ 7 1 3 valπ (2) Hansen, Miltersen, and Zwick + 2 3 2 3 valπ (3) 3 Strategy Iteration for discounted 2TBSGs Page 9/35 Summing values Since an optimal policy simultaneously minimizes the values of all states it minimizes the sum of the values: X valπ (i) = i∈S where eT ∞ X e T (γPπ )k cπ k=0 = (1, 1, . . . , 1). Hansen, Miltersen, and Zwick k 0 1 2 3 4 .. . eT Pk 1 1 1 1 1 1 12 32 3 1 7 4 1 2 4 3 3 1 2 4 4 2 5 8 3 4 3 8 9 4 .. . Strategy Iteration for discounted 2TBSGs Page 10/35 Summing values Since an optimal policy simultaneously minimizes the values of all states it minimizes the sum of the values: X valπ (i) = i∈S where eT ∞ X e T (γPπ )k cπ k=0 = (1, 1, . . . , 1). k 0 1 2 3 4 .. . eT Pk 1 1 1 1 1 1 12 32 3 1 7 4 1 2 4 3 3 1 2 4 4 2 5 8 3 4 3 8 9 4 .. . Let xa be the number of times action a is used, i.e., the sum of a column in the table. Then: X XX valπ (i) = ca xa i∈S Hansen, Miltersen, and Zwick i∈S a∈Ai Strategy Iteration for discounted 2TBSGs Page 10/35 Linear program for solving MDPs The following linear program simultaneously maximizes the values of all states: XX minimize ca xa i∈S a∈A Xi s.t. ∀i ∈ S : xa = 1 + γ a∈Ai ∀i ∈ S, ∀a ∈ Ai : xa ≥ 0 The constraints ensure “flow conservation”. Hansen, Miltersen, and Zwick XX Flow conservation: x1 = 1 x2 = 6γ Pa,i xa j∈S a∈Aj x3 = 4 x4 = 2 x1 + x2 = 1 + γ(x3 + x4 ) Strategy Iteration for discounted 2TBSGs Page 11/35 Basic feasible solutions and policies minimize XX i∈S a∈A Xi s.t. ∀i ∈ S : Flow conservation: ca xa xa = 1 + γ a∈Ai XX Pa,i xa x1 = 0 x2 = 1 + 6γ j∈S a∈Aj ∀i ∈ S, ∀a ∈ Ai : xa ≥ 0 In a basic feasible solution exactly one variable (action) is non-zero for each state. Every basic feasible solution corresponds to a policy π. Hansen, Miltersen, and Zwick x3 = 4 x4 = 2 x1 + x2 = 1 + γ(x3 + x4 ) Strategy Iteration for discounted 2TBSGs Page 12/35 Improving switches and reduced costs a i An action a originating at i is an improving switch w.r.t. a policy π if making the switch to a improves the value of i: valπ[a] (i) < valπ (i) Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 13/35 Improving switches and reduced costs a i An action a originating at i is an improving switch w.r.t. a policy π if making the switch to a improves the value of i: valπ[a] (i) < valπ (i) For every policy π, state i ∈ S, and action a ∈ Ai , we define the reduced cost of a w.r.t. π as: X c̄aπ = ca + γ Pa,j · valπ (j) − valπ (i) . j a is an improving switch w.r.t. π iff c̄aπ < 0. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 13/35 Multiple joint improving switches and optimality Lemma (Howard (1960)) Let π 0 be obtained from π by jointly performing any non-empty set of improving switches. Then for all states i, valπ0 (i) ≤ valπ (i) with strict inequality for at least one state. Lemma (Howard (1960)) A policy π is optimal iff there are no improving switches. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 14/35 Policy iteration Function PolicyIteration(π) while ∃ improving switch w.r.t. π do Update π by performing improving switches return π Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 15/35 Policy iteration Function PolicyIteration(π) while ∃ improving switch w.r.t. π do Update π by performing improving switches return π Howard’s algorithm (1960): For each state i pick the improving switch with the smallest reduced cost: π(i) ← argmin c̄aπ a∈Ai Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 15/35 Policy iteration Function PolicyIteration(π) while ∃ improving switch w.r.t. π do Update π by performing improving switches return π Howard’s algorithm (1960): For each state i pick the improving switch with the smallest reduced cost: π(i) ← argmin c̄aπ a∈Ai Dantzig’s rule (1947): Perform the single improving switch with the most negative reduced cost. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 15/35 Non-discounted MDPs Liggett and Lippman (1969) showed that for any MDP G there exists a discount factor γG < 1, such that the same policies are optimal for all discount factors γ 0 ∈ [γG , 1). Andersson and Miltersen (2009) showed that γG can be described with a number of bits that is polynomial in the bit complexity of G . A 2TBSG G is called non-discounted if it is implicitly using discount factor γG . Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 16/35 Upper bounds for policy iteration The number of iterations performed by Howard’s policy iteration algorithm for an MDP with n states, m actions, discount factor γ < 1, and bit complexity B is at most: Meister and Holzbaur (1986): Mansour and Singh (1999)1 : Ye (2011): We prove: 1 1 B log 1−γ ) O( 1−γ n 2 O( n ) mn n O( 1−γ log 1−γ ) n m O( 1−γ log 1−γ ) Assuming there are two actions per state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 17/35 Upper bounds for policy iteration The number of iterations performed by Howard’s policy iteration algorithm for an MDP with n states, m actions, discount factor γ < 1, and bit complexity B is at most: Meister and Holzbaur (1986): Mansour and Singh (1999)1 : Ye (2011): We prove: 1 B log 1−γ ) O( 1−γ n 2 O( n ) mn n O( 1−γ log 1−γ ) n m O( 1−γ log 1−γ ) Ye (2011): Policy iteration with Dantzig’s rule uses at most mn n O( 1−γ log 1−γ ) iterations. 1 Assuming there are two actions per state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 17/35 Upper bounds for policy iteration The number of iterations performed by Howard’s policy iteration algorithm for an MDP with n states, m actions, discount factor γ < 1, and bit complexity B is at most: Meister and Holzbaur (1986): Mansour and Singh (1999)1 : Ye (2011): We prove: 1 B log 1−γ ) O( 1−γ n 2 O( n ) mn n O( 1−γ log 1−γ ) n m O( 1−γ log 1−γ ) Ye (2011): Policy iteration with Dantzig’s rule uses at most mn n O( 1−γ log 1−γ ) iterations. Post and Ye (2012): Policy iteration with Dantzig’s rule uses at most O(n3 m2 log2 n) iterations to solve deterministic MDPs. 1 Assuming there are two actions per state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 17/35 Lower bounds for policy iteration Friedmann (2009), Fearnley (2010): There exists a family of non-discounted MDPs for which the number of iterations performed by Howard’s policy iteration algorithm is 2Ω(n) . Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 18/35 Lower bounds for policy iteration Friedmann (2009), Fearnley (2010): There exists a family of non-discounted MDPs for which the number of iterations performed by Howard’s policy iteration algorithm is 2Ω(n) . Hansen and Zwick (2010): For every discount factor γ < 1, there exists a family of deterministic MDPs for which the number of iterations performed by Howard’s policy iteration algorithm is Ω(n2 ). Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 18/35 Extension to games 2-player turn-based stochastic games (2TBSGs) is a natural generalization of MDPs. The strategy iteration algorithm is a generalization of policy iteration that solves 2TBSGs. 2 Assuming there are two actions per state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 19/35 Extension to games 2-player turn-based stochastic games (2TBSGs) is a natural generalization of MDPs. The strategy iteration algorithm is a generalization of policy iteration that solves 2TBSGs. The number of iterations performed by the strategy iteration algorithm using Howard’s rule is at most: B 1 O( 1−γ log 1−γ ) n 2 O( n ) m n O( 1−γ log 1−γ ) Littman (1996): Mansour and Singh (1999)2 : We prove: 2 Assuming there are two actions per state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 19/35 Extension to games 2-player turn-based stochastic games (2TBSGs) is a natural generalization of MDPs. The strategy iteration algorithm is a generalization of policy iteration that solves 2TBSGs. The number of iterations performed by the strategy iteration algorithm using Howard’s rule is at most: B 1 O( 1−γ log 1−γ ) n 2 O( n ) m n O( 1−γ log 1−γ ) Littman (1996): Mansour and Singh (1999)2 : We prove: We provide the first known strongly polynomial time bound for an algorithm for solving 2TBSGs with a fixed discount factor. 2 Assuming there are two actions per state. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 19/35 2-player turn-based stochastic games (2TBSGs) 7 1 2 1 2 -4 1 a3 a1 a2 1 3 5 1 a5 a4 1 1 4 2 2 1 4 a6 2 1 3 3 -10 A 2TBSG is an MDP where the set of states is partitioned into two sets: S1 ∪ S2 = S. S1 is controlled by player 1, the minimizer. S2 is controlled by player 2, the maximizer. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 20/35 2-player turn-based stochastic games (2TBSGs) 7 1 2 1 2 -4 1 a3 a1 a2 1 3 5 1 a5 a4 1 1 4 2 2 1 4 a6 2 1 3 3 -10 A 2TBSG is an MDP where the set of states is partitioned into two sets: S1 ∪ S2 = S. S1 is controlled by player 1, the minimizer. S2 is controlled by player 2, the maximizer. A strategy π1 (or π2 ) for player 1 (or player 2) is a choice of an action for each state i ∈ S1 (or i ∈ S2 ). Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 20/35 2-player turn-based stochastic games (2TBSGs) 7 1 2 5 1 2 1 a1 a5 a4 1 1 4 2 2 1 4 A strategy profile π = (π1 , π2 ) is a pair of strategies, defining a Markov chain with costs. We again use valπ1 ,π2 (i) to denote the corresponding value of state i. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 20/35 2TBSGs can model backgammon Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 21/35 Optimality for 2TBSGs Let π1 be a strategy for player 1. A strategy π2 for player 2 is an optimal counter strategy against π1 iff: ∀π20 , i : valπ1 ,π2 (i) ≥ valπ1 ,π20 (i) Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 22/35 Optimality for 2TBSGs Let π1 be a strategy for player 1. A strategy π2 for player 2 is an optimal counter strategy against π1 iff: ∀π20 , i : valπ1 ,π2 (i) ≥ valπ1 ,π20 (i) I.e., π2 is optimal, when maximizing, for the MDP resulting from fixing the choices of player 1 to π1 . Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 22/35 Optimality for 2TBSGs Let π1 be a strategy for player 1. A strategy π2 for player 2 is an optimal counter strategy against π1 iff: ∀π20 , i : valπ1 ,π2 (i) ≥ valπ1 ,π20 (i) I.e., π2 is optimal, when maximizing, for the MDP resulting from fixing the choices of player 1 to π1 . π1∗ is optimal for player 1 iff: ∀π1 , i : max valπ1∗ ,π2 (i) ≤ max valπ1 ,π2 (i) π2 π2 Shapley (1953): Optimal strategies always exist. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 22/35 Improving switches and reduced costs a i An action a originating at i ∈ S1 is an improving switch for player 1 w.r.t. (π1 , π2 ) if making the switch to a improves the value of state i for player 1: valπ1 [a],π2 (i) < valπ1 ,π2 (i) Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 23/35 Improving switches and reduced costs a i An action a originating at i ∈ S1 is an improving switch for player 1 w.r.t. (π1 , π2 ) if making the switch to a improves the value of state i for player 1: valπ1 [a],π2 (i) < valπ1 ,π2 (i) For every strategy profile (π1 , π2 ), state i ∈ S, and action a ∈ Ai , we define the reduced cost of a w.r.t. (π1 , π2 ) as: X c̄aπ1 ,π2 = ca + γ Pa,j · valπ1 ,π2 (j) − valπ1 ,π2 (i) . j a is an improving switch for player 1 w.r.t. (π1 , π2 ) iff c̄aπ1 ,π2 < 0. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 23/35 Optimality condition Optimal counter strategies, optimal strategies, and improving switches are defined analogously for player 2 with reversed inequalities. π2 is an optimal counter strategy against π1 iff player 2 has no improving switches in S2 . π1 is an optimal counter strategy against π2 iff player 1 has no improving switches in S1 . Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 24/35 Optimality condition Optimal counter strategies, optimal strategies, and improving switches are defined analogously for player 2 with reversed inequalities. π2 is an optimal counter strategy against π1 iff player 2 has no improving switches in S2 . π1 is an optimal counter strategy against π2 iff player 1 has no improving switches in S1 . (π1∗ , π2∗ ) is optimal iff neither player has an improving switch. I.e., (π1∗ , π2∗ ) is a Nash equilibrium. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 24/35 Optimality condition Optimal counter strategies, optimal strategies, and improving switches are defined analogously for player 2 with reversed inequalities. π2 is an optimal counter strategy against π1 iff player 2 has no improving switches in S2 . π1 is an optimal counter strategy against π2 iff player 1 has no improving switches in S1 . (π1∗ , π2∗ ) is optimal iff neither player has an improving switch. I.e., (π1∗ , π2∗ ) is a Nash equilibrium. Note that the decision problem corresponding to solving 2TBSGs is in NP ∩ coNP, since an optimal strategy profile is a witness for both yes and no instances. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 24/35 StrategyIteration (Rao et al. (1973)) We let π2 (π1 ) be an optimal counter strategy against π1 . Function StrategyIteration(π1 ) while ∃ improving switch for player 1 w.r.t. (π1 , π2 (π1 )) do Update π1 by performing improving switches return (π1 , π2 (π1 )) Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 25/35 StrategyIteration (Rao et al. (1973)) We let π2 (π1 ) be an optimal counter strategy against π1 . Function StrategyIteration(π1 ) while ∃ improving switch for player 1 w.r.t. (π1 , π2 (π1 )) do Update π1 by performing improving switches return (π1 , π2 (π1 )) Howard’s algorithm can be naturally extended to 2TBSGs by choosing for all i ∈ S1 : π ,π2 (π1 ) π1 (i) ← argmin c̄a 1 a∈Ai Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 25/35 Main observation Theorem (Hansen, Miltersen, and Zwick (2011)) The strategy iteration algorithm with Howard’s update rule n m log 1−γ ) iterations. terminates after O( 1−γ Main observation: 1 n log 1−γ ) iterations a new action will never be Every O( 1−γ used again. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 26/35 Main observation Theorem (Hansen, Miltersen, and Zwick (2011)) The strategy iteration algorithm with Howard’s update rule n m log 1−γ ) iterations. terminates after O( 1−γ Main observation: 1 n log 1−γ ) iterations a new action will never be Every O( 1−γ used again. Let π ∗ = (π1∗ , π2∗ ) be optimal. The actions a with largest ∗ reduced cost c̄aπ w.r.t. π ∗ are eliminated first. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 26/35 Convergence towards the optimal solution Lemma (Meister and Holzbaur (1986)) Let π t be the strategy profile generated after t iterations when running the StrategyIteration algorithm starting from π 0 , then: kvalπt − valπ∗ k∞ ≤ γ t kvalπ0 − valπ∗ k∞ . Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 27/35 Convergence towards the optimal solution Lemma (Meister and Holzbaur (1986)) Let π t be the strategy profile generated after t iterations when running the StrategyIteration algorithm starting from π 0 , then: kvalπt − valπ∗ k∞ ≤ γ t kvalπ0 − valπ∗ k∞ . Proofsketch: The values generated by the StrategyIteration algorithm decrease at least as fast as the corresponding values generated the ValueIteration algorithm. ValueIteration is a contraction with Lipschitz constant γ. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 27/35 When is an action eliminated? Lemma Let π t = (π1t , π2 (π1t )) be the strategy profile generated after t iterations when running the StrategyIteration algorithm ∗ starting from π 0 = (π10 , π2 (π10 )). Let a = argmaxa∈π0 c̄aπ , and 1 assume that a ∈ π1t . Then: kvalπt − valπ∗ k1 ≥ 1−γ kvalπ0 − valπ∗ k1 . n I.e., the action a does not appear after t iterations if: kvalπt − valπ∗ k1 < Hansen, Miltersen, and Zwick 1−γ kvalπ0 − valπ∗ k1 . n Strategy Iteration for discounted 2TBSGs Page 28/35 When is an action eliminated? Combining the two lemmas gives: 1−γ kvalπ0 − valπ∗ k1 ≤ kvalπt − valπ∗ k1 n ≤ nkvalπt − valπ∗ k∞ ≤ nγ t kvalπ0 − valπ∗ k∞ ≤ nγ t kvalπ0 − valπ∗ k1 Hence, a can only appear after t iterations if: t ≤ log1/γ n2 1 n = O( 1−γ log 1−γ ) 1−γ Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 29/35 Flux vectors Since an optimal policy for an MDP simultaneously minimizes the values of all states it minimizes the sum of the values: X i∈S valπ (i) = ∞ X e T Pπk cπ k=0 where e T = (1, 1, . . . , 1). Hansen, Miltersen, and Zwick k 0 1 2 3 4 .. . eT Pk 1 1 1 1 1 1 12 32 3 1 7 4 1 2 4 3 3 1 2 4 4 2 5 8 3 4 3 8 9 4 .. . Strategy Iteration for discounted 2TBSGs Page 30/35 Flux vectors Since an optimal policy for an MDP simultaneously minimizes the values of all states it minimizes the sum of the values: X i∈S valπ (i) = ∞ X e T Pπk cπ k=0 where e T = (1, 1, . . . , 1). k 0 1 2 3 4 .. . eT Pk 1 1 1 1 1 1 12 32 3 1 7 4 1 2 4 3 3 1 2 4 4 2 5 8 3 4 3 8 9 4 .. . We define the flux vector x π ∈ Rn such that xiπ is the expected (discounted) number of times state i is visited, i.e., the discounted sum of a column in the table: π T (x ) = ∞ X e T (γPπ )k k=0 Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 30/35 Flux vectors The flux vector provides an alternative way of summing values: X i∈S valπ (i) = ∞ X e T (γPπ )k cπ = (x π )T cπ k=0 where e T = (1, 1, . . . , 1). Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 31/35 Flux vectors The flux vector provides an alternative way of summing values: X i∈S valπ (i) = ∞ X e T (γPπ )k cπ = (x π )T cπ k=0 where e T = (1, 1, . . . , 1). Note that: eT x π = Hansen, Miltersen, and Zwick n 1−γ Strategy Iteration for discounted 2TBSGs Page 31/35 Flux vectors The flux vector provides an alternative way of summing values: X i∈S valπ (i) = ∞ X e T (γPπ )k cπ = (x π )T cπ k=0 where e T = (1, 1, . . . , 1). Note that: eT x π = n 1−γ Lemma Let π and π ∗ be any two strategy profiles. Then: X ∗ valπ (i) − valπ∗ (i) = ((c̄ π )π )T x π i∈S Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 31/35 A subexponential algorithm for 2TBSGs Kalai (1992) and Matoušek, Sharir and Welzl (1992) introduced the RandomFacet algorithm for solving linear programs, and more generally LP-type problems, in expected subexponential time. Halman (2007) showed that 2TBSGs are LP-type problems. Hence the RandomFacet algorithm can be used to solve √ 2TBSGs in expected time 2O( n log m) . This is the best known bound expressed only in terms of n and m. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 32/35 Overview of problems LP-type problems Abstract Concrete Turn-based stochastic games 21/2 players Mean payoff games 2 players Parity games 2 players Linear programming Markov decision processes 11/2 players Deterministic MDPs 1 player Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 33/35 Concluding remarks We have shown that StrategyIteration with Howard’s m n update rule performs O( 1−γ log 1−γ ) iterations. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 34/35 Concluding remarks We have shown that StrategyIteration with Howard’s m n update rule performs O( 1−γ log 1−γ ) iterations. n The best bound expressed only in terms of n and m is O( 2n ) by Mansour and Singh (1999) for m = 2n. This is true even for deterministic MDPs. The exponential lower bounds by Friedmann (2009) and Fearnley (2010) do not hold for deterministic MDPs. Here the best lower bound is Ω(n2 ) by Hansen and Zwick (2010). Open problem: Close the gap. Hansen and Zwick (2010) conjecture that the number of iterations is at most m for deterministic MDPs. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 34/35 Concluding remarks We have shown that StrategyIteration with Howard’s m n update rule performs O( 1−γ log 1−γ ) iterations. n The best bound expressed only in terms of n and m is O( 2n ) by Mansour and Singh (1999) for m = 2n. This is true even for deterministic MDPs. The exponential lower bounds by Friedmann (2009) and Fearnley (2010) do not hold for deterministic MDPs. Here the best lower bound is Ω(n2 ) by Hansen and Zwick (2010). Open problem: Close the gap. Hansen and Zwick (2010) conjecture that the number of iterations is at most m for deterministic MDPs. Hansen and Zwick conjecture that Fn+2 is an upper bound for the number of iterations performed by StrategyIteration with Howard’s update rule. Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 34/35 The end Thank you for listening! Hansen, Miltersen, and Zwick Strategy Iteration for discounted 2TBSGs Page 35/35
© Copyright 2026 Paperzz