Improved Expected Running Time for MDP Planning Shivaram Kalyanakrishnan1 Neeldhara Misra2 Aditya Gopalan3 1. Department of Computer Science and Engineering, Indian Institute of Technology Bombay 2. Department of Computer Science and Automation, Indian Institute of Science 3. Department of Electrical Communication Engineering, Indian Institute of Science June 2015 1/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 1 / 16 Overview 1. MDP Planning 2. Solution strategies Linear programming Value iteration Policy iteration Our contribution: Planning by Guessing and Policy Improvement (PGPI) 3. PGPI algorithm A total order on the set of policies Guessing game Algorithm 4. Discussion 2/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 2 / 16 Overview 1. MDP Planning 2. Solution strategies Linear programming Value iteration Policy iteration Our contribution: Planning by Guessing and Policy Improvement (PGPI) 3. PGPI algorithm A total order on the set of policies Guessing game Algorithm 4. Discussion 2/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 2 / 16 MDP Planning Markov Decision Problem: general abstraction of sequential decision making. An MDP comprises a tuple (S, A, R, T ), where S is a set of states (with |S| = n), A is a set of actions (with |A| = k ), R(s, a) is a bounded real number, ∀s ∈ S, ∀a ∈ A, and T (s, a) is a probability distribution over S, ∀s ∈ S, ∀a ∈ A. A policy π : S → A specifies an action from each state. The value of a policy π from state s is: "∞ # X t π V (s) = E γ rt | s0 = s, at = π(st ), t = 0, 1, 2, . . . , where t=0 γ ∈ [0, 1) is a discount factor. 3/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 3 / 16 MDP Planning Markov Decision Problem: general abstraction of sequential decision making. An MDP comprises a tuple (S, A, R, T ), where S is a set of states (with |S| = n), A is a set of actions (with |A| = k ), R(s, a) is a bounded real number, ∀s ∈ S, ∀a ∈ A, and T (s, a) is a probability distribution over S, ∀s ∈ S, ∀a ∈ A. A policy π : S → A specifies an action from each state. The value of a policy π from state s is: "∞ # X t π V (s) = E γ rt | s0 = s, at = π(st ), t = 0, 1, 2, . . . , where t=0 γ ∈ [0, 1) is a discount factor. Planning problem: Given S, A, R, T , and γ, find a policy π ⋆ from the set of all policies Π such that ⋆ V π (s) ≥ V π (s), ∀s ∈ S, ∀π ∈ Π. 3/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 3 / 16 MDP Planning Markov Decision Problem: general abstraction of sequential decision making. An MDP comprises a tuple (S, A, R, T ), where S is a set of states (with |S| = n), A is a set of actions (with |A| = k ), R(s, a) is a bounded real number, ∀s ∈ S, ∀a ∈ A, and T (s, a) is a probability distribution over S, ∀s ∈ S, ∀a ∈ A. A policy π : S → A specifies an action from each state. The value of a policy π from state s is: "∞ # X t π V (s) = E γ rt | s0 = s, at = π(st ), t = 0, 1, 2, . . . , where t=0 γ ∈ [0, 1) is a discount factor. Planning problem: Given S, A, R, T , and γ, find a policy π ⋆ from the set of all policies Π such that ⋆ V π (s) ≥ V π (s), ∀s ∈ S, ∀π ∈ Π. 3/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 3 / 16 Overview 1. MDP Planning 2. Solution strategies Linear programming Value iteration Policy iteration Our contribution: Planning by Guessing and Policy Improvement (PGPI) 3. PGPI algorithm A total order on the set of policies Guessing game Algorithm 4. Discussion 4/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 4 / 16 Linear Programming ⋆ def The optimal value function V π = V ⋆ is unique solution of Bellman’s Optimality Equations: ∀s ∈ S: ⋆ V (s) = max R(s, a) + γ a∈A X ′ ⋆ ′ ! T (s, a, s )V (s ) . s′ ∈S 5/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 5 / 16 Linear Programming ⋆ def The optimal value function V π = V ⋆ is unique solution of Bellman’s Optimality Equations: ∀s ∈ S: ⋆ V (s) = max R(s, a) + γ a∈A X ′ ⋆ ′ ! T (s, a, s )V (s ) . s′ ∈S V ⋆ can be obtained by solving an equivalent linear program: maximise X V (s) s∈S subject to V (s) ≥ R(s, a) + γ X s′ ′ ′ ! T (s, a, s )V (s ) , ∀s ∈ S, ∀a ∈ A. 5/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 5 / 16 Linear Programming ⋆ def The optimal value function V π = V ⋆ is unique solution of Bellman’s Optimality Equations: ∀s ∈ S: ⋆ V (s) = max R(s, a) + γ a∈A X ′ ⋆ ′ ! T (s, a, s )V (s ) . s′ ∈S V ⋆ can be obtained by solving an equivalent linear program: maximise X V (s) s∈S subject to V (s) ≥ R(s, a) + γ X s′ ′ ′ ! T (s, a, s )V (s ) , ∀s ∈ S, ∀a ∈ A. n variables, nk constraints (or dual with nk variables, n constraints). Solution time: poly(n, k , B), where B is the number of bits used to represent the MDP. Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 5/16 5 / 16 Value Iteration Classical dynamic programming approach. V0 ← Arbitrary, element-wise bounded, n-length vector. t ← 0. Repeat: For s ∈ S: P Vt+1 (s) ← maxa∈A R(s, a) + γ s′ ∈S T (s, a, s′ )Vt (s′ ) . t ← t + 1. Until Vt = Vt−1 = V ⋆ (up to machine precision). 6/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 6 / 16 Value Iteration Classical dynamic programming approach. V0 ← Arbitrary, element-wise bounded, n-length vector. t ← 0. Repeat: For s ∈ S: P Vt+1 (s) ← maxa∈A R(s, a) + γ s′ ∈S T (s, a, s′ )Vt (s′ ) . t ← t + 1. Until Vt = Vt−1 = V ⋆ (up to machine precision). Convergence to V ⋆ guaranteed using a max-norm contraction argument. 6/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 6 / 16 Value Iteration Classical dynamic programming approach. V0 ← Arbitrary, element-wise bounded, n-length vector. t ← 0. Repeat: For s ∈ S: P Vt+1 (s) ← maxa∈A R(s, a) + γ s′ ∈S T (s, a, s′ )Vt (s′ ) . t ← t + 1. Until Vt = Vt−1 = V ⋆ (up to machine precision). Convergence to V ⋆ guaranteed using a max-norm contraction argument. Number of iterations: poly(n, k , B, 1 ). 1−γ 6/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 6 / 16 Bellman’s Equations and Policy Evaluation Recall that V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )]. Bellman’s Equations (∀s ∈ S): P V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]. V π is called the value function of π. 7/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 7 / 16 Bellman’s Equations and Policy Evaluation Recall that V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )]. Bellman’s Equations (∀s ∈ S): P V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]. V π is called the value function of π. Define (∀s ∈ S, ∀a ∈ A): Q π (s, a) = π P s′ ∈S T (s, a, s′ ) [R(s, a, s′ ) + γV π (s′ )]. Q is called the action value function of π. V π (s) = Q π (s, π(s)). 7/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 7 / 16 Bellman’s Equations and Policy Evaluation Recall that V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )]. Bellman’s Equations (∀s ∈ S): P V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]. V π is called the value function of π. Define (∀s ∈ S, ∀a ∈ A): Q π (s, a) = π P s′ ∈S T (s, a, s′ ) [R(s, a, s′ ) + γV π (s′ )]. Q is called the action value function of π. V π (s) = Q π (s, π(s)). The variables in Bellman’s equation are the V π (s). |S| linear equations in |S| unknowns. 7/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 7 / 16 Bellman’s Equations and Policy Evaluation Recall that V π (s) = E[r1 + γr2 + γ 2 r3 + . . . |s0 = s, ai = π(si )]. Bellman’s Equations (∀s ∈ S): P V π (s) = s′ ∈S T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]. V π is called the value function of π. Define (∀s ∈ S, ∀a ∈ A): Q π (s, a) = π P s′ ∈S T (s, a, s′ ) [R(s, a, s′ ) + γV π (s′ )]. Q is called the action value function of π. V π (s) = Q π (s, π(s)). The variables in Bellman’s equation are the V π (s). |S| linear equations in |S| unknowns. Thus, given S, A, T , R, γ, and a fixed policy π, we can solve Bellman’s equations efficiently to obtain, ∀s ∈ S, ∀a ∈ A, V π (s) and Q π (s, a). 7/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 7 / 16 Policy Iteration For a given policy π: def I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) . a∈A 8/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 8 / 16 Policy Iteration For a given policy π: def I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) . a∈A Q π and I π are easily derived from V π (policy evaluation). 8/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 8 / 16 Policy Iteration For a given policy π: def I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) . a∈A Q π and I π are easily derived from V π (policy evaluation). I π = ∅ iff π is an optimal policy. 8/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 8 / 16 Policy Iteration For a given policy π: def I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) . a∈A Q π and I π are easily derived from V π (policy evaluation). I π = ∅ iff π is an optimal policy. Assume I π 6= ∅. Let C(π) be an arbitrarily “chosen” non-empty subset of I(π). Define a policy π ′ as follows. ′ def π (s) = ( arg maxa∈A Q π (s, a) if s ∈ C(π), π(s) otherwise. 8/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 8 / 16 Policy Iteration For a given policy π: def I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) . a∈A Q π and I π are easily derived from V π (policy evaluation). I π = ∅ iff π is an optimal policy. Assume I π 6= ∅. Let C(π) be an arbitrarily “chosen” non-empty subset of I(π). Define a policy π ′ as follows. ′ def π (s) = ( arg maxa∈A Q π (s, a) if s ∈ C(π), π(s) otherwise. It can be shown that ′ (1) ∀s ∈ S : V π (s) ≥ V π (s), and π′ (2) ∃s ∈ S : V (s) > V π (s). Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 8/16 8 / 16 Policy Iteration For a given policy π: def I(π) = s ∈ S : Q π (s, π(s)) < max Q π (s, a) . a∈A Q π and I π are easily derived from V π (policy evaluation). I π = ∅ iff π is an optimal policy. Assume I π 6= ∅. Let C(π) be an arbitrarily “chosen” non-empty subset of I(π). Define a policy π ′ as follows. ′ def π (s) = ( arg maxa∈A Q π (s, a) if s ∈ C(π), π(s) otherwise. It can be shown that ′ (1) ∀s ∈ S : V π (s) ≥ V π (s), and π′ (2) ∃s ∈ S : V (s) > V π (s). [Policy improvement] Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 8/16 8 / 16 Policy Iteration π0 ← Arbitrary policy. t ← 0. Repeat: Evaluate π t ; derive I(π t ). If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 . t ← t + 1. Until I(π t−1 ) = ∅. 9/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 9 / 16 Policy Iteration π0 ← Arbitrary policy. t ← 0. Repeat: Evaluate π t ; derive I(π t ). If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 . t ← t + 1. Until I(π t−1 ) = ∅. Howard’s Policy Iteration: C(π) = I(π). Number of iterations: O kn n . 9/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 9 / 16 Policy Iteration π0 ← Arbitrary policy. t ← 0. Repeat: Evaluate π t ; derive I(π t ). If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 . t ← t + 1. Until I(π t−1 ) = ∅. Howard’s Policy Iteration: C(π) = I(π). Number of iterations: O kn n . Mansour and Singh’s Randomised Policy Iteration: C(π) chosen uniformly at random from among the non-empty subsets of I(π). Expected number of iterations: n k 2 for general k . O 20.78n for k = 2; O 1 + log(k ) 2 9/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 9 / 16 Policy Iteration π0 ← Arbitrary policy. t ← 0. Repeat: Evaluate π t ; derive I(π t ). If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 . t ← t + 1. Until I(π t−1 ) = ∅. Howard’s Policy Iteration: C(π) = I(π). Number of iterations: O kn n . Mansour and Singh’s Randomised Policy Iteration: C(π) chosen uniformly at random from among the non-empty subsets of I(π). Expected number of iterations: n k 2 for general k . O 20.78n for k = 2; O 1 + log(k ) 2 References: Howard(1960), Mansour and Singh (1999), Hollanders et al. (2014). 9/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 9 / 16 Policy Iteration π0 ← Arbitrary policy. t ← 0. Repeat: Evaluate π t ; derive I(π t ). If I(π t ) 6= ∅, select C(π t ) ⊂ I(π t ) and improve π t to π t+1 . t ← t + 1. Until I(π t−1 ) = ∅. Howard’s Policy Iteration: C(π) = I(π). Number of iterations: O kn n . Mansour and Singh’s Randomised Policy Iteration: C(π) chosen uniformly at random from among the non-empty subsets of I(π). Expected number of iterations: n k 2 for general k . O 20.78n for k = 2; O 1 + log(k ) 2 References: Howard(1960), Mansour and Singh (1999), Hollanders et al. (2014). Note that bounds do not depend on B and γ! Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 9/16 9 / 16 Our Contribution Algorithm Computational complexity Linear Programming poly(n, k , B) Value Iteration poly n, k , B, Policy Iteration poly(n) · O 1 1−γ 1+ 2 log(k ) n k 2 (expected) 10/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 10 / 16 Our Contribution Algorithm Computational complexity Linear Programming poly(n, k , B) Value Iteration poly n, k , B, Policy Iteration poly(n) · O PGPI n poly(n) · O k 2 (expected) 1 1−γ 1+ 2 log(k ) n k 2 (expected) 10/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 10 / 16 Our Contribution Algorithm Computational complexity Linear Programming poly(n, k , B) Value Iteration poly n, k , B, Policy Iteration poly(n) · O PGPI n poly(n) · O k 2 (expected) 1 1−γ 1+ 2 log(k ) n k 2 (expected) PGPI = Planning by Guessing and Policy Improvement. Randomised algorithm. Key ingredient: a total order on the set of policies. Analysis involves basic probability and counting. 10/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 10 / 16 Our Contribution Algorithm Computational complexity Linear Programming poly(n, k , B) Value Iteration poly n, k , B, Policy Iteration poly(n) · O PGPI n poly(n) · O k 2 (expected) 1 1−γ 1+ 2 log(k ) n k 2 (expected) PGPI = Planning by Guessing and Policy Improvement. Randomised algorithm. Key ingredient: a total order on the set of policies. Analysis involves basic probability and counting. Even tighter bounds by combining PGPI with Randomised Policy Iteration! 10/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 10 / 16 Our Contribution Algorithm Computational complexity Linear Programming poly(n, k , B) [“subexponential” bounds (n, k ) exist!] Value Iteration poly n, k , B, Policy Iteration poly(n) · O PGPI n poly(n) · O k 2 (expected) 1 1−γ 1+ 2 log(k ) n k 2 (expected) PGPI = Planning by Guessing and Policy Improvement. Randomised algorithm. Key ingredient: a total order on the set of policies. Analysis involves basic probability and counting. Even tighter bounds by combining PGPI with Randomised Policy Iteration! 10/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 10 / 16 Overview 1. MDP Planning 2. Solution strategies Linear programming Value iteration Policy iteration Our contribution: Planning by Guessing and Policy Improvement (PGPI) 3. PGPI algorithm A total order on the set of policies Guessing game Algorithm 4. Discussion 11/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 11 / 16 A Total Order on the Set of Policies For π ∈ Π, define V (π) = X V π (s). s∈S 12/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 12 / 16 A Total Order on the Set of Policies For π ∈ Π, define V (π) = X V π (s). s∈S Let L be an arbitrary total ordering on policies for tie-breaking, for example the lexicographic ordering. 12/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 12 / 16 A Total Order on the Set of Policies For π ∈ Π, define V (π) = X V π (s). s∈S Let L be an arbitrary total ordering on policies for tie-breaking, for example the lexicographic ordering. Total order ≻: For π1 , π2 ∈ Π, define π1 ≻ π2 iff V (π1 ) > V (π2 ) or V (π1 ) = V (π2 ) and π1 Lπ2 . 12/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 12 / 16 A Total Order on the Set of Policies For π ∈ Π, define V (π) = X V π (s). s∈S Let L be an arbitrary total ordering on policies for tie-breaking, for example the lexicographic ordering. Total order ≻: For π1 , π2 ∈ Π, define π1 ≻ π2 iff V (π1 ) > V (π2 ) or V (π1 ) = V (π2 ) and π1 Lπ2 . Observe that if policy improvement to π yields π ′ , then π ′ ≻ π. ′ ′ ∀s ∈ S : V π (s) ≥ V π (s) and ∃s ∈ S : V π (s) > V π (s) =⇒ V (π ′ ) > V (π) =⇒ π1 ≻ π2 . 12/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 12 / 16 Guessing Game N 3 2 1 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N i+1 INCREMENT i 3 2 1 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N GUESS i+1 INCREMENT i 3 2 1 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N GUESS i+1 INCREMENT i 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N 3 2 1 How many operations needed to reach N? 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N N − O( N ) 3 2 1 How many operations needed to reach N? √ Pick the best of N guesses, and then increment up to N. 13/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13 / 16 Guessing Game N N − O( N ) 3 2 1 How many operations needed to reach N? √ Pick the best of N guesses, and then increment up to N. √ Expected number of operations: O( N). Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 13/16 13 / 16 PGPI Algorithm π ← Arbitrary policy. Repeat k αn times: Draw π ′ from Π uniformly at random. If π ′ ≻ π, π ← π ′ . While π is not optimal: π ← PolicyImprovement(π). 14/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 14 / 16 PGPI Algorithm π ← Arbitrary policy. Repeat k αn times: Draw π ′ from Π uniformly at random. If π ′ ≻ π, π ← π ′ . While π is not optimal: π ← PolicyImprovement(π). n α = 0.5; Expected number of iterations: O k 2 . 14/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 14 / 16 PGPI Algorithm π ← Arbitrary policy. Repeat k αn times: Draw π ′ from Π uniformly at random. If π ′ ≻ π, π ← π ′ . While π is not optimal: π ← PolicyImprovement(π). n α = 0.5; Expected number of iterations: O k 2 . k = 2, α = 0.46, randomised policy improvement; Expected number of iterations: O 20.46n . 14/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 14 / 16 Overview 1. MDP Planning 2. Solution strategies Linear programming Value iteration Policy iteration Our contribution: Planning by Guessing and Policy Improvement (PGPI) 3. PGPI algorithm A total order on the set of policies Guessing game Algorithm 4. Discussion 15/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 15 / 16 Discussion PGPI: improved complexity bound solely in terms of n and k . 16/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 16 / 16 Discussion PGPI: improved complexity bound solely in terms of n and k . Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections? 16/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 16 / 16 Discussion PGPI: improved complexity bound solely in terms of n and k . Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections? PGPI: no favourable experimental results yet! Policy Iteration (Howard, Mansour and Singh) very quick on “typical” MDPs. Yet to find MDPs on which PGPI dominates Policy Iteration. Currently known “lower bound” MDPs (Fearnley, 2010) not suitable. 16/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 16 / 16 Discussion PGPI: improved complexity bound solely in terms of n and k . Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections? PGPI: no favourable experimental results yet! Policy Iteration (Howard, Mansour and Singh) very quick on “typical” MDPs. Yet to find MDPs on which PGPI dominates Policy Iteration. Currently known “lower bound” MDPs (Fearnley, 2010) not suitable. References. R. A. Howard, 1960. Dynamic Programming and Markov Processes. MIT Press, 1960. Yishay Mansour and Satinder Singh, 1999. On the Complexity of Policy Iteration. In Proc. UAI 1999, pp. 401–408, AUAI, 1999. John Fearnley, 2010. Exponential Lower Bounds for Policy Iteration. In Proc. ICALP 2010, pp. 551–562, Springer, 2010. Romain Hollanders, Balázs Gerencsér, Jean-Charles Delvenne, and Raphaël M. Jungers, 2014. About upper bounds on the complexity of Policy Iteration, http://perso.uclouvain.be/romain.hollanders/docs/NewUpperBoundForPI.pdf. 16/16 Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 16 / 16 Discussion PGPI: improved complexity bound solely in terms of n and k . Policy Iteration:MDP Planning :: Simplex: Linear Programming? Connections? PGPI: no favourable experimental results yet! Policy Iteration (Howard, Mansour and Singh) very quick on “typical” MDPs. Yet to find MDPs on which PGPI dominates Policy Iteration. Currently known “lower bound” MDPs (Fearnley, 2010) not suitable. References. R. A. Howard, 1960. Dynamic Programming and Markov Processes. MIT Press, 1960. Yishay Mansour and Satinder Singh, 1999. On the Complexity of Policy Iteration. In Proc. UAI 1999, pp. 401–408, AUAI, 1999. John Fearnley, 2010. Exponential Lower Bounds for Policy Iteration. In Proc. ICALP 2010, pp. 551–562, Springer, 2010. Romain Hollanders, Balázs Gerencsér, Jean-Charles Delvenne, and Raphaël M. Jungers, 2014. About upper bounds on the complexity of Policy Iteration, http://perso.uclouvain.be/romain.hollanders/docs/NewUpperBoundForPI.pdf. Thank you! Kalyanakrishnan, Misra, and Gopalan (2015) Improved Running Time for MDP Planning 16/16 16 / 16
© Copyright 2026 Paperzz