Recap Rewards and Policies Value Iteration Decision Theory: Markov Decision Processes CPSC 322 Lecture 34 April 11, 2007 Textbook §12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 1 Recap Rewards and Policies Value Iteration Lecture Overview 1 Recap 2 Rewards and Policies 3 Value Iteration Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 2 Recap Rewards and Policies Value Iteration Markov Decision Processes An MDP is defined by: set S of states. set A of actions. P (St+1 |St , At ) specifies the dynamics. R(St , At , St+1 ) specifies the reward. The agent gets a reward at each time step (rather than just a final reward). R(s, a, s0 ) is the reward received when the agent is in state s, does action a and ends up in state s0 . Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 3 Recap Rewards and Policies Value Iteration Decision Processes A Markov decision process augments a stationary Markov chain with actions and values: S0 A0 Decision Theory: Markov Decision Processes R1 R2 R3 S1 S2 S3 A1 A2 CPSC 322 Lecture 34, Slide 4 Recap Rewards and Policies Value Iteration Lecture Overview 1 Recap 2 Rewards and Policies 3 Value Iteration Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 5 Recap Rewards and Policies Value Iteration Rewards and Values Suppose the agent receives the sequence of rewards r1 , r2 , r3 , r4 , . . .. What value should be assigned? ∞ X total reward V = ri i=1 r1 + · · · + r n n P∞ i−1 discounted reward V = i=1 γ ri average reward V = lim n→∞ γ is the discount factor 0≤γ≤1 Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 6 Recap Rewards and Policies Value Iteration Policies A stationary policy is a function: π:S→A Given a state s, π(s) specifies what action the agent who is following π will do. An optimal policy is one with maximum expected value we’ll focus on the case where value is defined as discounted reward. For an MDP with stationary dynamics and rewards with infinite or indefinite horizon, there is always an optimal stationary policy in this case. Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 7 Recap Rewards and Policies Value Iteration Value of a Policy Qπ (s, a), where a is an action and s is a state, is the expected value of doing a in state s, then following policy π. V π (s), where s is a state, is the expected value of following policy π in state s. Qπ and V π can be defined mutually recursively: V π (s) = Qπ (s, π(s)) X Qπ (s, a) = P (s0 |a, s) r(s, a, s0 ) + γV π (s0 ) s0 Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 8 Recap Rewards and Policies Value Iteration Value of the Optimal Policy Q∗ (s, a), where a is an action and s is a state, is the expected value of doing a in state s, then following the optimal policy. V ∗ (s), where s is a state, is the expected value of following the optimal policy in state s. Q∗ and V ∗ can be defined mutually recursively: X Q∗ (s, a) = P (s0 |a, s) r(s, a, s0 ) + γV ∗ (s0 ) s0 ∗ V (s) = max Q∗ (s, a) a π ∗ (s) = arg max Q∗ (s, a) a Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 9 Recap Rewards and Policies Value Iteration Lecture Overview 1 Recap 2 Rewards and Policies 3 Value Iteration Decision Theory: Markov Decision Processes CPSC 322 Lecture 34, Slide 10 Recap Rewards and Policies Value Iteration Value Iteration Idea: Given an estimate of the k-step lookahead value function, determine the k + 1 step lookahead value function. Set V0 arbitrarily. e.g., zeros Compute Qi+1 and Vi+1 from Vi : X P (s0 |a, s) r(s, a, s0 ) + γVi (s0 ) Qi+1 (s, a) = s0 Vi+1 (s) = max Qi+1 (s, a) a If we intersect these equations at Qi+1 , we get an update equation for V : X Vi+1 (s) = max P (s0 |a, s) r(s, a, s0 ) + γVi (s0 ) a Decision Theory: Markov Decision Processes s0 CPSC 322 Lecture 34, Slide 11 Recap Rewards and Policies 432 Value Iteration CHAPTER 12. PLANNING UNDER UNCERTAINTY Pseudocode for Value Iteration procedure value_iteration(P, r, θ) inputs: P is state transition function specifying P(s0 |a, s) r is a reward function R(s, a, s0 ) θ a threshold θ > 0 returns: π[s] approximately optimal policy V [s] value function data structures: Vk [s] a sequence of value functions begin for k = 1 : ∞ for each state s P Vk [s] = maxa s0 P(s0 |a, s)(R(s, a, s0 ) + γ Vk−1 [s0 ]) if ∀s |Vk (s) − Vk−1 (s)| < θ for each state s P π(s) = arg maxa s0 P(s0 |a, s)(R(s, a, s0 ) + γ Vk−1 [s0 ]) return π, Vk end Figure 12.13: Value Iteration for Markov Decision Processes, storing V CPSC 322 Lecture 34, Slide 12 Decision Theory: Markov Decision Processes
© Copyright 2026 Paperzz