Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3 From utility to optimal policy β’ The utility function U(s) allows the agent to select the action that maximizes the expected utility of the subsequent state: π β = argmax πβπ΄(π ) π π β² π , π π(π β² ) π β² The Bellman equation ο΅ Now, if the utility of a state is the expected sum of discounted rewards from that point onwards, then there is a direct relationship between the utility of a state and the utility of its neighbors: The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action π π β² π , π π(π β² ) π π = π π + πΎ max πβπ΄(π ) π β² the Bellman equation 3 The Bellman equation π π β² π , π π(π β² ) π π = π π + πΎ max πβπ΄(π ) π β² 4 The value iteration algorithm π π β² π , π π(π β² ) π π = π π + πΎ max πβπ΄(π ) π β² ο΅ For problem with n states, there are n Bellman equations, and n unknowns, however NOT linear π π β² π , π ππ (π β² ) ππ+1 (π ) β π π + πΎ max πβπ΄(π ) π β² ο΅ Start with random U(s), update iteratively ο΅ Guaranteed to converge to the unique solution Demo: http://people.cs.ubc.ca/~poole/demos/mdp/vi.html 5 Policy iteration algorithm ο΅ It is possible to get an optimal policy even when the utility function estimate is inaccurate ο΅ If one action is clearly better than all others, then the exact magnitude of the utilities on the states involved need not be precise Compute utilities of states Compute utilities of states for a given policy Compute optimal policy Compute policy for the given state utilities Value iteration Policy iteration 6 Policy iteration algorithm π π β² π , π π(π β² ) π π = π π + πΎ max πβπ΄(π ) π β² π π β² π , ππ (π ) π(π β² ) π π =π π +πΎ π β² π β = argmax πβπ΄(π ) Linear equation π π β² π , π π(π β² ) π β² 7 Policy evaluation ο΅n linear equations, n unknowns for problem with n states, solved in n cubic time, can also use iterative scheme 8 Summary ο΅ Markov decision processes ο΅ Utility of state sequence ο΅ Utility of states ο΅ Value iteration algorithm ο΅ Policy iteration algorithm 9
© Copyright 2026 Paperzz