The value iteration algorithm

Markov Decision Processes
AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3
From utility to optimal policy
β€’ The utility function U(s) allows the agent to select
the action that maximizes the expected utility of the
subsequent state:
πœ‹ βˆ— = argmax
π‘Žβˆˆπ΄(𝑠)
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆ(𝑠 β€² )
𝑠′
The Bellman equation
 Now,
if the utility of a state is the expected sum of
discounted rewards from that point onwards, then
there is a direct relationship between the utility of a
state and the utility of its neighbors:
The utility of a state is the immediate reward
for that state plus the expected discounted
utility of the next state, assuming that the
agent chooses the optimal action
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆ(𝑠 β€² )
π‘ˆ 𝑠 = 𝑅 𝑠 + 𝛾 max
π‘Žβˆˆπ΄(𝑠)
𝑠′
the Bellman
equation
3
The Bellman equation
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆ(𝑠 β€² )
π‘ˆ 𝑠 = 𝑅 𝑠 + 𝛾 max
π‘Žβˆˆπ΄(𝑠)
𝑠′
4
The value iteration algorithm
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆ(𝑠 β€² )
π‘ˆ 𝑠 = 𝑅 𝑠 + 𝛾 max
π‘Žβˆˆπ΄(𝑠)
𝑠′
 For
problem with n states, there are n Bellman
equations, and n unknowns, however NOT linear
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆπ‘– (𝑠 β€² )
π‘ˆπ‘–+1 (𝑠) ← 𝑅 𝑠 + 𝛾 max
π‘Žβˆˆπ΄(𝑠)
𝑠′
 Start
with random U(s), update iteratively
 Guaranteed to converge to the unique solution
Demo:
http://people.cs.ubc.ca/~poole/demos/mdp/vi.html
5
Policy iteration algorithm
 It
is possible to get an optimal policy even when the
utility function estimate is inaccurate
 If one action is clearly better than all others, then
the exact magnitude of the utilities on the states
involved need not be precise
Compute
utilities of states
Compute utilities of states
for a given policy
Compute
optimal policy
Compute policy
for the given state utilities
Value iteration
Policy iteration
6
Policy iteration algorithm
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆ(𝑠 β€² )
π‘ˆ 𝑠 = 𝑅 𝑠 + 𝛾 max
π‘Žβˆˆπ΄(𝑠)
𝑠′
𝑃 𝑠 β€² 𝑠, πœ‹π‘– (𝑠) π‘ˆ(𝑠 β€² )
π‘ˆ 𝑠 =𝑅 𝑠 +𝛾
𝑠′
πœ‹ βˆ— = argmax
π‘Žβˆˆπ΄(𝑠)
Linear
equation
𝑃 𝑠 β€² 𝑠, π‘Ž π‘ˆ(𝑠 β€² )
𝑠′
7
Policy evaluation
n
linear equations, n unknowns for problem with n
states, solved in n cubic time, can also use iterative
scheme
8
Summary
 Markov
decision processes
 Utility of state sequence
 Utility of states
 Value iteration algorithm
 Policy iteration algorithm
9