Decision Theory: Markov Decision Processes

Recap
Rewards and Policies
Value Iteration
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34
April 11, 2007
Textbook §12.5
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 1
Recap
Rewards and Policies
Value Iteration
Lecture Overview
1
Recap
2
Rewards and Policies
3
Value Iteration
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 2
Recap
Rewards and Policies
Value Iteration
Markov Decision Processes
An MDP is defined by:
set S of states.
set A of actions.
P (St+1 |St , At ) specifies the dynamics.
R(St , At , St+1 ) specifies the reward. The agent gets a reward
at each time step (rather than just a final reward).
R(s, a, s0 ) is the reward received when the agent is in state s,
does action a and ends up in state s0 .
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 3
Recap
Rewards and Policies
Value Iteration
Decision Processes
A Markov decision process augments a stationary Markov
chain with actions and values:
S0
A0
Decision Theory: Markov Decision Processes
R1
R2
R3
S1
S2
S3
A1
A2
CPSC 322 Lecture 34, Slide 4
Recap
Rewards and Policies
Value Iteration
Lecture Overview
1
Recap
2
Rewards and Policies
3
Value Iteration
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 5
Recap
Rewards and Policies
Value Iteration
Rewards and Values
Suppose the agent receives the sequence of rewards
r1 , r2 , r3 , r4 , . . .. What value should be assigned?
∞
X
total reward V =
ri
i=1
r1 + · · · + r n
n
P∞ i−1
discounted reward V = i=1 γ ri
average reward V = lim
n→∞
γ is the discount factor
0≤γ≤1
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 6
Recap
Rewards and Policies
Value Iteration
Policies
A stationary policy is a function:
π:S→A
Given a state s, π(s) specifies what action the agent who is
following π will do.
An optimal policy is one with maximum expected value
we’ll focus on the case where value is defined as discounted
reward.
For an MDP with stationary dynamics and rewards with
infinite or indefinite horizon, there is always an optimal
stationary policy in this case.
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 7
Recap
Rewards and Policies
Value Iteration
Value of a Policy
Qπ (s, a), where a is an action and s is a state, is the expected
value of doing a in state s, then following policy π.
V π (s), where s is a state, is the expected value of following
policy π in state s.
Qπ and V π can be defined mutually recursively:
V π (s) = Qπ (s, π(s))
X
Qπ (s, a) =
P (s0 |a, s) r(s, a, s0 ) + γV π (s0 )
s0
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 8
Recap
Rewards and Policies
Value Iteration
Value of the Optimal Policy
Q∗ (s, a), where a is an action and s is a state, is the expected
value of doing a in state s, then following the optimal policy.
V ∗ (s), where s is a state, is the expected value of following
the optimal policy in state s.
Q∗ and V ∗ can be defined mutually recursively:
X
Q∗ (s, a) =
P (s0 |a, s) r(s, a, s0 ) + γV ∗ (s0 )
s0
∗
V (s) = max Q∗ (s, a)
a
π ∗ (s) = arg max Q∗ (s, a)
a
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 9
Recap
Rewards and Policies
Value Iteration
Lecture Overview
1
Recap
2
Rewards and Policies
3
Value Iteration
Decision Theory: Markov Decision Processes
CPSC 322 Lecture 34, Slide 10
Recap
Rewards and Policies
Value Iteration
Value Iteration
Idea: Given an estimate of the k-step lookahead value
function, determine the k + 1 step lookahead value function.
Set V0 arbitrarily.
e.g., zeros
Compute Qi+1 and Vi+1 from Vi :
X
P (s0 |a, s) r(s, a, s0 ) + γVi (s0 )
Qi+1 (s, a) =
s0
Vi+1 (s) = max Qi+1 (s, a)
a
If we intersect these equations at Qi+1 , we get an update
equation for V :
X
Vi+1 (s) = max
P (s0 |a, s) r(s, a, s0 ) + γVi (s0 )
a
Decision Theory: Markov Decision Processes
s0
CPSC 322 Lecture 34, Slide 11
Recap
Rewards and Policies
432
Value Iteration
CHAPTER 12. PLANNING UNDER UNCERTAINTY
Pseudocode for Value Iteration
procedure value_iteration(P, r, θ)
inputs:
P is state transition function specifying P(s0 |a, s)
r is a reward function R(s, a, s0 )
θ a threshold θ > 0
returns:
π[s] approximately optimal policy
V [s] value function
data structures:
Vk [s] a sequence of value functions
begin
for k = 1 : ∞
for each state s
P
Vk [s] = maxa s0 P(s0 |a, s)(R(s, a, s0 ) + γ Vk−1 [s0 ])
if ∀s |Vk (s) − Vk−1 (s)| < θ
for each state s
P
π(s) = arg maxa s0 P(s0 |a, s)(R(s, a, s0 ) + γ Vk−1 [s0 ])
return π, Vk
end
Figure 12.13: Value Iteration for Markov Decision Processes,
storing V
CPSC 322 Lecture 34, Slide 12
Decision Theory: Markov Decision Processes