MAC 425/MAC 5739 A RTIFICIAL I NTELLIGENCE
Sequential Decision Making
Denis D. Mauá IME-USP 2016
Probability Theory in One Slide
I
Variables X1 , . . . , Xn takes values in Ω1 , . . . , Ωn
I
Factored possibility space Ω = Ω1 × · · · × Ωn
I
Event is a subset of Ω; e.g. {X1 = x1 , . . . , Xn = xn }
I
Probability function maps events α and β into real values such
that
1. 0 ≤ P(α) ≤ 1
2. P(Ω) = 1
3. if α and β are disjoint then P(α ∪ β) = P(α) + P(β)
I
Probability function is fully specified by probability of worlds
(atomic events): P(ω), ∀ω ∈ Ω
2 / 25
Expectation
Function f mapping worlds ω ∈ Ω to reals
Expected value of f :
E(f ) =
X
f (ω) · P(ω)
ω
Intuition: “Average” value of f : Ω ∈ {1, . . . , 110} represents “age”
of individuals, f is identity function:
E(f ) = 1 · P(1) + 2 · P(2) + 3 · P(3) + · · · + 110 · P(110)
3 / 25
Decision Making Theory in One Slide
A utility function U : ΩS × ΩA → R describes the desirability (gain,
reward) of taking (compound) action a if world is in the state s (this
is what we did in Search)
The environment is probabilistic (stochastic): an action a leads to a
state s with probability P(s|a)
The Principle of Maximum Expected Utility
You evaluate actions by their expected utility E(U (S, A)) and
decide for one of highest value
4 / 25
Cancer Treatment
I
States ΩS : cancerous, healthy
I
Actions ΩA : radiation, non-invasive test, surgery, nothing
Radiation
Non-Invasive Test
0.8
0.2
c
0.05
h
0.95
1.0
c
U (c, r) = −100
U (h, r) = −500
U (c, t) = −10
U (h, t) = −10
Surgery
Nothing
0.65
0.35
h
c
0.05
1.0
0.01
h
U (c, s) = −50
U (h, s) = −125
0.95
0.99
c
0.01
h
0.99
U (c, n) = −50
U (h, n) = 0
5 / 25
Sequential Decision Making (a.k.a. Sequential Decision Process)
I
A set of possible actions ΩA
I
A set of potential states ΩS
I
Time steps t = 1, 2, . . .
I
Agent applies action At in a state St−1 and moves to state St
I
A probabilistic transition model: P(St |S0 , A1 , . . . , St−1 , At )
I
A utility function for a sequence of actions and outcomes
U (S0 , A1 , S1 , . . . )
6 / 25
Sequential Decision Making (a.k.a. Sequential Decision Process)
I
Observable environment: Agent takes an action At based on
observed history of previous states and actions
S0 , A1 , . . . , St−1
I
Policy πt specifies agent’s behavior
t−1
πt : ΩtS × ΩA
→ ΩA
Rational agent adopts policy that maximizes expected utility:
arg max E(U (S0 , π1 (S0 ), S1 , π2 (S0 , π1 (S0 ), S1 ), . . . ))
π1 ,π2 ,...
7 / 25
Sequential Decision Process: Example
8 / 25
Markov Decision Process
I
Sequential decision process: time steps t = 1, 2, . . . , action
space ΩA , state space ΩS
I
Markovian assumption:
P(St+1 |S0 , A1 , . . . , St−1 , At ) = P(St |St−1 , At )
I
A utility function is additive
U (S0 , A1 , S1 , . . . ) =
P
t=1 Rt (St−1 , At , St )
A1
S0
S1
R1
A3
A2
S2
R2
S3
R3
9 / 25
Markov Decision Process: Example
I
P(St |At , St−1 ) =
I
(
±1,
St ∈ {(4, 3), (4, 2)}
Rt (St−1 , At , St ) =
−0.04, otherwise
10 / 25
Markov Decision Process: Optimal Policy
I
History Ht = S0 , A1 , S1 , . . . , At−1 , St−1
I
∗ (S ), S
Future Ft = St , πt+1
t
t+1 , . . .
I
Optimal agent behavior: πt∗ (Ht ) 7→ At
πt∗ (Ht ) = arg max
X
At
= arg max
X
At
= arg max
P(Ft |St−1 , At )
Ft
X
At
=
P(Ft |Ht , At )U (Ht , At , Ft )
Ft
Ft
X
Rk (Sk−1 , Ak , Sk )
k=1
P(Ft |St−1 , At )
X
Rk (Sk−1 , Ak , Sk )
k=t
πt∗ (St−1 )
Optimal policy depends only on current state, not on whole history!
11 / 25
Markov Decision Process: Optimal Policy
Markovian policy:
πt : ΩS → ΩA
Optimal behaviour:
!
arg max E
π1 ,π2 ,...
Rt (St−1 , At , St )
t=1
A1
S0
X
S1
R1
A3
A2
S2
R2
S3
R3
12 / 25
Markov Decision Process: Stationarity
I
Time-independent dynamics:
P(St |St−1 , At ) = P(St+k |St+k−1 , At+k )
I
Immediate Reward: R : ΩS ⇒ R
I
Discounted utility: Rt (St−1 , At , St ) = γ t R(S)
I
Planning horizon:
I
I
I
I
Finite: t = 1, 2, . . . , T , T < ∞
Infinite: t = 1, 2, . . .
Undefined: ∃S s.t.P(S|S, A) = 1, R(S) = 0∀A
S is called absorbing state
Infinite horizon ⇒ Stationary policy π : ΩS ⇒ ΩA
13 / 25
Markov Decision Process: Example
Optimum Stationary Policy
14 / 25
Markov Decision Process: Example
Risk and Reward
15 / 25
Expected Utility of a Policy
Expected utility of applying policy π when in state St−1 :
V π (St−1 ) =
X
U (St−1 , π(St−1 ), Ft )P(Ft |St−1 , π(St−1 ))
Ft
=
X
[R(St−1 ) + U (Ft )]P(Ft |St−1 , π(St−1 ))
Ft
= R(St−1 ) + γ
X
P(St |St−1 , π(St−1 ))V π (St−1 )
St
Usually written as
V π (S) = R(S) + γ
X
P(S 0 |S, π(S))V π (S 0 )
S0
16 / 25
Expected Utility of a Policy
V π (S) = R(S) + γ
X
P(S 0 |S, π(S))V π (S 0 )
S0
System of n = |ΩS | linear equalities on n variables
Can be solved by standard techniques (Gaussian decomposition)
in O(n3 ) time
Approximate method O(nN ):
V0π (S) = 0
Viπ (S) = R(S) + γ
X
π
P(S 0 |S, π(S))Vi−1
(S 0 )
S0
17 / 25
Value Function
Bellman’s Equation: Expected utility of acting optimally:
V ∗ (S) = R(S) + γ max
A
X
P(S 0 |S, A)V ∗ (S 0 )
S0
V ∗ (1, 1) = −0.04 + γ max 0.8V ∗ (1, 2) + 0.1V ∗ (2, 1) + 0.1V ∗ (1, 1),
0.9V ∗ (1, 1) + 0.1V ∗ (1, 2),
0.9V ∗ (1, 1) + 0.1V ∗ (2, 1),
0.8V ∗ (2, 1) + 0.1V ∗ (1, 2) + 0.1V ∗ (1, 1)
18 / 25
Markov Decision Process: Example
Value Function
19 / 25
Policy Iteration
optimal policy maximizes expected value function:
π ∗ (S) = arg max
A
X
P(S 0 |S, A)V ∗ (S 0 )
S0
(by construction, optimal policy cannot improve on optimal Value
Function)
Greedy policy:
π(S) = arg max
A
X
P(S 0 |S, A)V π (S 0 )
S0
(improves current Value Function)
20 / 25
Policy Iteration
Perform greedy search on policy space
1. Start with arbitrary policy π and repeat until convergence:
1.1 Compute V π
1.2 Find greedy policy π 0 using V π
Converges in few iterations but each iteration is slow due to
computation of value function
21 / 25
Value Iteration
Bellman’s Operator B(V ) 7→ V 0 :
X
V 0 (S) = R(S) + γ max
A
P(S 0 |S, A)V (S 0 )
S0
Optimal policy is
π ∗ (S) = arg max
A
X
P(S 0 |S, A)V ∗ (S 0 )
S0
Value function is a fixed point of Bellman’s Operator:
V ∗ (S) = R(S) + γ max
A
X
P(S 0 |S, A)V ∗ (S 0 )
S0
22 / 25
Value Iteration
1. Compute (optimal) Value Function V ∗
2. Find greedy policy using V ∗
Repeat until convergence:
V0 (S) = 0
Vi (S) = R(S) + γ max
X
A
P(S 0 |S, A)Vi−1 (S 0 )
S0
Then compute
π ∗ (S) = arg max
A
X
P(S 0 |S, A)V ∗ (S 0 )
S0
23 / 25
Value Iteration: Convergence
Error at ith iteration: δi = maxs |Vi (s) − Vi−1 (s)|
Theorem. If δi < (1 − γ)/γ then maxs |Vi (s) − V ∗ (s)| < Theorem.
maxs |Vi (s) − V ∗ (s)| < ⇒ maxs |V πi (s) − V ∗ (s)| < 2γ/(1 − γ)
24 / 25
Value Iteration Vs. Policy Iteration
I
Each iteration of VI takes O(|ΩA ||ΩS |2 ) time
I
VI converges in at most
N = dlog(2 maxs R(s)/(1 − γ))/ log(1/γ)e iterations
I
Each iteration of PI takes O(|ΩA ||ΩS |2 + |ΩS |3 ) time
I
Since there are at most ΩA ||ΩS | policies, PI finishes in at most
that number of iterations – this is unrealistic though
I
PI runs in a worst-cast polynomial number of iterations for
fixed γ ; in practice it converges much faster than VI
I
Optimal policy is obtained even when value function is
sub-optimal: this makes VI stopping criteria excessively
conservative
25 / 25
© Copyright 2026 Paperzz