SYS3060 Stochastic Decision Models
Lecture 3
Quanquan Gu
Dept. of Systems and Information Engineering
University of Virginia
Quanquan Gu
Reinforcement Learning
1 / 11
Recap
Markov Property
Markov Process
Markov Reward Process (MRP)
Markov Decision Process (MDP)
Returns
Value Function
Quanquan Gu
Reinforcement Learning
2 / 11
Bellman Equation for value function in MRP
v(s) = Rs + γ
X
Pss0 v(s0 )
s0 ∈S
Proof.
Last time we proved
v(s) = E[Rt+1 |St = s] + γE[v(St+1 )|St = s].
By definition of Rs and conditional expectation we have
X
v(s) = Rs + γ
v(s0 )P(St+1 = s0 |St = s)
s0 ∈S
= Rs + γ
X
Pss0 v(s0 ).
s0 ∈S
Quanquan Gu
Reinforcement Learning
3 / 11
Matrix form of Bellman equation for v(s) in MRP
Suppose S = {1, 2, . . . , n}, then the Bellman equation can be written in
the following matrix form
P11 P12 . . . P1n
R1
v(1)
v(1)
P21 P22 . . . P2n v(2)
v(2) R2
.. = .. + γ ..
..
.. .. .
.
. .
.
. .
Rn
v(n)
v(n)
Pn1 Pn2 . . . Pnn
Take the second row for example,
v(1)
v(2)
v(2) = R2 + γ[P21 , P22 , . . . , P2n ] .
..
v(n)
= R2 + γ
n
X
P2s0 v(s0 )
s0 =1
Quanquan Gu
Reinforcement Learning
4 / 11
Matrix form of Bellman equation for v(s) in MRP
Let
R1
v(1)
R2
v(2)
v = . ,r = .
..
..
v(n)
P11
P21
and P = .
..
Rn
P12
P22
..
.
. . . P1n
. . . P2n
.. ,
.
Pn1 Pn2 . . . Pnn
then the matrix form of Bellman equation can be written in a more
compact form as,
v = r + γPv,
from which we can obtain the closed form solution of v as
v = (I − γP)−1 r,
where I is an identity matrix.
Quanquan Gu
Reinforcement Learning
5 / 11
Example (Calculation of Rs )
Rstart =E[Rt+1 |St = start]
= − 1 · P(St+1 = start|St = start)
+ 0 · P(St+1 = stop|St = start)
= − 0.1
Similarly,
start
stop
start
0.1
0.5
Quanquan Gu
stop
0.9
0.5
Rstop = − 2 · 0.5 + 1 · 0.5 = −0.5
Reinforcement Learning
6 / 11
Example (Calculation of vs )
Recall Rstart = −0.1 and Rstop = −0.5
Let 1:start, 2:stop
We can calculate v(1) and v(2) by
solving the linear system of equations:
start
stop
start
0.1
0.5
Quanquan Gu
stop
0.9
0.5
v(1)
−0.1
0.1 0.9 v(1)
=
+γ
v(2)
−0.5
0.5 0.5 v(2)
Reinforcement Learning
7 / 11
Policy
Definition (Stochastic Policy)
A policy π is a distribution over actions given states:
π(a|s) = P(At = a|St = s).
A stochastic policy is called stationary if
π(a|s) = P(At = a|St = s)
is time independent.
Definition (Deterministic Policy)
A deterministic policy π is a function from state to action (π : S → A):
π(s) = a.
Deterministic policy is a special case of (stochastic) policy.
Quanquan Gu
Reinforcement Learning
8 / 11
Value Function (State Value Function)
Definition
The state-value function vπ (s) of an MDP is the expected return starting
from state s under policy π:
vπ (s) = Eπ [Gt |St = s].
Quanquan Gu
Reinforcement Learning
9 / 11
Action Value Function
Definition
The action-value function qπ (a, s) is the expected return starting from
state s, taking action a, and under policy π:
qπ (s, a) = Eπ [Gt |St = s, At = a].
By the law of total expectation: E[X|Y ] = E E[X|Y, Z] , we can
show that vπ (s) = EAt [qπ (s, a)].
Quanquan Gu
Reinforcement Learning
10 / 11
Bellman Equation for Value Function and Action Value
Function in MDP
Bellman equation for vπ (s):
vπ (s) =
X
h
i
X
a
0
π(a|s) Rsa + γ
Pss
0 vπ (s ) .
s0 ∈S
a∈A
Bellman equation for qπ (s, a):
X
X
a
Pss
π(a0 |s0 )qπ (s0 , a0 ).
qπ (s, a) = Rsa + γ
0
s0 ∈S
Quanquan Gu
a0 ∈A
Reinforcement Learning
11 / 11
© Copyright 2026 Paperzz