Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted
Markov Decision Processes
Dan Zhang
Leeds School of Business
University of Colorado at Boulder
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
1
Outline
The expected total discounted reward
Policy evaluation
Optimality equations
Value iteration
Policy iteration
Linear Programming
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
2
Expected Total Reward Criterion
Let π = (d1 , d2 , . . . ) ∈ ΠHR
Starting at a state s, using policy π leads to a sequence of
state-action pairs {Xt , Yt }. The sequence of rewards is given
by {Rt ≡ rt (Xt , Yt ) : t = 1, 2, . . . }.
Let λ ∈ [0, 1) be the discount factor
The expected total rewards from policy π starting in state s is
given by
" N
#
X
vλπ (s) ≡ lim Eπs
λt−1 r (Xt , Yt ) .
N→∞
t=1
The limit above exists when r (·) is bounded; i.e.,
sups∈S,a∈As |r (s, a)| = M < ∞.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
3
Expected Total Reward Criterion
Under suitable conditions (such as the boundedness of r (·)),
we have
" N
#
"∞
#
X
X
vλπ (s) ≡ lim Eπs
λt−1 r (Xt , Yt ) = Eπs
λt−1 r (Xt , Yt ) .
N→∞
t=1
t=1
Let
"
π
v (s) ≡
Eπs
∞
X
#
r (Xt , Yt ) .
t=1
We have
v π (s)
=
limλ↑1 vλπ (s)
Dan Zhang, Spring 2012
whenever v π (s) exists.
Infinite Horizon Discounted MDP
4
Optimality Criteria
A policy π is discount optimal for λ ∈ [0, 1) if
∗
vλπ (s) ≥ vλπ (s),
∀s ∈ S, π ∈ ΠHR .
The value of a discounted MDP is defined by
vλ∗ (s) ≡ sup vλπ (s),
∀s ∈ S.
π∈ΠHR
∗
Let π ∗ be a discount optimal policy. Then vλπ (s) = vλ∗ (s) for
all s ∈ S.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
5
Vector Notation for Markov Decision Processes
Let V denote the set of bounded real valued functions on S
with componentwise partial order and norm
||v || ≡ sups∈S |v (s)|.
The corresponding matrix norm is given by
X
||H|| ≡ sup
|H(j|s)|,
s∈S j∈S
where H(j|s) denotes the (s, j)-th component of H.
Let e ∈ V denote the function with all components equal to
1; that is, e(s) = 1 for all s ∈ S.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
6
Vector Notation for Markov Decision Processes
For d ∈ D MD , let
rd (s) ≡ r (s, d(s)) and pd (j|s) ≡ p(j|s, d(s)).
Similarly, for d ∈ D MR , let
X
X
rd (s) ≡
qd(s) (a)r (s, a), pd (j|s) ≡
qd(s) (a)p(j|s, a).
a∈As
a∈As
Let rd denote the |S|-vector, with the s-th component rd (s)
and Pd the |S| × |S| matrix with (s, j)-th entry pd (j|s). We
refer to rd as the reward vector and Pd as the transition
probability matrix.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
7
Vector Notation for Markov Decision Processes
π = (d1 , d2 , . . . ) ∈ ΠMR . The (s, j) component of the t-step
transition probability matrix Pπt (j|s) satisfies
Pπt (j|s) = [Pd1 . . . Pdt−1 Pdt ](j|s) = P π (Xt+1 = j|X1 = s).
For v ∈ V ,
Esπ [v (Xt )] =
X
Pπt−1 (j|s)v (j).
j∈S
We also have
vλπ =
∞
X
λt−1 Pπt−1 rdt .
t=1
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
8
Assumptions
Stationary rewards and transition probabilities: r (s, a) and
p(j|s, a) do not vary with time
Bounded rewards: |r (s, a)| ≤ M < ∞
Discounting: λ ∈ [0, 1).
Discrete state space: S is finite or countable
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
9
Policy Evaluation
Theorem
Let π = (d1 , d2 , . . . ) ∈ ΠHR . Then for each s ∈ S, there exists a
policy π 0 = (d10 , d20 , . . . ) ∈ ΠMR , satisfying
0
P π (Xt = j, Yt = a|X1 = s) = P π (Xt = j, Yt = a|X1 = s),
∀t.
=⇒ Suppose π ∈ ΠHR , then for each s ∈ S, there exists a policy
0
π 0 ∈ ΠMR such that vλπ (s) = vλπ (s).
=⇒ It suffices to consider ΠMR .
vλ∗ (s) = sup vλπ (s) = sup vλπ (s).
π∈ΠHR
Dan Zhang, Spring 2012
π∈ΠMR
Infinite Horizon Discounted MDP
10
Policy Evaluation
Let π = (d1 , d2 , . . . ) ∈ ΠMR . Then
"∞
#
X
vλπ (s) = Esπ
λt−1 r (Xt , Yt ) .
t=1
In vector notation, we have
vλπ =
∞
X
λt−1 Pπt−1 rdt
t=1
= rd1 + λPπ1 rd2 + λ2 Pπ2 rd3 + . . .
= rd1 + λPd1 rd2 + λ2 Pd1 Pd2 rd3 + . . .
= rd1 + λPd1 (rd2 + λPd2 rd3 + . . . )
0
= rd1 + λPd1 vλπ ,
where π 0 = (d2 , d3 , . . . ).
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
11
Policy Evaluation
When π is stationary, π = (d, d, . . . ) ≡ d ∞ and π 0 = π.
It follows that vλd
∞
vλd
∞
satisfies
= rd1 + λPd vλd
∞
∞
≡ Ld vλd ,
where Ld : V → V is a linear transformation.
Theorem
Suppose λ ∈ [0, 1). Then for any stationary policy d ∞ with d ∈
∞
D MR , vλd is a solution in V of
v = rd + λPd v .
Furthermore, vλd
∞
may be written as
vλd
∞
= (I − λPd )−1 rd .
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
12
Optimality Equations
For any fixed n, the finite horizon optimality equation is given
by


X
vn (s) = sup r (s, a) +
λP(j|s, a)vn+1 (j) .
a∈As
j∈S
Taking limits on both sides leads to


X
v (s) = sup r (s, a) +
λP(j|s, a)v (j) .
a∈As
j∈S
The equations above for all s ∈ S are the optimality equations.
For v ∈ V , let
Lv ≡ sup [rd + λPd v ],
d∈D MD
Lv ≡ max [rd + λPd v ].
d∈D MD
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
13
Optimality Equations
Proposition
For all v ∈ V and λ ∈ [0, 1),
sup [rd + λPd v ] = sup [rd + λPd v ].
d∈D MD
d∈D MR
Replacing D MD with D, the optimality equation can be
written as
v = Lv .
In case supremum can be attained above for all v ∈ V ,
v = Lv .
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
14
Solutions of the Optimality Equations
Theorem
Suppose v ∈ V .
(i) If v ≥ Lv , then v ≥ vλ∗ ;
(ii) If v ≤ Lv , then v ≤ vλ∗ ;
(iii) If v = Lv , then v is the only element of V with this property
and v = vλ∗ .
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
15
Solutions of the Optimality Equations
Let U be a Banach space (complete normed linear space).
Special case: space of bounded measurable real-valued
functions
An operator T : U → U is a contraction mapping if there
exists a λ ∈ [0, 1) such that ||Tv − Tu|| ≤ λ||v − u|| for all u
and v in U.
Theorem [Banach Fixed-Point Theorem]
Suppose U is a Banach space and T : U → U is a contraction
mapping. Then
(i) There exists a unique v ∗ in U such that Tv ∗ = v ∗ ;
(ii) For arbitrary v 0 ∈ U, the sequence {v n } defined by
v n+1 = Tv n = T n+1 v 0 converges to v ∗ .
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
16
Solutions of the Optimality Equations
Proposition
Suppose λ ∈ [0, 1). Then L and L are contraction mappings on V .
Theorem
Suppose λ ∈ [0, 1), S is finite or countable, and r (s, a) is bounded.
The following results hold.
(i) There exits a v ∗ ∈ V satisfying Lv ∗ = v ∗ (Lv = v ∗ ).
Furthermore, v ∗ is the only element of V with this property
and equals vλ∗ ;
(ii) For each d ∈ D MR , there exists a unique v ∈ V satisfying
∞
Ld v = v . Furthermore, v = vλd .
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
17
Existence of Stationary Optimal Policies
A decision rule is d ∗ is conserving if
d ∗ ∈ argmax{rd + λPd vλ∗ }.
d∈D
Theorem
Suppose there exists a conserving decision rule or an optimal policy,
then there exists a deterministic stationary policy which is optimal.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
18
Value Iteration
1
2
Select v 0 ∈ V , specify > 0, and set n = 0.
For each s ∈ S, compute v n+1 (s) by


X
v n+1 (s) = max r (s, a) +
λp(j|s, a)v n (j) .
a∈As
3
j∈S
If
(1 − λ)
,
2λ
go to step 4. Otherwise, increment n by 1 and return to step
2.
For each s ∈ S, choose


X
d (s) ∈ argmax r (s, a) +
λp(j|s, a)v n+1 (j)
||v n+1 − v n || <
4
a∈As
j∈S
and stop.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
19
Policy Iteration
1
Set n = 0 and select an arbitrary decision rule d0 ∈ D.
2
(Policy evaluation) Obtain v n by solving
(I − λPdn )v = rdn .
3
(Policy improvement) Choose dn+1 satisfy
dn+1 ∈ argmax[rd + λPd v n ].
d∈D
Setting dn+1 = dn if possible.
4
If dn+1 = dn , stop and set d ∗ = dn . Otherwise increment n by
1 and return to step 2.
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
20
Linear Programming
Let α(s) be positive scalars such that
Primal linear program is given by
X
min
α(j)v (j)
v
P
s∈S
α(s) = 1.
j∈S
v (s) −
X
λP(j|s, a)v (j) ≥ r (s, a),
∀s ∈ S, a ∈ As .
j∈S
Dual linear program is given by
XX
max
r (s, a)x(s, a)
x
X
s∈S a∈As
x(j, a) −
a∈Aj
XX
λp(j|s, a)x(s, a) = α(j), ∀j ∈ S,
s∈S a∈As
x(s, a) ≥ 0,
∀s ∈ S, a ∈ As .
Dan Zhang, Spring 2012
Infinite Horizon Discounted MDP
21