Topic 5
Markov Decision Problems and Dynamic Programming
Finite Horizon Case
1 Controlled Markov Chains
Consider a M.C. with countable state space S. The control action at time k, uk , takes values in U(xk ) ⊂
U. Note that here, the choice of control at time t is restricted to U(x) which is a function of state of
the system at time k which is a subset of all allowable control actions denoted by set U. The choice of
action u ∈ U affects transition probabilities, i.e.
P (u) : U → P
where P is the set of all stochastic matrices of dimension |S|, with the interpretation that
Pij (uk ) = P rob{xk+1 = j|xk = i, uk }.
Here we assume action space is finite (most of the results can be extended with care to the case
where U is compact or countable).
1.1 Control Policy (Feedback Law) in Case of Perfect Observations
A policy is nothing but the strategy (or plan of action). In general, a policy is the description of a
feedback controller, and hence its domain is determined by the available observation (we will discuss
this further). For now, though, we will assume that policy is a plan of action for each state xk ∈ S at
time k (in other words, we assume that M.C is perfectly observable). Mathematically, θ := {θ0 , θ1 , . . .}
is a sequence of stochastic Kernels on action space U. In other words,
θk (xk ) := θk (x0 , x1 , . . . , xk ) ∈ Θk ,
where Θk is the space of all distributions on U. In other words, at time k, and given history1 xk , action
u ∈ U(xk ) is taken with probability θk (u) (needless to say u∈U (xk ) θk (u) = 1 for all xk .
.
1
Notion of ak is used to represent all past values associated with the parameter a.
1
Definition 1 A policy is said to be deterministic if the stochastic kernel θk is deterministic, i.e. for any
xk there exists a u ∈ U(xk ) such that θk (u|xk ) = 1. In such a case, the policy is described by a
sequence of functions, g := {g0 , g1 , . . .}
gk (xk ) := gk (x0 , x1 , . . . , xk ) ∈ U.
It turns out that we can restrict attention to deterministic policies when S is countable and U is finite.
For this reason we focus on a more traditional setting where policies are described as the sequence of
functions, {gk }.
Definition 2 A policy g is said to be Markov if g = {g0 , g1 , . . .} is a feedback policy such that gk
depends only on the current state xk and not on the past states.
Definition 3 A Markov policy g is said to be Stationary if g0 = g1 = · · · = g.
Lemma 1 When a Markov policy g is employed, the resulting state process {xk } is a Markov process.
And the m-step transition matrix at time k is given by:
g
Pkg . . . Pk+m−1
where
Plg :=
Note to me: calculate Pkg , ... (3.2) of the textbook
2
2 Cost of a Policy: Finite Horizon Case
A policy determines the pd of the state process xk and the control process {θk (xk )}. Different policies
will lead to different probability distributions. In optimal control one is interested in finding the best
or also known as optimal policy. To do this one needs to compare different policies. This is done by
specifying a cost function. In most of this course, we deal with additive type costs. In other words, for
a known and fixed sequence of state and action pairs {(xk , uk )} the cost over horizon N is given as
C θ :=
N
−1
ck (xk , uk ) + cN (xN )
k=0
where ck (i, u) can be interpreted as the cost to be paid if at time k, when at state Xk = i control action
uk = u is taken, and is referred to as the immediate or one period cost, CN (xN ) is the terminal cost.
Notice that the stochastic evolution of the system Xk and the choices of uk all depend on the choice
of policy θ. In other words, N
k=0 ck (Xk , uk ) is a random variable. Hence, it is more meaningful if we
talk about expected cost:
θ
θ
θ
J(θ) := E C = E {
N
k=0
Why E θ ? What does the notation imply?
3
ck (Xk , uk ) + cN (XN )}.
2.1 Cost of a Markov Policy
As seen before, for any finite horizon N,
J(θ) := E θ
N
ck (xk , uk ).
(1)
k=0
Here, for notational simplicity we have absorbed cN , but we need to have the difference in mind.
In addition to the general cost function above, some non-additive cost functions can be put in this
form.
Example 1 Suppose the cost associated with the maintenance of a stochastic system is such that the
system incurs a penalty of A dollars if its state crosses beyond a threshold α and is free otherwise.
Consider a finite horizon N. Show that the expected cost can be written in the form given by (??).
We will see that it is sufficient to restrict attention to the class of all deterministic Markov policies,
i.e. among optimal policies, there always exists one which is deterministic Markov. For this reason, in
this section, we derive formulas which express the cost function under a deterministic Markov policy
g = {g0 , g1 , . . . , gN }. Recall that due to the Markovian assumption on g, the controlled process remains
Markovian, i.e.
J(g) = E
g
N
ck (xk , uk ) = E
k=0
g
N
ck (xk , gk (xk ))
k=0
Now using the structure of transition probability matrices, we have:
J(g) =
=
N k=0
P (xk = i)ck (i, gk (xk ))
i
4
Definition 4 Define the expected cost incurred during k, . . . , N, when xk = i as
Vkg (i)
g
:= E {
N
ck (xl , gl (xl ))|xk = i}.
l=k
What is the relationship between J(g) and Vkg ?
5
Lemma 2 The functions Vkg (·) can be calculated by backward recursion,
Vkg (i) = ck (i, gk (i)) +
g
(Pkg )ij Vk+1
(j),
j
starting with the final condition
VNg (i) = cN (i, gN (i)).
6
0 ≤ k < N,
In fact, we can use the same arguments to arrive at the following generalization:
Lemma 3 Let {xk , 0 ≤ k ≤ N} be a Markov chain with tranition probabilities Pxk+1|xk . Let {hk } be
functionals. Define the set of functionals:
HNg (i) = hN (i),
Hkg (i) = hk (i, gk (i)) +
g
(Pkg )ij Hk+1
(j).
j
Then the random variable Hk (xk ) satisfy
Hk (xk ) = E{
N
hk (i, gk (i))|xk } = E{
l=k
N
l=k
7
hk (i, gk (i))|xk , . . . , x0 }
(2)
Now that we have seen the basic (recursive) equations to calculate the cost of Markov Policies, we
are ready to tackle the problem of optimal policy.
3 Optimal Policy and Dynamic Programming
Dynamic Programming is based on the recursive structure discussed in last lecture. In other words, in
dynamic programming problems we go for a step by step computation.
E{C g |xk = i} = inf {c(i, u) + Exk+1 |uk ,xk E(C g |xk+1 = j, uk = u, xk = i)}
u
What is the interpretation?
In other words, we are after recursive forms to describe the total cost (for simplicity of notation
we assume that the expectation can be written in terms of sum, as the case with countable state space
Markov Chains):
Vk (i) := ECkg (i) = inf {c(i, u) +
u
Vk+1(j)Pij (u)}
j
The validity of recursion suggests that it is possible to restrict attention to the class of deterministic
Markov strategies, since it requires that the Markovian structure to remain valid for the controlled chain.
What comes next is to ensure that such a restriction of attention to deterministic Markov policies is not
consequential in terms of performance.
8
Recall that the cost of a deterministic Markov policy can be calculated recursively:
VNg (i) = cN (i),
Vkg (i) = ck (i, gk (i)) +
g
(Pkg )ij Vk+1
(j).
(3)
j
Also recall that for a policy θ (possibly randomized and non-Markovian), the random variable
Jkθ
= E{
N
cl (xθl , uθl )|xθ0 , . . . , xθk }
l
is the cost-to-go at k corresponding to policy g, i.e. J(θ) = EJ0θ . Due to the fact that g might be nonMarkovian, a recursive relationship might not hold. Instead we have the following important lemma for
arbitrary feedback policies:
Lemma 4 (Comparison Principle) Let Vk (x), 0 ≤ k ≤ N, be any set of functions such that
VN (i) ≤ cN (i),
Vk (i) ≤ ck (i, u) +
(4)
Pij (u)Vk+1(j)
(5)
j
for all state i and u ∈ Ui . Let g be an arbitrary policy. Then, w.p.1,
Vk (xθk ) ≤ Jkθ ,
k = 0, . . . , N.
Corollary 1 Let Vk (x) be functions satisfying (??) and (??). Also assume, there exists a policy g such
that J0g = V0 (x0 ). Then, g is optimal.
9
The following theorem summarizes the fundamental theorem of dynamic programming.
Theorem 1 Define recursively functions
VN (i) = cN (i),
Vk (i) =
inf
u∈Ui
ck (i, u) +
(6)
Pij (u)Vk+1(j) .
(7)
j
(i) For any arbitrary policy θ, Vk (xθk ) ≤ Jkθ w.p.1; in particular, J(θ) ≥ EV0 (x0 ).
(ii) A Markov policy g = (g0 , . . . , gN −1) is optimal if for all k the infimum in (??) is achieved at
gk (x); hence, Vk (xgk ) = Jkg and J ∗ = J(g) = EV0 (x0 ) .
(iii) A Markov policy g = (g0 , . . . , gN −1 ) is optimal only if for each k, the infimum at xgk in (??) is
achieved by gk (xgk ), i.e.
Vk (xgk ) = ck (xgk , gk (xgk )) +
j
10
g
Pxk j (g(xgk ))Vk+1
(j)
w.p.1
Definition 5 The functions Vk (x), k = 0, 1, . . . , N, are called the (optimal) value fucntions.
11
Corollary 2 When dealing with a finite horizon problem, one can restrict attention to the class of
Markov policies, GM , without incurring a loss in optimality (The only issue to worry about is the
existence of an optimal Markov policy, i.e. a policy that achieves the infimum! The finiteness of the
action space guarantees this. Note that to ensure the existence of such policy, we can instead assume
that Uk is compact, in order to guarantee that infimum is achievable).
Note: Based on this theorem, one can devise a recursive DP algorithm:
Step 1: Define VN (x) := cN (xN ).
Step 2: for k : N − 1 : 0, find gk : S → U by
gk (i) = arg min ck (i, u) +
u∈Uk
Pij (u)Vk+1 (j)
j
3.1 Generalizations
So far we have proved and studied the dynamic programming results for a controlled Markov chain
with countable state space and finite action space. The results can be generalized. See the references in
the text book.
12
© Copyright 2026 Paperzz