Value Iteration for Average Cost Markov Decision Processes in

AMRX Applied Mathematics Research eXpress
2005, No. 2
Value Iteration for Average Cost Markov
Decision Processes in Borel Spaces
Quanxin Zhu and Xianping Guo
1 Introduction
This paper studies the average cost problem for discrete-time Markov decision processes
(MDPs) in Borel state and action spaces. The costs may have neither upper nor lower
bounds instead of the case of nonnegative (or bounded below) costs widely used in the
previous literature. Here, the approach to deal with this problem is via the value iteration algorithm (VIA)—also known as the successive approximation algorithm.
The VIA for average cost problems in discrete-time MDPs is originally due to
White [19] and it has been widely studied. As is well known, when the state and action
spaces are both finite, the VIA converges in a finite number of steps (see, e.g., [3, 4, 7, 14]).
However, when a state space is countably infinite, a counterexample is used to show that
the VIA does not converge even though the action space is compact (see, e.g., [14]). Thus,
the main purpose that is concerned with is to find conditions to ensure that the VIA converges. To do this, extensive literature has been presented (see, e.g., [1, 2, 5, 6, 14, 16, 17]
for denumerable MDPs and [10, 11, 13] for MDPs in Borel spaces). In this paper, we also
deal with the VIA for MDPs in Borel spaces, and we now simply describe some existing works. (I) When the costs/rewards are bounded, the simultaneous Doeblin condition, which implies the strong ergodicity condition, ensures that the VIA converges in
[5]. The main results in [5] have been extended to the case with the so-called simple, relatively simple condition [2]. (II) When the costs are nonnegative (or bounded below),
one of the optimality conditions is that the relative difference hα (x) := Vα∗ (x) − Vα∗ (x0 ) of
Received 1 December 2004. Revision received 2 May 2005.
62 Q. Zhu and X. Guo
the discounted optimal value function is assumed to be bounded below in both states x
and discount factors α, where Vα∗ (x) is the discounted optimal value function (see, e.g.,
[11, 13, 17]). (III) For the much more general case when the costs have neither upper nor
lower bounds, to the best of our knowledge, only Cavazos-Cadena addresses the problem. In [6] he provides a Lyapunov function condition which is a stronger condition because it requires that some states be positive recurrent, and under which he ensures that
the VIA yields convergent approximations to the solution of the average cost optimality
equation, whereas the treatment in [6] is restricted to denumerable MDPs and the VIA
does not yield an average optimal stationary policy. In this paper, we further study the
much more general case in Borel spaces. We give another set of weaker conditions under
which we first get the average cost optimality equation, then under an additional condition (Assumption 3.8), we see that the VIA yields the optimal (minimum) average cost, an
average optimal stationary policy, and a solution to the average cost optimality equation.
A key feature of our conditions is that the function hα (x) is bounded only in the discount
factors α, whereas [11, 13, 17] hα (x) is assumed to be bounded below in both states x and
discount factors α. Finally, we provide an example to illustrate our conditions.
The rest of this paper is organized as follows. In Section 2, we introduce the control model and the optimality problem that we are concerned with. After optimality conditions and the average cost optimality equation given in Section 3, we study the VIA for
average cost Markov decision processes in Section 4. Our conditions are illustrated with
an example in Section 5.
2 The optimal control problem
Notation. If X is a Borel space (i.e., a Borel subset of a complete and separable metric
space), we denote by B(X) the Borel σ-algebra.
In this section, we first introduce the control model
S, A(x) ⊂ A, x ∈ S , Q · |x, a , c(x, a) ,
(2.1)
where S and A are the state and the action spaces, respectively, which are assumed to
be Borel spaces, and A(x) denotes the set of available actions at each state x ∈ S. We
suppose that the set
K := (x, a) : x ∈ S, a ∈ A(x)
(2.2)
is a Borel subset of S × A. Furthermore, Q(·|x, a) with (x, a) ∈ K, the transition law, is a
stochastic kernel on S given K.
Value Iteration for Average Cost Markov Decision Processes
63
Finally, c(x, a), the cost function, is assumed to be a real-valued measurable function on K. (As c(x, a) is allowed to take positive and negative values, it can also be interpreted as a reward function rather than a “cost.”)
To state the optimal control problem that we are concerned with, we need to introduce the classes of admissible control policies.
For each t ≥ 0, let Ht be the family of admissible histories up to time t, that is,
H0 := S and Ht := K × Ht−1 for each t ≥ 1.
Definition 2.1. A randomized history-dependent policy is a sequence π := (πt , t ≥ 0) of
stochastic kernels πt on A given Ht satisfying
πt A(x)|ht = 1 ∀ ht = x0 , a0 , . . . , xt−1 , at−1 , x ∈ Ht , t ≥ 0.
(2.3)
The class of all randomized history-dependent policies is denoted by Π. A randomized
history-dependent policy π := (πt , t ≥ 0) ∈ Π is called (deterministic) stationary if there
exists a measurable function f on S with f(x) ∈ A(x) for all x ∈ S, such that
πt
f(x) |ht = πt f(x) |x = 1 ∀ ht = x0 , a0 , . . . , xt−1 , at−1 , x ∈ Ht , t ≥ 0.
(2.4)
For simplicity, denote this policy by f. The class of all stationary policies is denoted by F,
which means that F is the set of all measurable functions f on S with f(x) ∈ A(x) for all
x ∈ S.
For each x ∈ S and π ∈ Π, by the well-known Tulcea’s theorem [8, 10, 12], there
exist a unique probability measure space (Ω, F, Pxπ ) and a stochastic process {xt , at , t ≥
0} defined on Ω such that, for each D ∈ B(S) and t ≥ 0,
Pxπ xt+1 ∈ D|ht , at = Q D|xt , at ,
(2.5)
with ht = (x0 , a0 , . . . , xt−1 , at−1 , xt ) ∈ Ht , where xt and at denote the state and action
variables at time t ≥ 0, respectively. The expectation operator with respect to Pxπ is denoted by Eπ
x . In particular, when the policy π := f is in F, the corresponding process {xt } is
a Markov chain with values in S and transition law Q(·|x, f(x)).
Now define the α-discounted cost and the long-run average expected cost criteria, respectively, as follows: for each π ∈ Π, x ∈ S, and α ∈ (0, 1),
Vα (x, π) :=
Eπ
x
∞
α c xt , at ,
t
Vα∗ (x) := inf Vα (x, π),
π∈Π
t=0
Vn (x, π)
,
V̄(x, π) := lim sup
n→ ∞
n
V̄ ∗ (x) = inf V̄(x, π),
π∈Π
(2.6)
64 Q. Zhu and X. Guo
n−1
where Vn (x, π) := Eπ
x[
t=0 c(xt , at )] denotes the n-stage total expected costs up to time n
when using the policy π, given the initial state x.
Definition 2.2. A policy π∗ ∈ Π is said to be α-discounted cost optimal if
Vα x, π∗ = Vα∗ (x)
∀ x ∈ S.
(2.7)
An average expected cost optimal policy is defined similarly.
To define the n-step value function, let V0∗ (·) := 0, and assume that Vn∗ −1 (·) has
been defined, let
Vn∗ (x)
∗
:= min c(x, a) + Vn−1 (y)Q dy|x, a
∀ n ≥ 1, x ∈ S.
a∈A(x)
S
(2.8)
It is easy to see that Vn∗ (x) is the minimum expected n-step costs, given initial state x.
That is,
Vn∗ (x) := inf Vn (x, π)
π∈Π
∀ n ≥ 1, x ∈ S.
(2.9)
The main goal of this paper is to give new conditions for the convergence of Vn∗ (x) and for
the VIA for average cost Markov decision processes in Borel spaces.
3 Average cost optimality equation
In this section, we state conditions for the VIA for average cost Markov decision processes and establish the average cost optimality equation.
We will first introduce two sets of hypotheses. The first one, Assumption 3.1, is a
combination of a “Lyapunov-like inequality” condition together with a growth condition
on the one-step cost c.
Assumption 3.1. (1) There exist constants b ≥ 0 and 0 ≤ β < 1 and a (measurable) function w ≥ 1 on S such that
S
w(y)Q dy|x, a ≤ βw(x) + b
∀ (x, a) ∈ K.
(3.1)
(2) There exists a constant M > 0 such that |c(x, a)| ≤ Mw(x) for all (x, a) ∈ K.
Value Iteration for Average Cost Markov Decision Processes
65
Remark 3.2. Assumption 3.1(1) is well known as a “Lyapunov-like inequality” condition
(see, e.g., [12, page 121]). Obviously, the constant b in (3.1) can be replaced by a bounded
nonnegative measurable function b(x) on S as [12, Assumption 10.2.1(f)]. Assumption
3.1(2) is the so-called “growth condition” and it is not required when the costs are
bounded.
The second set of hypotheses we need is the following standard continuitycompactness conditions; see, for instance, [9, 14, 18] and their references.
Assumption 3.3. (1) For each x ∈ S, A(x) is compact.
tion
S
(2) For each fixed x ∈ S, c(x, a) is lower semicontinuous in a ∈ A(x), and the funcu(y)Q(dy|x, a) is continuous in a ∈ A(x) for each bounded measurable function u
on S, and also for u =: w as in Assumption 3.1.
Remark 3.4. Assumption 3.3 is the same as [12, Assumption 10.2.1]. Obviously, Assumption 3.3 holds when A(x) is finite for each x ∈ S.
To get the average cost optimality equation, in addition to Assumptions 3.1 and
3.3, we give a new condition (Assumption 3.5 below). To state this assumption, we introduce the following notation.
For the function w ≥ 1 in Assumption 3.1, we define the weighted supremum
norm uw for real-valued functions u on S by
uw := sup w(x)−1 u(x) ,
x∈S
(3.2)
and the Banach space Bw (S) := {u : uw < ∞}.
Assumption 3.5. (1) There exist two functions v1 , v2 ∈ Bw (S) and some state x0 ∈ S such
that
v1 (x) ≤ hα (x) ≤ v2 (x)
∀ x ∈ S, α ∈ (0, 1),
(3.3)
where hα (x) := Vα∗ (x) − Vα∗ (x0 ) is the so-called relative difference of the function Vα∗ (x).
(2) There exists a sequence {αn } ⊂ (0, 1) such that the sequence {hαn } is equicontinuous.
Remark 3.6. (a) Assumption 3.5(1) is from [9, Assumption C] and it is weaker than [11,
Assumption 5.4.1(b)], [13, condition C1(b)], and [17, conditions S2 and S3] because the
function v1 (x) may not be bounded below, whereas the function hα (x) is assumed to be
bounded below in [11, 13, 17].
66 Q. Zhu and X. Guo
(b) In [9] we give some sufficient conditions as well as examples to verify Assumption 3.5(1). These conditions are generalizations of the stochastic monotonicity condition and the “Lyapunov-like inequality” condition.
(c) Assumption 3.5(2) is the same as [11, Assumption 5.5.1] and it is a key condition to obtain the average cost optimality equation (see Theorem 3.7 below). Obviously,
Assumption 3.5(2) is not required when a state space S is denumerable.
Under Assumptions 3.1, 3.3, and 3.5, we may obtain the average cost optimality
equation and ensure the existence of average optimal stationary policies (see Theorem
3.7 below).
Theorem 3.7. Under Assumptions 3.1, 3.3, and 3.5, the following statements hold.
(a) There exist a unique constant g∗ , a function h∗ ∈ Bw (S), and a stationary policy f∗ ∈ F satisfying the average cost optimality equation
∗
∗
g + h (x) = min
a∈A(x)
c(x, a) +
= c x, f∗ (x) +
S
h (y)Q dy|x, a
∗
S
h∗ (y)Q dy|x, f∗ (x)
(3.4)
∀ x ∈ S.
(3.5)
(b) g∗ = infπ∈Π V̄(x, π) for all x ∈ S.
(c) Any stationary policy f in F realizing the minimum of (3.4) is average optimal,
and so f∗ in (3.5) is an average optimal stationary policy.
Proof. (a) Under Assumptions 3.1, 3.3, and 3.5(1), Theorem 4.1(a) in [9] gives the existence of a unique constant g∗ , two functions h∗k ∈ Bw (S) (k = 1, 2), and a stationary policy
f∗ ∈ F satisfying the two average cost optimality inequalities
∗
g +
h∗1 (x)
∗
≤ min c(x, a) + h1 (y)Q dy|x, a
∀ x ∈ S,
a∈A(x)
S
c(x, a) + h∗2 (y)Q dy|x, a
a∈A(x)
S
∗ = c x, f (x) + h∗2 (y)Q dy|x, f∗ (x)
∀ x ∈ S.
g∗ + h∗2 (x) ≥ min
S
(3.6)
(3.7)
(3.8)
Moreover, from the proof of (3.6)–(3.8) in [9], we have that there exists a subsequence
{αk } of {αn } such that
h∗1 (x) := lim sup hαk (x),
k→ ∞
h∗2 (x) := lim inf hαk (x)
k→ ∞
∀ x ∈ S.
(3.9)
Value Iteration for Average Cost Markov Decision Processes
67
On the other hand, by Assumption 3.5(2) and the well-known Ascoli theorem, we obtain
h∗ (x) := lim hαk (x) = h∗1 (x) = h∗2 (x)
k→ ∞
∀ x ∈ S,
(3.10)
which together with (3.6)–(3.8) gives part (a).
Obviously, parts (b) and (c) follow from the proof of [9, Theorem 4.1(b) and (c)].
To guarantee the convergence of value iteration for average cost Markov decision
processes, in addition to Assumptions 3.1, 3.3, and 3.5, we give another key condition
(Assumption 3.8 below).
Assumption 3.8. For the policy f∗ in Theorem 3.7(a),
(1) the sequence {Vn∗ } is equicontinuous;
(2) there exists a probability measure P on S such that
lim
n→ ∞ S
u(y)Qn dy|x, f∗ (x) =
u(y)P(dy)
S
∀ x ∈ S,
(3.11)
for every bounded continuous function u on S, and P(G) > 0 for every
open set G;
(3) let v1 , v2 be the two functions as in Assumption 3.5, and for n = 1, 2, . . . , suppose that fn ∈ F realizing the minimum of (2.8); that is,
Vn∗ (x)
:= c x, fn (x) +
S
Vn∗ −1 (y)Q dy|x, fn (x)
∀ x ∈ S.
(3.12)
Then there exist two functions Lk (k = 1, 2) on S such that
(3a)
S
v1 (y)Qn dy|x, f∗ (x) ≥ L1 (x)
∀ x ∈ S, n ≥ 0;
(3.13)
(3b)
S
v2 (y)Qn dy | x, fn (x), fn−1 (x), . . . , f1 (x) ≤ L2 (x)
∀ x ∈ S, n ≥ 0.
(3.14)
68 Q. Zhu and X. Guo
Remark 3.9. (a) Assumption 3.8 is an extension of [11, Assumption 5.6.1] and [13, Assumption 5.2] to the case that the costs are allowed to be negative (or unbounded below).
(b) The existence of the minimizers fn ∈ F in (3.12) is ensured by Assumption 3.3.
(c) Obviously, Assumption 3.8(3a) is not required when the costs are nonnegative
(or bounded below).
(d) Assumption 3.8(1) holds when a state space S is denumerable.
(e) When a state space S is Borel and the costs are nonnegative, some examples
are provided in [11, 13] to verify Assumption 3.8. In fact, when the costs in [11, 13] have
neither upper nor lower bounds, Assumption 3.8 still holds. For instance, let c(x, a) :=
qx2 + ra2 , where r > 0 and q < 0. Then, under the conditions −r/αβ2 < q < −rγ2 /β2 and
αγ2 < 1, we can use approaches similar to those in [11, 13] to verify Assumption 3.8 by a
direct calculation.
Finally, we give an important lemma to close this section.
Lemma 3.10. Suppose that A(x) is compact for all x ∈ S, and let {fn } be a stationary policy sequence in F. Then there exists a stationary policy f ∈ F such that f(x) ∈ A(x) is an
accumulation point of {fn (x)} for each x ∈ S.
Proof. For the proof, see [15, Proposition 12.2].
4 Value iteration algorithm
In this section, under Assumptions 3.1, 3.3, 3.5, and an additional condition (Assumption
3.8), we see that the VIA yields the optimal average cost, an average optimal stationary
policy, and a solution to the average cost optimality equation. To do this, we use the following notations.
Let {Mn , n ≥ 0} be a sequence of constants, and define
gn (x) := Vn∗ (x) − Mn ,
(4.1)
kn := Mn − Mn−1 ,
where Vn∗ is as in (2.8). Then we may rewrite (2.8) as
kn + gn (x) = min
a∈A(x)
c(x, a) + gn−1 (y)Q dy|x, a
S
∀ n ≥ 1, x ∈ S.
(4.2)
∀ x ∈ S,
(4.3)
We consider the functions
un (x) := Vn∗ (x) − Vn∗ x0 ,
wn (x) := Vn∗ (x) − Vn∗ −1 (x)
Value Iteration for Average Cost Markov Decision Processes
69
where x0 is as in Assumption 3.5, and
en (x) := ng∗ + h∗ (x) − Vn∗ (x)
∀ x ∈ S.
(4.4)
Notice that the following relations are satisfied for all x ∈ S:
wn (x) = g∗ − en (x) + en−1 (x), n ≥ 1,
un (x) = h∗ (x) − en (x) + en x0 , n ≥ 0.
(4.5)
(4.6)
Theorem 4.1. Suppose that Assumptions 3.1, 3.3, 3.5, and 3.8 hold. Then the following
hold:
(a)
lim un (x) = h∗ (x),
n→ ∞
lim wn (x) = g∗ ,
(4.7)
n→ ∞
and the convergence is uniform on compact subsets of S;
(b) for each x ∈ S, there exists a subsequence ni = ni (x) (depending on x) such
that the limit limi→ ∞ fni (x) exists,
(c) if the limit limi→ ∞ fni (x) is denoted by f̄(x), then f̄(x) ∈ A(x) for all x ∈ S, and
f̄ is an average expected cost optimal stationary policy.
Proof. (a) We will first prove that there exists a constant k such that
lim en (x) = k ∀ x ∈ S,
(4.8)
n→ ∞
and the convergence is uniform on compact sets. In fact, from (4.4), we have
en (x) − en (y) = h∗ (x) − h∗ (y) + Vn∗ (x) − Vn∗ (y),
(4.9)
which together with the equicontinuity of h∗ (by Assumption 3.5(2) and Theorem 3.7)
and Vn∗ (by Assumption 3.8(1)) yields the equicontinuity of en .
Then, under Assumptions 3.3 and 3.8, from (2.8), we see that there exists fn ∈ F
such that
Vn∗ (x)
= c x, fn (x) +
S
Vn∗ −1 (y)Q dy | x, fn (x)
∀ x ∈ S, n ≥ 1.
(4.10)
70 Q. Zhu and X. Guo
Combining (4.10) and (3.4), we get
g∗ + h∗ (x) ≤ c x, fn (x) + h∗ (y)Q dy|x, fn (x)
S
∗
∗
h (y) − Vn∗ −1 (y) Q dy|x, fn (x) ,
= Vn (x) +
(4.11)
S
which together with (4.4) yields
en (x) ≤
S
en−1 (y)Q dy | x, fn (x)
∀ x ∈ S, n ≥ 1.
(4.12)
Iterating this inequality, we have
en−1 (y)Q dy | x, fn (x), fn−1 (x)
S
≤ · · · ≤ e0 (y)Qn dy | x, fn (x), fn−1 (x), . . . , f1 (x) .
en (x) ≤
(4.13)
S
Thus, noting that e0 (x) := h∗ (x), by Assumptions 3.5(1) and 3.8(3b), we obtain
en (x) ≤ L2 (x)
∀ x ∈ S, n ≥ 0.
(4.14)
On the other hand, for f∗ as in (3.5), from (2.8), again it follows that
Vn∗ +1 (x)
≤ c x, f∗ (x) +
S
Vn∗ (y)Q dy|x, f∗ (x)
∀ n ≥ 0, x ∈ S.
(4.15)
This inequality together with (4.4) and (3.5) gives
S
en (y)Q dy | x, f∗ (x) = ng∗ +
S
h∗ (y)Q dy|x, f∗ (x) −
≤ (n + 1)g∗ + h∗ (x) − Vn∗ +1 (x)
S
Vn∗ (y)Q dy|x, f∗ (x)
∀ n ≥ 0, x ∈ S.
(4.16)
That is,
S
en (y)Q dy | x, f∗ (x) ≤ en+1 (x)
∀ n ≥ 0, x ∈ S.
(4.17)
Value Iteration for Average Cost Markov Decision Processes
71
By induction and (4.17), we get
S
en (y)Qm dy | x, f∗ (x) ≤ en+m (x)
∀ m, n ≥ 0, x ∈ S.
(4.18)
In particular, letting n = 0 in (4.18), we have
em (x) ≥
S
h∗ (y)Qm dy | x, f∗ (x)
∀ m ≥ 0, x ∈ S,
(4.19)
which together with Assumptions 3.5(1) and 3.8(3a) yields
em (x) ≥ L1 (x)
∀ m ≥ 0, x ∈ S.
(4.20)
Combining (4.14) and (4.20), we have
L1 (x) ≤ en (x) ≤ L2 (x)
∀ n ≥ 0, x ∈ S.
(4.21)
From (4.21) and the well-known Ascoli theorem, we see that there exist a subsequence
{eni } of {en } and a continuous function l such that
lim eni (x) = l(x)
n→ ∞
∀ x ∈ S,
(4.22)
and the convergence is uniform on compact sets. Therefore, from [12, Lemma 5.6.5] and
(4.22), there exists a constant k such that
lim en (x) = k ∀ x ∈ S,
n→ ∞
(4.23)
and the convergence is uniform on compact sets. Thus, by (4.6) and (4.23), we see that
part (a) is satisfied.
(b) Since A(x) is compact, from Lemma 3.10, there exists a stationary policy f̄
such that f̄(x) ∈ A(x) is an accumulation of {fn (x)} for every x ∈ S; that is, for every x ∈ S,
there exists a subsequence ni = ni (x) (depending on x) such that the limit
lim fni (x) = f̄(x),
i→ ∞
(4.24)
which gives part (b).
We now prove (c). In fact, from (b), we have f̄(x) ∈ A(x) for all x ∈ S. Then, we
wish to prove that f̄ is an average expected cost optimal stationary policy. Indeed, from
72 Q. Zhu and X. Guo
part (a), we have that for arbitrary ε > 0, there exists an integer N > 0 such that
ε
ε
≤ un (x) ≤ h∗ (x) +
3
3
ε
∗
∀ n ≥ N.
wn x0 ≤ g +
3
h∗ (x) −
∀ n ≥ N, x ∈ S,
(4.25)
On the other hand, we may rewrite (2.8) and (4.3) as
wn x0
+ un (x) = min c(x, a) + un−1 Q dy|x, a
a∈A(x)
S
= c x, fn (x) + un−1 Q dy|x, fn (x)
∀ n ≥ 1, x ∈ S,
(4.26)
S
which together with (4.25) yields
g∗ + h∗ (x) +
2ε
≥ wn x0 + un (x)
3
= c x, fn (x) + un−1 Q dy|x, fn (x)
S
ε
∗
≥ c x, fn (x) +
h (y) −
Q dy|x, fn (x)
3
S
∀ n ≥ N, x ∈ S,
(4.27)
and so
g∗ + h∗ (x) + ε ≥ c x, fn (x) +
S
h∗ (y)Q dy|x, fn (x)
∀ x ∈ S.
(4.28)
By Assumption 3.3(2), part (b), and (4.28) as well as the “extension of Fatou’s lemma” [12,
Lemma 8.3.7], we obtain
∗
g + h (x) + ε ≥ lim c x, fni (x) + h (y)Q dy|x, fni (x)
i→ ∞
S
≥ c x, f̄(x) + h∗ (y)Q dy|x, f̄(x) ∀ x ∈ S.
∗
∗
(4.29)
S
Letting ε → 0, we have
g∗ + h∗ (x) ≥ c x, f̄(x) +
S
h∗ (y)Q dy|x, f̄(x)
∀ x ∈ S.
(4.30)
Thus, from (4.30) and [9, Theorem 4.1], we conclude that f̄ is an average expected cost
optimal stationary policy.
Value Iteration for Average Cost Markov Decision Processes
73
5 An example
In this section, we use an example in [9] to illustrate our assumptions.
Example 5.1 (see [9, Example 5.1], a controlled queueing system). Consider a controlled
queueing system in which the state variable denotes a number of customers in the system. The arrival rate is assumed to be a positive constant λ, and the service rate µ is
assumed to be controlled by a decision maker. Here, we interpret the service rate µ as
the control action. When the system’s state is at x ∈ S := {0, 1, . . . }, the decision maker
takes an action a from a given set A(x) ≡ [µ1 , µ2 ] with µ2 > µ1 > 0, which increases or
decreases the service rates given by (5.1) and (5.2) below. The action incurs a cost c̄(x, a).
In addition, the decision maker obtains a reward px during the period in which the system remains in state x, where the unit reward caused by a customer is presented by a
constant p > 0.
We now formulate this system as a discrete-time Markov decision process. The
corresponding transition law Q(y|x, a) and cost rates c(x, a) are given as follows:
Q 0|0, a :=
µ2
,
λ + µ2
Q 1|0, a :=
λ
λ + µ2
∀ a ∈ µ1 , µ2 .
(5.1)
For each x ≥ 1 and a ∈ [µ1 , µ2 ],
 a



λ + µ2





 µ2 − a
λ + µ
2
Q y|x, a :=

λ





λ + µ2




0
if y = x − 1,
if y = x,
(5.2)
if y = x + 1,
otherwise.
c(x, a) := c̄(x, a) − px for (x, a) ∈ K := (x, a) : x ∈ S, a ∈ A(x) .
(5.3)
We suppose the following.
(E1 ) The parameter λ satisfies that eλ < µ1 , where e is the well-known exponential
constant.
(E2 ) The function c̄(x, a) is continuous in a ∈ A(x) = [µ1 , µ2 ] for each fixed x ∈ S,
and c̄∗ (x) := supa∈A(x) |c̄(x, a)| < Cx for all x ∈ S and some constant C > 0.
Under (E1 ) and (E1 ), it has been shown in [9] that Assumptions 3.1, 3.3, and 3.5(1) hold.
Since a state space S is denumerable, by Remarks 3.6(c) and 3.9(d), we see that Assumptions 3.5(2) and 3.8(1) are satisfied. Moreover, from [9, Lemma 3.3 and Example 5.2], we
74 Q. Zhu and X. Guo
conclude that for each f ∈ F, the Markov process {xt } is uniform w-exponentially ergodic,
which implies Assumption 3.8(2). The rest is to verify Assumption 3.8(3). In fact, from
the proof of [9, Lemma 3.3 and Example 5.2], we have that v1 (x) ≤ hα (x) ≤ v2 (x), with
v1 (x) = −v2 (x) = −MR(1 + w(x0 ) + 2b /R)w(x), where x0 is as in Assumption 3.5, R and
b are the two constants in [9, Lemma 3.3], w(x) = ex and M = p + C. In particular,
let x0 = 0, then we get v1 (x) = −v2 (x) = −MR(2 + 2b /R)ex . Now take L1 (x) = −L2 (x) =
−MR(2 + 2b /R)(ex + b/(1 − β)), then we have
2b v1 (y)Q dy|x, a = −MR 1 + w x0 +
w(y)Q dy|x, a
R
S
S
2b βw(x) + b ,
≥ −MR 1 + w x0 +
R
2b 2
v1 (y)Q dy|x, a ≥ −MR 1 + w x0 +
β w(y)Q dy|x, a + b
R
S
S
2b 2
≥ −MR 1 + w x0 +
β w(x) + βb + b ,
R
..
.
2b n
β w(x) + βn−1 b + · · · + βb + b
v1 (y)Qn dy|x, a ≥ −MR 1 + w x0 +
R
S
2b b
≥ −MR 1 + w x0 +
w(x) +
∀ (x, a) ∈ K,
R
1−β
(5.4)
where
β :=
e2 λ + eµ2 + µ1 (1 − e)
∈ (0, 1),
e λ + µ2
µ1 (e − 1)
.
b := e λ + µ2
(5.5)
And so
S
v1 (y)Qn dy | x, f∗ (x) ≥ L1 (x).
(5.6)
On the other hand, we have
2b v2 (y)Q dy | x, a = MR 1 + w x0 +
w(y)Q dy|x, a
R
S
S
2b βw(x) + b .
≤ MR 1 + w x0 +
R
(5.7)
Value Iteration for Average Cost Markov Decision Processes
75
Thus, similar to the proof of (5.6), we obtain
S
v2 (y)Qn dy | x, fn (x), fn−1 (x), . . . , f1 (x) ≤ L2 (x).
(5.8)
Hence, from (5.6) and (5.8), we see that Assumption 3.8(3) is true.
Acknowledgments
This research was partially supported by NCET, EYTP, and NSFC.
References
[1]
A. Arapostathis, V. S. Borkar, E. Fernández-Gaucherand, M. K. Ghosh, and S. I. Marcus, Discretetime controlled Markov processes with average cost criterion: a survey, SIAM J. Control Optim.
31 (1993), no. 2, 282–344.
[2]
Y. Aviv and A. Federgruen, The value iteration method for countable state Markov decision
processes, Oper. Res. Lett. 24 (1999), no. 5, 223–234.
[3]
B. W. Brown, On the iterative method of dynamic programming on a finite space discrete time
Markov process, Ann. Math. Statist. 36 (1965), 1279–1285.
[4]
X. R. Cao and X. P. Guo, A unified approach to Markov decision problems and performance
sensitivity analysis with discounted and average criteria: multichain cases, Automatica 40
(2004), no. 10, 1749–1759.
[5]
R. Cavazos-Cadena, A note on the convergence rate of the value iteration scheme in controlled
Markov chains, Systems Control Lett. 33 (1998), no. 4, 221–230.
[6]
, A pause control approach to the value iteration scheme in average Markov decision
processes, Systems Control Lett. 33 (1998), no. 4, 209–219.
[7]
, Value iteration and approximately optimal stationary policies in finite-state average
Markov decision chains, Math. Methods Oper. Res. 56 (2002), no. 2, 181–196.
[8]
E. B. Dynkin and A. A. Yushkevich, Controlled Markov Processes, Grundlehren der Mathematischen Wissenschaften, vol. 235, Springer, New York, 1979.
[9]
X. P. Guo and Q. X. Zhu, Average optimality for Markov decision processes in Borel spaces: a
[10]
O. Hernández-Lerma, Adaptive Markov Control Processes, Applied Mathematical Sciences,
new condition and approach, submitted to J. Appl. Probab.
vol. 79, Springer, New York, 1989.
[11]
O. Hernández-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria, Applications of Mathematics, vol. 30, Springer, New York, 1996.
[12]
, Further Topics on Discrete-Time Markov Control Processes, Applications of Mathemat-
[13]
R. Montes-de Oca and O. Hernández-Lerma, Value iteration in average cost Markov control
ics, vol. 42, Springer, New York, 1999.
processes on Borel spaces, Acta Appl. Math. 42 (1996), no. 2, 203–222.
76 Q. Zhu and X. Guo
[14]
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John
Wiley & Sons, New York, 1994.
[15]
M. Schäl, Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal, Z. Wahrscheinlichkeitstheorie und Verw. Gebiete 32 (1975), no. 3,
179–196.
[16]
L. I. Sennott, Value iteration in countable state average cost Markov decision processes with
unbounded costs, Ann. Oper. Res. 28 (1991), no. 1–4, 261–271.
[17]
, The convergence of value iteration in average cost Markov decision chains, Oper. Res.
Lett. 19 (1996), no. 1, 11–16.
[18]
, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series
in Probability and Statistics: Applied Probability and Statistics, John Wiley & Sons, New York,
1999.
[19]
D. J. White, Dynamic programming, Markov chains, and the method of successive approximations, J. Math. Anal. Appl. 6 (1963), 373–376.
Quanxin Zhu: Department of Mathematics, South China Normal University, Guangzhou 510631,
China; School of Mathematics and Computational Science, Zhongshan University, Guangzhou 510275,
China
E-mail address: [email protected]
Xianping Guo: School of Mathematics and Computational Science, Zhongshan University, Guangzhou
510275, China
E-mail address: [email protected]

Download Report

Value Iteration for Average Cost Markov Decision Processes in

Paperzz.com

Your Paperzz