Contracting Markov decision processes

Contracting Markov decision processes
van Nunen, J.A.E.E.
DOI:
10.6100/IR29937
Published: 01/01/1976
Document Version
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences
between the submitted version and the official published version of record. People interested in the research are advised to contact the
author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
Citation for published version (APA):
van Nunen, J. A. E. E. (1976). Contracting Markov decision processes Amsterdam: Mathematisch Centrum DOI:
10.6100/IR29937
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal ?
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Download date: 28. Jul. 2017
i
.-:.
CONTRACTING MARKOV DECISION PROCESSES
\
,
CONTRACTING MARKOV DECISION PROCESSES
PROEFSCHIRFT
TER VERKRTJGING VAN VE GRAAV VAN VOCTOR IN VE
TECHNISCHE WETENSCHAPPEN AAN VE TECHNISCHE
HOGESCHOOL EINVHOVEN, OP GEZAG VAN VE RECTOR
MAGNIFICUS, PROF.VR.IR. G. VOSSERS, VOOR E'EN
COMMISSIE AANGEWEZEN VOOR HET COLLEGE VAN
VEKANEN ZN HET OPENBAAR TE VERVEVZGEN OP
VINSVAG 25 MEI 1916 TE 16.00 UUR.
do011.
JOHANNES ARNOLVUS ELISABETH EMMANUEL VAN NUNEN
GEBOREN TE VENLO
VU
p11.0e6~c.hhi.6t
..U goedgekewr.d
dooJt. de pMmo:toJten
PM 6. dlt. J. Wu~ e1.o
en
PM 6. dJt.. S. T. M. AckeJUna.11.6
VOOJf.
MaJtlun
vooJt. mi..jn ou.deJU
Page 11: replace line 11 and 12 from above by
= l; v1 ~ 0
[6((0,ill =OJ; YaEG.,[[o(a)
~OJ
(2.2.ll
o((OJ)
(2.2.2)
Ya,S€G., [o(a) = O•o((a,13))=0]; yaEG., yiES\{O}
.. [o((a,O)J
[o((ct,
= 1]]
O,i)) =OJ.
page 11: replace line 1 from below by
page 12: replace line 6 and 7 from above by
(ii)
(0)
E G; YctEG [
(a,0)
E G].
page 12: replace line 9 from below by
G
n
I
: = B u { (a, S)
n
B E
o; E B,
u
{O}k}, with B= ( u
k=l
k=l
page 13: replace line 6 from above by
page 14
line 3 from above: replace o(a) < 1 by o(a)
~
o.
page 14: replace line 6 from below by
..
G (0) :=
8
k
u {O}; GH(i)
k=l
:=
{(i,a)
I aE
i-1
u GH(j)} u {i}, for i ~ 0.
j=O
page 14: replace line 4 and 3 from below by
G := S u ( u
Bk) u { (a,S)
k=2
with B c: s\{O}.
page 133
line 13 from below: add
(to appear in Computing)
I
ct € S u ( u
k=2
Bk), S
E
u
k=l
{O}k}
CONTENTS
1. INTR.OVUCTION
'l.. PRELIMINARIES
2. 1.
2.2.
'l.. 3.
2. 4.
8
No:t:.ati:on4
Stopp.lng ti.mei.
Wugh.ted .6 upJte.mum nolf.ln6
14
Some Jr.e.ma/!.k.6 on bounding 6unc.ti.on4
J5
8
10
3. MARKOV REWARV PROCESSES
19
3. 1. The Mallkov Jr.eWaJtd model
3. 'l.. Con:tJuu!ti.on ma.pp.i..ng.6 and .e..topp.lng Ume-6
3. 3. A cll6cU6.6.i..on on :the tu,.t.wnpti.on4 3.1. 2-3.1. 5
4. MARKOV 17EC1S10N PROCESSES
19
U
32
37
4. 1. The Mallkov ded.6.i..on model
4. 'l.. Ved.6.i..on 11,Ulu and a.6.t.wnp.tlon4
4. 3. Some pJtopeJttlu 06 MaJckov dec-U.i..on pJtoc.U4U
4. 4. RemaJ!k.6 on :the a.6.6umpti.on4 4. 'l.. 1-4. 'l.. 4
5. STOPPING TIMES ANV CONTRACTION IN MARKOV 17EC1SION PROCESSES
5. 1. The cof!Vt.a.cti.on mapp.lng L~
5.2. The opti..mal Jte.twm. mapp.lng o&
37
38
43
59
63
64
70
6, VAWE ORlENTEV SUCCESSIVE APPROXIMATION
81
6.1. PoUcq hnpltovement value detellmlna:tlon pMc.eduJte.6
81
6. 2. Some lle.tllalllu on :the value Miented me:thod.6
92
7. UPPER BOUNVS, Lell/ER BOUNVS ANV SUBOPT1MAL1TY
1. 1. UppeJL bound.6 t:md toweJL bound.6 6oll vf and v*
n
1. 'l.. The .e.ubopt:..i.ma,Uty 06 MaJckov poUc.lu and .e.ubopthna,f. aet.l..on4
7. 3. Some lle.tMldu on Q.ln.lte 4.tate .6pa.c.e MaJckcv ded.6.i..on pJtoce.6.6e.6
8. N-STAGE CONTRACTION
95
95
102
106
111
8.1. ConveJLgence undeJL :the .6tlwt.g.thened N-.6.tage c.on:tJux.cti.on
111
9. SOME EXPLANATORY EXAMPLES
9. 1. Some.
example.6 06 i.pe.elQ..lc. M€Vlk.ov dew-i..on pMc.e.uu
9. 2. The. i.olu:tlon 06 4!{4.t.e.m6 06 UneM e.quatlon.6
9. 3. An -i..nventoJc.y p1t.oblem
111
117
119
123
REFERENCES
SUBJECT ZNVEX
135
SELECTEV LIST OF SYMBOLS
137
SA.MEN VATTING
139
CURRICULUM VITAE
143
CHAPTER 1
INTROVUCTION
In the last three decades much attention has been given to Markov decision
processes. Markov decision processes w!;)re first introduced by Bellman [ 2]
in 1957, and constitute a special class of dynamic programming problems. In
1960 How?-rd [ 35] published his book "Dynamic programming and Markov process-
es". This publication gave an important impulse to the investigation of
Markov decision processes.
'
We will first give an outline of the decision processes to be investigated.
s.
= 0,1,2, ••• ,
Consider a system with a countable state space
The system can be con-
trolled at discrete points in time t
by a decision maker.,If
at time t the system is observed to be in state i € s, the decision maker
can select an action from a nonempty set A. This set is independent of the
state i €Sand of the time instant t. If he selects the action a € A in state
i €
s, at time t , the system's state at time t + 1 will be
j € S with proba-
bility pa{i,j) again independent of t. He then earns an immediate {expected)
reward r{i,a).
Usually, the problem is to choose the actions "as good as possible" with respect to a given optimality criterion.
We will use the (expeated) totaZ r>ewaPd ar>iterion.
So the prOblem is:
1)
to determine a recipe {decision rule) according to which actions should
be taken such that the {expected) total reward over an infinite time horizon is maximal1
2) to determine the total reward that may be expected if we act according
to that decision rule.
In first instance the following solution techniques for Markov decision
processes with a finite state space and a finite action space were available: the policy iteration algorithm developed by Howard [35], linear programming [ 11], [ 13], apprpximation by standard dynamic programming [ 35].
2
A disadvantage {especially for large scaled problems) of the former two methods is that each iteration step requires a relatively large amount of computation.
Furthermore, the convergence of the method of standard dynamic programming
is very slow.
Hence, the construction of numerical methods for determining optimal solutions has been the subject of much research in this area. Moreover,
muc~
attention has been paid to the generalization of the model as described by
~ellman
and Howard. Both subjects will be studied in this monograph.
MacQueen [ 46) introduced an improved version of the standard dynamic programming
algo~ithm
by constructing, in each iteration step, improved upper
and lower bounds for the optimal return vector. His approach yields a rather ,fast algorithm for solving finite state space finite action space Markov
decision processes.
Modifications of the afore mentioned optimization procedures have been given
J:)y e.g. Hastinqs [25], (26] who proposed a Gauss-Seidel-like technique and
Reetz [60] who based his optimization procedure on an overrelaxation idea.
Modifications have also been given by Porteus (59], Wessels [74], van Nunen
,and Wessels [55], and van Nunen [53], [54].
By and by several extensions of the original model were presented. The finite state space and finite action space restriction was dropped, For
example Maitra [48] and Derman [14] studied Markov decision processes with
a countable state space and finite action space, whereas Blackwell [ S] and
Denardo [12] already investigated Markov decision processes with a general
state
~d
action space.
The restriction of equidistant decision point was dropped as well; see
Jewell (37], [38].
As a remaining restriction, however, a bounded reward structure is assumed
in the above articles.
This restriction has been released recently, see Lippman [44], [45], Harrison
,[24], weasels [75], and Hinderer [31].
In this monograph we will investigate Markov decision processes on a countably infinite or finite st.ate space and with a general action space. Furthermore, we allow for an unbounded reward st:ructu:re. Ne do not require the
transition probabilities to be strictly defective, with respect to the usual
3
supremum noxm0e assume the existence of a function b:
ve function µ:
s
+
:m+
:= {x
e :m
I
s
+JR and a positi-
x > O} such that
VieS VaeA(i) lrCi,a) - b(iJI < µ(i) ,
the function µ will bi:i used to construct a weighting function, or a bounding function. Moreover, we assume
3
• V
V
O<p<1 iE:S aeA(i) .
ls
pa(i,j)µ (j) S pµ (i) •
)€
In order to guarantee the existence of the total expected reward we assume
the function b on S to be a charge with respect to the transition probability structure (see section 4.1).
These assumptions on the reward and transition probability structure arise
in fact by a combination and a slight extension of the conditions as pro/
posed by Wessels [75] and Harrison [24]. As will be shown in the final
chapter the assumptions allow e.g. the investigation of a large class of
discounted Markov and Semi-Markov decision Processes. Lippman's assumptions
[45] are covered as well, see van Nunen and Wessels [56].
We will develop a set of optimization procedures for solving Markov decision problems, satisfying the described conditions, with respect to the
\
total reward criterion. This will be done by using the concept of stopping
time (see also Wessels [74]), which results in a unifying approach. This
set of methods includes the procedures for finite state, finite action
space Markov decision processes as proposed by Howard [35], Reetz [60],
Hastings [25] MacQueen [46], A main role in our approach will.be played by
the theory of monotone contraction mappings defined on a complete metric
space of functions on
s.
This space will be denoted by
V,
The concept of stopping time will be used to define a set ·of contraction
mappings on
V. Given a decision rule and given the starting state i e S we
may define the stochastic process {st
I
t
= 0,1, ••• }
where st denotes the
state of the system at time t. Roughly speaking a stopping time is a recipe
for terminating the stochastic process {st
I
t
2:
O}. For each stopping time
(denoted by 6) we define the mapping U of V by defining (U 0v) (i) as the
0
supremum over all decision rules of the expected total reward until the
process is stopped according to the stopping time
o,
given that the process
starts in state i e S, while, in addition, a terminal reward v(j) is earned
4
if the process is stopped in state j e
s. u
6
tractive mapping on the complete metric space
is proved to be a monotone con-
V.
For stoppinq times that are nonzel"o (see section 2.2) u6 will be strictly
contracting, its fixed point beinq equal to the requested optimal expected
'
total reward over an infinite time horizon (denoted by V* ) • Bence, the
fixed point is independent of the chosen nonzero stopping time. This implies
that for each nonzero stopping time
0
0
.
o, v*
b ya sequence vn :• U vn-i ,starting
0
f
may be approximated successively
6
rom any v0 e V. So for each nonzero
stopping time 6 we have
v* •
v0 +
n
These results may be formulated alternatively as follows: for each nonzero
stopping time
o, v*
is the unique solution of the optimality equation
The class of described methods may be extended.
For a special class of stoppinq times, which we called transition memoryless stopping times (see section 2.2), the mappings U produce the oppor0
tunity of determining in each iteration step a decision rule of a special
type for which the supremum by applying u is attained or approximately
0
attained (see chapter 5). Such a decision rule will be called a stationary
f
~
Markov strategy (denoted by f ) • We define the mapping L0 of V in a similar
way as we have defined u 6 , with the difference that the expected reward by
applying the stationary Markov strategy f
~
is computed instead of the sup-
remum over all decision rules.
For transition memoryless nonzero stopping times we define for each
(A) .
= {1,~, ••• }
a mapping u 6
of V.
If the supremum by computing u v is attained for a stationary Markov stra0
tegy (f ) then
;>. e:N
~
(;\)
U0
with ;>. e :N ,
v
uJ""> is defined by
:= lim U
n-
(n)
0
v •
5
If this supremum Ls not attained
uJ">
is defined by using a Markov strategy
for which the supremum is approximated (see chapter 6).
uJ">
is neither
necessarily contracting' nor monotone. However, the monotone contraction pro-
Li
perty of the mappings U0 and
enables the use of uJ"> as a base for successive approximation methods (see chapter 6).
For each transition memoryless nonzero stopping time
o>.
a sequence vn
o and
each A
E: liJ
u {co}
defined by
f
.. is chosen such that L nv!: approximates u v::
converges to V* • Here fn
0
1
0
1
sufficiently well. So each pair (o,>.) yields a successive approximation of
v*. Moreover, the stationary Markov strategy found in the n-th
such a procedure becomes (£-)optimal
o>.
cS>.
The vectors vn-l and vn
i~eration
of
for n sufficiently large.
enable us to construct upper and lower bounds for
the optimal return V* • In addition, the availability of upper and lower
bounds allows an incorporation of a suboptimality test. The use of upp,er
and lower bounds and a suboptimality test may yield a considerable gain in
computation time, see section 7.3.
We conclude this introduction with a short overview of the contents of the
subsequent chapters.
In the firs't three sections of chapter 2 some basic notions required in the
sequel are presented. After the introduction of some notations (section
2.1) we discuss in sections 2.2 and 2.3 the concepts of stopping time and
weighted supremum norms respectively. The final section of chapter 2 is
devoted to some properties of weighted supremum norms.
In chapter 3 we treat Markov reward processes. (stochastic processes without the possibility of making decisions). In section 3.1 the Markov reward
model is defined. Reward functions may be unbounded under our assumptions.
In section 3.2 the concept of stopping time is used to define the contraction mappings on the complete metric space V (introduced in section 2,3).
A discussion of the assumptions is the topic of the final section 3. 3.
6
The study of Markov decision processes starts in chapter 4. After a description of the model (section 4.1), section 4,2 contains the introduction of
decision rules and assilmptions, These assumptions will be a natural extens~on
of those in chapter 3. under our assumptions some results about Markov
decision
~rocesses
will be proved (section 4,3), The final section (4.4) is
again devoted to a discussion of the assumption5,
:In
chapter 5 the c0ncept of stopping time is used to generate a whole set
Of optimization procedures based on the mappings U , For each decision rule
0
~, not necessarily a stationary Markov strategy, and each stopping time o
a contractive mapping L; of
V will
be defined and investigated (section
5,1), Next, (section 5,2) the operator U0 will be studied. Finally, we will
present necessary and sufficient conditions for the stopping times under
which we can restrict the attention to stationary Markov strategies only
(transition memoryless stopping times).
In chapter.6 we investigate value oriented successive approximations based
on the mappings ur~.). The term "value oriented" is used since in each iteration step extra effort is given to obtain better estimates for the total
expected reward corresponding to the
co
stationary Markov strategy fn.
Chapter 7 will be used to construct upper and lower bounds for the optimal
reward
v*.
In this chapter also a suboptimality test will be introduced, In
the third section of this chapter we show how our theory may be used in
the special case of a Markov decision process with a finite state space and
a finite action space,
We
indicate the relation with the existing optimiza-
tion procedures.
In the brief chapter 8 we weaken the assumptions as imposed in chapter 4,
This weakened version corresponds to the N-stage contraction assumption
introduced by Denardo [12]. It will be proved that N-stage contraction with
respect to a given bounding function implies the existence of a new bounding function satisfying the assumptions of chapter 4.
\'
7
We
conclude this monoqraph with a chapter in which we show that/a number of
specific Markov decision processes is covered by our theory. We will also
show how a number of the existing approximation.methods for certain systems
of linear equations are included in our treatment of Markov reward processes (chapter 3). The final part of this chapter consists of an example. In
this example we treat an inventory problem.
9
CHAPTER 2
PRE L1 MINARI ES
,The goal of this chapter will be the introduction of some of the noti6ns
which play an important role throughout this monograph.
First (section 2,1) we will give some notations and we will introduce the
measurable spaces relevant for the stochastic processes that will be investigated.
Next (section 2.2) stopping times are introduced. we will allow for randomized stopping times. Several specific stopping times will be described.
In section 2.3 a bounding function µ is introduced. The function µ will be
used to define a weighted supremum norm. In the following chapters this
bounding function will appear to be one of the tools for handling Markov
decision processes with an unbounded reward structure and with a transition
prci>ability structure that needs not to be contractive with respect to the
usual supremum norm.
Using the bounding function µ a Banach space W and a complete metric space
V are defined.
Finally (section 2,4), we discuss some properties of bounding functions.
2.1. No:ta:tlom
As mentioned in the introductory chapter we study a system which is ob-
served to be in one of the states from a state spaae S at times t
we assume S to be countably infinite or
finit~,
=O, 1; •••
and represent the states
by the integers, starting with zero. So if the state space is finite, S is
represented by {0,1, •• .,N}, where N+l is the number of states. Ifs is
countably infinite it is represented by {0,1,2, ••• }. A path is a sequence
of states that are subsequently visited.
NOl"ATIONS 2,1.1.
(i)
.
sk := s x s x
s
... x s,.. the k-fold Cartesian product of s, sos
1
=s.
s x ••• J s is the set of all paths.
.
s k , with
k ~ n, k,n e :N then a (n) denotes the row vector of
:= s x
{ii)
Let a
(iii)
ka is the number of components of a. So ka is n if and only if
a E Sn.
E
the first n components of a.
10
(iv)
The i-th component of a e: Sn, n ;;:: i is denoted by (a] i-l •
(v)
Hence a
(vi}
Y := (a,B> := ([a]o,[a]1, ... ,[a]k -l'(BJo, ... ,[a]k -1>, k.y•ka +
"'
k a
B
where a,B e: G with G := u s •
(vii)
The term (column) veator is used hereafter for real valued func.-
E
n
.
00
tions on
.
s may be written as a= ((aJ 0 ,(aJ 1 , ••• ,(a]k _ >.
.
a 1
co
ke,
k=l
s.
(viii) The term matl'ix is used hereafter for real valued functions on
s2 •
(ix)
The (1,j)-th element of a matrix A will be denoted by a(i,j).
(x)
A is the identity matrix (with diagonal entries equal to. one and
0
other entries equal to zero),
(Xi}
Matrix multiplication and matrix-vector multiplication are defined
as usual (in all cases there will be absolute convergence}.
·(Xii)
An is then-fold matrix produ~t Ax A x ••• x A, the (i,j)-th entry
of An is denoted by a(n)(i,j).
(xiii) Let. v,w be vectors, then vs w if and only if v(i) s w(i) for all.
i
€
Si
v < w if and only if v s w and for at least one i ;;: S
v(i) < w(i).
S0 be the a-field of all subsets of s, then the measurable space
Let
m ~F > is defined to be the product space, with o0 = s"" and F0 is the a0 0
algebra on '2 generated by the finite products of the a-field S •
0
0
In order to be able to use the concept of stopping time in an adequate way
we extend the measurable space ('2 ,F > to the measurable space ('2,F). The
0 0
space (il,f) will play a main role in the sequel. Let the set E := {0,1} and
let
S be the a-field of all subsets of s x E then the measurable space
(G, f) is defined to be the product space with '2
:=
(S x E)
00
and F is the
a-algebra on G generated by the finite products of the a-field
So
o0
contains all sequences of the form
whereas G· contains - all sequences of the form
S.
11
REMARK 2.1.1. The state 0 will play a special role throughout this monograph. We do not exclude Markov decision processes with defective transition probabilities, Therefore, the state 0 will be used as a (fictive) ab'
' '
sorbing state. This will give us the possibility of defining (based on the
transition probabilities) adequate probability measures on
0
cn~,F )
W,F>.
and
2.2. S.topp.in.g :tirne6
We now are ready to introduce the notion of stopping time.
DEFINITION 2.2.1. A function O: G.., .... [0,1] is called a (Pandomized) stopp-
ing time if and only if o satisfies the following properties
1] ,
(2. 2.1)
(2.2.2)
V
[
a ,Se:G..,
(o (a)
0) A ([13]k -l
a
r O)].,.. [o({a,13))
O] •
DEFINITION 2.2.2. The set of all {randomized) stopping times is 6,
REMARK
(i)
2.2, 1.
Roughly speaking for each
a
e sk, we will use 1 - 6 (a}
lity that a stochastic process on
as
the pr,obatii-
s is stopped at time k -
1 in state
[a.]k-l given that the states [aJ 0 ,[aJ 1 , ••• ,[a]k-l have been visited
successively.
(ii) From now on
we
will use the less formal notation o(a,B), o(i) instead
of o((a,13)) and o({i)) respectively.
DEFINITION 2.2.3 •
. (i)
o e ti. is said
to
be a norwandomized stopping time if and only if
Vae:G.., o(a) e {0,1} •
{ii)
o e A is said to be a memo:riytess stopping time i f and only i f
12
(iii)
oe
t:.. is said to be an ent;riy time i f and ~ly i f
o is
nonrandomized
and memoryless.
DEFINITION 2.2.4. A nonempty subset G c ('\., is said to be a goahead eet if
and only if
(i)
{ii)
V
[ (a ,S) e G .. a e G]
a,SeG.,.
(0) € G
(iii)
NOTATIONS 2.2.1.
(i)
Gn is the goahead set of those sequences of
~
nents [a]. are zero for i
l.
c;..
for which the compo-
n, if there are any. So
n
Gn := { U sk} u {(a,S)
I
Se
k=l
..
u
k=l
{ii) For a goahead set G we define G(i) by
G(i) := {a e G
I
[aJ
0
= i},
i
e
s •
LEMMA 2.2.1. There is a one to one correspondence between nonrandomized
stopping times and goahead sets.
PllOOF. The characteristic function of G is a nonrandomized stopping time. D
DEFINITION 2.2.s. o e t:.. is said to be a nonzezio stopping time if and only
if
13
REMARK 2.2.2.
(i)
A nonrandomized stopping time is nonzero if and only if
V iES
o (i)
1 •
(ii) The only nonzero stopping time
o
8, which is an entry time, has the
E
following property
V
G
at: co
o (a) = 1
•
DEFINITION 2. 2. 6. A goahead set is said to be nonzero if and only if S c G.
DEFINITION 2.2.7. Let
o0 ,o 1
EA then
o0 $of
if and only if
LEMMA 2.2.2. Let Q be an index set and suppose for each q E Q,
oq
E A, is a
nonrandanized stopping time,
Let
o-,o+
be defined by
0- (a) :=inf 0 (a),
q£Q
respectively, then
o-
o+ (a)
i= sup 0 (a)
q
and
qt:Q
o+ are
q
elements of 8.
Consequently, if for q E': Q, G corresponds to the stopping time oq then
+
q
and o correspond to the goahead sets n G , n G respectively.
qE':Q
q
qEQ
q
D
PROOF. The proof follows by inspection.
DEFINITION 2.2.8. The nonrandomized
stopping fwiation
defined by
(ii)
-r (w) = .. •
(dt = O for all t
o-
E':
.r; +) •
T: '1-+- r;+
u {00 } is
\
'
14
~DEFINITION,
2.2.9. A stopping time
o is
and only if there exists a subset Tl E
said to be t:ronsition memoeyZess if
s 2 such that
DEFINITION 2.2.10. A goahead set is said to be transition memo:cyless if and
only if the corresponding nonrandomized stopping time is transition memoryless.
LEMMA 2.2.3. Memoryless stopping times are transition memoryless.
We will now give some simple
examples of nonzero stopping times. The exam-
ples 2.2.1-2.2.4 are nonrandomized stopping times, so they can be expressed
in
terms
of goahead sets. Example 2. 2. 5 is a simple illustration of a ran-
domized stopping time. The examples 2.2.2-2.2.5 give transition memoryless
stopping times, see also Wessels [74], and van Nunen and Wessels [55].
EXAMPLE 2.2.1. G := G or in terms of stopping times V
n
o(a) = o.
llEG
o (a)
1 else
n
EXAMPLE 2.2.2. The goahead set G is defined by
8
co
:=
u
k=l
{O}k, G (1) := {(i,a)
8
Ia
e
EXAMPLE 2.2.3.
co
G :=
u
{a E Sn
I
[a] j
E
Bi j < n} u S
n=2
with B c s such that O
E
B.
EXAMPLE 2. 2. 4. The goahead set GR is defined by
"'
i-1
u G (j)}, for i '/' 0 .
8
j=O
15
GR(i) :={a
I a f:
..
u {i}k} u { (a,13)
k=l
I a f:
.
u {i}k,
k==l
a f:GR(O)},
EXAMPLE 2.2.S. 6 is qiven by Vif:S\{O} 6(i) =~else 6(a)
i e: s\{O} •
1.
DEFINITION 2.3.1. A real valued function µ on S is said to be a bounding
funation if and only
if
(il
µ(i) > 0 1 for all i
{ii)
µ{0)
=0
£
S\{O} •
•
DEFINITION 2.3.2. Let µ be a bounding function, then Wµ is the set of vectors such that
(2.3.1)
REMARK 2.3.1. Note that w{O)
0 for each w
£
~µ•
DEFINITION 2. 3. 3. Let µ be a bounding function. Then, for each
W
f: Wµ
1
the
µ-norm of w is defined by·
LEMMA 2. 3.1. The space W with this µ-norm (weighted supremum norm) is a
µ
Banach space.
PROOF. The proof is straightforward.
0
DEFINITION 2.3.4. Let the matrix A be a bounded linear operator in Wµ• The
no%lll of A is defined by
16
REMARK 2.3.2.
(i)
It is easily verified that
II All •
µ
l
sup
µ -l (i)
la(i,j)lµ(j) ,
iES\{O}
jES
and if this supremum is finite then A is a bounded linear operator.
(ii)
'I'he concept of bounding function is studied in more detail in sec-
tions 2.4, 3.3, 4.4, and
s.1.
(iii) We refer to Wessels (75], who introduced the concept of weighted supremum no:rms in this context and to Hinderer [ 31] 1 who used Wessels'
idea of weighted supremum norms for defining bounding functions.
DEFINITION 2.3.5. Let b be a vector with b(O)
=0
1
and let p E [0,1). The
set of vectors V
is defined by
µ,b,p
Vµ,b,p := { V
I
(V -
(1
- p)
-1
b) E Ill } •
µ
REMARK 2.3.3. Note that also v{O) = 0 for v
v1 ,v2
E
E
V and v -v
2
1
E
Wµ for
V.
DEFINITION 2.3.6. The metr>ie dµ on Vµ,b,p is defined by
LEMMA 2.3.2. A set
Vµ, b ,p
with the metric d µ is a complete metric space.
unless explicitely mentioned we fixed µ, b, and p for the remaining part of
this monograph. Referring to these fixed µ, b and p we will omit the subscripts µ, b, p.
2.4. Some
~
on bounding 6unction.6
In this section we give some properties of a bounding function µ' with respect to the corresponding spaces W , and V , •
µ
µ
17
LEMMA 2.4.1. Suppose S contains a finite number of elements. Let µ 0 ):)e a
bounding function and wn
€
W
µo
~ 0)
(n
[II w - w II
+ O] ,.[II w - w II
n
µ0
n
µ
then
, .. o, for all bounding function µ 'J •
PROOF. For a proof we refer to books on numerical mathematics, see e.g.
D
Collatz [ 8 ] , Krasnosel' skii [ 42].
LEMMA 2.4.2. Suppose s contains a finite number of elements and µ 0 is a
bounding function, then
where
B
is a matrix.
PR:>OF. The proof of (i) follows directly from the fact that
[II B II
< 1] • lim Bn
µo
n-+co
0 ,
where O is the matrix with all entries zero. The proof of (ii) can be found
in e.g. Krasnosel'skii [42].
LEMMA 2.4.3. Suppose µ 0 is a bounding function and B is a matrix with finite µo-norm, then
< 1] .. 3
µ
I
[II B II
µ
I
< 1] •
PROOF. For a proof we refer to van Hee and Wessels [29], who proved this
theorem for (sub-) Markov matrices, but their proof can easily be extended
to this lemma.
IJ
18
REMARK 2.4.1.
(i)
Note that in lemma 2. 4. 3
S is not niquired to be finite.
(ii)
If S is co~tably infinite the linear space Cll may contain elements
that are not bounded. If µ (i) -+- .. for i + .., then it is also pe:cmitted
for lw(i) I + ""•
(iii) Note that it is not requested that the µ-no:r:m of b exists. If the JJno:r:m of b exists then clearly
W = V.
19
CHAPTER 3
MARKOV RBllARV PROCESSES
In this chapter we restrict ourselves to Markov reward processes. So we exhibit our method for the first time in a relatively simple situation.
After defining the model {section 3.1) we introduce the assumptions on the
reward structure and the transition probability structure. As mentioned we
require neither the reward structure to be bounded nor the transition probabilities to be strictly defective.
Next {section 3.2) we show that each nonzero stopping time
contraction mapping
(L
) on the
complet~
metric space
oe
~
defines a
V. The fixed point
0
appears to be independent of the stopping time. It equals the total ex-
pected reward over an infinite time horizon. In the final section (3.3) we
discuss the assumptions on the reward structure and the transition.probabilities in relation to the bounding function µ and the function b.
The Ma!rkov Jr.eWaJtd model
3.1.
We consider a system that is observed to be in one of the states of S at
discrete points in time t
i
€
= 0,1, .•.•
If the system's state at time tis
s, the system's state at time t+l will
be j €
s with probability
p(i,j), independent of the time instant t.
ASSUMPTION 3.1.1.
(i)
V.
(ii)
v. s
.
l.1)€
l.e
s o
l
jeS
:s; p Ci, j J :s; 1
p{i,j)
E
1
p(0,0) - l •
(iii)
For each i es the unique probability measureP1 on {00 ,F0 J is defined in
the standard way, see' e.g. Neveu [52], Bauer [ 1] by defining the probabiliti,:es of cylindrical sets.
(3.1.1)
where n
€
Z+ and o. j is the Kronecker symbol
l.,
20
if i .. j
else •
Given the starting state i
€
S and the matrix P (with entries p(i,jJJ we
I
coo.sider the stochastic process { s
n ;i: 0}, where s
cw J • [w J • So
0 ,n
0 ,n 0
0 n
s ,n is the state of the process at time n. The stochastic process
0
{s
n ~ O} is a Markov chain with stati6nary transition probabilities.
0 ,n
See e.g. Ross [62], Feller [17], Karlin (39], Cox and Miller [ 9], Kemeny
I
and Snell (40],
For each stoppinq time 6
€
/1 and each startinq state i € S we define in a
similar way the unique probability measure Pi,o on CQ,F) by qiving for
n
€
z+
with tk
E
s
and
~ E
E.
This defines for each i
€
s and 6
E
/1 a stochastic process { (sn ,en)
I n ~ 0},
where sn(w) :•in' en(w) :• dn.
So sn 1' en are the state and the value of en at time n.
REMARK
3.1.1. The stochastic process {(sn,en)
I
n ~ O} is not a Markov
chain since the value o{a) may depend on the complete history
([aJ ,[aJ , ••• ,[a]k _ > for each a€ G=.
1
0
1
a
Formula {3.1.2) shows the connection between Pi and P i,o •
For each w €
For
Ao
E
n,
w = ((io•da),(i 1 ,d J •• ) we define w0 by w0 := Ci 0 ,1 1 ,1 2 , .. J.
1
F0 we define the set A € f by
It is easily verified that for A € F and each o
0
0
E
/1 we have for i
E
S
21
Let f be a measurable function on ('2 ,F ). The function f on ('2,F) is then
0
0 0
defined such that
It follows from formula (3.1.3) that
where Ei f , Ei,of denote the expectation of f 0 and f with respect to the
0
prd:>abili ty measures Pi and Pi, 0 respectively.
In the sequel we omit the subscript 0 in f • The process {sO,n
will thus be denoted by {s
n
I
n ~ O}.
0
I
n ~ O}
NOTATION 3.1.1. ByEf, E 0 f we denote the vector with components Eif, Ei,of
respectively.
We now state the assumptions on the reward structure and the tr.ansition
probabilities of the system considered in this section. Therefore, we first
introduce the reward function. At each point in time a reward is earned. We
assume this reward to depend on the actual state of the system only. So the
reward function r is a vector.
ASSUMPTION 3.1.2.
(r - b) E (JJ •
ASSUMPTION 3.1.3.
ASSUMPTION 3.1.4.
II P II < 1 •
ASSUMPTION 3.1.5,
(Pb - Pb)
E
W•
22
REMARK 3.1. 2.
(i)
Since b(O}
O
and µ
(0)
0, asswnption 3.1.2 implies that also
r(O) = O.
(ii)
I W,
If b
then r
I W.
(iii) In terms of potential theory (see e.g. Hordijk [33]) the second asswnption states that b is a charge with respect to P, which implies
the existence of the total expected reward over an infinite time horizon.
(iv)
Assumption 3.1.4 means that the transition probabilities are such
that the expectation of µ (~ ) , with respect to Pi is at most llP H• U (i) •
1
This implies that the process has a tendency to decrease its J.!-value.
(v)
The final asswnption states that, given the startinq state i
0
€
s,
the difference between the expected one-stage return and Pr (i 0 ) lies
between -MJJ (i Q ) and
(vi)
Note that if b €
'(a) r
€
Mµ
(i 0 ) for some
M € lR+
and all i
W the assumptions 3.1.2-3.1.5 may
0
€
s.
be replaced by
W,
(b) l!Plf < 1.
LEMMA 3.1.1.
(Pr - pb)
E:
W•
P:ROOF. II Pr - pb II !> II Pb - pb II + 211 r - b I!=: M, which is finite according to
0
the assumptions 3.1.2-3.1.5.
LEMMA 3. 1. 2. For M as defined in the proof of lemma 3 .1.1, we have
1
n:;;;l,2, ••. ,
with p 0 := max{llPl!,p}.
PROOF. The proof proceeds inductively. The statement is true for n = 1.
Suppose it is true for arl:>itrary n
~
1. Using the assumptions 3.1.2-3.1.5
we then have
0
23
LEMMA 3.1. 3.
"ES
lim (Pl\.) (i)
lini (Pnr) (i)
l.
n+'»,
n+«>
V
=O
•
PROOF. The proof is a direct consequence.of assumption 3.1.3 and the fore-
D
going lemma.
For each n
~
1
we define the vector Vn by
n-1
(3. 1. 4)
v
:=E
n
l
k=O
r(sk) •
LEMMA 3.1.4.
vn =
PROOF. The proof follows by inspection.
0
Clearly Vn(i) represents the total expected reward over n time periods when
the initial state is 1
THEOREM
E
s.
3.1.1.
PROOF. The convergence of
l
Pnr follows from assumption 3.1.3 and 3.1.2,
n•O
since,
l
pl\, +
n=O
We now have by lemma 3.1.2,
"'
l
n=O
Pn(r - b) •
24
DEFINITION 3.1.1. The total eli;pected reward vector V is defined by
3. 2 •
CcntJux.c:tlon mappbtgt. and .&:toppbtg :ti.mu
µ:MMA 3,2.1. Let v e: V then
(1)
Pnlvl exists for all n e: £+,
= O,
(ii) lim (Pnjvl> (i)
n.._
i e: s.
PROOF. v e: V implies that v can be written as v
.l
So Pnlvl
s
..l
= (1-p)-lb
+ w where we fJJ.
(1-p)-!Pnlbl + Pnlwl which is defined. Moreover, since
Pnlbl and
n=O
0
Pnlwl exist we find part (ii) of the le11111a.
n=O
DEFINITION 3.2.1. The mappinq L
1
of V is defined by
LEMMA 3.2.2.
(i)
'L
(ii)
L
1
1
maps V into V.
is a monotone mappinq.
(iii) The set {v
E:
V
I llv-
1
2
(1-p)- bll s M (1-p 0 ) - } is mapped in itself
1
by Ll.
(iv)
(v)
is strictly contractinq with contraction radius If P II.
1
The unique fixed point of L is v.
L
1
PROOF.
(i)
L v = r + Pv = r + P ( (1 - p)
1
II L 1v - ( 1 -
p)
-1
b If
= II r
-1
b + w) , with w
+ {1 -
p)
-1
e fJJ. so
Pb + Pw - ( 1 - p)
-1
b II
s II b + ( 1 - p) - lPb - U - pl - lb II +II Pw II +II r - b II
s II (1-p)-l(Pb-Pb) ll+llPll•llwll+llr-bll
•
1
(1-p)- 11Pb-pbll+llPll•llwll+llr-bll
<"'.
25
The proof of part (ii) is trivial.
(iii) Let v € {v € V II v- (1-p) -1bll S: M (l-P 0 ) -2} then
1
.
I
.
llL v-(1-p)
-1
1
bll=llr+Pv-(1-p)
s
(iv)
Let v ,v
1
v
2
2
€
.. (1-p)
llP(v-(1-p)
-1
b)
-1
bllS:
11+11
(1-p)
-1
Pb-. (1-p)
-1
b
.
+ rll
-1
V then v ,v2 can be given by v 1 = (1 -P) b + w1 and
1
.
b + w respectively where w ,w € W, So v -v •w -w ,
1
1
2
2
1 2
2
-1
thus
By choosing v
1
and v
such that w • µ and w = 0 the equality is at2
1
2
tained.
The last part of the lemma follows directly from
L V = r + P(
1
l
Pnr)
n=O
=r
+
l
n=l
where the interchange of summation is justified since
..l
I
Pn rl < co.O
n=O
Now
we return to the concept of stopping time. Note that the stopping func-
tion T on
n is a random variable. so we can define the random variable s,
by
if
T
= n
if
'l"
= ...
(3.2.1)
Given the starting state i
€
s
1
and the transition
probabilit~es
bution of T is uniquely determined by the choice of
o€
A.
the distri-
26
LEMMA 3.2.3. Let o
ll be a n0nzero stopping time, then
€
PROOF. The proof follows directly from the definition of ' and the defini-
0
tion of nonzero stopping time.
DEFINITION 3.2.2. Leto e /:;;, the mapping L
0 of V is defined component-wise
by
(L v) (i) :.,;Ei [
0
0
I
•-1
l
r(sk) + v(s,)J,
i
s .
€
k=O
l'U!:MARK 3. 2. 1. Note that as a consequence of the definitions of
CL0 v) (0) =
o
for all v
EXAMPLE 3.2.1. Let o
€
€
0
v) (i)
EXAMPLE 3.2.2. Let
=
V,
V.
A be the nonrandanized stopping time that corres-
ponds to the goahead set Gi and let v
(L
o and
r(i) +
l
€
V then
p(i,j)v(j) •
jE:S
oe
ll be the stopping time that corresponds to the go-
ahead set GH and v e V then
0
EXAMPLE 3.2.3. Let
l
= r(i) +
(L v) (i)
jq
oe
p(i,j) (L v) (j) +
0
l
p(i,j)v(j) •
~i
fl be the stopping time that corresponds to the go-
ahead set GR then
(L v) (il = (1-p(i,i))
0
DEFINITION 3.2.3. The matrix P
0
-1
r(i) + (1-p(i,i))
l
pi o(Sn
'
l.
p(i,j)v(j), i
,&o.
is defined to be the matrix with (i,j)-th
.,.
n=O
~
jf'i
element (p (i,j)) equal to
0
po(i,j) :=
-1
= j,
L
= n) •
27
LEMMA 3.2.4. Let o
P
0
:=llP
E
0
11
stoppinq~'time.
fl be a nonzero
then
S {1-inf o(i)) +(inf o(i))llPll < 1 .
ieS
iES
PROOF. First note that II p 6 II is finite, since up0 II
s
r
II pn ll < .. (assump-
n=O
tion 3.1.4).
For ~ E fl we define the stopping time oM e fl by
00
a.
if
oM(a) ==
u
<!
(S\{O}) k
k=M+1
{:Cal
else •
Now,
since
co
I11 P6
II - II P0 11 I s
M
l
II Pn II
k=M
So it suffices to prove the lemma for stoppinq times oM. This will be done
by induction with respect to M.
Let lln c fl be the set of stopping times with
fl
So
'1J
n
:=
{o
e fl
I o<a.>
0 for all a.
..u
e
(S\{O})k} •
k=n+l
only contains the stopping times with o (i) = 0 for all i
6 e ll 0 we have II P 0 II = 1.
Suppose 6 E ll
1
is a nonzero stopping time, then
c1-o(i))µCil + o(i)
I
p(i,j)J.1(j)
j<!S
s [0 -0<1» + 0CilllPllJµc1> •
E
s\{O}. For
28
o€
is suppos~d to be a nonzero stopping time, there exists a mm1
ber e: > 0 such that o (i) > e:, for all i e S which implies II P0 II = P 0 < 1.
Since
1::.
Now we state the induction hypothesis: Suppose for arbitrary n
II P0 II
Let
o
E
~
verified that o i
E
An.
Now for each i
s
we have
j€S
E
p
1
1 on lln and II P0 II < 1 if o e An is nonzero •
t..n+ 1 , we define oi (a)
l
~
0
(i,j)µ(j) =
:=
l
o (i,a) for i
l
l
Pi
[pi
j£S
'
0 (sm = j, T = m)µ(j)
,o <so = j,
n+1
+
=
l
m=1
is easily
n+1
jES m=O
=
s and a e G..• It
E
Pio (s
,
m
o-
'
= j,'
=O) +
= m)]µ(j)
n+1
I I I
0Cil>µc1> + oCi>
pCi,kl •
jES m=1 kES
(1 -
+
o(i)
0 (i)) µ
l
kES
s
(i)
+
n
l I
p(i,k)
jES
m=O
c1 - 0Ci))J.1Cil + o(i)
pk 0 (sm=j, T=m)J.l(j)
' i
I
p(i,klJ.1Ck>
k€S
s
So if
oe
~+l
c1-0Cil>J.1Ci> +
oCilllPll)µ(il.
is a nonzero stopping time then
This completes the proof.
D
29
LEMMA 3.2.5, Leto € /1 then L V,. V.
0
PROOF. We first mention a property of Markov chains
1fP (sn,. j) > O, k,n
1
E~.
s
consider for each i €
T.-1
(L0 V) (i) = Ei
'
0[
l
r(sk) + V(s,)J
k•O
T-1
oo
l
l
l
,. Ei 0 [
r(sk)] +
Pi 0 (s =j, T =n)V(j)
' k=O
n=O j€S
'
n
T-1
oo
l
l
oo
l
•E1 &(
r(sk)] +
P. 0 (s =j,
' k=O
n=O jES i ,
n
T-1
=E. 0[
i'
l
l
r(sk)]+
k=O
l
T
•n)
l
Ejr(sk)
k"'O
oo
l
n=O jES k=O
Pi 0 (sn=j, T=n)Ejr(sk)
1
T-1
[l
=Ei 0
r(sk>J+
' k=O
00
l
+
..
l
l
n=O j€S k=O
Pi 0 (sn=j, T=n)Ei(r(sn+k) lsn = j]
'
1'-1
= Ei 0[
r(sk)
'k=O
l
..
J+
""
Ei 0 r(sT kl =Ei 0
r(sk) =V(i)
k=O'
+
'k=O
l
l
where the interchange of SUlllll!ations is justified by the fact that
l
Eilr<sn>
n=O
LEMMA 3.2.6. Let
o
€
I
<"" •
6 then
L maps V into V.
0
(ii)
L is a monotone mapping.
0
(iii) L is strictly contracting if and only if o is nonzero.
0
{iv) The contraction radius of L equals P = II P II.
0
0
0
(i)
0
30
1
llv- (1-p)- bll S (l-p 0
VI
The set {v e:
(v)
self, where M'
2M1
:= ~~1 -p
)-
2 1
M } is mapped by L
0
in it-
and M is defined as in the proof of lelllllla
1
0
3.1.1 by Ml :=Ii Pb - Pbll +
21r - bll.
PROOF. The proof of (i) follows from lemma 3.2.5 and theorem 3.1.11 since
•
1
VE: Veach v e;
= L0 V + P w
0
To
prove
V may be written
aa
v = V + w with we:
W. Now L 0 v=L0 (V+w) •
=V+
(iii)
P0w e: V. The monotonicity of L0 is trivial.
we first note that v ,v2 EV imply that <v 1 -v l,
1
2
(L v _- L v > are elements of W. Moreover v ,v may be given by v =V + w ,
1
0 1
1
0 2
1 2
v
V + w with w ,w € W. So
L
2
=
0
is strictly contracting if and only if o is nonzero follows from lemma
2
1
2
3.2.4, The contraction radius equals llP 0 11 as is verified by choosing
= V +µand v = v. The last assertion follows from llv- (1-p) -1 bll s
1
v
S M (1-P)1
2
2
so each v
E
{v
V
€
I llv
1
(1-Pl- bll
-
written as v = V + w, where the µ-norm of w
W is
€
2
(l -P 0 ) - M'1 may be
$
at most
Now
II L.s v - ( 1 -
p)
-1
b II = II Lo (V + w)
s llv-
- ( 1 - p)
(1-p)
-1
Ml (1-po)
o).
b II
$
bll + llP ll•llwll
0
-2
+Po(Ml+M')(l-po)
TBEOREM 3.2.1. For any nonzero stopping time
unique fixed point v (independent of
-1
o E:
~
-2
the mapping L
=
0
has the
PROOF. The proof follows directly from the fact that V is a complete metric
·space and the foregoing lemmas.
0
31
LEMMA 3.2.7.
(ii)
Suppose o ,o 2 are nonrandomized nonzero stopping' times and G1 ,G2 the
1
goahead sets corresponding to o and o then
1
2
G1 c G2 .. o
1
$
o
2
and thus p
(iii) Let Q be a set of indices, o
+
p
sup p
0
q€Q
q
$
0
and
q
~
p
0
0
1
€ A, o nonzero and nonrandomized then
q
inf
q€Q
p
0
q
LEMMA 3.2.8. Let o € A be a nonzero stopping time, v~ €
0
(i)
v
(ii)
Lovo
(iii)
0
Lovo
n
0
0
v
(in µ-norm)
"'v
(in µ-norm)
+v
(in µ-norm)
Lo (vn-1) +
:=
0
0
0
0
$
VO .. vn
~
VO .. vn
V then
where, the convergence is component-wise.
PROOF. The proof of (i) is a direct consequence of theorem 3.2.1, whereas
parts (ii) and (iii) follow from the monotonicity of the mapping L and
0
theorem 3.2.1.
D
So the determination of the total expected reward over an infinite time
horizon,
v,
may be done by successive approximation of V by v!, with arbi-
trary nonzero stopping time o and arbitrary element
v~ of V.
Particularly if o , oH, oR are the nonrandomized nonzero stopping times
1
corresponding to the goahead sets G1 , ~' GR respectively, the following
well-known successive approximations converge to the return vector V
32
01
(i)
vn
(ii)
vn
oH
:
=r
01
01
+ Pvn- l , with v 0 e: V;
is component-wise defined by
o
OH
,o
v H{i) := r(i) +
p(i,j)vn (j) +
p(i,j)v ~
n
j<i
j~i
n
l
.
oH
with v
0
(iii)
0
l
1 Cj),
i
€
S ,
e: V:
v R is component-wise defined by
n
0
v
R
(i) := (1-p(i,i))
-1
r(i) + (1-p(i,i)
-1
n
i :;
o,
with v
oR
0
0
R
p(i,j)v
Cj), i e:
j;'i
n- 1
\'
l..
s,
e: V.
Furthermore if in each of these approximations
v~
is chosen as required in
lemma 3.2.8(ii) or (iii) the convergence will be monotone.
The following lellllllas will clarify some of the relations between the bound'ing function µ and the function b
(remind remark 3.1. 2 (vi}).
LEMMA 3.3.1. Suppose 3M'ElR+ Vie:S [b(i) :1!: -M'µ(.i:)] then there exist a
p',
0 < p • < 1 and a bounding function µ • such that
(3. 3.1)
llPll 1 S p ' < l
µ
C3.3.2l
II rllµ, <
00
•
-1
PROOF. Choose M3 :• max{211Pb-pbll. (1-p)
,2M'}, p' :• P~ + l.i(1-p~) and
'
'
µ
"
µ' (i) := b(i) + M µ(i), t:llen clearlyµ• is a bounding function. Now
3
II r II ,
µ
=
sup
(Ir (i) I) (b (i) + M. µ (i)) -l
3
ie:S\{O}
s
sup
clbCill>Cb(i)+M3µ(i))-l
ie:S\{O}
+
sup
clr(i) -b(i) I> (b(i) +M31.l(i))-l <"'.
ie:S\{O}
33
Furthermore
:s; p •µ ••.
D
In a similar way the following lell!llla. can be proved.
LEMMA 3.3.2. Suppose 3M'ElR+ ViES [b(i) ~ M'µ(i)]; then there exist a
P',
0 < p' < 1 and a bounding functionµ' such that (3.3.1) and (3,3.2) are
satisfied.
The latter two lemmas express that if the reward function :Ls bounded from
one side (with respect to the weighting factors µ (i)), a new bounding function µ• can be defined such that the Banach space Wµ, contains b, r, Vn and
V. However, for the existence of such a bounding function µ •, the condition
that b is bounded from one side (with respect to
µ)
is essential, as is
illustrated by the following example.
1, ViES\{O} p(i,0) • 1 -p,
Let i 0 :=min { c + > > p}, if i 0 is even then we redefine
1 2
EXAMPLE 3.3.1. S := {0,1,2,.;.}l p(0,0)
r(O)
= o.
iE'.N
+ 1.
0
0
For all O ~i , we .
choose
p(2i,2i + 2) =p(2i -1,2i + 1) =p (1 -ai) ,
p(2i,2i + 1) =p(21 -1,21 +2) =Pai ,
with
r(2i+1) = -(i+1)-2 p-(i+l)I p(21-1,j) = p(2i,j) = 0 otherwise;
b
r; µ(i) = 1 for i
~
O.
34
= P.
We clearly have II Pllµ
Moreover it is easily verified that
..l
l
P(n)
(21,j)lr(j)I
P-i
s
n=O j€S
l
n
-2
n=i
and
.l
l
P(n)
(2i+1,j)lrCj)I
S
p-(i+!)
n=O j€S
l
n
-2
n=i+1
It can also be verified that
vi€S II p(i,j)bCj> - pb(ill
=o
j
So the assumptions 3.1.2-3.1.5 are satisfied.
However, no bounding function µ' exists for which P ' < 1 and II r II , < '"'.
= (i) -2 p -i , r (2i + 1) = - (i + 1} -2P- (i+l) µ for
This follows since r (2i}
i.> ~io implies that an eventuel bounding functionµ• should satisfy
µ' <2il ~
for some M
€
M'1
P
:R+ and i
-i
>
Ci>
-2
=: µ
0 c2i>
~i 0 •
Assume the existence of a bounding function µ' and a p' < 1 such that
(3.3.3)
µ•
~LPµ'
p'
•
i
:== max{i ,min{(i;
We define
1
Then, substituting µ
the condition
0
0
ia-1
1
)
2
> '1(1
+p')}} •
in the right hand side of (3.3.3) yields, for i > 21 ,
1
35
µ• <il
with 13
:=
:?:
aµ =: µ 1 Cil ,
0
1 + p.
1
pi-<-2- - l
> 1.
Substituting µ (il in (3.3.3) yields for i > 2i
1
1
µI (i)
i!:
13µ
1
(i)
= 13 2 µ0 (i)
•
Iterating in this way proves that no bounding function exists.
REMARK 3. 3.1.
(i)
It is easily verified that the assumptions 3.1.3 and 3.1.4 do not imply assumption 3.1.5. By replacing the rewards in the above example by
lrl the assumptions 3.1.3 and 3.1.4 remain satisfied whereas assumption
3.1.5 fails.
{ii) If b is bounded from one side (in µ-norm) it follows from lemma 3.3.1
or 3.3.2 that r is a charge with respect
..l
..l
n=O
to
P, since
Pn Ir I 11 , S ( 1 - P ' ) -lll r II , ,
µ
Pnlrl Cil s
µ
(1-p'>- 1 1rI (µ'(i))-l,
i ;. 0 •
n=O
In this case assumption 3.1.3 may be replaced by the assumption that
bis bounded from one side (in µ-norm).
37
CHAPTER. 4
MARKOV DECISION PROCESSES
As
mentioned in chapter l we consider in this and the following chapters
Markov decision processes.
In section 4.1 the model is described. Next, in section 4.2, decision rules
and the assumptions on the transition probabilities and on the reward structure will be introduced. Again the assumptions will allow for an
unbound~d
reward structure. They are in fact a natural extension of the assumptions
3.1.2-3.l.5 to the case in which decisions are permitted. A number of re-
sults about Markov decision processes will be proved under our assumptions
(section 4. 3). For example, the existence of e:-optimal stationary Markov
decision rules will be shown. For Markov decision processes with a bounded'
reward structure this result has also been obtained by Blackwell [ 5] and
Denardo [ 12] •
Harrison [24] proved the same for discounted Markov decision processes with
a bounding function µ (i)
=1
for i
E:
s\ {O}. Moreover, in section 4. 3 we
prove the convergence (in µ-norm) of the standard dynamic programming algorithm.
The final section will be devoted again
4.1.
to
a discussion on the assumptions.
'The MaJtkov dec,U,ion model.
We consider a Markov decision process on the countably infinite or finite
state spaces at discrete points in time t
= O,l, ••••
In each state i es
the set of actions available is A. We allow A to be general arid suppose
to be a a-field on A with {a} e
is i
E:
s and action a e
{i)
(ii)
(iii)
a
E:
A
A. If the system's actual state
A is selected, then the system's next state will be
j e s with probability pa(i,j).
ASSUMPTION 4.1.1.
A if
38
(iv) pa(i,j) as a function of a, is a measurable function on (A,A) for each
i,j e.
s.
If state i e. s is observed at time n and action a e. A has been selected,
then an immediate (expected) reward r(i,a) is earned. So frOlll now on the
reward function r is a real valued measurable function on S x A.
The objective is to choose the actions at the successive points in time in
such a way that the total expected reward over an infinite time horizon is
maximal. A precise formulation will be given in the following sections.
It will be shown later on (chapter 9) that our model formulation includes
the discounted case (with a discounted factor 13 < 1), since
e may
be sup-
posed to be incorporated in the pa(i,j). The same holds for semi-Markov
decision processes where it is only required that t is interpreted as the
number of the decision moment rather than actual time. For semi-Markov decision processes with discounting, the resulting discount factor depends on
i,j and a e. A only, and may again be supposed to be incorporated in the
transition probabilities pa(i,j).
In the first part of this section we are concerned with the concept of decision rules. Roughly, a decision rule is a recipe for taking actions at
each point in time. A decision rule will be denoted by
selected at time n, according to
11,
1f.
The action to be
may be a function of the entire history
of the process until that time. We allow for the decision rule
'If
to be such
that for each state i e. S actions are selected by a randc:m mechanism. This
random mechanism may be a function of the history too.
DEFINITION 4. 2. 1.
n-stage histo1.'y hn of the process is a (2n + 1)-tuple
(i)
An
(ii)
Hn' n
~
0 denotes the set of all possible n-stage histories.
(iii) Sn is the product o-field of subsets of Hn generated by S
0
and
A.
39
DEFINITION 4.2.2.
(i)
Let~
be a transition probability of (Hn,Sn) into (A,A), n
~
O. So
(a) for every hn e Hn
~<·lhn) is a probability measure on (A,A);
(b) for every A' e A
~(A'I•> is measurable on (Hn,Sn)•
Then a decision rule
is defined to be a sequence of transition
('IT)
probabilities, 'IT := (~,q ,q , ••• ).
1 2
(ii) The set of all decision rules is denoted by D.
DEFINITION 4.2.3.
(i)
A decision rule
'IT=
(~,q 1 , ••• ) is called non'l'andorrrized if ~<·lhn)
is a degenerated measure on (A,A) for each n
3 A [q Calh )
a.::
(ii)
-n
n
A decision rule
= 1].
'IT
~
O, i.e.
'l'he set of all nonrandomizeddecision rules is N.
== (~ ,q , ... ) is said to be Mazokov or memo'l'Y'leee
if for all n ~ O a
-n
1
(<
Ihn )
depends on the last component of h
n
only.
(iii)
The set of all Markov decision rules is denoted by BM.
(iv)
A decision rule is said to be a Markov-strategy if it is nonrandomized and Markov.
(v)
The set of Markov strategies is denoted by M.
(vi)
A Markov strategy can thus be identified with a sequence of functions {fn I n = 0,1, ••• } where fn is a function from S into A. Such
a function is called a (Markov) poliay. The set of all possible policies is denoted by F.
(vii)
A Markov strategy is called
station~
are identical. We denote by
£""
if all its component policies
the stationary Markov strategy with
component f. F.. denotes the set of all stationary Markov strategies.
(viii) For
(f ,£ , ... ) .:: Mand g c F we denote by (g,1T) := (g,f ,f , •• )
0 1
0 1
the Markov strategy that applies g first and then applies the poli'IT=
cies of
'IT
in their given order.
n
n
· n
For n c:N we define the measurable space ( (S x A) ,s0 ,A), where S 0 ,A is the
product a-field generated by S 0 and A. The product space cn A'FO A) is the
0
m
'
'
space with n A = (S x A) and F A the product a-field of subsets of
0,
0,
(S x A) generated by S and A. For each 1T = (~ ,q1 , ••• ) and each n .:: ::N we
0
define for ( (S x A) n ,S~ A) and (S x A,S A) the transition probability on
0,
n
,
((s x A)n,so,A> as follows
..
40
I
p n,n+l (Bx A'
where B
S0
€
and A'
Ci 0 ,z0 ), ••• ,(in,zn)) :=
A.
€
For each decision rule
11 €
D and each starting state i
€
S
(as a conse-
quence of the Ionescu Tulcea theorem, see e.g., Neveu [52]), the sequence
transition probabilities {p
l defines a unique probability measure Pi1T
n,n+ 1 'n
on C!l0 ,A,FO,A). So we may consider the stochastic processes { (sn,an) I n~O}
and {Sn I n ~ 0} where s n and a n are the projections on the n-th state
space and n-th action space respectively. This means that sn is the state
and an is the action at time n.
It may be verified that for the simplified situation considered in chapter
3, the Ionescu Tulcea theorem would yield the probability measure Pi.
NOTATIONS 4.2.1.
(i}
Let v be a real valued function on m A, f A) • We denote by E ~ v the
0t
0I
1T
expectation of v with respect to the probability measure Pi.
(ii)
E'ITv denotes the vector with i-th component equal to E~v.
~Dis
(iii) If 1T
E
(iv)
a stationary Markov strategy
1T
1!
instead of Ei, E respectively.
f
11
= {f,f,, •• } we
f
may useEi'
By rf we denote the vector with components r(i,f(i)).
By Pf we denote the matrix with (i,j)-th element equal to pf(i) (i,j).
0
11 e D we define P (1T) := I, where I is the identity matrix,
(v)
(vi)
For each
and the matrix Pn(tt), (n > 0) is the matrix with (i,j)-th entry equal
toP~(sn =
So if
1T
j).
e M1
=
11
The stochastic process {s ,a I n ~ O} is not necessarily a Markov process
n n
since the decision rules 'IT e D may be such that actions are selected depending on the complete history of the process. If 'IT is Markov then the stochastic process {s
n
I
n
<!:
O} is a, not necessarily stationary, Markov chain.
Moreover, if 'IT is a stationary Markov strategy, then the stochastic process
. {s
·n
I
n
:!:
O} is a Markov chain with stationary transition probabilities.
41
· As
mentioned, we will use
horizon
the expected total reward over an infinite time
as a measure for the effectiveness of a decision rule
ir·.
If at
{i ,z0 ,i ,z , ••• ,in} has been observed and action
0
1 1
zn E A is selected at that point in time, then the total reward (return)
time n the history hn
:=
over n time periods is
n-1
l
(4.2.1)
k=O
r(iit,~) •
Without assumptions on the reward function r and the transition probabilities pa{i,j) there is of course no guarantee that under an arbitrary decision rule
ir,
vn has a finite expectation. In order to guarantee the exis-
tence of the expectation of Vn and the total expected reward over an infinite time horizon for an arbitrary decision rule
ir
we generalize the assump-
tions 3.1.2-3.1.5 to the situation in which decisions (actions) are allowed.
ASSUMPTION 4.2.1.
ASSUMPTION 4.2.2.
ASSUMPTION 4.2.3.
ASSUMPTION 4.2.4.
REMARK
(i)
4.2.1.
Note that if A contains one element only, the assumptions coincide
with .those in section 3.1.
42
(ii)
'As a consequence of the definition of the bounding function µ and the
=
function b assumption 4.2.1 implies that V A r(O,a)
O. Since for
aE
i = O for all a € A the reward and the transition probabilities are
equal they may be identified.
(iii) r
{iv)
f
may be written as r
f
= b + y
f
where y
f
€
w.
f
We define p* : .. sup {llP II} and p 0 := max{p,p*}.
fEF
LEMMA 4. 2.1.
PROOF. We will only prove part (iii) since the proof of the other parts
proceeds along the same lines
S
llPfb - pbff + llPfygll + pllyhll
S
llPfh - pbll + llPfll•llyqll + pllyhll <"' •
0
LEMMA 4.2.2.
PROOF. For 1f
= (f0 ,f1 , ••• ),
n-1
II Pn (1f) 11 s
TI
f
Pn(1f) Ir nl .. Pn(tt)
lb+ y
f
nl
f
II P k II s p~ •
k=O
f
f
F we have II Pn (1f) y n II s p ~M , where M is chosen
1
in accordance with assumption 4.2.1.
Since y n
€
W for all fn
€
1
43
So,
"'
I
n•O
f
llPn(ir)y nll S (1 - p *) -1 Ml
which implies
..,
f
I Pn(ir)jy nl
n ..o
s
( 1 - p *)
-1
M1• µ •
This yields the result
....
l
n=O
f
Pn(irJlrnl
s
..
l
n•O
..,
l
n=O
Pn(ir) jbj +
f
Pn(irJ jy nl < ...
D
REMARK 4.2.2. In terms of potential theory {see e.g. Hordijk (33]). Lemma
4.2.2 says that the reward structure is a charge structure with respect to
the transition probabilities.
In the previous section we have introduced the assumptions on the reward,
and the probability structure. In this section a number of results in Markov
decision theory will be proved under our assumptions. We shall first prove
that the total expected reward over an infinite time horizon for every decision rule 1T
i;;
D is an element of the complete metric space V. Let 1T be an
amitrary decision rule, Given the initial state i
determines the unique probability measure
P; on
i::
s the decision rule
m0 ,A,FO,A)
as described in
the previous section.
LEMMA 4. 3.1. For each ir
€
o, II Pn (1T)
II :S p~.
PROOF. The proof proceeds by induction. The statement is trivial for n
Suppose it is true for some n :e: 1, then
=0, 1 •.
44
s
JP:(dhn)p*µ(sn)
H
n
0
LEMMA 4.3.2. There exists a M €JR+ such that for all
tt €
D, and all
n=0,1,2, •••
tt
n.
n-1
II E r (s , a ) - p .o II ::> p
M (n + 1) •
n n
o
PROOF. The proof proceeds along the same lines as the proof of lenma 3.1.2.
Choose M := i2M
1 + M2 ) where M1 is such that Vf€F llrf - bll ~ M1 and M2 is
such that llP r 9 - prhll S M for all f,g,h e F. Then the proposition is true
2
for n = 0,1, as follows from lemma 4.2.l(ii). Assume it is true for an arbitrary n 2: 1, then we have to prove that it is true for
note that for all n 2: 1 and for all i
llE'll[r(s
n+ 1
€
n + 1.
We first
S
,a
> - pr(s ,a ) ] II::> pn0 M
n+ 1
n n
as follows from
=E~[j~S
pan(sn,j)
f ~+1
(da
I so,ao, ••• ,an-l'sn,an,j)r{j,a)
A
- pr{s ,a ) J
n
::> E:[
'II
l
jeS
a
p n(s ,j) (b(j) +M µ(j)) -pr(s ,a )]
n
1
n
n
SEi[pb{s) +M µ{s) +p M µ(s) -pr(s ,a)]
2
n
n
* 1 n
n n
n
+
45
Using this inequality and the induction hypothesis yields
In a similar way it may be proved that
D
Let the decision rule
decision rule
(4. 3.1)
given the initial state i
1T
v
n,n
(i)
n-1
:=En ~
i k=O
THEOREM 4.3.1. For all
ward vector
v1T
V1T
be given. The expected n-period reward by using
1T
1T €
€
S is defined by
r(sk,a.) •
It
D, define the corresponding total expected re-
by
:=
lim V
n....,. 1T ,n
then
where M is defined as in the proof of lemma 4.3.2.
46
PROOF.
D
So for each
e D the corresponding expected total reward over an infinite
11'
time horizon exists and is an element of
V, We now want
to
prove the exis-
tence of e:-optimal Markov strategies.
DEFINITION 4.3.1.
(i)
A
decision rule
11'*
is said to be e:-optimal if and only if
for all
(ii)
A
decision rule
v *
~
v'll' ,
11'
*
11'
e D •
is said to be optimal if and only if
for all
11'
e D •
11'
11' = (q ,q , ••• ) e D there exists a Mar0 1
that for all i E S
THEOREM 4.3.2. For every e: > 0 and
,kov
Ci}
strategy
v
11'
*
e
Msuch
*cu ;: :
v'll' Cil -
s:µ
Ci> ,
11'
(ii)
sup vTr (i)
TrED
REMARK 4,3.1.
(i)
The theorem is proved in a more general setting by van Hee [22]. To
prove his theorem van Hee used a result of Derman and Strauch (which
was generalized by Hordijk [33J) in which was stated that for fixed
i e S and an a:r:bitrary
'!!
e D there is a
'!!*
e RM such that
for all j
E
s , A'
E
A.
We will use a slightly different approach that proceeds along the
same lines as a proof given by Blackwell [ 5] for discounted Markov
decision processes with bounded rewards.
47
(ii)
The result obtained by van Hee c.overs the corresponding results of
Blackwell [ 6] (positive dynamic programming), Strauch [66] (negative
dynamic programming), Hordijk [34] (convergent dynamic programming)
and our result (contracting dynamic programming).
(iii) For qiscounted (semi-) Markov decision processes with a finite state
space and a finite decision space an elementary proof is given in
Wessels and van Nunen [76].
l
P~(n + 1)M s ¥ where Mis as
N+1
defined in the proof of lemma 4.3.2. Now if n' is another decision rule
PROOF OF THEOREM 4.3.2. Choose N such that
such that <IQ=
<Io•
qi= q 1 , ••• ,q~ = qN then
llV -V,llS2
n
n
l
N+1
This enables us to assume that n is Markov from some point (say N + 1) on.
=
(q0 ,q 1 , ••• ,qN,fN+l'fN+ 2 , ••. ) with fk E F
for k > N. we will show that there exists a decision rule n" of the form
son might be represented by n
n" := Cq0 ,q 1 , ••• ,qN-l'fN,fN+l'" .. ) such that
vn" (i) <!: vn, (i) - e:'µ(i),
e:' > 0 •
Using this fact N times will produce the desired Markov strategy n * , if £'
is sufficiently small.
For each j E S we define
00
'
L
E.n' [r(s ,a )
n=N+1 i .
n n
I SN+1 = j)
for
j) > O, 0 otherwise •
we determine the action fN (i)
E
A such that
(4.3.2)
1
<!:sup {r(i,a) +
pa(i,j)"'!' Cj)} ._ e:'µ(i).
aEA
jES
n
l
48
However, since
and
it will be clear that
'II'"
with fN such that (4.3.2) holds has the desired
D
property.
As a consequence of the foregoing theorem we can state the following corol16.ry.
COROLLARY 4.3.1. For allies, all
'II'
ED and each€> 0 we have
'II'
sup v'll' (i} > v'll'(i) - &µ(i) •
eM · o
0
So if we look for an e:-optimal decision rule
'II',
expected return over an infinite time horizon V'll'
and its corresponding total
1
it is allowed to restrict
the investigations to Markov strategies.
LEMMA 4.3.3. For each v e V and each
'If "
Mwe
have
VieS lim (Pn ('If) v) (i) = 0 •
n-P1!DOF. From lemma 4.3.2 it follows that llPn('ll')b - p~ll
Since v may be written as v
II Pn ('11') II
:s; p~
=
(1 - p)
-1
+
O for n + ~.
b + w, with w e W and since
the statement will be clear.
Now in a similar way as we have introduced the mapping L1 on
f
we introduce the mapping L of V for each f e F.
1
D
V in chapter 3
49
DEFINITION 4. 3. 2. Let f
€
F, v
€
v,
v is
f
then the mapping L1 Of
defined
component-wise by
r(i,f(i)) +
l
pf(i) (i,j)v(j) ,
i
€
S
I
j€S
or in vector notation
REMARK 4.3.2. For P
:= Pf, r
:= rf the results of chapter 3 may be obtained.
So we have e.g.
(i)
(ii)
Li is a monotone mapping of V into V.
Li is a contraction mapping on V with contraction radius II Pf H
=: Pf.
(iii) LfVf"" = Vf"" •
We now define the well-known optimal return operator
u1 on V (see e.g.
Blackwell [ 4 ), MacQueen [ 46], Harrison [ 24] or van Nunen [ 53], [54]).
DEFINITION 4.3.3. The mapping u
cu 1vl (i)
1
of
V is
defined component-wise as follows
l
:=sup {r(i,a) +
pa(i,j)v(j) },
a€A
j€S
i
€
s ,
or in vector notation
:= sup {rf
+ Pfv} •
f€F
REMARK 4.3:3. It will be clear that u v(i) gives the supremum of the expec1
ted return that can be earned if we start in state i at time t
decision a
e;
A at time t
cision, state j
e;
=0
S is reached at time t = 1.
THEOREM 4,3, 3.
(i)
(ii)
= O,
use
and receive v(j) if, as a result of that de-
u1 maps V into V.
u1 is a monotone mapping.
so
'
(iii)
o1
is a contraction mapping with contraction radius
v1
:= sup
f
II P II
:= sup pf = p *
(iv)
o1
maps {v
€
< 1 •
f.:;F
ftF
V
I
II v -
(1 - P)-
1
bll
~
2
(1 - P 0 ) - M linto itself, where M0
0
is defined as in theorem 4. 3.1.
(v)
o 1 has a unique fixed point
V
*
so V * is the unique solution in V of
the optimality equation
(4.3.3)
(vi)
v
Let v
= o 1v
0
E:
•
V then the sequence vn
· PROOF. Clearly for any e: > 0 and v
Lfv ~
u1v
E
:=
o 1vn-l
converges in µ-norm to
V there exists an f
- e:µ. It now follows from remark 4.3.2(i) that
The monotonicity of
o 1 is trivial. Let v 1 and v 2
f,g E F such that
Then
on the other hand
which yields
'Since e: > O was chosen arbitrary we have
€
V.
E
u1
v*.
F such that
maps
V into V.
For any e: > 0 choose
51
-1
-1
:= (1-p)
b + 1.µ and v := {1-p) b, with 9.
1
2
ly verified that for any £ > 0
By using v
€
m+ , it is easi-
for .e. sufficiently large. so the contraction radius of u equals v 1.·
1
1
2
Let v € {v €VI llv - (1-p)- bll S (1-p 0 ) - M0 }then for each£> 0 there
exists an f
Lf maps {v
€
€
F such that Lfv :<: U v - &µ. However, since for each f € F ,
1
V llv - {1-p) -1bll S {1-p 0 ) -2 M)into itself, also part (iv)
I
1
of the theorem is proved. The latter two assertions are direct consequences
of the fact that u
is a contraction mapping from the complete metric space
1
see e.g. Ljusternik and Sobolew [43].
0
V into V,
THEOREM 4.3.4.
(i)
v'll s v* for all
(ii}
For each
(iii} Let
(iv)
1T
*
1T €
£
n € D.
> 0 there exists an f
f.l and
g € F then
v
n*
F such that Vt" :<: v* - e:µ.
[v {g,n) > vn] • [v9"' > v'll'J.
is optimal if and only if V
(4.3.3), so
€
*
= v.
*
satisfies the optimality equation
'ff
= {f ,f , ••• } is an arbitrary element of M. 'I'hen,
0 1
_..n+1 *
*
fo fl
fn *
L Li ••• L
s u V = V for each n :<: O. However, L1 L1 ••• L1 V =
1
1
1
=v
+ pll(n}v*, so letting n + w we have as a consequence of lemma 4.3.3
PROOF. Suppose
fo fl
fn
1T
*
V
'IT ,n
that
lim v
s v*
n-><» 'lf,n
As a consequence of theorem 4. 3. 2 the statement (i) is true for all deci-
sion rules
'ff
since it is true for Markov strategies.
(ii) Choose f €Fin such a way that L f V* :<: V* - £(1-p 0 )µ, withe:> O,
1
then
(Lf>llv* :<:
1
so, letting n +
~,
v* -
n-1
+ ••• +P., ]µ,
£(1 -P )[1 +p
we have
"
0
52
v
00
v* - e:µ •
~
f
(iii)
V(g ,'If}
Livw. Now since Li_ is monotone we have
=
so again, letting n-+-
v ..
(iv} If V11 *
,
we obtain
v11
>
g
/
00
"' V*
then trivially
Reversely if 11* is optimal then
..
f
would be better then 1T
*
1T
*
is optimal (see part
Liv * s V * for all f
and thus
'If*
1T
E:
(i}
of this theorem}.
F since otherwise
1T
would not be optimal. Hence,
u 1v1T* s Vw*· However, since u 1 is a monotone operator we have
so
~ * = v*
lim
n-
:s;
v* ,
'If
1T
this yields the equality
v* • v
lT
D
*
REMARK 4.3.4.
(i)
If V
00
;::
v* - e:µ then £"' is e:-optimal.
f
(ii)
We will often denote v
00
by
Vf*
f
(iii) From part (111) of the foregoing theorem we see that Howard's policy
illlprovement routine [35], remains valid in this more general setting.
(iv}
A proof of the foregoing theorems for discounted Markov decision processes with a general state and action space but with a bounded reward
structure was given by Blackwell [ 5]. For discounted Markov decision
processes with countable state and action space and with a bounded
reward structure an elementary proof was given by Ross [62]. For
bounded rewards a proof can also be found in Denardo [ 12].
53
For two actions a ,a e A and i e S it is possible that r{i,a J = r{i,a )
0 1
1
0
ao .
al
.
and VjeS p {i,J} = p (i,J). We then say that a 0 and a 1 coincide with respect to i e
i
s, otherwise a 0 and a 1 are different with respect to state
e s.
Let A{i) be the set of all different actions with respect to state i e
s.
We say A(i) is finite if A{i) contains only finitely many elements.
LEMMA 4.3.4. If A{i) is finite for each i e S then there exists an optimal
stationary Markov strategy.
PROOF. If, for each i e
s, the action set A(i) is finite then, since the
supremum which defines the mapping u1 is defined component-wise there will
exist an f e F such that Lf v * = V* ,
1
which means that f
..
D
is optimal.
REMARK 4. 3. 5.
(i)
A proof of this lemma, for discounted Markov decision
pro~ess~s
with
a bounded reward structure, is also given by Blackwell [ S] and Derman
[15].
{ii) If A(i) is not finite for each i e S then it is easy to construct counter exauples showing that an optimal decision rule may not exist, see
e.9. Blackwell [ 5 ].
Our interest is in fact the computation of the optimal return vector v* and
the determination of an (e:-) optimal
stationary Markov strategy. We will
prove that as a consequence of the theorems 4.3.3 and 4.3.4 the method of
successive approximation may be used to determine the optimal return vector
v* and an (£-)optimal
stationary Markov strategy.
DEFINITION 4. 3. 4. A function q defined on the power set of F into F is said
to be a choioe function if and only if for each nonempty set B c: F we have
g{B)
E
B,
54
From now on we assume g to be an arbitrary but given choice function.
REMARK 4.3.6. For the existence of a choice function, in a general situation as we have defined it here, we need the a:x:i,om of choice. We introduce
the choice function only for notational convenience,
For the processes we described, the fullstrengthenessis never required
since we have to make only countably many choices, In the first place, only
at the successive points in time n = 0,1, ••• a policy f has to be selected
from a nonempty set B c F. Secondly, since S may at most be countable, for
each f
€
only countably many actions have to be selected.
F
Usually, in practical situations, the choice function can be given explicitly. This is specifically true in the situation that s and A are finite,
since in that situation also F contains finitely many elements.
DEFINITION 4. 3. 5. For e > 0 we define the mapping
with f
:=
I Hu 1v
g({f
is well defined for e
1,
E:
of V by
- Livll < d), from theorem 4.3.3 it follows that u 1,e
O.
>
f
3fEF U1v = L1V then
u1, 0v := Lfv, with f = g({f
If
u
"'vev
E:
may be zero in that case we define
o 1v = L f v}).
1
LEMMA 4.3.5. The mapping u ,e maps
1
V into V.
PROOF. The proof is evident since for each f
€
F Li maps V into V.
0
DEFINITION 4.3.6.
(i)
A mapping B of V into V is said to be e-monotone (e
v,w
E
2:.
0) if for each
V, with v 2:. w, we have
Bv 2:. Bw -
Ell •
(ii) A mapping B of V into V is said to be e-contPacting (e
traction radius p' if for each v,w E V we have
II Bv
-
Bw II :S p 'II v - w II
+ e: •
2:.
0) with con-
SS
LEMMA 4.3.6. The mapping u ,e; (e; > 0) has the following properties
1
(i)
ul ,e; is. &-monotone (e; > 0).
(ii)
u1 ,e;
v1 •
is e:-contractinq with contraction radius
If u , is defined then u , has these properties aswell.
1 0
1 0
PROOF.
(i)
Let v,w e V be such that v :a: w; then since u
1
is a (0-)monotone
map-
ping we have
So
0
LEMMA 4.3. 7. Let v 0 e V and suppose u , is defined. Let ~e sequence vn be
1 0
defined by vn := u , vn-l and let fn i • q({f
u1vn-l L 1vn_ 1l>, then
I
1 0
=
(i)
(ii)
(iii) If v
0
e V is chosen such that u v
1 0
v
2?
0
then
PROOF.
(i)
llvn -
v*ll
=
ntt;v0
-
v*ll = lltt;v0
-
tt;v*ll
S
P~lv0
-
v*ll.
56
(ii) Consider
f
f
(L1n)kvn - vn = (L 1n)kvn -
f
f
£
(L1n)k-1vn + (Lln)k-lvn- ••• + L lvn-vn
n
which yields
f
k
II (L n) v - v II
1
n
n
By letting n
~ m
-l
s
(1 -
s
p (1-p)
p )
*
*
n-1
fn
( 1-p * >II L v - v II
1
*
-1
(1-p
n-1
*
n
n
lllv - vn_ 11
1
n
we have
II vf - vn II s (1 - p ) -l p II v - v
11 •
*
* n
n- 1
n
The final part follows from the monotonicity of
LEMMA 4. 3. a. Let e: > 0 and v
Let the sequence vn (n
~
0
and L
fn
•
1
0
e: V such that
0) be defined by
where fn := g(B ) with B := {f e:
n
n
then
(i)
u1
FI Lf1vn- 1
~ ma.x{vn- 1 ,u 1vn- 1 -e: (1
-p )µ,}} ,
o
II vn - v* II < e:, for n sufficiently large ,
(ii)
PROOF. We define ll := (1-p 0 )e:. Note that Bn is not empty. Namely, if
vn-l ~ u1 vn-l - llµ,, then fn := fn-l already suffices, since
57
fn-1
Since, L
is monotone we have
1
Similarly, we have for k
€ :N
which yields by letting n -+ "'
v
n
s v
l''n
"'=
vf
n
However we know already v*
<!:
Vf , so
n
The inequality vn
<!:
vn-l follows directly from the definiti0n of vn.
It now remains to be proven that vn -+ V* in µ-norm. Since v0
~ ~n-l
this
convergence will be monotone.
Continuing in this way we get,
Now, since ~v ~ v* (in µ-no:i:m) we have the final result llvn - v*ll < e: for
0
n sufficiently large.
0
REMARK 4.3.7. It is easy to find a starting vector v0 E V satisfying
-1
+
:= {l -p) b - 9..µ for R. e JR chosen suf-
v (i) s u v (i) - e:µ(i) namely v 0
0
1 0
ficiently large.
58
LEMMA 4.3.9. Let v0
(i)
(ii)
V.
E
lim vn = V*
n-+«>
PROOF. Let M
(in
= V*
lim V f
n-+«>
Let the sequence vn be defined by
µ-norm) ,
(in µ-norm) ,
n
:=II v
0
v* II, then
-
The first part of the proof follows now by induction.
Suppose II vn - V* II s; v n (M + n) for some n ~ 0 then
1
llvn+l -
V
* II
f +1
:s; llL n v
1
0
-
u1vnll + llu 1vn - u 1v*ll
,,.~ vn+l + v (vn (M + n) ) = vn+l (M + (n + 1) ) •
1 1
1
1
So since the induction hypothesis holds for n
= 0,1
it holds for all n,
this implies part (i) of the le111111a.
Choose e: > O.
Note that as a consequence of part (i) there exists a N
Now, as a consequence of (a) we have
€ ll
such that
/
59
f
+ (L 1n)k-1vn-1. -. • • +
So by letting k+oo we have flvf
n
- vn_ 1 11
<
l:!e:
(n > N).
Similarly, we deduce from {b)
l!v* -v
1!<1:!e:,
n- 1
Since
£
n>N.
D
was chosen arbitrarily this implies part (ii) of the lemma.
So the computation of V* , the optimal total expected reward over an infinite
time horizon, may be executed by successive approximation of v* by v
describ,ed in the le11111as 4.3.8 and 4.3.9. Moreover, for each
£
n
as
> 0 the Mar-
kov strategy fw is £-optimal for n sufficiently large.
n
4.4. Renuvi.k.6 on :the auumpt:.l.on.6 4.2.1-4.2.4.
The first assumption (4.2.1) stated that the difference between b and rf is
bounded in µ-norm for f
i::
F. This assumption arises by ccmbining the assump-
tions of Wessels [75] and Harrison [24] on the reward structure and the
transition probabilities. A,s proved in section 3.3 our assumptions are more
general than the assumptions of Wessels and Harrison separately. Assumption
4.2.2 is introduced to ensure the existence of the (conditional) expectations of the stochastic variables. If we use an extended notion of expectation (see van Hee and van Nunen [30]) assumption 4.2.2 may be replaced by
the weaker assumption
This assumption was first used by Harrison [23 ]. Assumption 4.2.3 requires
that also for a€ A the expectation of µ(s 1> is at most p*µ(s 0 ). This i111plies that for each decision rule w the corresponding stochastic process
{sn
I
n
'1:
O} has a tendency to decrease its µ-value and/or to fade. The
final assumption requires that, for each i
i::
S and each a
i::
A, the expected
one-stage reward differs not too much from the immediate reward r(i,a).
60
For µ (i) "" 1, i 'I 0 and the usual notion of expectation our assumptions include the assumptions as used by Harrison [24] for discounted Markov decision processes. If b E
W, the Markov decision processes as described
by
Wessels [75] are obtained. For a specific choice of a bounding function µ
and. b E
W, the idea of using weighted supremum norms is used by Lippman [ 44],
[45] for semi-Markov decision processes, see also van Nunen and Wessels [56].
Wijngaard (78], used the idea of a specific (exponential) weighted supremum
norm for inventory problems with respect to the average reward criterion.
Hinderer [31] used the idea of bounding functions for finite horizon Markov
decision processes.
For µ(i} "" 1, i
~
0 and b
€
W our assumptions reduce to the well-known as-
sumptions for Markov decision processes (see e.g. Denardo [12]} to guarantee
the existence of the total expected reward over an infinite time horizon
Finally, we want to extend the lemmas 3.3.1 and 3.3.2 to the case in which
decisions are allowed.
LEMMA 4.4.1. If 3M€1R+ ViES b(i)
~
-Mµ(i) then there exists a number p' with
0 < p• < 1 and a bounding functionµ• such that
LEMMA 4.4.2. If 3
+Vi S b(i)
:S:
Mµ(i) then there exists a number p' with
.
Me'.IR
€
0 < p' < 1 and a bounding function µ • such that
f
y fEF II P IIµ' < P' < 1 •
So if in addition to the assumptions 4.2.1-4.2.4 the rewards are bounded
from one side (with respect to the bounding function µ) then it is possible
to define a new bounding function µ' such that the Banach space Wµ, of vectors v with II v II , <
f
L and
1
00
contains v * and Vf for f
€
F. Moreover, the operators
u are contraction mappings on Wµ, with fixed points Vf and V* resµ
1
61
pectively. If b is bounded from oner side with respect to the boundinq function µ, the assumption 4.2.2 may be omitted. In that case it is easily proven that the reward structure is a charge with respect to the transition
prcbabilities (see Bordijk [33] and Groenewegeri [21]), i.e. for each n
w = (f0 ,£ 1 , ••• ) we have
f
00
l
Pn(n) Ir nl
< .. ,
n=O
which follows from
co
l
n=O
f
Pn (n) Ir n I 11
µ
, ~ (1 - P 1)
-1
sup
feF
II r
f
II , <
µ
00
•
€
M,
63
CHAPTER 5
STOPPING TIMES ANV CONTRACTION IN MARKOV VECISION PROCESSES
In this chapter we will use the concept of stopping time as introduced in
chapter 2 to generate a whole set of optimization procedures for solving
Markov decision processes satisfying the assumptions 4.2.1-4.2.4.
The idea of using stopping times for generating such a set was introduced
by Wessels [74]. Wessels used nonrandomized stopping times to generate a
set of optimization procedures for finite state space, finite action space
discounted Markov decision processes.
In this chapter this set of optimization procedures is extended by allowing randomized stopping times. Furthermore, the class of problems for which
the procedures hold is generalized by using the less restrictive assumptions
4.2.1-4.2.4. The set of optimization procedures will include 'the known solution techniques as introduced by Blackwell [ 4 ] , Hastings [ 25] , Reetz [ 60]
and van Nunen [54] for discounted Markov decision problems. Recently, again
for finite state finite decision space Markov decision processes, Porteus
[59] also introduced a set of optimization procedures. He introduced a number of transformations which might be used to investigate transfonned Markov decision processes with the same (E-)optimal
policies and the same
(possibly transformed) optimal return vector <v*>. A number of the trans.:.
formations introduced by Porteus are in fact covered by our approach.
,,.
After some preliminaries the contraction mappings L0 of V will be introduced and investigated (section 5.1). We will restrict the considerations to
nonrandalli.zed decision rules, which is justified by the results of the preceding chapter. Then (section 5.2) the optimal return operator U0 on V will'
o e 6 the mapping u 6 yields a potiay im-
be defined. For each stopping time
provement procedure on which the determination of an (&-)optimal Markov
strategy f
00
and the successive approximation of v* and Vf may be based. We
6
will prove that the sequence v , arising by successive application of U0 on
o
n
*
an arbitrary element v e V converges to V if and only if the stopping
0
time o is nonzero. Moreover, we prove in section 5.2 that by using U0 attention can be restricted to stationary Markov strategies if and 6n!y if the
stopping time
o is
transition memoryless.
Finally, we will give algorithms which determine an (&-)optimal
strategy and compute the optimal return vector V* by means of
approximation.
Markov
s~ccessive
64
s.1. The c.oiitJulcti.on mapp)..ng
L~
For each n E JN we define the measurable space { (S x E x A) n, sn
A) , where
is the product cr-'field generated by
~
the space with nA := (S x E x A)
(S x E x
S and A. The product space
snA
{fl ,F ) is
A
A
and FA the product a-field of subsets of
A)~ qenerated by S and A.
In a similar way as in chapter 3 we define, for each stopping time 6 e: /J.
each decision rule
ir
:=
1T
<<so ,q 1 , • • •) e: N and each starting state i
E
s,
a
probability measure Pi,o on {!lA,FA) in the standard way by giving the probability for cylindrical sets in !lA'
wA := ((io,do,zo>,<i1,d1,z1>, ••• ).
where 6
is again the Kronecker symbol (see section 3 .1) •
i,io
This probability measure can also be obtained in a similar way as in chapter 4.by using transition probabilities.
For each
'If
= (<io,ql , ... )
E
N each n e:
Ji,
and6 e: /J. we then define for
(S x E x A,SA) the transition probabilities P!,n+t on ( (S x E x A)n ,S:) as
follows
p
6
n,n+ 1
cs
x E' x A' I (i ,d ,z0 i, ... , Ci ,d ,z >> :=
0 0
n n n
J ~+l (da:I
(i 0 ,z 0 , ... ,i0 ,zn,i)) ,
A'
where A• e: A and B and E' are elements of the power set of S and E respecti vely.
65
For each decision rule
11' E
N and each starting state i € s, as a conse-
quence of the Ionescu 'l'ulcea theorem (see e.g. Neveu [52]) these transition
0
+l induce a unique probability measure P~ ~ on (OA,F ) •
n,n
i.,u
A
So we may consider the stochastic processes {sn
n ~ O}, {(s ,e) In~ O},
probabilities p
I
{ (sn,en,an)
I
n
n
n ;z: O}, where as in chapter 3 and 4, sn, en, an are the pro-
jections on the n-th state space, the n-th E space and the n-th action
space respectively.
NOTATIONS 5.1.1.
11'
For fa real valued function on (0,F) we denote byE.
(i)
11'1, 0
f the expecta-
tion of f with respect to the probability measure Pi 0 •
11'
'11'
E 0 f denotes the vector with i-th component equal to Ei, 0 f.
(ii)
(iii}
If
11'
€Mis a stationary Markov strategy,
f
11'"'
(f,f, ••• )
then
Ei,o ,
11'
11'
E 0 may be used instead ofEi,o' E 0 respectively.
Clearly, the probability measure P~,o on (rl,F) induces a unique probability
m0 , F0 >•
measure on the coarser measurable space
LEMMA 5.1.1. For each .S € t:.,
11'
€ N and i
€
ed by P~,.S on the coarser measurable space
s the probability measure induc-
co0 ,F0 > equals P~.
PROOF. The statement follows directly from the definitions of JP
11'
110
11'
and Pi. D
We now extend the mapping L of chapter 3 (defined by using the stopping
0
function •) for the situation in which decisions are allowed.
DEFINITION 5 .1. 1. For each
o
€ t::. and
11'
€ N the mapping L~ of V is defined
by
i
or
in vector notation
€
S
I
66
REMARK 5.1.1.
(i)
We remind that v(sT) := v(O)
(ii) Note that for all
o
LEMMA 5.1.2. I f
and 1T
oe
A, w e
=0
N,
is the nonrandomized
= {~,q 1 , ... } e N,
= ""•
if T
ir
1T
V we
v e
have (L0 v) (0)
= O.
stopping time corresponding to G
1
fo
as defined in chapter 4 where
then L0 equals L
1
and zi e A is the action for which ~ (z 1il ... 1.
1
0
0
1
Consequently for that o, L~ is a monotone mapping of V + V, L~ is strictly confo
trac:ting with contraction radius P 0 and fixed point V f •
0
f
is defined by f
(i) .. z
DEFINITION 5.1.2. For each
oe
A and
'II
11
e N the matrix P0 is defined to be
the matrix with (i,j)-th element equal to
..l
1!
p (i,j) :=
0
n=O
p~
O(sn = j, T
REMARK 5.1.2. Note that for each
'II
vjeS\{O} Po(O,j)
n)
•
'
o
e A and
'II
11
e N we have p 0 (0,0)
= o.
LEMMA 5.1.3. Let o be a nonzero stopping time and let
p
1!
0
O and
:= llP
'II
0
11
11
e N, then
S (1 - inf o(i)) +inf o(i)P* < 1,
ieS
ieS
· PROOF. The proof proceeds along the same lines as the proof of lemma 3.2.4.0 ·
LEMMA 5.1, 4.
then for all 1T ~
Suppose o
(ii)
Suppose o ,o
1 2
ahead sets corresponding to o
1
s
o2
(i)
N,
'II
~ P
1T
•
0
1
2
2
1
are nonrandomized stopping times G and G are the go1
and
p
0
o2 •
Then
for all 1T e
N•
(iii) Let Q be an index set and silppose for each q e Q, o e A is a nonq
randomized stopping time then for all 1T e
N
67
lf
P +
6
:::
lf
sup PIS
qEQ
q
LEMMA 5.1. 5. Let o " 8 and 1T e: N, then L; maps V into V.
PROOF. From the foregoing chapter we know that there exists a M' € lR+ such
-1
+
that for all lf .;N
llvw.. - (1-P)
bll S: M'. Choose, for each v e V, Mv "JR
.
such that ll v - ( 1 - p ) - lb II = M • So for all 11'
v
N we have
€
Let 11' := (q ,q , ••• ) "N and i € s let f (i) be defined as in lemma
0 1
0
Then 1Ti is defined by 11'i := (qQ,qi•···> with
s.1.2.
Let
.6.NI contain all stopping times with the following property
Vae:
U
sk
o(a(N))
~ 0,. o{a)
=1
•
k=N+1
I
Then it is easily verified that for each i
11'
P . .., (T S: N V ., = "") = 1.
i, v
Let o
a e:
'
€
~
k=1
AN+l' and define for each i
gk,
then clearly o •
1
€
€
s, .Si
s,
€
€
'IT
€
N and 6
e .6.N
.6. by .Si {a) :• o (i,a) for each .
t.N', Now,
T-1
l
(1-o(i))v(i) +o(i)[r(i,f (i)) +E~ [
r(sk'~) +v(s,>le0 =0JJ
0 k=l
0
i,
.. (1-li(i))v(il +oCil[r(i,f Ci)l +
0
68
f
0
(i)
s
{1-o(i))v(i) +o(i)[r(i,f (i)) + I p
0
j€S
s
(1 -o (i))v(i) +o (i)[Vf (i) +p (M' +M )lJ (i)]
0
*
v
s vf
0
(i) + (1-o(i)) (M' +M )lJ(i) +o(i)P
v
*
(i,j}[vf (j} + CM'+M )µ(j)JJ
0
v
(M' +M )µ(i)
v
S Vf (i) + (M' +M )µ (1) •
0
v'
In a similar way it may be proved that
1r
So L v € V.
0
Since for o
1r
t
E
a0 ,
that for all 0
€
0
a,
LEMMA 5.1.6. Leto
(i)
(ii)
V1 as is easily verified, it follows by induction
L V €
L:v
Ea
€
v.
and n
D
EN, then
L'lf is monotone.
0
L~ is strictly contracting if and only if
o
is a nonzero stopping
time; in this case the contraction radius equals p~.
1F
(iii) For each nonzero stopping time, L
V
(iv)
1r
0 possesses a unique fixed point
E
V.
0
The set {v €VI Hv n
by L
0
2
(1-p)-lbll s (l.i.p l- M'l is mapped in itself
n-1
°
whereM':"' (1-p )
0
Mand Mis defined as in theorem 4.3.1.
P.ROOF. Part (i) of the lemma is trivial. To prove part (ii), let v ,v
1 2
-1
then v and v can be given by vk • (1-p) b +wk (k • 1,2), where wk
1
2
So llv 1 - v 11 = llw 1 - w2 11.
2
€
V,
E:
W.
Hence
The fact that L~ is strictly contracting if and only if
o
is a nonzero stop-
ping time is a consequence of lemma 5.1.3. The contraction radius equals
1r
-1
-1
11 as is easily verified by choosing v
= (1-p) b + µ, v • (1-p) b.
2
1
0
llP
69
Part {ii) of the lemma is evident since L~ is a contraction mapping on the
complete metric space V (see e.g. Ljusternik and Sobolew [43]). The final
statement is clear since
Hence
1T
L v s
0
~
v1T0
(1 - p)
1T
+ p 0 (1 - p 0 )
-1
b
-2
M'µ s
+ (1 - P 0
-2
)
M0 µ
+
Cl
-11,>
-2
1T
•p M'µ
0
S (1 - p)-ib + (1 - p 0 ) - 2M1 µ •
Similarly
D
LEMMA 5.1. 7.
{i)
Leto
€
A be a nonzero stopping time and let
11
:=
(f,f,, •• ) be a sta-
tionary Markov strategy, then Lf has the unique fized point Vf inde0
pendent of the stopping time.
(ii) If
11 E!
N is not stationary then there exists a situation
{pa(i,j),r(i,a)} for which nonzero stopping times o
1T
such that L0
1
possess different fixed points.
1T
1
and L0
and
o2
€A exist,
2
PROOF.
(i)
If
11
is a stationary Markov strategy then the process {sn
I
n :!: O} is
a Markov chain with rewards only depending on the current state.
Theorem 3.2.1 now implies the result.
(ii)
Suppose
11
:= (q ,q , ••• )
1
more elements. Let n
0
0
€
N is nonstationary then A contains two or
c :N be such that
(a) for all k < n0 qkC£ (1)j1 ,f c1 J,1 ,f (i ), ••• ,ikl = 1
0 0 0
1 0 1
0
(note that n 0 ~ 1).
(b) for at least one jO
€
S and at least one path
70
Let r(i,z)
b(i) for i € s\{j } and all z €A, choose r(j ,f Cj )) :•
0
0 0 0
= bCj 0 ) and r(j ,z> = b(j J + tµ for all z € A\f (j ). Choose
0
0
0 0
zl
p
:=
(i,j) = p
z2
(i,j) > 0 for all i,j € s and z ,z € A such that the
1 2
assumptions 4.1.1, 4.2.1-4.2.4 are satisfied. so for each n the stochastic process {sn' n
o
no
corresponding
~ 0)
is a Markov chain. Now by choosing
no
1
o1
,
respectively it is easily verified that
to G , G
the fixed points of the mapping,s
are different for t suf-
ficiently large.
D
LEMMA s.2.1. Let i € s, let
o€
exists a decision rule ni €
N such
n
€
N.
PROOF. The existence of a
L
n
v
0
s
(1 - p)
exists a
-1
M
€
JR.+
t:. and v €
n
V, then for each e
> O there
that L iv(i) ~ L~v(i) - eµ(i) for all
0
such that for all n € N,
b + Mµ follows from lemma 5.1. 5, so for each i e
s there
ni € N such that L~iv(i) ~ L~v(i) - eµ(i).
O
DEFINITION 5.2.1. The mapping U0 of Vis defined component-wise by
REMARK
5.2.1. Note that the supremum over n € N is taken canponent-ise.
THEOREM 5.2.1. Let 0
(i)
U maps
0
(ii)
U
0
(iii) u
0
€
t:., then
V into V.
is monotone.
is a contraction mapping if and only if
tion radius
"o
nonzero. The contrac-
satisfies v 0 :=sup p~.
-1
n€N -2
llv - (1-p) bll S (1-p 0 ) M'} is mapped into itself
of
I
o is
U
0
(iv)
The set {v € V
(v)
by U0 , where M' is defined as in lemma 5.1.6 with p~ replaced by v 6 •
If o is nonzero then u possesses the unique fixed point V* , or equi6
valently v* is the unique solution in V of the optimality equation
(5.2.1)
71
PROOF. Part (i) of the proof follows directly from,lemmas 5.1.5 and 5.2.1.
Part (ii) is trivial.
(iii) Let v,w
€
II and e: > O. Let i
€
s, choose
tt
for all
'1T €
0
,tt
1
€
N such that
1T 0
'1T
LIS v(i) 2: ItV{i) - e:µ(i)
and
N •
We then have
'!To
I
j€S
P 0 (w(jl - v(j)J + e:µ(il
and
=l
j
p
1T 1
0
(i,j) (v(j) - w(j)) - e:µ (i) •
This implies
Since & was chosen arbitrarily we have
It follows from lemma 5.1.3 that u6 is strictly contracting for nonzero 6.
Now we prove that the contraction radius is v 0 • Let -rr 0 be such that
110
II P
U 2: v0 - l.ie: and let i
0
6
Then choose v
11
0
L5
({1-p)
-1
bl
:=
2:
(1
(1
-p)
-p)
-1
-1
€
s be such that
b +JI.µ and w
I
j€S
11
p 0 °u 0 ,jlPCjl
:= (1 - p) -lb.
Let
M
2:
cv 6 -e:lµCi 0 l.
CIR+ be such that
b - Mµ (see lemma 5.1.5). Consider u 11 v - 00 w Ci 0 > ;;:: ~ v Ci 0 > - c1 -P >-lb Cio> - Mµ Cio>
;: I
jES
11'
p °
6
0
0
2M
for t. chosen sufficiently large (t > T) .
This implies v 6 is the contraction radius.
.
11'
Since II P 0 II = 1 if o is not a nonzero stopping time (see the proof of lemma
5.1.3) we have also proved that U0 is not strictly contracting in this case.
(iv) It is a direct consequence of lemma 5.1.6 that
is mapped into itself by u0 •
(v) We first note that for o nonzero U0 has a .unique fixed point. The final
part of the theorem follows by considering u6v * •
Now choose e > O and f e F such that Vf;;:: V* - eµ. In theorem 4.3.4 we have
proved that such an f exists. Let
11'
00
e F
be defined by
11' :=
(f,f, ••• ).
Then
f
However, (see remark 4.3.2), we already know L6vf = Vf' which implies
u6v*(i) ;;:: Vf(i) - eµ(i) and since e was chosen arbitrarily we have
on. the other hand U0V* (i) s V* (i) as can be proved inductively. This proof
proceeds in a similar way as the first part of the proof of lemma 5.1.5. D
I'
73
COROLLARY 5 • 2 • 1. i:.et
o
€
0
t:. be nonzero and let v 0
€
0
V • Let the sequence vn
be defined by
then
v*
(in µ-norm) •
LEMMA 5.2.2.
(i)
(ii)
Let
Let
o1,o 2
o1 , o2
t:. be nonzero and suppose
€
€
o1 s o2 ,
then v
1
0
s v
1
2
0
•
2
t:. be nonrandomized and nonzero, let G and G be the go-
ahead sets corresponding to
Gl c G2 ..
o1
s
o2
o1
and
and thus v
o2
01
respectively, then
~ v
02
In the previous chapter we have proved the existence of (e-)optimal
sta-
tionary Markov strategies. Regrettably, the procedure as described in corollary 5.2.1 does not produce such (£-)optimal
Markov strategies. So we
should like to characterize nonzero stopping times which allow for the use
of stationary Markov strategies only. The two theorems in the sequel provide the main step for such a characterization •.
THEOREM 5. 2. 2. If
o
€
there exists an f
€
F such that for all i
t:. is transition memoryless, e
€
>
0 and v
€
V, then
s
Hence
PROOF. Let v
€
V. We will define a new Markov decision process such that
L1Tv(i) (of the old process) is the total expected reward over an infinite
0
time horizon (for the new process) if the starting state is i and strategy
74
~ €
N is
used. Hence, for this new Markov decision process attention may be
restricted to memoryless Markov strategies, as is proved in chapter 4. '!'pis
implies that for the determination of U0 v in the original problem, the restriction to stationary Markov strategies is permitted too.
We will assume, as is allowed without loss of generality, o(i)
i e
s.
= 1
for all
We define the new Markov decision process in the following way:
S, tl'le new state space, is the union of two copies of
S;
So the states in S are two times represented in s. For the states i* r S* c: S
and all a e A we define pa(i*,0*) := 1 and r{i*,a) = v(i). For the states
in S * .we define
r (i
*
1
a) : = r (i, a) •
It is easily verified that L~v{i) is just the total expected reward over an
infinite time horizon if the process starts in state i and decision rule
~
0
is used.
Transition memoryless stopping times are the,only stopping times for which
a restriction to memoryless or stationary Markov strategies is always allowed: this fact is expressed in the following theorem.
THEOREM 5.2.3. Suppose the stopping time o
€
b. is not transition memoryless,
then there exists a Markov decision process with state space S (i.e. there
exists a set A and numbers {pa(i,j)}, {r(i,a)} such that for this Markov
decision process
f
REMARK s.2.2. In fact sup L0v may not be defined.
f!i!F
75
PROOF OF THEOREM 5.2.3. 6 E:
existence of states i ,j
6
ea ,1 0 ,j 0 )
0
< 6
tr ,i 0 ,j 0 >.
0
Ais
not transition memo;eyless, implies the
E: s and two paths.a,y E: ~- {S\{O})k such that
k=O
Let
which implies c > 1. The case
o CY ,i0 ,j 0 )
= 1 requires a slight, self-evi-
dent modification. For n E: (S\{O})O we define o(n,i ,j ) := o(i ,j ) w~ere
0 0
0 0
nisaory.
We will construct a counterexample satisfying the following conditions
(1)
VH:S\ (io) A(i) contains only one element, whereas A(i 0 ) contains two
elements, A(i0 ) := {1,2}.
(2) For allies, i t B := {[aJ 0 ,(aJ 1 , ••• ,[aJka-l'(yJ 0 , ••• [y]k.y_ 1 ,1 ,j 0 }
0
we have pa(i,i)
:= p,
pa(i,j)
:=
0 for j
E:
S {i,O}; p(i,O) := 1 - P;
r{i,a) := O; v(j) := O. Moreover, p(0,0) := 1.
(3) For all i E: Band all j, pa(i,j) := 0 if j t B u {O}, else pa(i,j) > O.
Vi jE:B V E:A(i)
'
a
(5) µ(i) = 1 for i
(4)
As
l
jE:B
~ O.
pa(i,j)
a consequence of condition
Sp;
(1)
ViE:B VaeA(i) [pa(i,0) = 1-
l
jeB
pa(i,j)].
the index a in pa(i,j) and r(i,a) can be
r i 0•
omitted if i
For the investigation of u v the following form has to be maximized with
6
respect to a for n = a,y respectively
(5.2.2)
where on is defined by on Ca>
:=
.
u sk and n = a,y
o(n,io,a> for all a E:
k=1
respectively.
The second term in (5.2.2) may be written as
+
l
paCi ,jlo Cn,10 ,jlrCj,a) +
0
l
pa(i ,j)o(n,i ,j)
0
0
jeB
+
jeB
l
keB
pa(j,k)U
0
n ,j
v(k) •
76
with 6
Tit
j (8)
= 6 (n,io,j,.S)
~·
for
a kcl
'ri
sk and
E:
n
a,y.
Suppose i 0 "' j 0 •
Let p 1 c1 ,j ) := q ~ 01 p 2 c1 ,j ) := l:iq. Let for j e: a\{j },
0 0
0
0 0
1
2
p Ci ,j) := p Ci 0 ,j) := e; for i ~ i 0 and all j e: B p(i,j) := e. we define
0
-1 -1
v(j ) := [1 - 6 (a,i ,j 0 )J q
and v(j) := o for j 'f j ; r(j) := o for
0
j
0
0
e: B\{i 0 }; we will ahoose lrCi 0 ,a>I
< 2
for a= 1,2.
It is now easily verified that U0 v{j), U0
v(k) are bounded by
v(jo> + 2 c1·-p)-1
11
n,j
M :=
So formula 5.2.2 can be given (using 5.2.3) for Tl = a by
for e + 0
fore+o.
For 11
=y
we have
for e + O
for e + 0 •
1
1
1
Choose r(i ,2> = O and r(i 0 ,1) =+ 2E--2 - cl. Then in state i 0 after path a
0
2
decisiflll 1 has to be selected whereas in state i 0 after path Y decision 2
is optimal if
e; is chosen sufficiently small.
suppose j 0 = i 0 •
Let p 1 {1 ,i ) := q; p 2 {i ,i ) := l:iq; V{j) := r(j) := 0 for j € B\{i0 }
0 0
0 02
1
2
-1 -1
p c1 ,;1> :.. p c1 ,j) = q , q « 1 and choose v{i ) := [ 1 - 6 (a,1 ,i ) J q
0
0
0
0
Then formula (5.2.2) can be given for 11 = a by
For n
=y
we have
for q
+
0
for q
+
0
0
77
r(i0 ,1) +
*
+ 0(q)
{ r(i ,2) + !c + O(q}
0
for q +
o
for q +
o•
This i11plies in a similar way as for i
.;. j that in state i after path a
0
0
0
decision 1 is optimal whereas after path y decision 2 is optimal in state
i
0
COI'IOLLARY 5.2.2. Let
0
v
0
D
(for q chosen sufficiently small).
E
oE
A be nonzero and transition memoryless and let
0
V, then the sequence { vn}, n ~ 0 defined by
converqes in µ-norm to
v*.
Pl<OOF. The statement follows directly from the foreqoing theorems 5. 2 .1 and
D
s.2.2.
DEFINITION 5.2.2. For e: > 0 and o
mapping u,,
U
with f
;==
IE:
E
A transition memoryless, we define the
of II by
q({f
I II Uc;V
- L:vll <
d}. Fran theorem s.2.2 it follows that
=
f
uo
is well defined for e: > O. If VvEV 3fEF U0 v
L0 v then e: m:y be zero, in
v := L:v, with f := g({f
U v = L v}).
0
0
010
I
that case we define u
u,e: maps V into V.
LEMMA 5.2. 3. The mapping u..
LEMMA 5.2.4. The mapping u ,e:, e: > O and
0
zero, has the following properties
(1)
o transition
memoryless and non-
uo,e: is e:-monotone.
(ii) uo,e: is e:-contracting with contraction radius vo.
If u ,0 is defined then also
0
u010
has these properties.
,
e:
78
LEMMA 5,2.5. Leto
and suppose
<!:
0
A be transition memoryless and nonzero, let v
0
u6 , 0 is defined. Let the sequence
v!
€
V,
be defined by
(i)
0
(ii)
llvn - vf II s: (1-v )
0
n
(iii) if
-1
\1
0
0
llvn - v
II
0
n-1
vg e V is chosen such that u0 ,Ovg :: vg then
PROOF.
(i)
0
__n 0
vn = u v • This implies
0 0
So
So tfor k
-+- "'
we find
f
Part (iii) of the lemma follows from the monotonicity of U ,O and L n.
0
0
0
79
o e A be
v~:;; u0 v~ - eµ.
LEMMA 5.2.6. Let £ > O, let
transition memo:i:yless and nonzero and
let v 0 e V such that
by
Let the sequence v! {n
where fn := g{Bn)' with Bn :={f
I L:v~-l
l!: max{v!_
1 ,u0
~
0) be defined
v!_ 1 -e(1-v0
)µ}} ,
then
(i)
0
II vn - v* II <
(l..i)
0
_. VO ,,.
vn-1 "' n "' V f
£
for n sufficiently large
n
< *
- V •
0
PROOF. The proof is completely analogous to the proof of lemma 4.3.8.
REMARK 5.2.3.
(i)
6
Also in this situation it is easy to find a starting vector v io V
0
6
0
0
-1
such that u v :?!: v + eµ by choosing e.g. v := (1 -p) b - I!.µ with
0
0
0
0
+
I!. e:R sufficiently large.
(ii} It follows by combining part (i) and (ii) of the foregoing lemma that
- V* II <
II V f
~
for n sufficiently large.
n
LEMMA 5.2. 7. Let
o
c:: A be transition memo:i:yless and nonzero and let
0
Let the sequence v n be defined by
(i)
lim v
~
0
n
=
v*
(in µ-norm)
,
v~
io
V.
80
v*
(ii)
(in µ-norm) •
PROOF. The proof is completely analogous to the proof of lemma 4. 3.9.
D
REMARK 5.2.4.
(i)
From the preceding lemmas it follows that for any nonzero transition
memoryless stopping time
&€
~
the dete:r:mination of the optimal return
vector V* can be done by successive approximation of V* with a sequence
&
'
vn as described in leDDDa 5.2.5-5.2.7. Moreover, for each e > 0 the policy fn is e-optimal for n sufficiently large. So in fact each nonzero
transition memoryless stopping time produces a poZiay improvement procedure for solving the Markov decision problem.
(ii) For finite state space Markov decision processes with a bounded reward
structure and nonrandom:i.zed stopping times, many of the results given
in this chapter may be found in Wessels [74] and van Nunen and Wessels
(55].
81
CHAPTER 6
VALUE ORIENTEV SUCCESSIVE APPROXIMATION
In the previous chapter we have developed policy improvement
procedur~s
for
Markov decision processes. Here we will deal mainly with policy improvement
value determination procedures. Procedures of this type require extra computational effort, in each iteration step, in order to find better estimates for the actual value vector Vf •
n
In section 6.1 we will show that each transition memoryless stopping time ·
6
E fl
(so in fact each policy improvement procedure) generates a whole set
of policy improvement value determination procedures. In section 6. 2 some
aspects of the value oriented methods will be discussed and connections
with results of other authors will be given.
From now on we will restrict the considerations to transition mempryless
nonzero stopping times. This restriction is made since stationary (Markov)
strategies are in fact the relevant decision rules, as is proved in theorem
4.3.4, and transition memoryless stopping times are the only stopping times
for Which a restriction to stationary strategies is always allowed (see section 5.3). In spite of this restriction, the existing policy improvement
procedures, as introduced by Boward [35], Hastings [26], Reetz [60], and
the present author [ 54], are contained in the set of policy improvement procedures generated by transition memoryless stopping times.
For each o e fl that is transition memoryless and nonzero,, a set of value
oriented procedures will be based on an extension of the contraction mapping U0 of V.
DEFINITION 6.1.1. The set of all transition memoryless nonzero stoppinq
times is denoted by fl ' •
DEFINITION 6. 1. 2. For
of V by
£
> 0, 0
E
fl. , and A
E
lil
(A,)
we define the mappinq u I,),£
~
92
with f := g({f
I 11u0 v
- Livll < e:}),
From theorem 5.2.2 it follows that u(A) is well defined for e: > o.
<"">
o,e:
We define the mapping u~
of V by
u,£
:=
(A)
lim U0 v
A-+m
,e:
= Vf
may be zero. In that case we define
lim u (A) v
A-+oo 0 , 0
= vf with
LEMMA 6.1.1. For A
€
E u { 00 } , and o
LEMMA 6.1.2. Leto
€
/l', let e: > 0 and A
(i)
€
f := g({f
I
Uv
0
= L0f v}),
A', U(A) maps V into V.
0 ,e:
€
:N u {oo}, then
u (A) is not necessarily e:-monotone.
0 ,e:
(ii) U(A) is not necessarily £-contracting.
0 ,e:
_ PBOOF. The statements follow directly from the following example, where the
exact value of e: is irrelevant, since only finitely many Markov policies
=
are available. We choose e:
1. Leto €A• correspond to G , S\{O} := h,2},
1
µ(1) :m µ(2) :• 1, A(1) :m A(2) :• {1,2} 1 pl(l,1) :• p 1 (2 1 1) :• p 2 (2 1 2) :=0
2
p (1,2) := o,
p 1 (1,2) p• p 1 (2,2) := p 2 (2,1) := p 2 (1,1) := 0.99,
r(l,1) :=
r(2~1)
:• 10, r(1,2) := r(2,2) :=O, v (1) := v (2) :=O,
1
1
v (1) := 100, v (2) := 10, Now it is easily verified that
2
2
Bl
:=
f
{flll u v 1 - L v 1 II s 1}
0
6
= {f = ( 11) }
and
B2 := {flllu0 v 2 - L:v2 11
$
1}
(2)}
2
{f
However, for A + "" we have
(A)
1000
0 o,1v1 + <1000> ,
(A)
0 o,1v2 + (0)
0
.
0
From the foregoing lemma it may seem impossible to use the mapping
a similar way as we have used u,, and u,,
"
u!Al
u ,e;
in
• However, the structure of Markov
u,e: (A)
decision processes enables us to base on u,,
approximation procedures for
<JI£
solving the Ma:dtov decision process. This will be proved in the sequel of
this section. we will first assume that 0
tion this assumption is often satisfied.
010
is defined. In practical situa-
83
/''A s />'A s vf s v*
(i)
n-1
/i'nJ,, 1
(ii)
n
n
v* ,
fl
Since L0 is a monotone contraction on V
Moreover, it follows from the monotonicity of L
u0 v o'1Ji. .. L0f2 v o'1A ~ L0fl v o'1Ji. ~
o'
J
i
.
o'Ji. ~ u2 v o'A • The proof
v
and v
0 0
2
1
Now, since
oJ..
v
~
v
fl
that
0
o'A
it is similarly verified that
1
proceeds further in an inductive way
2
using the monotonicity and contraction properties of the mappings U and
0
f
_..n o'A 1 V* (see lemma 5.2.5).
L and the fact that u 0 v
0
0
REMARK 6.1.1. It is easy to find a v
ol..
0
€
H
~
V such that u v0
0
v
o'Ji. (see re-
0
mark 5.2.3).
LEMMA 6.1.4. Let 6 E A.' and suppose u...
is defined. Let
OJ..
o'Ji.
u, 0
v
E V. Let the sequence vn
be defined by
0
'A
E
:N u
£.. } and
0
84
then
(i)
(ii)
limv6). =V*
n
= V* · (in µ-norm)
lim Vf
n-+«>
(in µ-norm)
I
n
PROOF. We will prove the lelllllla :\. ;:: JN since the proof of the lemma for l. =""is
("") l)oo
a direct consequence of the foregoing lemma. Namely, U.s
implies U0 vf
0
•
fl
<!: L.s vf
1
v
10 0
• vf • This
1
= vf • So after the first iteration step the con1
1
ditions of the foregoing lemma are satisfied.
Proof of part (i) of the lemma: First, we remark that
as follows from
.S>. <!: V* - eµ for n sufficiently large is more complicated.
The proof of vn
We will first prove
(6.1. 1)
and
Similarly we find
BS
So
By induction it follows that
(6.1.2)
This proves formula (6.1.1).
Let£ > O, let N be such that (A -1)v
Nl
;\ -1
(1 -v 0 ) M < £. Then (6.1.2) with
n • N implies
Similarly we find
This implies
(6.1.3)
From formula (6.1.2) with n ., N + 1 we have
which yields
Substituting the riqht hand side of formula (6.1,3) in· the above formula we
qet
(), _ l) v (N+1) AMµ
0
So
However, from formula (6.1.2) we have
In a similar way as for N + 1, N + 2 this yields
In general
oA ~ k o). _ (' - l·)[ (N+k-1)).
(N+k-2)1.+1 +
+ NA+k-1]M
VN+k
0-,svN
"
Vo
+Vo
• • • \Io
).I
So for k -+
ao
we 9et
ol.
lim inf VN+k
k--
~
*
V - (l - 1) V
N).
(1
A. -1
- \I ) Mµ
0
This completes the proof of part (i) of the lemma.
Note that it follows from this proof that
(ii) For e > O and n sufficiently large we have
So ·for all k
<£
:N
~
*
V -
€).I
87
This results in
llvf
- v * II< 2& ,
n
for n sufficiently large.
D
Until now we have proved the convergence of sequences v!A to v* under the
restrictive condition that u
010
is defined, In the sequel of this section
this assumption will be replaced by weaker assumptions.
THEO.REM 6.1.1. Let 0 €A'. Let A
E ~
u {oo} and let
o:l
Vo
€
v.
Let the
sequence v!A be defined by
(i)
v*
(in µ-norm)
lim vf .. v *
n
(in µ-norm)
lim v0 A
n
fr+""
(ii)
n-
PROOF. The proof proceeds along the same lines as the proof of lemma 6. 1. 4
with the exception of the first part where a slight modification is required.
Define
then
88
fl
as follows from the monotonicity and contraction property of L • Moreover
0
In general we find inductively
oA
- vn
The final part of the theorem proceeds in a similar way as the proof of
D
lemma 6.1.4.
LEMMA 6.1.5. Let e.: > O, let Ci
that v
ClA
0
where f
QA
s u v
6 0
€
- e.:µ. Let the sequence vn
:= g(B) with B :={f
n
n
n
ClA
!J. 1 • Let A € JN u {co} and let v
0
ClA
f oA
I L,,v
u n- 1
€
V such
be defined by
·
d).
Ci).
,u,,v
n- 1 u n- 1
~ max{v
e.:(1 -
v,,)µ}.
"
Then
II v 0 A - v* II < e.:, for n sufficiently larqe ,
(i)
n
s;
{ii)
v* •
PROOF. The proof may be found st:raii.ghtforwardly
by usinq similar arguments
as in lemma 6.1.3 and the foregoing theorem (6.1.1), see also the lemmas
D
4.3.8 and 5.2.6.
REMARK 6.1.2. As mentioned earlier it is not difficult to find a starting
vector v
oA
0
E:
'
V such
that v
CiA
0
Ci).
s u v
6 0
- e.:µ.
THEOREM 6.1.2. Let€> o, let Ci€ !J.'. Let A€ lN
ClA
Let the sequence vn be defined by
u
{co}, and let
VoClA
€
v.
with TJ :• l;:: (1-vt5) 3 • Then
{i)
II vt'iA. - v* II < 2e: for n sufficiently large ,
(ii)
II Vf
n
- v* II < 3e: for n sufficiently large •
n
PROOF. The proof proceeds in a similar way as the_proof of lemma 6.1.4.
Also in this case it will be clear that
So
and
L
f2 oA.
oA.
oA.
o:I.
0 v 1 - v 1 ;:: ut'iv 1 - v 1 - nµ
: : L:1 v~A.
Inductively we find
So for n sufficiently large
-
v~A
- nµ :::
-v~Mi.t
- nµ •
90
So from formula (6.1.4) with n = N we have
Iterating in this way we get
Now since
w~
get
This yields by applying U
0
(6.1.5)
From formula (6, 1. 4) with n = N + 1 we have
and
In a similar way as for n = N this yields
So
91
By using (6.1.5) we find
and
So for k
+
we have
00
cS>. '"'" V* - eµ •
lim inf vN+k
k+a>
This proves part (i) of the lenma.
The final part follows in a similar way as the proof of part (ii) of lemma
D
6.1. 4.
REMARK 6. 1. 3.
(i)
Note that the foregoing theorem (for A = 1) states that, for any cS
cS
cS
e > O, and v E: V we have for v := u
vcS
6 ,n n-1
0
n
(a)
II v 6 - v* II < 2e
Cb)
II
E:
!:!. ',
n
vf
v* 11 s 3e ,
-
n
for n sufficiently large.
(ii) From the foregoing theorems and lemmas it will be clear that for any
v* may be ap6
proximated by a sequence v A as described in the theorems 6.1.1 and6.1.2
n
and the lemmas 6.1.3-6.1.5. Moreover, the policy fn becomes (&-)opcS
E:
!:!. ' and for any A e JI! u { 00 } the optimal return vector
timal for n sufficiently large. So in fact each cS
A
E: Jll
E:
8' and each
u { 00 } produce a policy improvement value determination procedure
for solving the Markov decision problem.
cSX
(iii) Note that for A = 1 the sequences vn coincide with the corresponding
6
sequences v discussed in chapter 5.
n
92
Some lf..enrur.lu on :the value olLi..enU.d method!.
6.2.
In this section we shall try to illustrate what really happens when we use
a value oriented successive approximation method. Furthermore, connections
with existing solution techniques for Markov decision problems will be given.
We shall restrict ourselves here to the situation that u
010
is defined. In
the preceding sections we have seen how the results can be extended if this
restriction is released.
In a very simple situation, finite state and decision space discounted
Markov decision processes, it has been illustrated by the author [53] that
for
ocorresponding to G1 the mapping uJ~b produced better estimates for
Vf
compared to the estimates of the mapping U0 • This might be expected
n
since in each iteration step extra computational effort is spent to find
better estimates for Vf
For v
E:
V and
o E:
(the value vector of the actual policy fn).
n
Ii ' ,
let f : = g ({ f
f
L0 v
= u0 v}).
f
Then, since L0vf = vf we
have
This implies that in general uiA)v is a better estimate for the total expected return vf under the actual policy f than U0v.
Choose v0 E: V such that U v <i: v • It is easily verified that the sequenCElS
0 0
0
0
0
01
0
u!Ao)v 1 and v := u,, ov 1 both started with the chosen Vo satisfy
v
n
\=
u,
n-
n
u,
n-
For some examples we refer to van Nunen [53]. In [53] we have shown for
finite state space finite decision space discounted Markov decision processes that, for
o E:
o'A
('A) 6A
corresponding to G1 the value vector vn := u 010 vn-i
may be interpreted as the total expected reward over nA-time periods if the
Ii
strategy
ot.
is used and the terminal value vector equals v 0 • A similar interpretation
may be qiven if more complicated stopping times are used.
93
o is nonzero and A =
If
"", then the algorithms as described in theorem
6.1.1 are clearly of the policy iteration type, see e.·g. Howard [35], Mine
and Osaki [50]. This means that in each iteration step the value vectors
Vf
(total expected return over an infinite time horizon) of the
n
licy fn' is computed exactly. The choice of
o E;
/:;. •
act~l
po-
only determines the way
of looking for possible improvement of policies. For any
o€
/:;.•
in the case
A = ""• each iteration step brings a strict improvement of the values Y,f
*
n
until the optimum V is reached. This occurs in a finite number of steps if
only finitely many Markov policies are available. For :A =
iteration step, the value vector Vf
00
in the first
,
is computed exactly. So after one
1
iteration step the condition u
010
tisfied~
v <t v, as required in lemma 6.1.3, is sa-
from then on the convergence will thus be monotone.
Howard's policy iteration algorithm [35], [SO], for finite state space finite decision space discounted Markov decision processes equals the example
o corresponding to
G , A
1
If in this situation
o is
= w,
µ(i) = 1 and b{i) = 0 for i
€
S\{O}.
replaced by the stopping time corresponding to
the goahead set GB we get Basting's modified (Gauss-Seidel) policy iteration algorithm [26]. As mentioned, for :A
=1
and
o corresponding to
G the
1
successive approximation methods yield the standard dynamic programming
method as described by e.g. Bellman [3), Blackwell [4], [5], MacQueen [46],
In this chapter the procedures have been defined for a fixed number
A
€
JN u {ao}, However, it is not essential that A is fixed for all itera-
tion steps. The value of A may depend .on the number of the iteration step
and even on specific aspects of the actual iteration process.
For numerical eXperiences with a value oriented method we refer to van
Nunen [53].
95
CHAPTER 1
UPPER BOUNVS, LOWER BOUNVS AN'O SUBOPTIMALITV
In the previous chapters we have proved the convergence of the sequences
0
v , ve,, v A., as defined in the preceding chapters to the optimal retu:rn
n
n
n
~
vector V* • We have also proved that the Markov strategy fn
found in the
n-th iteration step of the actual algorithm is £-optimal for n sufficiently
large. In general the convergence proceeds at a gecmetric rate, since we
have required II Pf II:!> p* < 1 for all f € F (see lemma 5.2.5). The question
now arising is whether one is able to construct better estimates for
and
v*
yf
n
at an earlier stage of the iteration process. We will show that this
can be done by extrapolation based on the differences v
Upper and lower bounds for the value vectors V f
0
n
- v
0
n- 1
•
and V* enable us to incor-
porate a test for the suboptimality of policies~
Section 7.1 will be devoted to the construction of upper and lower bounds.
Seetion 7.2 will deal with the concept of suboptimality of policies. In the
final section 7.3 we restrict the considerations to finite state space Markov decision processes. The concepts of bounds and suboptimality will be
considered with respect to these processes. Several numerical aspects will
be discussed; relations with the work of other authors will be given.
7. 1. UppeJt. bound4 and loweJt. bou.nd4 601t
vf
and v*
n
We will treat the concept of bounds in a similar way as Porteus [SS], [5g]
did in the finite state case with
o
corresponding to G1 , b = 0 and the un-
weighted supremum norm. We also refer to van der Wal [ 71] who treated the
concept of bounds in a similar way for finite state space Markov games with
b = 0 and the unweighted supremum norm.
0
DEFINITION 7 .1.1. Let vn
0
0
an, Sn for n
= 1,2, •••
be a sequences of vectors in V, then we define .
by
(16 ;=
n
[ {v~ (i) - v6 l{i))µ-l(i)]
inf
n1€S\{O}
Bo
sup
iES\{O}
n
:=
[ cv°n Ci>
- VO 1 (i))p-l(i)]
n-
96
0
0
Moreover, we define zn' Yn by
0
Zn :=
c5
yn :=
{~finf
f<::F i;i'O
(µ -1 (i)
I
jES
p f (i,j)µ(j))
0
if a.
0
n
<:: 0
0
VO
if a.
r
if B
n < 0
0
<:: O
n
inf inf (µ -1
fEF i;i'O
c1> I
p:(i,j)µ(j)}
j<::S
if Bo < 0
n
LEMMA 7.1.1.
(i)
(ii)
(ii)
V
f£F
o o
fo
fo
0
z•a.•µSLv-Lv
n n
o n e5 n-1 s yn
(z oNO
) •a. • µ
n
n
:S
rJ!.OrJ!o
v v
on
o n-1
:S
·an·0 ·u
oN o
(yn) •Sn•µ,
N £ lN
cS cS
0
0
s ycS. 6 0.µ + e;µ •
Ve:>O zn• a.n•µ- e:µ s uo ,e: vn - u c5 ,e: vn-1
n n
PROOF. Since
L:
is a monotone mapping it is straightforwardly verified that
We Wil'.l. prove part (ii} only for N = 1. For general N the proof follows by
1
induction. Choose an arbitrary E: > O and choose f
F such that
,t.2 .:
'l'hen
and
97
<l'.
6 6
z •a •µ - e;µ •
n
n
Since e: was chosen arbitrarily this implies part {ii).
Part (iii) follows
similarly by using the definition of u,, , •
IJ ,e:
D
15
6 € l:i. •, v 0 € V and suppose U0 ,O is defined. Let the se0
quence v be defined as in lemma S.2.5 then
LEMMA 7.1. 2. Let
n
{i)
0 0 > O,. 0 0 l
"n
"n- > 0 and thus
{ii)
~n
No
<
v0
15 = yn-l
6
= Yn
0
0
0 • an-i
< 0 and thus v 0 • z n
=
PROOF.
(i)
(ii)
O >a
6
n
0
6
n
n- 1
= inf [ (v (i) - v
irO
(i)) µ
-1
{i)] =
D
0
€ V and suppose U ,
0 0 is defined. Let the
0
sequence v! be defined as in lemma 5.2.S then
COROLLARY 7. 1. 1. Let
o€
A' , v
98
(ii)
6
0
PROOF. For Bn > 0 and a.n < 0 the proof of part (i) and (ii) follows from
the foregoing two lemmas.
O
The other cases are trivial.
sequence
v!
E
0
The sequence un (n
(i)
.s
fl',
let v E V and suppose U0 ,o is defined. Let the
0
be defined as in lemma 5.2.5.
LEMMA 7. l. 3. Let .S
~
0
0
0
un :• vn + y n c1
1) defined by
-
0 -1 Ii
an ·µ
n
Y >
yields monotone nonincreasinq upper bounds for v* and thus for Vf •
n
::Moreover
u6 +
n
(ii)
v* •
The sequence Jl.
0
n
(n
~
1) defined by
yields 1110notone nondecreasing lower bounds for Vf
and thus for
n
Moreover
.o t v* •
"'n
(iii)
PROOF.
N
o
o
N
o
N-1
o
N-1
o
o
o
u;v
vn • vn + u..,v
vn - u...., vn + u.ru vn -••• + u..,v
on - vn
see lemma 7.1.l(ii).
v* .
99
So for N + .. we find
0
The monotonic!. ty of un follows from
The statements about the lower bounds follow similarly by considering
fn
No
(Lo ) vn-1"
The convergence of u
0
n
and i
0
n
to v* follows from lemma
s.2.s.
D
Part (iii) follows by inspection.
REMARK 7.1,1,
{i)
Note that for
o chosen
corresponding to G the bounds
1
the sequence vn as defined in lemma 4.3.7.
may
be used for
7,1,3 the starting vector
0
0
0
v is chosen such that u v :?:: v then it is easily verified that
0
6 0
0
0
0
0
0
a 0 ~ O for all n ~ 1. This implies z = z
and y = y
for all
n
n
n- 1
n
n- 1
n > 1.
(ii) If in the situation as described in lemma
LEMMA 7.1.4. Let the sequence v
0
n
(i)
0
The sequence wn (n
~
be defined as in lemma 5.2.6, then
1) defined by
0
0 1
0
w0 (i) := min{w
Ci),u (il +(1-y >- ec1-v,,)µ(i)}, n>l, ies\{O}
n
n- 1
n
n
"
where u! is defined as .in lemma 7. 1. 3 1 yields monotone nonincreasing ·
upper bounds for V* and thus for Vf •
n
100
0
(ii) The sequence xn , (n
X
o (i)
~
___ _r o
:= max\X
n
n- 1
1) defined by
(i) ,JI. 6 (i) } , n > 1, i e: S\{O }
n
where JI. 0 is defined as in lemma 7. 1. 3, yields monotone nondecreasing
*
n
lower bounds for V •
PROOF. The proof proceeds along the same lines as the proof of the foregoing lemma.
0
0
We first remark that vn - vn-l ~ 0 for all n since
0
v (i)
n
0
~max{/'n- 1 Cil,u.,v
Cil
u n- 1
so for all n > 1
yon = yon-1 = v.,u
and
- e:(l-v.,lµCil}
zon = zon- 1·
Now consider
For N-+
00
we obtain the requested result.
For the lower bounds we have
So for N + oo we have
u
101
0
0
0
0
.
The monotonicity of w • x follows from the definition of wn and xn whereas
6
6
n
n
llwn - xnll ~ 2e:(1-v 0 )-1 follows from lemma 5.2.6.
D
r5
REMARK 7. 1. 2. From the foregoing proof it will be clear that JI. n is a lower
bound for
vf •
n
0
LEMMA 7.1.5. Let vn be defined as in lemma 5.2.7, then
(i)
0
the sequence wn (n
~
w0 (i) := min{w0
n
n- 1
1) defined by
(i) ,u0 (i) + (v..,)n(1-y 0 )- 1µ (i)}, (n > 1), i e: s\{O}
n
n
u
where u! is defined as in lemma 7.1.3,yields monotone nonincreasing
upper bounds for V* and thus for V f •
(ii)
o
n
the sequence xn (n <:: 1) defined by
x°Ci) := max{x
where
0
n- 1
n
0
Ci),Jl. (i)}, n > 1, i e: s\{O}
n
t! is define~ as in lemma 7.1.3, yields monotone nondecreasing
lower bounds for V •
(iii)
PROOF. The proof of the first two parts proceed in a similar way
as the
proofs of the foregoing lemnas whereas the final part follows from lemma
0
5. 2. 7.
REMARK 7.1.3.
(i)
A lower bound for Vf
may be found in a similar way as in lemma 7 .1.4
n
(see also remark 7 .1. 2).
102
(ii)
Note that if 6 is chosen corresp~ding to G then the bounds may be
1
used for the sequences v and f as defined in lemma 4.3.8.
(iii)
Upper and lower bounds for the sequences vn
n
.SA.
n
as defined in chapter 6
can be ob.tained in a similar way. We will illustrate this by giving
H
the bounds for the sequences vn as defined in lelllllla 6,1,4, i.e •
.SA. := .SA.+ (1 -
un
.SA.)-10.SA.
vn
Yn
"'n µ
where
fn+1
: .. sup( (Lo
i~
v
oA. (i)
n
- v
oA. (i))µ -1 (l.)]
.
I
n
f
:==
inf[ {L,. n+lvoA. (i) - /'"A (i)) µ-l (i) J
ii'O
u
n
n
o"A
.S"A
an d y , z
are defined by means of
0
non
yn and zn.
anoJ..
and z
o"A
in a similar way as
n
7. 2. The ,&u.bop:tlma.llty 06 Maltkov poUclu and hu.bop:ti..mal a.c.tloit.6
We first want to remark once more that we still restrict ourselves to transition memoryless nonzero stopping times.
DEFINITION 7.2.1.
(i)
A Markov policy f e: F is called suboptimat i f f°" is not optimal.
(ii) A Markov policy f
00
e: F is called &-suboptimat i f £
is not &-optimal.
LEMMA 7. 2. 1. Suppose u, t e: V are an upper bound and a lower bound for the
value vector v*, and let .S
(i)
(ii)
[L~u < U0 R.] ,.
€
/J.' , then
f is suboptimal
103
PROOF.
(i}
In chapter 5 we have proved f
moreover we have proved u6 v*
€
F is optimal if and only if
= v* •
L:v* = v* 1
However
consequently f is suboptimal.
So
Iteratinq in this way yields
0
which implies that vf < v* - eµ.
It would also be nice to make assertions as to the (e-)suboptimality
~f
actions.
DEFINITION 7.2.2.
(i)
An action
i
€
s
ao
€
€
0
s if there exists no &-optimal policy f
LEMMA 7. 2. 2. Let &
Let i
€ F with f{i) = a •
0
A(i) is said to be e-suboptimal with respect to state
if there exists no optimal policy f
{ii) An action a
i
A (i) is said to be suboptimal with respect to state
€
I:.', u, .t
€
S. For each f
€
F7
1
;=
{f'
€
€
€
F such that f(i)
= a0 •
V be an upper bound and a lower bound for v*.
€
F we define the set Ff
F
I
i
c:
F by
f' (j) = f(j) for all j '/: i}
then
(i)
The action a € A(i) is suboptimal with respect to i € S if for each
f
€
F with f (i)
=a
there exists an f'
€
f
Fi such that
104
I
(ii) The action a
e: A(i) is £-suboptimal with respect to i e S if for each
f <: F with f'(i)
=a
there exists an f' e: Ff such that
i
PROOF. The proof proceeds in a similar way as the proof of the foregoing
D
lemma.
The assertions of the foregoing lemma may also be expressed in terms of
maximizing over the decisions in state i e: S as follows.
COROLLARY 7.2.1. Let IS
e:
t:.•,
and u,R. e: V be an upper bound and a lower
bound for V * • Denote by fa e F a Markov policy with f(i) = a then
(i)
the action a' e A(i) is suboptimal with respect to i e:
fa'
€
s if for all
F
f
sup
(LIS a) JI, (i) •
aeA(i)
(ii) The action a' <: A(i) is £-suboptimal with respect to i e
fa'
€
s if for all
F
f
sup (L a)R. Cil - e;µ (i) •
0
aeA(i)
Especially for computational purposes it would be desirable to maximize
component-wise in the determination of UIS v, UIS ,e; v. Sufficient conditions to
allow a C::OJ!l)Onent-wise maximization may be found in the following lenma,
LEMMA 7.2. 3. Let IS <: !!. ' and. suppose (eventually after renumbering the state
space) that
(7.2.1)
105
then
(1 - o(i))v{i) +
o(i)
sup
{r(i,a)+
~AU)
l
pa(i,j)[(1-o(i,j))v(j)+o{i,j)U v(j)]
0
j~
+
l
.
Pa(i,j)v(j)}
j~i
PROOF.
The proof of the letmna follows directly by analysing the transfo.tmed
problem as described in the proof of theorem s.2.2.
D
REMARK 7.2.1. The transition memoryless nonzero stopping time GH and GR as
described in the examples 2.2.2 and 2.2.4 and the stopping time 6 corresponding to a
1
satisfy the conditions of the foregoing iemma..
In the next section we will consider into more detail situations as described in formula (7.2.1). This will be done for finite state space Markov
decision processes.
LEMMA
7.2.4. Leto
E:
t:.•, and let R.,u
€
V be a lower
and
an upper bound for
v* respectively and suppose (7.2.1) is satisfied, then
(i)
the action a
€
0
A(i) is suboptimal i f
a
r(i,a J + l p o(i,j)[ (1-o(i,j)}u(j) +o{i,j)U u(j)] + l p
0
0
j<i
j~i
<
ao
Ci,j)u(j)
I
sup {r(i,a) + l pa(i,j)[(1-o(i,j))R.(j)+o(1,j)U R.(j)] +
pa(i,j)R.(j)}.
0
aeA(i)
j<i
j~i
(ii) The action a
€
A(i) is e-suboptimal if the first term under (i) is
smaller than the second term minus eµ
PRJOF. The proof follows by inspection.
(i)
(6 (i))
-1
•
D
106
REMARK 7.2.2.
{i)
For
o corresponding
to
G1 the condition required in lemma
7.2.~\l{i)
reduces to
a
p O {i,j)u(j) < sup
{r{i,a) +
aEA{i)
jES
l
\
(ii)
The latter suboptimality criterion can be expressed more explicitly
if the upper and lower bounds are given in terms of v
n
and v
n- 1
, see
e.g. MacQueen [47].
{iii) Better criteria can be obtained if we impose
additional conditions
on the transition probabilities, see e.g. HUbner [36].
7. 3. Some
JLe.maJl}v., on
6in-lte 4.:ta.te
4pace Malik.av decA...6-lon p1WC.U4U
For practical pu:r:poses {the finite state space) Markov decision processes
become more and more important. Many practical problems such as inventory
problems, replacement problems and Marketing problems may be described by
Markov decision models. We refer to e.g. Scarf [63], Tijms [67], Howard
[35], Hastings [25] and Wessels and van Nunen [73].
Sometimes it will be self evident to approximate Markov decision processes
with a countable state space by finite state
sp~ce
processes, this is done
e.g. by Fox [19].
For the solution of Markov decision problems in the finite state case, linear programming [13], [49], and policy iteration may be used. If the underlying decision space contains only a finite number of elements, both methods yield an optimal solution in a finite number of steps. However, difficulties arise if the state space is large. For example, the policy iteration method requires in each iteration step the solution of a system of
linear equations of the size of the number of states. For solving Markov
decision problems with a large state space, successive approximation methods that avoid the solution of the large systems of linear equations,
be~
come preferable. This is especially true if the concept of extrapolation is
used to construct upper and lower bounds, see Schellhaas [64], MacQueen
[46], Porteus [59], Finkbeiner
the present author [54].
and Runggaldier [18], Das Gupta
[22] and
107
As we have seen upper and lower bounds enable us to incorporate a test for
the suboptimality of actions. Such a test may also reduce the required computational effort considerably, see e.g. Grinold [20], MacQueen [47],
Porteus (58], Hastings and Mello [27], and HUbner [36].
Hinderer [32] derived in a similar way bounds for finite stage dynamic programs in the case b
W,
o corresponding
to G and the usual supremum no:an.
1
In [31] Hinderer extended his results by using weighted supremum norms as
€
introduced by Wessels [74].
A number of the described techniques for solving Markov decision problems
with respect to the total reward criterion can be used in a modified way
for solving these problems with respect to the average reward criterion,
see Odoni [57], White [77], Schweitzer [65], Morton [51], Veinott [69] and
van der Wal [70]. Also for Markov games similar techniques can be used; see
van der Wal [ 71], [ 72]. Since periodic Markov decision processes {see e.g.
carton [ 7], Riis [61)) can be described as ordinary Markov decision pro·cesses, by incorporating the period in the state definition the same holCls
for such processes.
For Markov decision processes with a finite state space the assumptions
4.1.1, 4.2.1-4.2.4 reduce in fact to the well-known assU11ptions
(n-stage contraction) , see for instance Denardo [ 12] or Porteus [ 59]. For
finite state space problems van Hee and Wessels (29] have proved the equivalence between the existence of a bounding functionµ• such that II Pf IL,~ P < 1
for all f
€
F and N-stage contraction. We will discuss the concept of N-
stage contraction in chapter 8.
The use of a bounding function may be interpreted as the transformation of
the problem in an equivalent one (in a similar way as Porteus did in his
recent paper [59] (the similarity transformations)). This can be done by
~
defining transformed rewards r(i,a) := µ
-1
(i)r(i,a) and transformed transi-
tion probabilities pa(i,j) := µ-l(i)pa(i,j)µ(j). Then the optimal value
vector of the transformed problem (V
- ) equals -V
= µ-1•
V ,
with V* the opti-
mal return of the original problem as is straightforwardly verified.
Also the use of stopping times for successive approximation procedures may
be viewed upon as investigating a transformed problem with
108.
proved in the preceding chapters the optimal return vector
As
, the original problem.
equals V* of
V*
then
For finite state space Markov decision processes the last mentioned transformation corresponds to the pre-inverse transformation as introduced by
Porteus [ 59] •
Let the nonn of tYle pz>oceaa be defined by sup II Pf II and let the apeatr>aZ
·
f.::F
r-adiue of the pz>ocese .be defined by sup yf, with yf the spectral radius of
fE:F
Pf (its maximum absolute eigenvalue) then it will be clear that the choice
Of an adequate transition memo:i:yless nonzero stopping time yields a transformed process with a reduced spectral radius. The choice of the µ-function
however may influence the norm of the transformed process but does not alter
A
~he
spectral radius.
reduction of the spectral radius improves the performance of the value
iteration while a reduction of the norm of the process (using an adequate
µ-function) may improve the bounds introduced in section 7.1.
For computational purposes it will be desirable that for each f .:: F and
each o.:: ~· the matrix with (i,j)-th ent:i:y equal to Pf(i) (i,j)o(i,j) possesses the triangular structure as described in formula (7.2.1). As mentioned earlier, if the stopping times corresponding with G1 , GR,
then (7.2.1) is satisfied. For o corresponding
<>a are used
G1 the approximation pro-
to
cedure described in le11111a 4.3.6 results in
t
Vi~S Vn(i) =
~
max
a.::A(i)
{r(i,a) +
l
j.::S
pa(i,j)V. -l (j)} •
n
For o corresponding to GR the procedure described in lemma 5.2.4 may be
represented more explicitly by
which is the well-known Jacobi iteration, see e.g. Porteus (59]. By choos- '
in9 o such that o corresponds to GR with exception of o(i,i), i.:: S for
which it is allowed that &(i,i) < 1 for each i.:: S\{O} we get a successive
109
overrelaxation procedure. An arbitrary choice of
o liluch
that (7.2.1} is sa-.
tisfied may yield a combination of overrelaxation and Gauss-Seidel-like p:z:ocedures.
For 6 corresponding to G8 the procedure as described in lemma 5.2.4 can be
executed component-wise
Of course the way in which the state space is ordered may influence the
rate of converqence the process, see e.g. Kushner and Kleinman [41].
As described in the previous sections all those methods allow for the construction of lower and upper bounds and for the use of suboptimality criteria. Which procedure is preferable for solving a given finite state space
Markov decision process depends on the problem under consideration. If we
want to compare two different procedures it will be necessary to compare
the correspondinq sequences of upper and lower bounds. However, where the
estimates for V* in the n-th iteration step of a specific algorithm may be
better then those of another algorithm (other stopping time) this does not
mean unfortunately that it is possible to construct in an easy way, bounds
that are better too. We have illustrated this phenomenon by some examples
in [54]. Moreover the question whether additional effort spent for the
computation by llX)re sophisticated stoppinq times and the computational gain
in the number of requested iterations, is evenly matched, cannot be answered in general. For numerical experiences in these directions we refer to
Porteus [58], Schellhaas [64] and Kushner and Kleinman [41].
The use of value oriented procedures, as described in chapter 6, for solving finite state and decision space Markov decision problems may yield a
considerable qain in computational effort. This is the case if the policy
improvement procedure requires many operations i.e. if the total number of
iterations is large compared with the number of states. Furthermore, in
practice optimal policies are achieved after a relatively small number of
iteration steps, whereas a nonnegligable number of iterations is still required to satisfy the convergence criterion. Especially for the last reason
it will be profitable to adapt the value of A during the iteration process.
For numerical experiences with the value oriented method in the case 6 corresponding to G we refer to [53].
1
111
CHAPTER 8
N-STAGE CONTRACTION
In chapter 4 we have required in fact one stage contraction with respect to
the bounding function µ i.e.
This assumption may be weakened to the case of the so-called
N-s~aqe
con-
traction. The extension is comparable with Denardo's (12] concept of Nstage contraction in its strengthened form introduced for the unweighted
supremum norm case. We will prove that if the N-staqe contraction assumption in its strengthened form is satisfied instead of assumption 4.2.2,
this implies the existence of a bounding function µ• such that, the assumptions 4.2.1-4.2.2 are satisfied whenµ is replaced byµ•. Moreover we will
o
a).
indicate the convergence Un µ-norm) of the sequences vn' vn, vn
as defined
in the preceding chapters under the N-stage contraction assumption in its
strengthened form.
First we will introduce the strengthened N-stage contraction 0$8umption.
ASSUMPTION 8,1.1.
{i)
(ii)
THEOREM 8.1.1. Suppose, the assumptions 4.2.1, 4.2.3, 4.2.4 and 8.1.1 are
satisfied then there exists a bounding function µ' such that
(i)
(ii)
112
(iii)
3MEJR+ V feF
II P~
Pbl~,
-
S M •
PBCX>F. Choose a such that P * < a N < 1 where P * < 1 follows from assumption
8.1.1 and defineµ• by
...
r
µI I"' sup
L
irl!M n=O an
f
~(11") •µ
f
However, since P O ••• P N-lµ s
P
•
p *µ
it holds that
fm+l • • .P fm+nNµ s (p*)nµ •
Hence
µ• s sup
ireM
On
N pm
r
m=O
(1T)
(1 - p *a -N ) -1 a -mµ
the other hand
as follows from the definition of µ• and the fact that Markov strategies
daninate all decision rules, for a proof see van Hee [28], We also refer to
van Hee and Wessels (29].
This implies aµ' '2:
P
f
,finition of µ' we see µ• '2:
f
JIµ, Sa< 1. From the deµ. Now the proof of part (1) and (iii) is trivial.
µ'for all f E F, so VfEF
llP
D
The foregoing theorem proves that the strengthened N-stage contraction assumption implies the existence of a. bounding function µ' such that for this
new bounding function the assumptions 4.2.1-4.2.4 are satisfied.
113
REMARK 8.1.1. Also Lippman [45] gives conditions for N-staqe contraction.
Howev:er, it is proved in van Nunen and Wessels [56] that these conditions
imply the existence of a bounding functionµ' such that forµ• the assumptions 4.2.1-4.2.4, with b = O, are satisfied.
on one hand one could say the strengthened N-staqe contraction assumption
yields no real extension since it guarantees the existence of a µ' with the
described properties. On the other hand the strengthened assumption 8.1.1
o
o>..
also guarantees the convergence of the sequences vn' vn' vn , as defined in
the previous chapters, with respect to the original µ-norm. This will be
indicated in the sequel of this section by proving that the above statement
is true for the sequences as defined in lemmas 4.3.7 and 5.2.S.
THEOREM 8.1.2. Suppose the assumptions 4.2.1, 4.2.3, 4.2.4 and 8.1.1 are
satisfied then for each f e F we have
(i)
Lf is a mapping V + V.
(ii)
(Lf)N is a contraction mapping of V into V with contraction radius P',
1
where N and p' follow from assumption 8.1.1.
(iii) For each v
f
0
€
f
V the sequence vn defined by
ProQF. The proof of the first two parts of the lemma is straightforward.
To prove part (iii), note that if L
f
1
has a fixed point it has to be the
fixed point of CLf)N.
Let v be the fixed point of (Li} N then v is a fixed point of Lf as follOlfS
from
That vf + v {in µ-no:r:m) for n +
n
such that
00
is proved by choosing e: > 0 and k e E
114
Let n > Nk, then n can be written as n = Nk
1
+II. with kl <!: k and 1 S II. s N,
Where c :"' max{ 1, (sup II pf II) N}.
That V equals
V~
feF
is easily verified.
Part Ci v) follCMs in the standard way
D
THEOREM 8.1.3. Suppose the assumptions 4.2.1, 4.2.3, 4.2.4 and 8.1.1 are
satisfied and suppose
(i)
(ii)
u1
,0
u1 , 0
is defined, then
maps V into V.
N
cu110 >
is a contraction mapping of V into V with contraction radius
P • where N and p ' follow from assumption 8 .1. 1.
(iii) the sequence vn defined for v
0
€
V by
converges to v * (in µ-norm).
(iv)
II vnN - v * II s
*
(p ) n.ilv
0
- v * II •
PROOF. The first part of the lemma is trivial. Part (ii) holds since the
N-stage contraction assumption 8.1.1 is satisfied. The final parts follow
0
in a similar way as the final parts of the foregoing theorem.
We finish this chapter with
in lemma 5.2.S(iii).
a
~
theorem concerning the sequence vn as defined
115
'l'BEOREM 8.1.3. Suppose the assumptions 4.2.1, 4.2.3, 4.2.4 and a.1.1 are.
satisfied, and let
suppose
v010
o be
a transition memoryless nonzero stopping time and
is defined, then
maps V into V.
0I 0
0
(ii) Suppose v i;; V is such that
(i)
v
0
by
converges to
v*
in µ-norm.
PROOF. The proof follows directly f:rom the fact that
and the fact that V V*
0
= V* •
D
117
CHAPTER 9
SOME EXPLANATORY EXAMPLES
In section 9.1 we will describe how a number of specific Markov decision
processes, namely discounted Markov decision processes and {discounted)
semi-Markov decision processes are covered by the theory developed in this
monograph.
In section 9.2 we will show how the theory developed in chapter 3 can be
used to generate successive approximation procedures for solving system$ of
linear equations of the form
Ax = r ,
where A can be described A = I - P, where I is the !dent! ty matrix and P
satisfies the assumptions of chapter 3. Four special cases will be indicated.
In section 9.3 the monograph will be concluded with an example concerning
an inventory p:r::oblem. The problem is the countable state space discounted
equivalent of the inventory problem with averaqe reward criterion as treated by Wijngaard [ 78].
We start with a remark concerning the applicability of our model. In chapter 4 we have introduced the reward r(i,a) as the immediate return if the
system is in state i e
s and action a e
A
is selected. However in several
situations the one stage return will depend on the subsequent transition
that occurs a.swell. So the one stage return can be composed of two parts,
an amount r 1 (i,a} depending on the actual state i e S and the selected act~on a e A, and an amount r 2 (i,a,j) which depends on the next state (j}
that is visited a.swell. However we can still use our model if we define
r (i, a) as an expected one stage return . i. e.
r (!,a) : = r 1 (i, a} +
l
pa (i, j) r 2 (i, a, j} •
jt::S
We will now consider discounted Markov decision processes, treated for finite state space finite decision space Markov decision processes by e.g.
Boward [35] and for general state space, general action space Markov decision processes with a bounded reward structure by e.g. Blackwell [5]. The
difference between the models we have discussed and discounted models is
118
that in the latter situation the reward earned at time n is weighted by a
factor 13n, where 13
is the discount factor. We denote the transition pro-
2: 0
babili ties for the discounted process by qa(i,j) for i,j e:
s and a
e: A. For
an arbitrary Markov strategy~:= Cf0 ,f 1 , ••• ) we define Qn(n) with respect
to the transition probabilities qa{i,j) in a similar way as we have defined
Pn(n). Then the total expected diaaounted reward over an infinite time horizon equals
..
l
f
BnQn(n)r n •
n=O
By
incorporating the discount factor S in the transition probabilities i.e.
pa(i,j) := 13qa(i,j) the total expected discounted reward equals
So if for these redefined transition probabilities and for the reward structure the assumptions of chapter 4 are satisfied then discounted Markov decision processes are covered by the thepry developed in the previous cilapters.
Usually this discountfactor is supposed to be smaller than one (i.e.
0 :> 13 < 1). However, depending on the transition probabilities it
allowed that 6
2:
~
be
1.
For discounted Markov decision processes with 13 < 1, b
if i
may
=0
and with µ {i)
1
0 the assumptions of chapter 4 are satisfied if the rewai:'ds are
bounded.
We now consider semi-Markov decision processes as introduced byJewell[37],
[ 38].
For semi-Markov decision processes state transitions do not,neccessarily
occur at equidis,tant points in time. The time (t) between
two
state transi-
tions is a random variable with a probability distribution function F:,j(t).
The prci>abilities of the state transitions are as in chapter 4. So if imme-
diately after a transition, the state of the system is i e: S and action
a e A(i) is selected, then the system's next state will be j with a probability denoted by qa(i,j) where the qa(i,j) satisfy the assumption 4.1.1.
119
Such a transition will occur before time t
1
with probability
t .
1
dF:,j(t).
I
o-
we can consider .the embedded Markov decision process, in discrete time by
defining the state of the embedded process at time n (n
= 0,1, ••• )
to be
the state immediately after the n-th transition of the original decision
process (see e.g. Mine and Osaki [SO], Ross [62], De Cani [10]).
For semi-Markov decision processes with respect to the total expected reward criterion (without discounting) the total expected reward over an infinite number of transitions, (using w
=
(f ,f 1 , ••• ) and Qn (w) as we did
0
before) equals
So provided that for the embedded process the assumptions of chapter 4 are
satisfied the theory of the preceding chapters can be applied.
In the theory of semi-Markov decision processes with discounting, one assumes that rewards incurred at time t are discounted by a factor f3 t. In a
similar way as in the ordinary discounted case the discountfactor can be
incorporated in the transition probabilities for the embedded process i.e.
00
_I
pa(i,j) :=
0
9. 2.• The .&olut,lon 06 .&y.&tem& 06 Une.cvr. equation6
Suppose the following system of linear equations has to be solved
(9.2.1)
with A
Ax= r
:=
I - P, where I is the identity matrix and P satisfies the condi-
tions as imposed in chapter 3 with µ(i) := 1, i.,, 0 and b(i) :=O for i e: s.
120
For each
o c A we
VO
n
have proved in chapter 3 that the sequence
:=
converges to the solution of (9.2,1) if and only if
o is
a nonzero stopping
time.
So for each nonzero stopping time we have a successive approximation method
for solving (9.2.1). For some specific stopping times these methods are already known from numerical mathematics (see e.g. Varga [68]), 'lbis will be
shown in this section. so the convergence of these numerical methods to the
solution of the system of linear equations {9.2.1) follow
at one blow from
theorem 3. 2.1.
EXAMPLE 9.2.1. Let o
i::
A' correspond to the goahead set GR. We have for
each i c s
(9.2.2)
o Ci> :=Lovn-1
o "'1 -p(i,i)
1
Vn
r +
+
1
"<.o<
1 -p{i,i) j~i p i,J)Vn-1 j)
with v 0 e.
V•
We define D as the diagonal matrix with diagonal entries (1 -p(i,i)) and
define F and E by the strictly upper and strictly lower triangular parts of
P. 'lben clearly (I - P) can be expressed by
I - P • D- E - F •
0
Now vn can be given by
vno
:=
o + D-1 r •
0 -l(E + F)vn-l
This iterative method is known as the point Jacobi or point total step
me~
thod (see Varga [ 68]) •
EXAMPLE 9.2.2. Let o c A' correspond to the goahead set Gg with exception
of o(i,i) which is equal to 1 for all i e. S (i.e. Vie.S o(i,i~
= 1),
then
(as is easily verified, see also example (iii) on page 32) vn is given
component-wise by
121
(9.2.3)
v! (i) :=
1-p~i,i)
r(i) + 1
-p~i,i)
j t p(i,j)v! (j) +
+1-p~i,i) j~i
By using the matrices D, E and F
p(i,j)v!-1(j) •
as defined in the previous example, formu""
la (9.2.3) can be given by
(D -
E)
vn0
= Fvon-1
+ r
1
or alternatively by
This iteration method is known as the point Gauss-Seidel or point single
step iterative method.
EXAMPLE 9.2.3. Leto
~A'
be the transition memoryless stopping time such
that 6 corresponds to the qoahead set GB with exception of o (i) and o (1,i)
which are chosen arbitrary but such that
V
o (1,i) (1-p(i,i)) .. o (j,j) ll -p(j,j) l , and 0 (i)
0
1-6 (j,j)p(j,j).
=
(1,i) '
1,j~S 1 -6 (1,i)p{i,i)
17'0
0
Then vn can be expressed component-wise by
(9.2. 4)
v! (i)
0 {i)
1 - 0 (1)
0
1 -o (i)p(i,i) r{i) + 1 -o·(1)p(1,1} vn-1 {i) +
0<1>
f
0 (i)
f
o
1
p(i,j)vn (j) +
+ 1-oCilpCi,il j;i
P<1 ,j>vn.-1Cjl
+ 1-6 (1),p(i,i) j~
0
6
1
By defininq w := li) (l -p(i, )) and using the matrices D, B, F (9.2.4) can
1 - 0 (i)p(i,i)
be given
by
(D - wE)v!
= wr
+ ((1 -w)D + WF)v!_ 1 •
122
This is known as the point successive overrelaxation method.
EXAMPLE 9.2.4. Suppose the state space is partitioned in blocks
B
1
= {1,2, ••• ,n }, B
2
1
p
= {n 1 +1, ••• ,n2 }
p 1, 1
Pl,2
P2,1
P2,2
:=
and so on. Let P be given by
''
p
1
..'
n, 1
p
p
n,2
l?',n
'
where Pn,m contains the entries p(i,j) with i E Bn and j E Bm. We define
the ma.trices o, E and F by
D := I
E
-
r~ •,.~_J
0
o-- --
P2,1
0 - -- -
p 3, 1
''
P3,2''
0
p 1,2
p 1,3
0
0
p213
F=
0
'
''
Now let
oE
with 6 (i)
p
- -
2,4
''
'
'
A' be the transition memoryless nonrandomized stopping time
1 for all i e
s,
and
o(i,j)
= .1 if j = 0 or there is some n Elf
B , let o (i,j) = 0 elsewhere. Now it is easily verified
n
that vn := L0 vn-l can be given by
such that i 1 j
o
E
o
This iterative method is known as the block Jacobi iterative method (see
- e.g. Varga [68)).
123
From the foregoing examples it will be clear that the so called block successive overrelaxation iterative method can be found by combining the ideas
used in example 9.2.3 and 9.2.4. Of course several other options (other
choices of 6) are available. The previous examples are only given to illustrate how the concept of stopping time can be used to generate the iterative methods for solving systems of linear equations of the fom (9.2.1).
9.3.
An i.nven:to11.y pMblem
The inventory problem we deal with in this final section will not show the
full strength of the described methods.
It only illUstrates the power of the concept of bounding function (weighted
supremum norm). Therefore, in the remaining part of this section, the vector,
b is taken equal to the zero vector (Yi.sS b (i)
= 0).
In our example we treat
the discrete analogue of the inventory problem studied by Wijngaard [ 78].
However we will not investigate whether optimal strategies exhibit a specific structure or not.
We will first describe what we will mean here with an inventory problem.
An inventory problem consists of
(i)
A state space containing the allowed inventory levels. Here, we assume the allowed inventory levels to be integers. we allow for Dacklogging which explains why the inventory levels can be negative.
(ii)
Decision spaces A{i) containing for each inventory level i e E the
possible orders. We assume the orders to be nonnegative integers. So
A{i) might be given by e.g. A{i) = {0,1, ••• ,N} if at level .i all orders of size 0 up to N are allowed. we assume the leadtime
to be
zero.
(iii) A distribution function of the demand (d) per period. We assume this
demand to be nonnegative. For k c: E+ the probability that the demand
equals k is pk we assume
{9.3.1)
Po Y.
1. Moreover we assume
·1
124
on z+ into JR+. For each k
1
ordering costs of k units.
(iv)
A cost function c
E:
z+, c 1 (k) gives the
A cost function c on z into JR+. For each k .:: z, c (kl gives the one
2
2
period (expected) inventory and stock out cost if at the beginning' of
the period k units are available (note that k < 0 means a shortage)
we assume
3MEE.+ Vke:z+ [c 1 (k)exp(-k) < M]
(9.3.2)
{ 3M~+ VkE:Z [c (k)exp(-lkl) < M] •
2
(v)
An optimality criterion. Here we will use the total (expected) dis-
counted reward criterion with discountfactor B (0 < 13 < 1).
We will now adapt the formulation of the inventory problelllS in such a way
that they are covered by the models developed in the preceding chapters.
Of course it is possible to label the inventory levels with J.ll
u
{O}. How-
ever, we will use the state ,space S := z u {O'} to maintain the correspondence with the inventory problems as treated by Wijngaard. The state 0' represents the fictive absorbing state. We define the bounding function µ on
s
by
µCO')
: .. O; µ(i) • expClil>
for i .::
.z •
We define the transition probabilities pa(i,j) with a .:: A(i), i,j .:: S and
i ;.
o•
by
for j
s i + a ,
forj=O',
else •
For i = 0' we define p(O' ,0')
we define for 1 .:: S\{O'} and a
r(i,a)
= 1.
E:
A(i) the rewards r(i,a) by
125
THEOREM 9. 3.1. Let m,M,R e: z such that m < 0 < M, Rs M - m and
co
I
pkexp(k) < exp(R) •
k=O
Suppose A(i) is such that
i:SM .. V
i+a:SM
ae:A(i)
i > M .. A(i) = {O} •
Then
(i)
(ii)
(iii)
3P '< 1 3NEE
vf
0
f f
f
, ••• , f Ne:F II P oP 1 ••• P NII s P ' •
PROOF. The parts (i) and (ii) follow straightforwardly. To prove the final
part of the theorem we use a result of Wijngaard [78]. From Wijngaard theorem 5. 4 and the boundedness of µ on [m,MJ it follows that
-n fo f1
fn
*
3 M*~+ VnaJ Vf , ••• , f n e:F 6 (P P ••• P ) µ S M µ •
0
Multiplying both sides by Sn we have for n sufficiently large that the
n-stage assumption s.1.1 is satisfied.
D
127
REFERENCES .
[ 1]
Bauer, H., Wahrscheinlichkei tstheorie und Grundzuge der MaStheorie.
Berlin, Walter de Gruiter & Co., 1968.
[2]
Bellmann, R., A Markovian decision process.
J. Math. Mech.
[ 3]
!
(1957), 679-684.
Ballmann, R., Dynamic programming.
Princeton (N.J.J, Princeton University Press, 1957.
[4]
Blackwell, D., Discrete dynamic programming.
Ann. Math. Statist.
[5]
~
{1962), 719-726.
Blackwell, D., Discounted dynamic programming.
Ann. Math. Statist. 1§. (1965}, 226-235.
[6]
Blackwell, D., Positive dynamic programming.
Proc. Fifth Berkeley Sympos. Math. Stat. and Prob., Vol. 1
(1967)
[ 7]
I
415-418,
Carton, D.C., Une application de l' algorithme de Boward pour des phenomenas saisonniers.
Proc. 3rd Intern. Conf. Operations Res., Oslo, (1963), 683-691.
[8]
Collatz, L., Funktional Analysis und Numerische Mathematik.
Berlin, Springer-Verlag, 1964.
[9]
Cox, D.R. and H.D. Miller, The theory of stochastic processes.
London, Methuen
[10]
&
co. LTD, ( 1968).
De Cani, J.S., A dynamic progranming algorithm for embedded Markov
chains when the planning horizon is at infinity.
Management Sci. 10 (1964}, 716-733.
[11]
De Ghellinck, G.T. and G.D. Eppen, Linear programming solutions for
separable Markovian decision problems.
Management Sci.
£
(1967), 371-394.
128
[ 12J Denardo, E. v., Contraction mappings in the theory underlying dynamic
programming.
SIAM Rev, 2_ (1967), 165-177.
[ 13] D'Epenoux, F., Sur un probleme de production et de stock.age dans
l'aleatoire.
Rev. Franc. Rech. Oper. ,!.! (1960), 3-16.
[14] Derman,
c., Markovian sequential control processes-denumerable state
space.
J.
[15] Derman,
Math. Anal. Appl.
.!2.
(1965), 295-302.
c., Finite state Markovian decision processes.
Academic Press New York etc., (1970).
[16] Derman, c. and R.E. Strauch, A note on memoryless rules for controlling sequential control processes.
Ann. Math. Statist. 37 (1966), 276-278.
[ 1 7] Feller, W. , An introduction to probability theory and its applications,
Vol. II (2nd ed.).
New York, Wiley, 1971.
[18] Finkbeiner, B. and
w.
Rungaldier, A value iteration algorithm for
Markov renewal programming.
Computing methods in optimization problems 21 Ed. by
L. Zadeh. New York etc,, Academic Press (1969), pp. 95-104.
[19] Fox, B.L., Finite state approximation to denumerable state dynamic
programs.
J, Math. Anal. Appl.
~
(1971), 665-670.
[20] Grinold, R., Elimination of suboptimal actions in Markov decision
problems.
Operations Res. 3,l {1973), 848-851.
[21] Groenewegen, L.P.J., Convergence results related to the equalizing
property in a Markov decision process.
Eindhoven, University of Technology Eindhoven, Dept. of
Math., 1975. (Memorandum-COSOR 75-18).
129
[22] Das Gupta, s., An algorithm to estimate suboptimal present values for
unichain Markov processes with alternative reward structures.•
Berlin, Springer-Verlag, Berlin etc., 1973, pp. 399-406.
(Lecture Notes in Computer Science, no. 3).
[ 23] Harrison, J., Countable state discounted Markovian decision processes
with unbounded rewards.
Stanford, Dept. of Operations Res., Stanford University,
1970. (Techn. Rep. no. 17.)
[24] Harrison,
J.,
Ann.
Discrete dynamic programming with unbounded rewards.
£
Math. Statist.
(1972), 636-644.
[ 25] Hastings, N.A.J., SO!!IE! notes on dynamic programming and replacement.
Operations Res. Quart• .!2, (1968), 453-464.
[ 26] Hastings, N.A.J., Bounds on the gain of a Markov decision process.
Operations Res •
.!2.
(1971), 240-248.
[27] Hastings, N.A.J. and J. Mello, Tests for suboptimal actions in dis-
counted Markov programming.
Management Sci • .!2, (1973), 1019-1022.
[28] van Hee, K.M., Markov strategies in dynamic programming.
Eindhoven, University of Technology Eindhoven, Dept. of
Math., 1975. (Memorandum COSOR 75-20.)
[29] van Hee, K.M. and J. Wessels, Markov decision processes and strongiy
excessive functions.
Eindhoven, University of Technology Eindhoven, Dept. of
Math., 1975. (Memorandum COSOR 75-22.)
(30] van Hee, K.M. and J.A.E.E. van Nunen, A note on the iterated expecta-
tion criterion for discrete dynamic programming.
Eindhoven, University of Technology Eindhoven, Dept. of
Math., 1976. (Memorandum COSOR 76-03.)
130
[31] Hinderer, K., Bounds for stationary finite stage dynamic programs with
unbounded reward functions.
Hamburg, Institut fur Mathematische Stochastik der Universitat Hamburg, June 1975 (Report).
[ 32] , Hinderer 1 K.
1
Estimates for finite stage dynamic programs.
J. Math. Anal. Appl. (to appear).
[33] Hordijk, A., Dynamic programming and Markov potential theory.
Amsterdam, Mathematisch Centrum,1974 (Mathematical Centre
Tracts, no. 51).
[34] Hordijk, A., Convergent dynamic programming.
Stanford, Dep. Operations Res., Stanford University, 1974.
(Techn. Rep. 28).
[35] Howard, R.A., Dynamic programming and Markov processes.
Cambridge (Mass.), M.I.T. Press, 1960.
[36) HUbner, G., Improved procedures for eliminating suboptimal actions in
Markov programming by the use of contraction properties.
Transaction of the seventh Prague Conference on Information
Theory, Statistical Decision FUnctions, Random Processes:
Prague, 1974 (incl. 1974 European Meeting of Statisticians).
Academia, Prague, to appear.
[37) Jewell,
w.s.,
Markov renewal programming: I. Formulation, finite re-
turn models.
Operations Res. 11 (1963), 938-948.
[38] Jewell,
w.s.,
Markov renewal programming: II. Infinite return models,
example,
Operations Res.,
[39] Karlin,
s.,
..!.!.
(1963), 949-971.
A first course in stochastic processes,
New York etc., Academic Press, 1966.
[40) Kemeny, J.G. and J.L. Snell, Finite Markov chains,
Princeton (N.J.), Van Nostrand, 1960.
131
[41] Kushner, H.J. and A.J. Kleinman, Accelerated procedures for the solution of discrete Markov control problems.
IEEE Transactions on automatic control. A.C.-.!.§. (1971),
147-152.
[42] Krasnosel'skii, M.A., Approximate solutions of operator equations.Groningen, Wolters-Noordhoff.Publ. Comp., 1972.
[43] Ljusternik, L.A. and w.1. Sobolew, Elementen der Funktional Analysis.
Berlin, Akademie-Verlag, 1968.
[ 44] Lippman, S.A., Semi-Markov decision processes with unbounded rewards.
Management Sci • .!.2_ (1973), 717-731.
[ 45] Lippman, S.A., On dynamic programming w.i,th unbounded rewards.
Management Sci.!!._ (1975), 1225-1233.
( 46] MacQueen, J., A modified dynamic programming method for Markovian
decision problems.
J. Math. Anal. Appl.
l!
(1966), 38-43.
[47] MacQueen, J., A test for suboptimal actions in Markovian decision problems.
Operations Res • .!2, (1967), 559-561.
(48] Maitra, A., Dynamic programming for countable state systems.
Sankhya Ser.
A~
(1965), 259-266.
(49] Manne, A.S., Linear programming and sequential decisions.
Management Sci. .§_ (1960) , 259-26 7.
[SO] Mine, B. and S. OSaki, Markovian decision processes.
New York etc., Elsevier, 1970.
(51] Morton, T.E., Undiscounted Markov renewal programming via modified
successive approximations.
Operations Res • .!.2_ (1971), 1081-1089.
(52] Neveu, J., Mathematical foundations of the calculus of probability.
San Francisco, Holden-Day, 1965.
[53] van Nunen, J.A.E.E., A set of successive approximation methods for
discounted Markovian decision problems.
Zeitschrift fur Operations Res. (to appear).
132
[54) van Nunen, J.A.E.E., Improved successive approximation methods for dis-
counted Markov decision processes.
Amsterdam, North-Holland Publ. Comp., 1974,
thematica Societatis Janos Bolyai
~
Colloquium ma-
(1974), 667-682.
[55] van Nunen, J.A.E.E. and J. Wessels, A principle for generating optimi-
zation procedures for discounted Markov decision processes.
Amsterdam, North-Holland Publ. Comp., 1974,
matica Societatis Bolyai Janos, Vol.
~·
Colloquia mathe-
pp. 683-695.
[56] van Nunen, J.A.E.E. and J. Wessels, A note on dynamic programming with
unbounded rewards.
Eindhoven, University of Technology Eindhoven, Dept. of
Math., 1975. (Memorandum COSOR 75-13.)
[57] Odoni, A., on finding the maximal gain for Markov decision processes.
.!.Z. (1969),
Operations Res,
857-860,
[SB] Porteus, E.L., Some bounds for discounted sequential decision processes.
Management Sci. 1§. (1971), 7-11.
[59] Porteus, E.L., Bounds and transformations for discounted finite Markov
decision chains.
Operations
Res.~
(1975), 761-784.
[60] Reetz, D., Solution of a Markovian decision problem by successive overrelaxation. Zeitschrift fur Operations Res.
~
(1973), 29-32.
[61] Riis, J.o., Discounted Markov programming in a periodic process.
Operations Res,
.!l
(1965), 920-929.
(62] Ross, S.M., Applied probability models with optimization applications.
San Francisco, Holden-Day, 1970.
'[63] Scarf, H., The optimality of (S,s) policies in the dynamic inventory
problem.
Matltiematical methods in the social sciences; ed. by
K.J. Arrow,
s.
Karlin and P. Suppes. Stanford, Stanford
University Press, 1960; chap. 13.
133
[ 64] Schellhaas, B., Zur Extrapolation in Ma;koffschen entscheidungsmodellen
mit Diskontierung.
Zeitschrift fllr Operations Res. ;!.!i (1974), 91-104.
[65] Schweitzer, P,, Iterative solution of the functional equations of undiscounted Markov renewal programming,
J. Math. Anal. Appl. 2_i (1971), 495-501.
(66] Strauch, R.E., Negative dynamic programming,
Ann. Math. Statist.
lI
(1966), 871-889.
[67] Tijms, B.C., Analysis of (S) inventory models.
Amsterdam, Mathematisch Centrum 1972 (Mathematical Centre
Tracts, no, 40).
[68] Varga, R.S., Matrix iterative analysis.
Englewood Cliffs, Prentice-Hall, 1962.
[69] Veinott, A.F., on finding optimal policies in discrete dynamic programming with no discounting,
Ann. Math. Statist.
lI
(1966), 1284-1294.
[70] van der Wal, J,, The solution of an undiscounted completely ergqdic
Markov decision process by successive approximations.
(71] van der Wal, J., The method of successive approximations of the discounted Markov game.
Eindhoven, University of Technology Eindhoven, Dept. of
Math., 1975 (Memorandum COSOR 75-02,)
[72].van der Wal, J,, The solution of Markov games by successive approximation.
(Master's thesis), Eindhoven, Department of Mathematics,
university of Technology, Eindhoven, 1975.
[73] Wessels, J. and J.A.E.E. van Nunen, Dynamic planning of sales promotions by Markov programming.
Proceed. XX Int. Meeting of the Institute of Manag, Sci.
Tel Aviv. Jerusalem Academic Press, 1975, Vol. II, 737-742 ..
134
[ 7 4] Wessels, J. , Stopping times and Markov programming.
Transaction of the seventh Prague Conference on Information
Theory,.Statistical. Decision Functions, Random Processes:
Prague, 1974 (incl. 1974 European Meeting of Statisticians).
Academia, Prague, to appear.
[75) Wessels, J., Markov programming by successive approximations with respect to weighted supremum norms.
J. Matµ. Anal. Appl. (to appear).
[76] Wessels, J. and J.A.E.E. van Nunen, Discounted semi-Markov decision
processes: linear programming and policy iteration.
Statistica Neerlandica 32, (1975), 1-7.
[77] White, D.J., Dynamic programming Markov chains, and the method of successive approximations.
J. Math. Anal. Appl.§_ (1963), 373-376.
[78] Wijngaard,
J.,
Stationary Markovian decision problems.
University of Technology.Eindhoven, 1975.
135
SUBJECT IN'OEX
actions
37 Inventory problem
suboptimal
103
e:-suboptimal
104
Jacobi iterative method
block
122
point
120
Linear equations
119
action space
qeneral
37
finite
53
bounding function
15
bounds
system of
Lower bounds
95
upper
95
choice function
53
decision model
contracting
decision rules
Markov
memoryless
95
Markov
lower
e:-con tractinq
123
54
37
decision process
37
policy
39
reward model
19
reward process
19
strategy
39
38
39 Matrix
norm
39
19
15
nonrandomized
39 memoryless
optimal
46
decision rules
15
stationary
39
stopping time
10
e:-optimal
46
discounted
Metric
16
Monotone
Markov decision process
118
rewards
118
semi-Markov decision process 119
119
total expected reward
e:-monotone
54
linear operator
15
orm
matrix
embedded Markov decision process
entry time
Gauss-Seidel iterative method
Goahead set
nonzero
transition memoryless
histories
119
12
121
15
norm of the process
108
vector
15
weighted supremum
15
12 -stage contraction
Ul
13 n-stage history
14
38
38
I
/
136
optimal
stopping time
decision rule
46
return operator
49,70
E-optimal decision rule
optimality equation
46
50,70
path
9
point iterative method
11
exaq;iles
14
memoryless
11
nonrandomized
11
nonzero
12
randomized
11
transition piemoryless
14
strategy
Gauss-Seidel
121
Markov
39
Jacobi
120
stationary
39
121
single step
successive overrelaxation 122 suboptimal
action
total step
120
policy
39
policy
suboptimality
39
Markov
103
102
102
suboptimal
102 system of linear equations
E-suboptimal
102 total eXpected reward
discounted
reward
model
19
process
total eXpected
118
21,38 transition memoryless
19
goahead sets
function
119
14
stopping times
14
24,45,50
vector
21,40
transition probability structure
assumptions on the
1
21,41
reward structure
assump ti ons on the ,
Semi-Markov decision process
with discounting
spectral radius
state space
21 , 41
118
upper bounds
95
value oriented methods
81
119 vector
10
108 weighted supremum norms
15
9
state 0
11
stopping functions
13
137
SELECTEV LIST OF SYMBOLS
a
n
40,65
9
G..,
00
s
0
0
9
a
98
ao
n
95
95
38
u
B
38
u1
49
0
11
53
ka
9
uo
70
o+
13
b
15
J!.o
98
ul ,e;
54
0-
13
D
39
Ll
24
u
77
01
31
81
oH
31
A
37
h
A.
37
A(i)
n
n
n
n
0 ,e;
n
20,65
Lo
26
u(>.)
E
10
Lf
49
v
.24
oR
31
E
21
L'l'i
65
vf
52
/:i
11
:m:i
21
M
39
v'l'i
45
/:i'
81
, :m:o
21
N
39
v*
50
µ
15
lEf
40
p(i,j)
19
16
"1
50
E'l'i
40
pa(i,j)
37
16
"o
70
Ef
i
40
Po
26
15
'If
39
E~J.
40
pf
40
v
vµ,b,p
w
wµ
15
p
15
Ef
0
65
Pn('lf)
40
Yn
96
po
E&
65
P.
19
z
96
T
lE.i , &
21
p'l'i
40
65
P.J., &
e
n
'l'i
'l'i
Ei,&
00
1
0
J.
i
'If
20
f
39
Pi,o
64
F
39
~
39
F""
39
r
g
53
RM
39
G(i)
12
Sn
20 1 40,65
GH
14
s
Gn
12
s
9
GR
14
sk
9
f
T
40
25
o,e;
0
0
n
22,42
12
.!
139
SAMENVATTING
Bet in dit proefschrift beschreven onderzoek, betreft Markov beslissinqsp:i:ocessen. De praktische relevantie van Markov beslissingsprocessen neemt
snel toe. Vele praktische problemen zoals voorraadproblemen, vervanqingsproblemen, marketinqproblemen en wachttijdproblemen kunnen met behulp van
Markov beslissingsmodellen beachreven worden. Een Markov beslissingsproces
is een systeem dat op de tijdstippen t
= 0,1,2,. ••
uit een of andere toestandsruimte verkeert.
Op
in een van de toestanden
elk tijdstip kan men uit een
verzamelinq van mogelijke akties er een kiezen. De verzamelinq van akties,
welke tijdsonafhankelijk is, wordt aangeqeven met A. Als bet systeem zich
op tijdstip n in toestand i van de toestandsruimte s bevindt en men aktie a
ui t de verzamelinq A kiest, dan zal bet systeem zich op tijdstip (n + 1) in
toestand j vans bevinden met kans pa(i,j). De kans om het systeem £4n
tijdseenheid later in toestand j aan te treffen wordt dus bepaald door ds
beslissinq die in toestand i qenomen wordt. Elke keer als het systeem in
een toestand komt en er een aktie genomen wordt, ontvangt men afhankelijk
van de toestand (i) en' de aktie (a) een bedrag r(i,a). Bet probleem is het
systeem zo goed mogelijk te besturen. In dit proefschrift wil dit zeggen de
akties op de opeenvolgende tijdstippen in de dan geobserveerde toestanden
z6 te kiezen dat bet totale verwachte bedraq dat ontvanqen wordt maximaal
is. Een voorschrift dat voor elke toestand en voor elk tijdstip aangeeft,
welke aktie qekozen moet worden heet een beslissinqsregel of strategie. Beslissingsreqels waarvoor geldt dat de te kiezen aktie alleen maar afhangt
Van de toestand waarin bet systeem Zich bevindt, noemt men Stationaire beslissingsreqels of stationaire straqieen. onder elke stationaire strateqie
is het onderbavige proces een Markovproces.
In dit proefschrift worden Markov beslissinqsprocessen met een eindiqe of
aftelbare toestandsruimte en een algemene aktieruimte bestudeerd. We la.ten
toe dat de kosten niet begrensd zijn. We eisen bet bestaan van, op de toestandsruimte gedefinieerde, functies b en µ zodanig dat voor elke toestand
i uit de toestandsruimte S en voor elke aktie a uit A
Ir (1, a)
- b (i)
I < µ (i)
•
We eisen verder dat er een positief getal p kleiner dan 1 bestaat zodanig
dat
140
l
pa(i,j)µ
(j)
s pµ {i) •
j€S
Onze voorwaarden aangaande de kostenstructuur en de overgangskansen ontstaan
in feite door de voorwaarden zeals die voorgesteld zijn in een artikel van
J. Wessels en een artikel van J. Harrison te kombineren. Er wordt aange-
toond dat de zo ontstane kondities ruimer zijn dan de voorwaarden van ieder
der genoemde auteurs afzonderlijk.
Er wordt in dit proefschrift een verzameling van methoden gepresenteerd om
voor Markov beslissingsprocessen, die aan de beschreven kondities voldoen
~ptimale
strategieen te bepalen en de bijbehorende totale verwachte op-
brengst te berekenen. Deze verzameling van oplosmethoden zal gedefinieerd
worden met behulp van stoptijden. Het gebruik van stoptijden leidt tot een
unificerende aanpak. In onze aanpak speelt de theorie van monotone kontrakties in een volledige metrische ruimte een belangrijke rol. Een stoptijd is
een voorschrift om het Markov beslissingsproces te beeindigen. In hoofdstuk
5 wordt bewezen dat met behulp van eZke niet-nul-stoptijd een monotone kontrakerende afbeelding op een volledige metrische ruimte gedefinieerd kan
worden. Bet vaste punt van zo'n afbeelding is juist de optimale totale verwachte opbrengst. Dit vaste punt is dus onafhankelijk van de gekozen stoptijd,
m.a.w. elke niet-nul-stoptijd levert ons een successieve approximatie methode om de optimale opbrengst te berekenen. In hoofdstuk 6 wordt de klasse
van oplosmethoden nog uitgebreid door voor elke stoptijd afzonderlijk weer
een hele klasse van waarde georienteerde successieve approximatiemethoden
~in
te voeren. Al de genoemde approximatiemethoden stellen ons in staat om
naast de optimale totale verwachte opbrengst ook de bijbehorende optimale
of bijna optimale stragieen te bepalen. In hoofdstuk 7 wordt aangegeven hoe
boven en ondergrenzen voor de optimale waarde berekend kunnen worden. Bovendien wordt getoond hoe, met behulp van boven- en ondergrenzen suboptimaliteitskriteria om niet-optimale akties te identificeren gegeven kunnen worden. Bet gebruiken van boven- en ondergrenzen en van suboptimaliteitstesten, bij de bepaling van de optimale opbrengst en een optimale strategie,
kan tot een aanzienlijke winst in de benodigde rekentijd leiden. In hoofd1
stuk 8 worden de voorwaarden zoals die eerder ingevoerd zijn nog iets verruimd. Deze verruiming is vergelijkbaar met de door E. Denardo ingevoerde
141
N-staps kontraktiekonditie. Het proefschrift wordt besloten met een aantal
toelichtende voorbeelden. Hierin wordt o.a. qetoond hoe een aantal specifieke Markov beslissinqsprocessen binnen onze theorie valt.
143
CURRICULUM VITAE
De
schrijver van dit proefschrift werd op 23 decem.ber 1945 te Venlo geboren.
In 1964 behaalde hij het diploma H.B.S.-b aan de Rijks H.B.S. aldear. Na
het vervullen van zijn militaire dienstplicht studeerde hij van 1966 tot
1971 wiskunde aan de Technische Hogeschcol te Eindhoven. Als afstudeerrichting koos hij de richting "Kansrekening Statistiek en Operations Research"
i.h.b. stochastische (beslissings) processen. In 1971 behaalde hij (met lof)
het doktoraalexamen wiskunde. Van 1970 tot aan zijn afstuderen was hij werkzaam als assistent van Prof. J. Wessels, onder wiens leiding later ook bet
promotie onderzoek
plaa~s
vond.
Sinds 1971 is hij als wetenschappelijk medewerker verbonden aan de onderafdeling der Wiskunde van genoemde Hogeschool.
STELiUJGEN
I
Om
een probleem met behuJ.p van een Markov-model te kunnen beschrijven zal
men vaak gebruik moeten maken van "het principe der supplementaire variabelen", het opnemen van additionele karakteristieken in de toestandsdefinitie;
Het aangegeven procede leidt echter tot grotere modellen waardoor de n1.llllerieke hanteerbaarheid in gevaar kan komen. Konkrete problemen bezitten echter vaak zoveel extra structuur dat toch hanteerbare modellen gekonstrueerd
kunnen worden.
II
Voor het bepalen van de benodigde rekruteringsaantallen ter realisering van
een gewenste personeelsformatie over een zekere tijdsspanne, kan een op dynamische programmering gebaseerde methode te verkiezen zijn boven een lineaire programmerings aanpak.
Wessels, J. and J.A.E.E. van
!!.!!!'Power
~stems.
Nunen, FORMASY,
!!:,roasting and .:::_ecruitment in
Eindhoven, University of Technology Eindhoven,
Dept. of Math., 1975 (Memorandum COSOR 75-17).
III
In stationaire vcorraadproblemen kan het gebruik van vernieuwingstheorie behal ve theoretisch ook praktisch zinvcl zijn, aangezien voor gamma verdeelde
vraag de bijbehorende vernieuwingsfunktie en vernieuwingsdichtheid expliciet
te berekenen zijn.
IV
De
konvergentie van de door Porteus [1] voorgestelde successive approximatie
algoritme voor het oplossen van Markov beslissingsproblemen, waarbij de policy improvement procedure telkens een aantal malen achterweiJe qelaten.
wordt, volqt op grond van de resultaten uit hoofdstuk 6 van dit proefschrift.
[1]
Porteus, E.L., Some bounds for discounted sequential decision processes,
Management Sci.
.!!!. (1971) , 7-11
v
Bij het plannen van reklame kampaqnes kan het gebruik van Markov beslissingsmodellen inzicht verschaffend zijn.
Wessels, J. and J.A.E.E. van Nunen, Dynamic planning of sales promotions by
Ma:x:kov programming. Proceed. XX Int. Meeting of the Institute of
Management Sci. Tel Aviv, Jerusalem Academic Press,· 1975, Vol. II,
737-742.
VI
Afstudeerprojekten verricht in samenwerking met een organisatie buiten de
eigen afdeling van de universitaire instelling kunnen hun vruchten afwerpen
zowel veer het ondei:wijs en het onderzoek van de afdelin9 als voor de betrokken organisatie. Wil daze laatste echter de oogst in voldoende mate binnen halen, dan dient ze zor9 te dragen dat na het vertrek van de afstudeerder niet slechts het afstudeerverslag resteert.
VII
Het is bekend dat men Markov beslissingsprocessen met overgangswaarschijnlijkheden die perlodiek veranderen (zie [3]} kan beschrijven met een uitgebreid Markov beslissingsmodel waarbij in de toestandsdefinitie de periode
is opgenomen. In [ 1] warden zulke problemen beschouwd met betrekking tot
het gemiddelde opbrengsten kriterium en in [3] met betrekking tot het totale ve:r:wachte verdiskonteerde opbrengsten kriterium.
De
daar gegeven methoden volgen direkt als men de door Hastings [2] gemodi-
ficeerde policy iteration algoritme gebruikt voor het uitgebreide probleem.
[1]
Carton, D.C., Une application de l'algorithme de Howard pour des phenomenas saisonniers. Proceed. 3rd intern. confer. Operations
Res., Oslo, {1963), 683-691.
[2]
Hastings, N.A.J., Some notes on dynamic programming and replacement.
[3]
Riis, J.O., Discounted Markov programming in a periodic process. Ope-
Operations Res. Quart.
12.
(1968}, 453-464.
rations Res. 13 (1965), 920-929.
VIII
Dynamische programmering is (nog) geen standaardgereedschap voor 1n de praktijk werkende Operations Research groepen, De oorzaak hiervan is niet een gebrek aan praktische bruikbaarheid van genoemde techniek.
IX
Veronderstel dat een Markov beslissingsproces voldoet aan assumption 4.1:1
in dit proefschrift (blz. 37) en veronderstel dat (in de notaties van dit
proefschrift)
dan zijn de volgende voo:rwaarden equivalent:
(2)
3
(3)
3
s p'
p'<l 3ME:N V f€F ff (Pf) Miiµ
p'<l
3
ME:N
v f ,f , ••• ,fM€F
1 2
f
llP
f
f
1 2
P ••• P Miiµ
s p'
.
Men kan dan "AssWllJ?tion• 8.1.1 van dit proefschrift (blz. 111) vervangen
door
(1)
en
(2).
x
Het is voor menig wiskundig ingenieur mogelijk om zich bezig te houden met
wiskundige aspekten van Marketing. Het is voor menig wiskundig ingenieur
echter een noodzaak om zich bezig te houden met Marketing aspekten van wiskunde.
Eindhoven, 25 mei 1976
J,A,E.E. van Nunen

Download Report

Contracting Markov decision processes

Paperzz.com

Your Paperzz