MARKOV STRATEGIES FOR OPTIMAL CONTROL PROBLEMS

The Annals of Probability
1983, Vol. 11, No.3, 642-647
MARKOV STRATEGIES FOR OPTIMAL CONTROL PROBLEMS
INDEXED BY A PARTIALLY ORDERED SET
By G. F.
LAWLER AND
R. J.
VANDERBEI 1
Duke University and University of Illinois
We define the notion of a Markov strategy for the general optimal control
problem where the index set is partially ordered. We prove that the supremum
over all strategies is always attained by a Markov strategy if and only if the
structure of the probability space is that of a Markov random field.
1. Introduction.
Several papers have appeared recently (Krengel and Sucheston,
1981, Mandelbaum and Vanderbei, 1981, Mazziotto and Szpirglas, 1981, Washburn and
Willisky, 1981) which generalize the optimal stopping problem to the case where the index
set S is partially ordered. In the paper by Mandelbaum and Vanderbei, it was shown that
if the pay-off is a function of the state of a multi-time parameter Markov chain then the
supremum over all policies coincides with the supremum over all "Markov policies." This
means that the strategy and the stopping rule depend only on the current state of the
process. In the case when S = N = {O, 1,2, ... }, the converse to the above statement has
been proved recently by Irle (1981). That is, the supremum over all stopping times
coincides with the supremum over all "Markov" stopping times if and only if the structure
of the probability space is that of a Markov chain.
The aim of this paper is to extend Irle's result to the case of a partially ordered index
set. In this case it turns out that in addition to the reward obtained when we stop, we need
to consider also a running reward. That is, the correct setting is optimal control as opposed
to optimal stopping (in the case where S = N, these notions coincide).
In Section 2, we investigate the general optimal control problem. We define the notion
of a Markov strategy in Section 3. We then prove that the supremum over all strategies is
always attained by a Markov strategy if and only if the structure of the probability space
is that of a Markov random field. Finally we apply this to the case of a family of
independent stochastic processes and we see that the supremum is attained by a strategy
which at each time is only a function of the current state of each process if and only if each
process is Markov.
2. The optimal control problem. Throughout this paper we assume that S is a
countable partially ordered set such that: (i) there is a unique minimal element 0 in S, (ii)
the set of direct successors U(s) of each point s E S is finite, and (iii) each point s E S has
only finitely many predecessors.
Let (n, .91, P) be a probability space and let 9'= {ffs}SES be an increasing family of suba-algebras of d. For any sub-a-algebra 'l1 of .91 and any random variable Z, the notation
Z E 'l1means that Z is 'l1-measurable.
A measurable mapping I' from a measurable subset of n into S (with the a-algebra of all
subsets) is called a random point. For a random point I' in S, let ff. be the a-algebra
generated by functions of the form LSES Z.1{v - sh Zs E ffs. A random point I' is called a
stopping point if {I' = s} E ffs for all s E S. If I' is a stopping point, the above definition of
ff. is equivalent to the usual definition: a set A is in ff. if and only if A n {I' = s} E ffs for
allsES
.
Let (1 = {atleEN be an increasing sequence of random points in S (i.e., at::s at+l for all
tEN). Put T = inf {t: at+l = at} with the convention that inf 0 = 00. We say that (1 is a
Received April 1982; revised July 1982.
I Research supported by NSF grant MCS 81-14163.
AMS 1980 subject classifications. Primary 60G40; secondary 60G45.
Key words and phrases. Optimal control problem, partially ordered index set, Markov strategies.
642
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve, and extend access to
The Annals of Probability.
®
www.jstor.org
643
MARKOV STRATEGIES FOR OPTIMAL CONTROL
strategy starting at s if:
(a)
0'0
(b)
0'1+1
= s;
E U(O't)
(c) 0'1+1 E ~t'
for t <
T
and O't = O'T for t
~
T;
Usually we will be interested in strategies starting at O. For such strategies we omit the
phrase "starting at 0". The following properties of 0' follow from (a), (b), and (c):
(d) For each tEN, O't is a stopping point;
.
(e) O'T is a stopping point (defined on {T < oo});
(f) T is a stopping time (with values in N U {oo}) relative to the increasing sequence of
O'-algebras 1Ft = ~t' tEN.
In the case where S = N, there is a one to one correspondence between strategies and
stopping times: to each strategy corresponds the stopping time v = O'T on {T < oo}, v = 00 on
{T = oo} and to each stopping time v corresponds the strategy O't = v /\ t.
Let Z~,u, u E U(s), s E S, and Z~, s E S, be random variables such that Z~,u E ~,
Z; E~. The random variable Z~,u is the running reward obtained in going from s to u and
Z; is the final reward obtained if we stop at s. Since our goal is not to solve the most
general optimal control problem, we will make the simplifying assumption that the running
and final rewards are uniformly bounded
(2.1)
SUPSES,UEU(S)ess
For each strategy
0'
supu IZ!,u I <
00,
supSESess supnl Z~ I <
00.
let
denote the discounted payoff obtained using 0'. Here a is a real number strictly between
zero and one-the discount factor. We use the convention that aTZ~. = 0 on {T = oo}. The
problem is to find a strategy 0'* which is optimal in the sense that
Ef!l'(a*) = supuE~f!l'(a).
(2.2)
The supremum is taken over the class ~ of all strategies starting at O.
Let Xs denote the highest reward possible using strategies starting from s:
Xs
(2.3)
=
ess sup"El:,E ,¥,f!l'(a).
Here ~s denotes the collection of all strategies starting at s. The process Xs is caned Snell's
envelope. We have
THEOREM
1.
Snell's envelope satisfies the stochastic dynamic programming equation
(2.4)
Suppose we are using some strategy and we have arrived at the point s. If Z; attains the
maximum in (2.4) we stop. If not, we proceed to a point of u E U(s) and, of course, we pick
one which attains the maximum. In this way we construct a strategy a = {O'thEN, starting
at 0, such that
(2.5)
Xu, = Z;,.
(2.6)
Strategies starting from 0 which satisfy (2.5) and (2.6) will be called admissible.
Let a * be an admissible strategy and put T * = inf {t : 0' t+1 = 0' n. By summing the
stochastic difference equation (2.5) we get
LEMMA
(2.7)
1.
(Dynkin's formula).
644
G. F. LAWLER AND R. J. VANDERBEI
In the case where S = N, Z!.u == 0, and a = 1, formula (2.5) is the definition of a
martingale up to time 11 = aT and (2.7) is the martingale optional sampling theorem (as is
well known, the optional sampling theorem does not hold for all stopping times but
requires some additional restrictions-we have avoided these difficulties by requiring that
a be strictly less than one and assuming (2.1». Using (2.6), (2.7), and the intuitive meaning
of X o, we are tempted to write
Ef!l'(a*)
= E L;:Ol atZ~:,U:+l + EaT·Xu:. = E~ = sUPuEl:Ef!l'(a).
This is the idea behind the proof of
THEOREM 2. Every admissible strategy is optimal.
The following theorem justifies our calling X. Snell's envelope
THEOREM 3.
The process X. is the minimal process which satisfies
X. 2:: Z~
(2.8)
sES
(2.9)
u E U(s), s E S.
That is, if Y. also satisfies (2.8) and (2.9), then
X. :s Y. a.s. P for all s E S.
Theorems 1, 2, and 3, and Lemma 1 are straightforward generalizations of results found
in Neveu (1975) and Mandelbaum and Vanderbei (1981), so the proofs have been omitted
(in fact, introducing the discount factor a makes the proofs easier and eliminates the need
of any technical assumptions other than (2.1». The next theorem shows that Snell's
envelope is the increasing limit of the best that can be done using k-step look-ahead
strategies. The proof is quite similar to the case S = N (see e.g. Neveu (1975) Section VI2).
THEOREM 4.
Put X~o) =
Z~
and define X~k) recursively by the formula
x~k+l) = max{Z~,
maxueu(.)E ~(Z!.u
+ cxX~k)}.
Then X~k) increases with k and
PROOF.
We start by noting that x11) 2:: X~o) and proceeding inductively we see that if
for all s E S, then
X~k) 2:: X~k-l),
. X!k+l)
= max{Z~, maxueu(.)E ~(Z;.u + cxX~k)}
2:: max{Z~,
maxueu(.)E ~(Z;.u
+ cxX~k-l)}
=X!k).
The limit X!oo) = ~--+,.x!k) therefore satisfies the inequalities
(2.10)
(2.11)
sES,
u E U(s), s E S.
Suppose that Y. is any other process which satisfies (2.10) and (2.11). By (2.10), Y. 2:: X!O)
and again proceeding inductively we see that if Y. 2:: X!k), for all s E S, then (2.11) implies
that
MARKOV STRATEGIES FOR OPTIMAL CONTROL
645
and so
Ys:::: max{Z;, maxuEu(s)E .?;(Z;.u + aX~k»} = X1k+ 1 ).
Hence, by Theorem 3, xi"") is Snell's envelope.
3. Markov strategies and the Markov property. In practice, when an optimal
control problem is solved either by hand or on a computer the a-algebras g.;; are finite.
Unfortunately, however, they tend to grow very fast (usually exponentially fast) and so it
would be nice to discard unnecessary information. In this section we investigate under
what conditions this is possible.
Let 'lJs, s E S, be a family of a-algebras such that g.;; = Vu,,;,s 'lJu. The a-algebra 'lJs
represents the knowledge available at the "present" at point s. Let £s = Vu;;"s 'lJs represent
the "future." For a random point v in S, let 'lJ be the a-algebra generated by functions of
the form L Zsl(p~sh Zs E 'lJs (if, for each s, 'lJs is the a-algebra generated by a mapping Y s
of (Q, d) into a state space (E, f14), then 'lJ = a{Y., v}). A strategy a is Markov if
p
p
(3.1)
THEOREM 5. The following are equivalent:
(a) For every ZL E 'lJs V 'lJu and Z; E 'lJs satisfying (2.1), the supremum over all a E
.
~~
is attained by a Markov strategy.
(b) For every s E S, g.;; and £s are conditionally independent given 'lJs •
We remarked in Section 2 that in the case S = N there is a one-to-one correspondence
between strategies and stopping times. Hence it makes sense to call a stopping time T
Markov if T 1\ (t + 1) E 'lJ,!\t. It is easy to check that this definition is equivalent to the one
given in Irle (1981): for every tEN there exists a 'lJt-measurable set Gt such that
{T = t} = {T:::: t}
n Gt •
PROOF OF THEOREM 5. (b) ~ (a). Let Xi k ) be the process defined in Theorem 4. Note
that, since Z; E 'lJs, X1°) E Gs. Suppose that X~k) E 'lJs. Then, since Z;.u E 'lJs V 'lJu, the
Markov property (b) gives
X1 k+ 1 ) = max{Z;, maxuEu(s)E~'(Z!.u + aX~k»}.
Hence xik+ 1) E 'lJs. Since Xs = limk->""Xik) we see that Xs E 'lJs for all s. Now according to
(2.4) we can find an admissible strategy which is Markov, and so, by Theorem 2, the
supremum over all strategies is attained by a Markov strategy.
(a) ~ (b). Suppose there is a point v E S for which ~ and .Jil'v are not conditionally
independent given 'lJv • Then there exists a point w E S, w> v, and aYE 'lJ w such that 0
::; Y::; 1 and P{Effvy,p E ~uY} > O. In fact, since both conditional expectations have the
same expected value, the set
(3.2)
must have positive probability.
. Let 0 = ro, rJ, "', rn = v, rn+l, "', rn+m = w be a sequence of direct successors
connecting 0 to v to wand chosen so that nand m are minimal. Put
s=rt,u=rt+J,O::;t<n
I
Z!.u = { 0
otherwise,
s=v
s= w
otherwise.
646
G. F. LAWLER AND R. J. VANDERBEI
The expected payoff obtained using the non-Markov strategy
onB
onBc
can be estimated as follows:
(3.3)
E.fl'(a) = L;!:"OI at + E1Ba na mE \Jvy + E1Bcan+my
where /3n = L;!:"l at. The strict inequality follows from the defmition of B.
Now consider any strategy a. Put
C = {Ut = rt, 0 S t S n}.
On the set CC, the payoff .fl'(a) is bounded above by /3n-l + an < /3n (we use here the fact
that n is minimal to concludeaTZ!, sanZ!, S an). Hence,
= E.fl'(a)1ec + E.fl'(a)1e
E.fl'(a)
(3.4)
S /3nP( C C )
+ E {/3n + aT Z!,} Ie
Since Z~ takes on non-zero values only at the points v and w,
(3.5)
Now suppose that a is Markov. Then, by the definition of Cd"", there is a Cdv-measurable set
A such that
{Un
=
v}
n
{un+l
= v} =
{Un
=
v}
nA
{Un
=
v}
n
{Un+l '" v}
=
{Un
=
v}
n A c.
Hence
(3.6)
E1c{1(on+l_ v}E \Jvy + l(on+l"v} Y} S E {lAE \Jvy + lAc Y} = EY.
Combining (3.4), (3.5), and (3.6) we get
E.fl'(a) S /3n
+ an+mEY
for any Markov strategy a. By (3.3) we see that the supremum over all strategies is strictly
greater than the supremum over Markov strategies.
The above result would not be true if we considered only a final reward, i.e., the case of
optimal stopping. To see this we consider the following example. Let S = {O = (0,0), a =
(1, 0), b = (0, 1), C = (1, I)} with the usual partial order. Let (0, d, P) be any nontrivial
probability space and put Cdo = Cdc = d and Cda = Cdb = {0, O}. Clearly, if s = a or b, ~ and
Jt". are not conditionally independent given Cd•• For any collection of final rewards Z~ E
Cd., we wish to maximize EaTZ!, over all a E ~. It is easy to see in this example that by
changing Z~ we can replace ()i by 1. For any strategy a,
We will give a Markov strategy which attains this upper bound. Put M(w) = max.Z~(w).
Let A o, A a , Ab be an d-measurable partition of 0 such that
Ao C {Z~ =M}
Aac {Z;=M} n {Z~=M,Z;sZn
AbC {ZE=M} n {Z~=M,Z;>Zn.
MARKOV STRATEGIES FOR OPTIMAL CONTROL
647
Let u be the strategy defined by
0'0 =
0
onAo
onAa
onA b
o
a
b
c
on {Ul = O}
on {Ul = a}
on {Ul = b}
otherwise,
n
n
{Z! > Zn
{Z! :s
Zn
It is easy to see that u is Markov andEZ;, = EM.
It was pointed out in Mandelbaum and Vanderbei (1981) and Washburn and Willisky
(1981), that the importance of the condition Ut+l E ~at is that it preserves martingales.
That is, if M., s E S, is an § - martingale then M7 = Mat is an § a - martingale.
Similarly, (3.1) preserves the Markov property. Put t:§7 = t:§a,!F'l = Vr,,;,t t:§~ and Jf'7 =
Vp-t t:§7. If :Fa and ~ are conditionally independent given t:§., for all s' E S, and u satisfies
(3.1), then!F'l and Jf'7 are conditionally independent given t:§7 for every tEN. In the case
S = N, condition (3.1) is the discrete analogue of the statement that Ut is the inverse of a
(time inhomogeneous) additive functional.
We now apply Theorem 5 to the case of several independent stochastic processes.
For i = 1, ••• , k, let {Y;,} .'EN be a stochastic process defined on a probability space
(Oi, d i, pi) and taking values in a state space (E i, B i ). Let t:§!. = u{y!,},~, =
Vu',,;,s' t:§~i, Jf'!, = V U ''2.' t:§~,. Put
(0, d, P) = (0\ d\ pI)
X •••
X (Ok, dk, pk)
(E, B) = (E\ Bl) X ••• X (E k , Bk)
Y. = (Y!" ••• , Y~k) E E, s = (s\ ... ,~) E N k •
Applying Theorem 5 to t:§. = u{ Y.} and using the fact that the processes yi are independent
we get
COROLLARY 1.
The following are equivalent:
(a) For every pair of bounded measurable functions h: N k X N k X E X E -+ Rand
f:N k X E -+ R, the supremum over all u E ~ of
E
U:;:-J OIth(ut, Ut+l,
Ya" YaH')
+ OITf(UT)
Ya,)}
is attained by a Markov strategy.
(b) For every i = 1, ••• , k and every Si E N, ~:. and Jf'!, are conditionally
independent given t:§!i (i.e., each process yi is Markov).
REFERENCES
IRLE, A. (1981). Transitivity in problems of optimal stopping. Ann. Probability 9 642-647.
KRENGEL, U. and SUCHESTON, L. (1981). Stopping rules and tactics for processes indexed by a
directed set. J. Multivariate Anal. 11199-229.
MANDELBAUM, A. and VANDERBEI, R. J. (1981). Optimal stopping and supermartingales over partially
ordered sets. Z. Wahrsch. verw. Gebiete 57 253-264.
MAZZIOTTO, G. and SZPIRGLAS, J. (1981). Arret optimal sur Ie plan. (preprint).
NEVEU, J. (1975). Discrete-Parameter Martingales. North Holland, Amsterdam.
WASHBURN, R. B. and WILLISKY, A. S. (1981). Optional sampling of supermartingales indexed by
partially ordered sets. Ann. Probability 9 957-970.
DEPARTMENT OF MATHEMATICS
UNIVERSITY OF ILLINOIS
URBANA, ILLINOIS 61801
DEPARTMENT OF MATHEMATICS
DUKE UNIVERSITY
DURHAM, NORTH CAROLINA 27706

Download Report

MARKOV STRATEGIES FOR OPTIMAL CONTROL PROBLEMS

Paperzz.com

Your Paperzz