On the Convexity of the Value Function in Bayesian

On the Convexity of the Value Function in Bayesian Optimal Control Problems
Author(s): Yaw Nyarko
Source: Economic Theory, Vol. 4, No. 2 (Mar., 1994), pp. 303-309
Published by: Springer
Stable URL: http://www.jstor.org/stable/25054763
Accessed: 09/05/2010 18:41
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=springer.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Springer is collaborating with JSTOR to digitize, preserve and extend access to Economic Theory.
http://www.jstor.org
Econ.
4,303-309
Theory
(1994)
"?^
;?
Economic
Theory 1994
?
the convexity of the value function
optimal control problems*
On
Springer-Verlag
in Bayesian
Yaw Nyarko
Department
Received:
of Economics,
July
New
16, 1991; revised
York University,
version
January
New
York,
NY
10003, USA
18, 1993
I study the question on the convexity pf the value function and Black well
Summary.
Theorem
of optimal policies. The main
and relate this to the uniqueness
(1951)'s
in Blackwell's
results will conclude
that strict convexity and a strict inequality
Theorem will hold if and only if from different priors different optimal actions may
be
chosen.
I. Introduction
I study the question of the convexity of the value function and Blackwell's Theorem
(1951) and relate this to the uniqueness of optimal policies. The main results will
conclude that strict convexity and a strict inequality in Blackwell's Theorem will
hold if and only if from different priors different optimal actions may be chosen.
The principal purpose of this paper is to provide simple and accessible proofs to
economists of the above results. Most of the proofs of these results in the literature
either use special assumptions
(e.g., finiteness of the set of signals and/or states); or,
because of a greater interest in maximal generality,
they rely on arguments which
make them inaccessible to many economists. For further work on the convexity of
the value function and Blackwell's Theorem see Kihlstrom
(1984) or LeCam (1964).
For applications
in economics
see Rothschild
(1974), Grossman
et al. (1977), Kiefer
andNyarko (1989),andNyarko (1991).
II. The decision problem
An agent does not know the true value of a parameter 6, in a parameter space 0.
The agent's (prior) beliefs about 9 in the first period, date 1, are represented by the
ju0over 0. The agent must choose at each date t an action at in
prior probability
an action space A. The action results in an observation
(or signal) yt in the set Y,
*
and the Research
from the C. V. Starr Center
Financial
Challenge
support
Tara Vishwanathan
I thank Professor
is gratefully
acknowledged.
University
resulted in this paper.
Fund
whose
at New
York
questioning
304
Y. Nyarko
with probability distribution P(dyt\at, 9) which depends upon both the action at and
result in a date t utility of u(at, yt).We
the parameter 9. The action and observation
are
A
Y
that
and
and
0,
suppose
complete
separable metric spaces. Given any
metric space S we let @>(S) denote the set of all (Borel) probability measures on S.
of weak
the topology
stated, be endowed with
??(S) will, unless otherwise
convergence
(see Billingsley
(1968) for more on this). Suppose that at some date t
the agent begins the period with beliefs about the parameter 9 represented by the
yt resulting from the
pt.x. The agent uses the observation
posterior probability
action at to update the posterior via Bayes' rule (i.e., the laws of probability). We
let B.A x Y x 0>(0)^>0>(0)
be the Bayes' rule operator, so that for all t> 1
=
/if B(flf,j;f,|if_1)
H.a.
Histories
(2.1)
and policies
n partial history
actions and prior
is any sequence of observations,
x
at all dates preceding n; i.e., hn= (p0,{(at,yt,pt)}ntZ{)e&{0)
probabilities
nntl{
=
=
t> 1,
where
each
for
IA x Y x ^(?)]
Hn. A policy is a sequence n
{nt}^=l,
=
a
as
t
the
function
of
the
date
action
at
(measureable)
nt:Ht-+A
nt(ht),
specifies
set
initial
D
of
all
We
let
denote
the
n,
given
Any
policies.
history.
policy,
partial
and posterior beliefs,
beliefs p0, then generates a sequence of actions, observations
A date
and the Bayesian
via the conditional
P{dy\6,a)
probability
{(at,yt,pt-1)}?Ll,
n
a
sum
results
in
of
discounted
rewards
rule
expected
(2.1).
updating
Any policy
We define the
with discount
factor <5e(0,l), Vn(pQ) = E[_YdT=^t~1^{^yt)\Po^\
value function by V(p) = SupnGDVn(p). A policy is optimal if it attains this supremum.
We assume the existence of an optimal policy and we assume the utility function is
uniformly bounded so our value function is well defined. (See Kiefer and Nyarko
(1989) for conditions
Il.b.
which
ensure
this and for other details.)
*-Policies
define a date n *-partial history, h*, to be any sequence of past actions and
observations
(i.e., the same as date n partial histories without a specification of the
=
x Y = H*. A date n
of
history
posterior beliefs); i.e., any h*
{(at,yt)}"~leT\"llA
*-policy is any function n* :H* A and a *-policy is any collection of date n policies,
n* = {n*} = x.Let D* denote the set of all ^-policies. At date one there is no history
an action (i.e.,
so
a
n\eA). Any parameter
h\ is "null" history and 7r* is identified with
the sequence of actions and observations,
value 6 and *-policy n* generates
{an,y?} =1, defined inductively via the following relations:
=
and given the *-partial history h*, an = n*(h*) and
7T*, yx ^P(dy\0,al)
ax
~
the
notation yn ~ P we mean that yn is a random variable
where
yn P(dy\9,an),
by
the
with distribution
given by
probability P. n* and 9 also generate a conditional
We
expected
discounted
sum of returns
Un*,
0)
=
El^:=1?n-'u(an9yn)\9,n^
(2.2)
Recall that a date n policy as defined in section H.a. is allowed to be a function
Hence every *-policy may also be
also of the history of posterior distributions.
Bayesian
optimal
control
305
problems
a policy, but not vice versa. However,
a fixed prior p and policy n
"induce" a *-policy as follows: Define 7r*= n^p); for any n > 2, any *-partial history
h* will generate via Bayes' rule (2.1) a unique sequence of posteriors
{pt}?=i and
hence a unique partial history hn; so define n*(h*) = n?(hn). The constructed *-policy,
considered
as the policy n from
n*, will generate the same sequence of actions and observations
fixed initial prior p. (Of course, a *-history, h*, is any member
of H*, and in
not
be
consistent
with
rule
from
initial
may
(2.1)
Bayes
particular
prior p. Such
a
zero
have
observed
under
histories
of
the
fixed prior
being
*-partial
probability
n.
we
are
to
At
and
such
free
p
arbitrarily define the
policy
*-partial histories, h*,
n
date
corresponding
*-policy, n*.)
It should be fairly obvious from the above constructions
that for an agent with
a
a
to
is
choosing
fixed prior p, choosing
policy
equivalent
*-policy. The argument
- one
which is independent
is as follows: Every *-policy may be considered a policy
of the posteriors. Hence choosing a policy (as opposed to a *-policy) can not result
in lower sum of discounted payoffs. Further, for fixed prior, we argued above that
of actions and
any policy induces a *-policy which results in the same distribution
same
and
the
for
each
hence
for
observations,
fixed prior,
policy there
payoffs. So,
exists a *-policy which attains the same sum of discounted payoffs. This shows the
required equivalence. Now fix any policy n and initial prior p and let 7i** be the
*-policy
induced by n and p. We may
therefore conclude
=
Vn(p) \L(^,9)p(d9)
that
(2.3)
and
VM
SuPneD
=
=
V(p) Sup^^f L(n*99)p(d9)
Any *-policy that attains the supremum on the right-hand-side
an optimal *-policy for the given initial prior, p.
III. The (weak) convexity
(2.4)
of (2.4) shall be called
of the value function
space of all finite signed measures on the parameter space 0
where the
endowed with the total variation norm defined by \\p\\ =SupJ//(v4)|,
subsets A of 0. For any *-policy, n*9 and any peM9
supremum is over measureable
we may define (with a slight abuse of notation - see (2.2)):
Let M
be the vector
=
L(n*9p) ?L(n*99)p(d9). (3.1)
that the function L(7r*, p) is a function of p only directly through the second
argument and not indirectly through n*9 (which is indeed why we work with *-policies
as opposed to policies)! The lemma below is easy to check and implies Proposition
3.2,
the convexity to the value function.
Note
Lemma 3.1. (a) L(n*9p) is linear in p for fixed *-policy n*. (b) L(n*,p) is continuous
on M under the total variation norm, uniformly in n*'9 i.e.9 3K > 0 such that for all p'
and p" inM, Supff*6D.|L(7r*,/0
L(tt*,//')| < K \\p' p" ||.
3.2. The value function V is (a) continuous onM when M
Proposition
the total variation norm and (b) convex on M.
is endowed with
306
Y. Nyarko
Proof. The value function, V9 can of course be defined on the set of finite signed
measures
Ai (as opposed being defined on the set of probability measures)
e.g., by
and
(2.3)
(2.4). From (2.4) and Lemma 3.1(b) it is easy to check that V is continuous.
Further, it is easy to check that the supremum of an arbitrary collection of linear
real-valued functions is convex. Hence from (2.4) and the linearity of L(n*9 p) in p
we conclude that the value function is convex.
IV. Strict convexity
of the value function and uniqueness
of optimal
actions
We now show that the value function isNOT strictly convex between two priors if
and only if all the priors in the convex hull of those two priors share a common
if one can check that all the priors in the
optimal *-policy function. In particular,
convex hull of two priors do NOT share a common date one optimal action then
the value function must be strictly convex between those two priors.
4.1. Fix any p'9 p"eM and ke(091 ).Denote their convex hull by C(p\ p") =
s.t. /i = #'
Then K(V + 0 -k)p") = kV(p') +
+ (l-#"}.
{/?eM|30e[O,l]
?
3a
such
that from all initial priors peC(p'9p")9
and
(1 k)V(p") if
only if
*-policy n*9
an
n* is
optimal *-policy from that prior p.
Proposition
V. The comparison
of experiments
combination
(a9y)eA x Y provides information on the
Any action and observation
unknown parameter 9 and hence may be referred to as an experiment on 9.We shall
B if the observation
A is sufficient
for Experiment
in
say that Experiment
some
B
A
is
the
observation
of
noise.
perturbed
by
Experiment
Experiment
Formally we have:
5.1. Fix any actions a and ? in A and let y and / denote the observations
resulting from those actions. The experiment P(dy\a9 9) is sufficient for P(dy'\a!9 9) if
on y) such that
there exists a conditional
(of / conditional
probability M(dy'\y)
=
for
all
9
in
for
bounded
i.e.,
0;
any
(measurable)
M(dy'\y)-P(dy\a99)
P(dy'\a\9)
function g.Y^R
and any 9e09 $Yg(y')P(dy'\a,99) = lYjYg(y')M(dy'\yyP(dy\a99).
then
In particular, if y*' is the random variable generated via M(dy*'\y)-P(dy\a9Q)
conditional on 0, / and y*' have the same distribution.
Definition
= 9a + ewith
Example 5.2. Suppose A and 0 are both subsets of the real line and y
e normally distributed with zero mean and unit variance. Fix any a and a'eA with
is sufficient for P(dy'\a\9)9
with
\a\ > \a'\> 0. Then one may show that P(dy\a99)
=
e
mean
and
distributed
with
normally
y*'
(ya'/a) + e* where e* is independent of
zero and variance equal to [1 ? (a!/a)2'].
Proposition
5.3.
(Blackwell's
Theorem).
If the experiment
P(dy\a99)
is sufficient for
P(dy'\?,9) thenVpe&(0), \e>iYV(B(at9y'9p))P(dyt\a\9)p(d9)<\BxYV(B(a,y9p))'
P(dy\a99)p(d9).
VI. The
'strict comparison'
of experiments
5.3 can be made strict.
We seek to determine when the inequality in Proposition
the
conclusion
will
be of the form that
As one may guess from Proposition
4.1,
Bayesian
optimal
control
307
problems
5.3 holds with equality
if and only if the actions
from the two
are
Fix
the
"same."
the
initial
somehow
and
let
be
p
P(dy\a,9)
experiments
prior
sufficient for P(dy'\a', 9). Let y*' be as in the definition (5.1) of sufficient experiments.
Proposition
Let Q denote the joint probability distribution
of y*', y and 9 generated by letting
9 have distribution
distribution
have
p, letting y
P(dy\a99)9 and letting y*' have
=
distribution M(dy*'
real-valued
\ (i.e., given any
y);
function/ on Y x Y x 0, JfdQ
at
two
that
date
x
one,
Suppose
initially
ir yx ef(y*\y,0)M(dy*'\y)P(dy\a99)p(d9)).
identical agents with the same prior p choose the actions a and ? respectively. Upon
making
given by
the agent
P denote
we
their observations,
y and / respectively,
they will choose date 2 actions
and
Let
be
the optimal date 2 action of
a2(y)
a'2(y') respectively.
af(y*')
who observes only y*'. Given any probability P on a metric space let Supp
the support of P (i.e., the smallest closed set with P-probability
one). Then
have,
5.3 holds with equality. Then Q-a.e.9 (a)
Proposition 6.1. Suppose Proposition
=
= 1
all
outside of a set of (y,y*') values with Q
(i.e.,
for
y*';
Q({a2
a2*,}\y*')
=
all
and (b) a2 and
zero,
probability
for
y*', a2(y)
a'2(y*') for all yeSupp?(|>>*')),
same distribution conditional on 9; i.e., ifC is any (measurable) subset of
the
have
a'2
=
the action space A then Q({a2tC}\9)
Q({a2eC}\9)V9.
Remark.
take values on the real line and that the
Suppose that the observations
on Y of Q, both
conditional
probability M(dy*'\ y) and the marginal distribution
admit strictly positive
functions,
(as in
density
q(y,y*') and q(y) respectively
Ex. 5.2). Then using Bayes' rule it is easy to see that the marginal distribution on Y
of Q(-\y*') has a strictly positive density function given by q(y,y*')/q(y).
Then
= Y.
that there is a common
6.1(a) concludes
Proposition
{yeSuppQ(dy\y*')}
action which is optimal from all observations
y. Hence, if the utility function of the
of y lead to different optimal actions,
agent is such that different observations
5.3 must hold with strict inequality.
Proposition
Appendix:
The Proofs
Proof of Lemma 3.1. (a)Obvious,
(b)Given any peM define
the positive and negative parts of p; then both p+ and /i" are
?
and p = p+
and Schwartz
p~ (see Dunford
(1957)). By
so we may suppose without
function is uniformly bounded
in
and bounded above uniformly
L(n*99) is non-negative
for any *-policy n*, and any p', p"eM,
=
=
|L(7T*,//) L(7T*,/i")l
|L(7T*,/l' p")\
|?L(7Z*,9)d(p'
/x+ and p~ to be, resp.,
non-negative measures
the utility
assumption
loss of generality that
(n*,9) by some K >0.
Then
//')l
= |
<
JL(n*, 9)d(p' p'Y
JL(tc*,e)d(p' pT I JL(n*,9)d(p' pT)+
+ L(7r*,
MW
/i")"< Kl(p' p'Y(?) + (P' PT(0)H < 2K ||p' p" ||
which
implies part (b) of lemma.
and a Xe(Q, 1), such that
Proof of Proposition 4.1: Suppose there exists //, p"e&(0)
=
from
n*
Let
be
+
+
any
(\-X)V(p").
V(Xp'
(\-X)p")
XV(p')
*-policy optimal
initial prior Xp' + (1 - X)p". Then V(p') > L(n*, p') and V(p") > L(7i*, p"). Suppose,
that from one of the priors p' or p", the *-policy n* is not optimal;
per absurdum,
308
Y. Nyarko
then one of these inequalities
hold
strictly. Then,
+ (1- k)L(7i*9p")
V(kp'+ (1 k)p")= kV(pf)+ (1 k)V(p")> kL(7t*9p')
=
L(n*9 kp' + (1
-
A)//')
=
V(kp' + (1
-
A)//'),
a contradiction. Hence the *-policy rc* is optimal from initial priors p' and p". Now
?
and suppose that from initial prior <?>p'
fix any 0e[0,1]
+ (l
(j))p", n* is not
>
Then
+
+
+
(1
optimal.
V((?>p'
0)ji")
L(n*,<t>p'
(1
(?>)p")=>L(k*,/0
?
?
=
to the convexity
(1
(?))L(7t*9p") (?)V(pf)+ (1 0)K(//') which is a contradiction
of V. This proves the "only if" part of the proposition.
Next, for fixed probability measures p' and p" on 0 suppose that there exists a
*-policy 7T*,such that from all initial priors p in C(p'9 p"), n* is an optimal *-policy
from that prior. Then for all such p9 V(p) = L(n*, p). The "if" part of the proposition
then follows from linearity of L in p.
Lemma 5.3.1. (Jensen's Inequality): Let S be a complete and separable metric space
and let q be any joint (Borel) probability
Let p and p(-\s) denote
onfi=0xS.
the
and
conditional
distribution
of
If W is any
q on 0.
respectively
marginal
on 0
continuous
and convex function on M (the set of finite signed measures
endowed with the total variation norm) then W(p) <
?sW(p(\s))dq.
Proof of lemma 5.3.1.: This
Jensen's
typically
inequality
Woodroofe(1982)).
of the version of
is a straightforward
application
textbooks.
found in mathematical
(See
analysis
Proof of Proposition
5.3. Let y*' be
5.3. Let p, a, d9 y, and y' be as in Proposition
as in the definition 5.1 of sufficient experiments. Let Q denote the joint probability
distribution of (y*', y, 9) as defined in section VI; and let p(-\ y), p(-\ y*') and p(-\ y*', y),
on y, y*' and {y*'9
on 0 of the probability Q conditional
denote the marginals
y}
?
an
Fix
with
initial
and
consider
any y*'eY,
agent
p(\y*'),
respectively.
prior \i
of y*' only. Suppose such an agent then observes y.
obtained from the observation
over 0 will be p(-\y*'9y). Since the value function V is
The posterior distribution
of Jensen's inequality
and convex (Proposition
continuous
3.2), an application
that
(Lemma 5.3.1) implies
v(p(\yn)<?YV(p(\y*\y))dQ('\y*'Y
Let LHS, RHS denote respectively
the distribution
of /
construction
(5.3.2)
the left and right hand side of (5.3.2). Since by
on 9 is that same as that of y*'
conditional
conditional on 0, J(LHS)dQ= ?V(p(- \y*'))dQ= E\_V(B(a'9
/, p))l Since conditional
=
on y9 y*' is independent of 0, p(-\y*'9y) = B(a9y9p)9 so
E[V(B(a9y9p))l
?(RHS)dQ
Hence (5.3.2) implies Proposition
5.3.
5.3 holds with equality then (5.3.2) in the
6.1. If Proposition
of Proposition
must
hold
with
5.3
equality (for Q-a.e. y*'). For any such y*'9
proof of Proposition
to
in
the proof of Proposition
similar
those
used
4.1, it is easy to
using arguments
for
the
initial
if
is
the *-policy n*(y*')
show that
prior p('\y*') then that
optimal
almost every y.
initial
for
6(1^*')
prior p('\y*\y)
-policy is also optimal for the
Since p(-\y*', y) = p('\y), this means that conditional on y*', the optimal action from
Proof
Bayesian
control
optimal
309
problems
p(-\y*') is equal to the optimal action from p(-\y), for Q('\y*) a.e. y, from which part
(a) follows. Next, let C be any (measureable) subset of the action space A, and let EQ
denote expectationswith respect toQ. Then Q({a2eC}\9) = EQ[Q({a2eC} \y*\0)|0]
=
(by iteratedconditioning) EQ[Q({a*'?C}\y*'99)\9'] (frompart (a))= Q({a*'eC}\9)
(upon integration)
=
Q({a'2eC}\9)
(since y*' has same distribution
as /).
References
P.: Convergence
measures.
of probability
New York: Wiley
1968
Billingsley,
Black well, D.: The comparison
of experiments.
In: Proceedings
of the Second
Statistics
and Probability,
of California
Press, Berkeley
(1951)
University
J.: Linear operators.
New York: Wiley
1957
N., Schwartz,
Dunford,
Berkeley
Symposium
on
to the production
L.: A Bayesian
of information
and
S., Kihlstrom,
R., Mirman,
approach
by doing. Rev. Econ. Stud. 44, 533-547
(1977)
Y.: Optimal
control of an unknown
linear process with learning. Int. Econ. Rev. 30,
N., Nyarko,
Grossman,
learning
Kiefer,
571-586(1989)
R.: A Bayesian
of Blackwell's
Kihlstrom,
exposition
M. Boyer and R. Kihlstrom
(eds). Bayesian models
Publishers
B.V. 1984
LeCam,
Nyarko,
theorem
on
of economic
the comparison
of experiments.
In:
Elsevier Science
theory. Amsterdam:
L.: Sufficiency
and approximate
Stat. 35, 1419-1455
sufficiency. Ann. Math.
Y.: On the convergence
of Bayesian
in linear economic
processes
posterior
Dynam.
Rothschild,
Woodroofe,
(1982)
15, 687-713
(1991)
M.: A two-armed
bandit theory
(1964)
models.
J. Econ.
Control
M.:
Sequential
allocation
with
of market
covariates.
pricing.
Sankhya
J. Econ.
Theory 9, 185-202
(1974)
Ind. J. Stat. [Ser. A] 44, pp. 403-414