On the Convexity of the Value Function in Bayesian Optimal Control Problems Author(s): Yaw Nyarko Source: Economic Theory, Vol. 4, No. 2 (Mar., 1994), pp. 303-309 Published by: Springer Stable URL: http://www.jstor.org/stable/25054763 Accessed: 09/05/2010 18:41 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=springer. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Springer is collaborating with JSTOR to digitize, preserve and extend access to Economic Theory. http://www.jstor.org Econ. 4,303-309 Theory (1994) "?^ ;? Economic Theory 1994 ? the convexity of the value function optimal control problems* On Springer-Verlag in Bayesian Yaw Nyarko Department Received: of Economics, July New 16, 1991; revised York University, version January New York, NY 10003, USA 18, 1993 I study the question on the convexity pf the value function and Black well Summary. Theorem of optimal policies. The main and relate this to the uniqueness (1951)'s in Blackwell's results will conclude that strict convexity and a strict inequality Theorem will hold if and only if from different priors different optimal actions may be chosen. I. Introduction I study the question of the convexity of the value function and Blackwell's Theorem (1951) and relate this to the uniqueness of optimal policies. The main results will conclude that strict convexity and a strict inequality in Blackwell's Theorem will hold if and only if from different priors different optimal actions may be chosen. The principal purpose of this paper is to provide simple and accessible proofs to economists of the above results. Most of the proofs of these results in the literature either use special assumptions (e.g., finiteness of the set of signals and/or states); or, because of a greater interest in maximal generality, they rely on arguments which make them inaccessible to many economists. For further work on the convexity of the value function and Blackwell's Theorem see Kihlstrom (1984) or LeCam (1964). For applications in economics see Rothschild (1974), Grossman et al. (1977), Kiefer andNyarko (1989),andNyarko (1991). II. The decision problem An agent does not know the true value of a parameter 6, in a parameter space 0. The agent's (prior) beliefs about 9 in the first period, date 1, are represented by the ju0over 0. The agent must choose at each date t an action at in prior probability an action space A. The action results in an observation (or signal) yt in the set Y, * and the Research from the C. V. Starr Center Financial Challenge support Tara Vishwanathan I thank Professor is gratefully acknowledged. University resulted in this paper. Fund whose at New York questioning 304 Y. Nyarko with probability distribution P(dyt\at, 9) which depends upon both the action at and result in a date t utility of u(at, yt).We the parameter 9. The action and observation are A Y that and and 0, suppose complete separable metric spaces. Given any metric space S we let @>(S) denote the set of all (Borel) probability measures on S. of weak the topology stated, be endowed with ??(S) will, unless otherwise convergence (see Billingsley (1968) for more on this). Suppose that at some date t the agent begins the period with beliefs about the parameter 9 represented by the yt resulting from the pt.x. The agent uses the observation posterior probability action at to update the posterior via Bayes' rule (i.e., the laws of probability). We let B.A x Y x 0>(0)^>0>(0) be the Bayes' rule operator, so that for all t> 1 = /if B(flf,j;f,|if_1) H.a. Histories (2.1) and policies n partial history actions and prior is any sequence of observations, x at all dates preceding n; i.e., hn= (p0,{(at,yt,pt)}ntZ{)e&{0) probabilities nntl{ = = t> 1, where each for IA x Y x ^(?)] Hn. A policy is a sequence n {nt}^=l, = a as t the function of the date action at (measureable) nt:Ht-+A nt(ht), specifies set initial D of all We let denote the n, given Any policies. history. policy, partial and posterior beliefs, beliefs p0, then generates a sequence of actions, observations A date and the Bayesian via the conditional P{dy\6,a) probability {(at,yt,pt-1)}?Ll, n a sum results in of discounted rewards rule expected (2.1). updating Any policy We define the with discount factor <5e(0,l), Vn(pQ) = E[_YdT=^t~1^{^yt)\Po^\ value function by V(p) = SupnGDVn(p). A policy is optimal if it attains this supremum. We assume the existence of an optimal policy and we assume the utility function is uniformly bounded so our value function is well defined. (See Kiefer and Nyarko (1989) for conditions Il.b. which ensure this and for other details.) *-Policies define a date n *-partial history, h*, to be any sequence of past actions and observations (i.e., the same as date n partial histories without a specification of the = x Y = H*. A date n of history posterior beliefs); i.e., any h* {(at,yt)}"~leT\"llA *-policy is any function n* :H* A and a *-policy is any collection of date n policies, n* = {n*} = x.Let D* denote the set of all ^-policies. At date one there is no history an action (i.e., so a n\eA). Any parameter h\ is "null" history and 7r* is identified with the sequence of actions and observations, value 6 and *-policy n* generates {an,y?} =1, defined inductively via the following relations: = and given the *-partial history h*, an = n*(h*) and 7T*, yx ^P(dy\0,al) ax ~ the notation yn ~ P we mean that yn is a random variable where yn P(dy\9,an), by the with distribution given by probability P. n* and 9 also generate a conditional We expected discounted sum of returns Un*, 0) = El^:=1?n-'u(an9yn)\9,n^ (2.2) Recall that a date n policy as defined in section H.a. is allowed to be a function Hence every *-policy may also be also of the history of posterior distributions. Bayesian optimal control 305 problems a policy, but not vice versa. However, a fixed prior p and policy n "induce" a *-policy as follows: Define 7r*= n^p); for any n > 2, any *-partial history h* will generate via Bayes' rule (2.1) a unique sequence of posteriors {pt}?=i and hence a unique partial history hn; so define n*(h*) = n?(hn). The constructed *-policy, considered as the policy n from n*, will generate the same sequence of actions and observations fixed initial prior p. (Of course, a *-history, h*, is any member of H*, and in not be consistent with rule from initial may (2.1) Bayes particular prior p. Such a zero have observed under histories of the fixed prior being *-partial probability n. we are to At and such free p arbitrarily define the policy *-partial histories, h*, n date corresponding *-policy, n*.) It should be fairly obvious from the above constructions that for an agent with a a to is choosing fixed prior p, choosing policy equivalent *-policy. The argument - one which is independent is as follows: Every *-policy may be considered a policy of the posteriors. Hence choosing a policy (as opposed to a *-policy) can not result in lower sum of discounted payoffs. Further, for fixed prior, we argued above that of actions and any policy induces a *-policy which results in the same distribution same and the for each hence for observations, fixed prior, policy there payoffs. So, exists a *-policy which attains the same sum of discounted payoffs. This shows the required equivalence. Now fix any policy n and initial prior p and let 7i** be the *-policy induced by n and p. We may therefore conclude = Vn(p) \L(^,9)p(d9) that (2.3) and VM SuPneD = = V(p) Sup^^f L(n*99)p(d9) Any *-policy that attains the supremum on the right-hand-side an optimal *-policy for the given initial prior, p. III. The (weak) convexity (2.4) of (2.4) shall be called of the value function space of all finite signed measures on the parameter space 0 where the endowed with the total variation norm defined by \\p\\ =SupJ//(v4)|, subsets A of 0. For any *-policy, n*9 and any peM9 supremum is over measureable we may define (with a slight abuse of notation - see (2.2)): Let M be the vector = L(n*9p) ?L(n*99)p(d9). (3.1) that the function L(7r*, p) is a function of p only directly through the second argument and not indirectly through n*9 (which is indeed why we work with *-policies as opposed to policies)! The lemma below is easy to check and implies Proposition 3.2, the convexity to the value function. Note Lemma 3.1. (a) L(n*9p) is linear in p for fixed *-policy n*. (b) L(n*,p) is continuous on M under the total variation norm, uniformly in n*'9 i.e.9 3K > 0 such that for all p' and p" inM, Supff*6D.|L(7r*,/0 L(tt*,//')| < K \\p' p" ||. 3.2. The value function V is (a) continuous onM when M Proposition the total variation norm and (b) convex on M. is endowed with 306 Y. Nyarko Proof. The value function, V9 can of course be defined on the set of finite signed measures Ai (as opposed being defined on the set of probability measures) e.g., by and (2.3) (2.4). From (2.4) and Lemma 3.1(b) it is easy to check that V is continuous. Further, it is easy to check that the supremum of an arbitrary collection of linear real-valued functions is convex. Hence from (2.4) and the linearity of L(n*9 p) in p we conclude that the value function is convex. IV. Strict convexity of the value function and uniqueness of optimal actions We now show that the value function isNOT strictly convex between two priors if and only if all the priors in the convex hull of those two priors share a common if one can check that all the priors in the optimal *-policy function. In particular, convex hull of two priors do NOT share a common date one optimal action then the value function must be strictly convex between those two priors. 4.1. Fix any p'9 p"eM and ke(091 ).Denote their convex hull by C(p\ p") = s.t. /i = #' Then K(V + 0 -k)p") = kV(p') + + (l-#"}. {/?eM|30e[O,l] ? 3a such that from all initial priors peC(p'9p")9 and (1 k)V(p") if only if *-policy n*9 an n* is optimal *-policy from that prior p. Proposition V. The comparison of experiments combination (a9y)eA x Y provides information on the Any action and observation unknown parameter 9 and hence may be referred to as an experiment on 9.We shall B if the observation A is sufficient for Experiment in say that Experiment some B A is the observation of noise. perturbed by Experiment Experiment Formally we have: 5.1. Fix any actions a and ? in A and let y and / denote the observations resulting from those actions. The experiment P(dy\a9 9) is sufficient for P(dy'\a!9 9) if on y) such that there exists a conditional (of / conditional probability M(dy'\y) = for all 9 in for bounded i.e., 0; any (measurable) M(dy'\y)-P(dy\a99) P(dy'\a\9) function g.Y^R and any 9e09 $Yg(y')P(dy'\a,99) = lYjYg(y')M(dy'\yyP(dy\a99). then In particular, if y*' is the random variable generated via M(dy*'\y)-P(dy\a9Q) conditional on 0, / and y*' have the same distribution. Definition = 9a + ewith Example 5.2. Suppose A and 0 are both subsets of the real line and y e normally distributed with zero mean and unit variance. Fix any a and a'eA with is sufficient for P(dy'\a\9)9 with \a\ > \a'\> 0. Then one may show that P(dy\a99) = e mean and distributed with normally y*' (ya'/a) + e* where e* is independent of zero and variance equal to [1 ? (a!/a)2']. Proposition 5.3. (Blackwell's Theorem). If the experiment P(dy\a99) is sufficient for P(dy'\?,9) thenVpe&(0), \e>iYV(B(at9y'9p))P(dyt\a\9)p(d9)<\BxYV(B(a,y9p))' P(dy\a99)p(d9). VI. The 'strict comparison' of experiments 5.3 can be made strict. We seek to determine when the inequality in Proposition the conclusion will be of the form that As one may guess from Proposition 4.1, Bayesian optimal control 307 problems 5.3 holds with equality if and only if the actions from the two are Fix the "same." the initial somehow and let be p P(dy\a,9) experiments prior sufficient for P(dy'\a', 9). Let y*' be as in the definition (5.1) of sufficient experiments. Proposition Let Q denote the joint probability distribution of y*', y and 9 generated by letting 9 have distribution distribution have p, letting y P(dy\a99)9 and letting y*' have = distribution M(dy*' real-valued \ (i.e., given any y); function/ on Y x Y x 0, JfdQ at two that date x one, Suppose initially ir yx ef(y*\y,0)M(dy*'\y)P(dy\a99)p(d9)). identical agents with the same prior p choose the actions a and ? respectively. Upon making given by the agent P denote we their observations, y and / respectively, they will choose date 2 actions and Let be the optimal date 2 action of a2(y) a'2(y') respectively. af(y*') who observes only y*'. Given any probability P on a metric space let Supp the support of P (i.e., the smallest closed set with P-probability one). Then have, 5.3 holds with equality. Then Q-a.e.9 (a) Proposition 6.1. Suppose Proposition = = 1 all outside of a set of (y,y*') values with Q (i.e., for y*'; Q({a2 a2*,}\y*') = all and (b) a2 and zero, probability for y*', a2(y) a'2(y*') for all yeSupp?(|>>*')), same distribution conditional on 9; i.e., ifC is any (measurable) subset of the have a'2 = the action space A then Q({a2tC}\9) Q({a2eC}\9)V9. Remark. take values on the real line and that the Suppose that the observations on Y of Q, both conditional probability M(dy*'\ y) and the marginal distribution admit strictly positive functions, (as in density q(y,y*') and q(y) respectively Ex. 5.2). Then using Bayes' rule it is easy to see that the marginal distribution on Y of Q(-\y*') has a strictly positive density function given by q(y,y*')/q(y). Then = Y. that there is a common 6.1(a) concludes Proposition {yeSuppQ(dy\y*')} action which is optimal from all observations y. Hence, if the utility function of the of y lead to different optimal actions, agent is such that different observations 5.3 must hold with strict inequality. Proposition Appendix: The Proofs Proof of Lemma 3.1. (a)Obvious, (b)Given any peM define the positive and negative parts of p; then both p+ and /i" are ? and p = p+ and Schwartz p~ (see Dunford (1957)). By so we may suppose without function is uniformly bounded in and bounded above uniformly L(n*99) is non-negative for any *-policy n*, and any p', p"eM, = = |L(7T*,//) L(7T*,/i")l |L(7T*,/l' p")\ |?L(7Z*,9)d(p' /x+ and p~ to be, resp., non-negative measures the utility assumption loss of generality that (n*,9) by some K >0. Then //')l = | < JL(n*, 9)d(p' p'Y JL(tc*,e)d(p' pT I JL(n*,9)d(p' pT)+ + L(7r*, MW /i")"< Kl(p' p'Y(?) + (P' PT(0)H < 2K ||p' p" || which implies part (b) of lemma. and a Xe(Q, 1), such that Proof of Proposition 4.1: Suppose there exists //, p"e&(0) = from n* Let be + + any (\-X)V(p"). V(Xp' (\-X)p") XV(p') *-policy optimal initial prior Xp' + (1 - X)p". Then V(p') > L(n*, p') and V(p") > L(7i*, p"). Suppose, that from one of the priors p' or p", the *-policy n* is not optimal; per absurdum, 308 Y. Nyarko then one of these inequalities hold strictly. Then, + (1- k)L(7i*9p") V(kp'+ (1 k)p")= kV(pf)+ (1 k)V(p")> kL(7t*9p') = L(n*9 kp' + (1 - A)//') = V(kp' + (1 - A)//'), a contradiction. Hence the *-policy rc* is optimal from initial priors p' and p". Now ? and suppose that from initial prior <?>p' fix any 0e[0,1] + (l (j))p", n* is not > Then + + + (1 optimal. V((?>p' 0)ji") L(n*,<t>p' (1 (?>)p")=>L(k*,/0 ? ? = to the convexity (1 (?))L(7t*9p") (?)V(pf)+ (1 0)K(//') which is a contradiction of V. This proves the "only if" part of the proposition. Next, for fixed probability measures p' and p" on 0 suppose that there exists a *-policy 7T*,such that from all initial priors p in C(p'9 p"), n* is an optimal *-policy from that prior. Then for all such p9 V(p) = L(n*, p). The "if" part of the proposition then follows from linearity of L in p. Lemma 5.3.1. (Jensen's Inequality): Let S be a complete and separable metric space and let q be any joint (Borel) probability Let p and p(-\s) denote onfi=0xS. the and conditional distribution of If W is any q on 0. respectively marginal on 0 continuous and convex function on M (the set of finite signed measures endowed with the total variation norm) then W(p) < ?sW(p(\s))dq. Proof of lemma 5.3.1.: This Jensen's typically inequality Woodroofe(1982)). of the version of is a straightforward application textbooks. found in mathematical (See analysis Proof of Proposition 5.3. Let y*' be 5.3. Let p, a, d9 y, and y' be as in Proposition as in the definition 5.1 of sufficient experiments. Let Q denote the joint probability distribution of (y*', y, 9) as defined in section VI; and let p(-\ y), p(-\ y*') and p(-\ y*', y), on y, y*' and {y*'9 on 0 of the probability Q conditional denote the marginals y} ? an Fix with initial and consider any y*'eY, agent p(\y*'), respectively. prior \i of y*' only. Suppose such an agent then observes y. obtained from the observation over 0 will be p(-\y*'9y). Since the value function V is The posterior distribution of Jensen's inequality and convex (Proposition continuous 3.2), an application that (Lemma 5.3.1) implies v(p(\yn)<?YV(p(\y*\y))dQ('\y*'Y Let LHS, RHS denote respectively the distribution of / construction (5.3.2) the left and right hand side of (5.3.2). Since by on 9 is that same as that of y*' conditional conditional on 0, J(LHS)dQ= ?V(p(- \y*'))dQ= E\_V(B(a'9 /, p))l Since conditional = on y9 y*' is independent of 0, p(-\y*'9y) = B(a9y9p)9 so E[V(B(a9y9p))l ?(RHS)dQ Hence (5.3.2) implies Proposition 5.3. 5.3 holds with equality then (5.3.2) in the 6.1. If Proposition of Proposition must hold with 5.3 equality (for Q-a.e. y*'). For any such y*'9 proof of Proposition to in the proof of Proposition similar those used 4.1, it is easy to using arguments for the initial if is the *-policy n*(y*') show that prior p('\y*') then that optimal almost every y. initial for 6(1^*') prior p('\y*\y) -policy is also optimal for the Since p(-\y*', y) = p('\y), this means that conditional on y*', the optimal action from Proof Bayesian control optimal 309 problems p(-\y*') is equal to the optimal action from p(-\y), for Q('\y*) a.e. y, from which part (a) follows. Next, let C be any (measureable) subset of the action space A, and let EQ denote expectationswith respect toQ. Then Q({a2eC}\9) = EQ[Q({a2eC} \y*\0)|0] = (by iteratedconditioning) EQ[Q({a*'?C}\y*'99)\9'] (frompart (a))= Q({a*'eC}\9) (upon integration) = Q({a'2eC}\9) (since y*' has same distribution as /). References P.: Convergence measures. of probability New York: Wiley 1968 Billingsley, Black well, D.: The comparison of experiments. In: Proceedings of the Second Statistics and Probability, of California Press, Berkeley (1951) University J.: Linear operators. New York: Wiley 1957 N., Schwartz, Dunford, Berkeley Symposium on to the production L.: A Bayesian of information and S., Kihlstrom, R., Mirman, approach by doing. Rev. Econ. Stud. 44, 533-547 (1977) Y.: Optimal control of an unknown linear process with learning. Int. Econ. Rev. 30, N., Nyarko, Grossman, learning Kiefer, 571-586(1989) R.: A Bayesian of Blackwell's Kihlstrom, exposition M. Boyer and R. Kihlstrom (eds). Bayesian models Publishers B.V. 1984 LeCam, Nyarko, theorem on of economic the comparison of experiments. In: Elsevier Science theory. Amsterdam: L.: Sufficiency and approximate Stat. 35, 1419-1455 sufficiency. Ann. Math. Y.: On the convergence of Bayesian in linear economic processes posterior Dynam. Rothschild, Woodroofe, (1982) 15, 687-713 (1991) M.: A two-armed bandit theory (1964) models. J. Econ. Control M.: Sequential allocation with of market covariates. pricing. Sankhya J. Econ. Theory 9, 185-202 (1974) Ind. J. Stat. [Ser. A] 44, pp. 403-414
© Copyright 2026 Paperzz