Microeconomic Theory I1 Terence Johnson [email protected] nd.edu/∼tjohns20/micro601.html Updated 8/2014 1 Disclaimer: These notes have plenty of errors and misstatements, are not intended to substitute for a legitimate textbook, like Simon and Blume and Chiang and Wainwright, or MWG and Varian. They are simply to make it easier to follow what is going on in class, and help remind me what I’m supposed to teach from year to year. I am making them available to you as a courtesy. Contents I Mathematics for Economics 5 1 Introduction 1.1 Economic Models . . . . . . . . . . . 1.2 Perfectly Competitive Markets . . . 1.2.1 Price-Taking Equilibrium . . 1.3 Basic questions of economic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 7 7 9 2 Basic Math and Logic 2.1 Set Theory . . . . . . . . . . . . . 2.2 Functions and Correspondences . . 2.3 Real Numbers and Absolute Value 2.4 Logic and Methods of Proof . . . . 2.4.1 Examples of Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 15 16 19 II . . . . . Optimization and Comparative Statics in R 3 Basics of R 3.0.2 Maximization and Minimization . . . . 3.1 Existence of Maxima/Maximizers . . . . . . . . 3.2 Intervals and Sequences . . . . . . . . . . . . . 3.3 Continuity . . . . . . . . . . . . . . . . . . . . . 3.4 The Extreme Value Theorem . . . . . . . . . . 3.5 Derivatives . . . . . . . . . . . . . . . . . . . . 3.5.1 Non-differentiability . . . . . . . . . . . 3.6 Taylor Series . . . . . . . . . . . . . . . . . . . 3.7 Partial Derivatives . . . . . . . . . . . . . . . . 3.7.1 Differentiation with Multiple Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Chain Rules 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 28 30 32 33 34 35 36 38 39 4 Necessary and Sufficient Conditions for Maximization 43 4.1 First-Order Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Second-Order Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 The Implicit Function Theorem 50 5.1 The Implicit Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Maximization and Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6 The Envelope Theorem 59 6.1 The Envelope Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 1 7 Concavity and Convexity 66 7.1 Concavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 III Optimization and Comparative Statics in RN 8 Basics of RN 8.1 Intervals → Topology . . . . 8.2 Continuity . . . . . . . . . . . 8.3 The Extreme Value Theorem 8.4 Multivariate Calculus . . . . 8.5 Taylor Polynomials in Rn . . 8.6 Definite Matrices . . . . . . . 9 71 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 74 77 77 78 81 82 Unconstrained Optimization 9.1 First-Order Necessary Conditions . 9.2 Second-Order Sufficient Conditions 9.3 Comparative Statics . . . . . . . . 9.4 The Envelope Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 89 91 95 97 . . . . . . . . . . 99 100 100 101 101 105 108 109 110 114 117 . . . . . . . . . . . . 10 Equality Constrained Optimization Problems 10.1 Two useful but sub-optimal approaches . . . . . . . 10.1.1 Using the implicit function theorem . . . . . 10.1.2 A more geometric approach . . . . . . . . . . 10.2 First-Order Necessary Conditions . . . . . . . . . . . 10.2.1 The Geometry of Constrained Maximization 10.2.2 The Lagrange Multiplier . . . . . . . . . . . . 10.2.3 The Constraint Qualification . . . . . . . . . 10.3 Second-Order Sufficient Conditions . . . . . . . . . . 10.4 Comparative Statics . . . . . . . . . . . . . . . . . . 10.5 The Envelope Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Inequality-Constrained Maximization Problems 121 11.1 First-Order Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 11.1.1 The sign of the KT multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 129 11.2 Second-Order Sufficient Conditions and the IFT . . . . . . . . . . . . . . . . . . . . 129 12 Concavity and Quasi-Concavity 12.1 Some Geometry of Convex Sets . . . . . 12.2 Concave Programming . . . . . . . . . . 12.3 Quasi-Concave Programming . . . . . . 12.4 Convexity and Quasi-Convexity . . . . . 12.5 Comparative Statics with Concavity and 12.6 Convex Sets and Convex Functions . . . IV . . . . . . . . . . . . . . . . . . . . . . . . Convexity . . . . . . Classical Consumer and Firm Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 132 134 136 138 140 141 147 13 Choice Theory 148 13.1 Preference Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 2 13.2 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 13.3 Consistent Decision-Making and WARP . . . . . . . . . . . . . . . . . . . . . . . . . 152 14 Consumer Behavior 14.1 Consumer Choice . . . . . . . 14.2 Walrasian Demand . . . . . . 14.3 The Law of Demand . . . . . 14.4 The Indirect Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 . 156 . 157 . 158 . 160 15 Consumer Behavior, II 161 15.1 The Expenditure Minimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . 161 15.2 The Slutsky Equation and Roy’s Identity . . . . . . . . . . . . . . . . . . . . . . . . 164 15.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 16 Welfare Analysis 168 16.1 Taxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 16.2 Consumer Surplus vs. CV and EV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 17 Production 17.1 Production Sets . . . . . . 17.2 Profit Maximization . . . 17.3 The Law of Supply . . . . 17.4 Cost Minimization . . . . 17.5 A Cobb-Douglas Example 17.6 Scale . . . . . . . . . . . . 17.7 Efficient Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 172 174 175 176 178 179 180 18 Aggregation 183 18.1 Firms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 18.2 Consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 V Other Topics 188 19 Choice Under Uncertainty 19.1 Probability and Random Variables . . . . . 19.2 Choice Theory with Uncertainty . . . . . . 19.3 Risk Aversion . . . . . . . . . . . . . . . . . 19.4 Stochastic Orders and Comparative Statics 19.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 . 189 . 192 . 198 . 204 . 207 20 Informational Economics 208 20.1 The Lemons Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 20.2 Insurance Markets and Existence of Competitive Equilibria . . . . . . . . . . . . . . 210 20.3 Market Intermediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 21 Signaling 21.1 Educational Signaling . . . . . . . . . . 21.1.1 Separating Equilibria . . . . . . . 21.1.2 Pooling Equilibria . . . . . . . . 21.1.3 Least Cost Separating Equilibria 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 216 217 218 219 22 Monopolistic Screening 220 22.1 A Simple Price Discrimination Model . . . . . . . . . . . . . . . . . . . . . . . . . . 220 22.2 Price Discrimination with a Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 23 Monotone Comparative Statics 227 23.0.1 Single Crossing and the Monotonicity Theorem . . . . . . . . . . . . . . . . . 228 23.1 Lattices and Supermodularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 4 Part I Mathematics for Economics 5 Chapter 1 Introduction If you learn everything in this math handout — text and exercises — you will be able to work through most of MWG and pass comprehensive exams with reasonably high likelihood. The math presented here is, honestly, what I think you really, truly need to know. I think you need to know these things not just to pass comps, but so that you can competently skim an important Econometrica paper to learn a new estimator you need for your research, or to pick up a book on solving non-linear equations numerically written by a non-economist and still be able to digest the main ideas. Tools like Taylor series come up in asymptotic analysis of estimators, proving necessity and sufficiency of conditions for a point to maximize a function, and the theory of functional approximation, which touch every field from applied micro to empirical macro to theoretical micro. Economists use these tools are used all the time, and not learning them will handicap you in your ability to continue acquiring skills. 1.1 Economic Models There are five ingredients to an economic model: • Who are the agents? (Players) • What can the agents decide to do, and what outcomes arise as a consequence of the agents’ choices? (Actions) • What are the agents’ preferences over outcomes? (Payoffs) • When do the agents make decisions? (Timing) • What do the agents know when they decide how to act? (Information) Once the above ingredients have been fixed, the question then arises how agents will behave. Economics has taken the hard stance that agents will act deliberately to influence the outcome in their favor, or that each agent maximizes his own payoff. This does not mean that agents cannot care — either for selfish reasons, or per se — about the welfare or happiness of other agents in the economy. It simply means that, when faced with a decision, agents act deliberately to get the best outcome available to them according to their own preferences, and not as, say, a wild animal in the grip of terror (such a claim is controversial, at least in the social sciences). Once the model has been fixed, the rest of the analysis is largely an application of mathematical and statistical reasoning. 6 1.2 Perfectly Competitive Markets The foundational economic model is the classical “Marshallian” or “partial equilibrium” market. This is the model you are probably most familiar with from undergraduate economics. In a pricetaking market, there are two agents: a representative firm and a representative consumer. The firm’s goal is to maximize profits. It chooses how much of its product to make, q, taking as given the price, p, and its costs, C(q). Then the firm’s profits are π(q) = pq − C(q) The consumer’s goal is to maximize his utility. The consumer chooses how much quantity to purchase, q, taking as given the price, p, and wealth, w. The consumer’s utility function takes the form u(q, m) where m is the amount of money spent on goods other than q. The consumer’s utility function is quasi-linear if u(q, m) = v(q) + m where v(q) is positive, increasing, and has diminishing marginal benefit. Then the consumer is trying to solve max v(q) + m q,m subject to a budget constraint w = pq + m. An outcome is an allocation (q ∗ , m∗ ) of goods giving the amount traded between the consumer and the firm, q ∗ and the amount of money the consumer spends on other goods, m∗ . The allocation will be decided by finding the market-clearing price and quantity. The consumer is asked to report how much quantity they demand q D (p) at each price p, and the firm is asked how much quantity q S (p) it is willing to supply at each price p. The market clears where q D (p∗ ) = q S (p∗ ) = q ∗ , which is the market-clearing quantity; the market-clearing price is p∗ . The market meets once, and all of the information above is known by all the agents. Then the model is • Agents: The consumer and firm. • Actions: The consumer reports a demand curve q D (p), and the firm report a supply curve q S (p), both taking the price as given. • Payoffs: The consumer and firm trade the market-clearing quantity at the market-clearing price, giving the consumer a payoff of u(q ∗ , w − p∗ q ∗ ) and the firm a payoff of p∗ q ∗ − C(q ∗ ). • Information: The consumer and firm both know everything about the market. • Timing: The market meets once, and the demand and supply curves are submitted simultaneously. 1.2.1 Price-Taking Equilibrium c Let’s make further assumptions so that the model is extremely easy to solve. First, let C(q) = q 2 , 2 and let v(q) = b log(q). 7 The Supply Side Since firms are maximizing profits, their objective is c max pq − q 2 q 2 The first-order necessary condition for the firm is p − cq S = 0 and the second-order sufficient condition for the firm is −c < 0 which is automatically satisfied. Solving the FONC gives the supply curve, q S (p) = p c This expresses the firm’s willingness to produce the good as a function of their costs, parametrized by c, and the price they receive, p. The Demand Side Consumers act to maximize utility subject to their budget constraint, or max b log(q) + m q subject to w = pq + m. To handle the constraint, re-write it as m = w − pq and substitute it into the objective function to get max b log(q) + w − pq q Then the first-order necessary condition for the consumer is b −p=0 qD and the second-order sufficient condition for the consumer is − 1 <0 (q D )2 which is automatically satisfied. Solving the FONC gives the demand curve, q D (p) = b p This expresses how much quantity the consumer would like to purchase as a function of preferences b and the price p. 8 Market-Clearing Equilibrium In equilibrium, supply equals demand, and the market-clearing price p∗ and market-clearing quantity q ∗ are determined as q D (p∗ ) = q S (p∗ ) = q ∗ or b p∗ = q∗ = p∗ c Solve for p∗ yields p∗ = and √ bc √ b q =√ c ∗ Market Equilibrium Notice that the variables of interest — p∗ and q ∗ — are expressed completely in terms of the parameters of the model that are outside of any agents’ control, b and c. We can now vary b and c freely to see how the equilibrium price and quantity vary, and study how changes in tastes or technology will change behavior. 1.3 Basic questions of economic analysis After building a model, two mathematical questions naturally arise, • How do we know solutions exist to the agents’ maximization problems? (the Weierstass theorem) • How do we find solutions? (first order necessary conditions, second order sufficient conditions) as well as two economic questions, • How does an agent’s behavior respond to changes in the economic environment? (the implicit function theorem) 9 • How does an agent’s payoff respond to changes in the economic environment? (the envelope theorem) In Micro I, we learn in detail the nuts and bolts of solving agents’ problems in isolation (optimization theory). In Micro II, you learn — through general equilibrium and game theory — how to solve for the behavior when many agents are interacting at once (equilibrium and fixed-point theory). The basic methodology for solving problems in Micro I is: • Check that a solution exists to the problem using the Weierstrass theorem • Build a candidate list of potential solutions using the appropriate first-order necessary conditions • Find the global maximizer by comparing candidate solutions directly by computing their payoff, or use the appropriate second-order sufficient conditions to verify which are maxima • Study how an agent’s behavior changes when economic circumstances change using the implicit function theorem • Study how an agent’s payoffs changes when economic circumstances change using the envelope theorem Actually, the above is a little misleading. There are many versions of the Weierstrass theorem, the FONCs, the SOSCs, the IFT and the envelope theorem. The appropriate version depends on the kind of problem you face: Is there a single choice variable or many? Are there equality or inequality constraints? And so on. The entire class — including the parts of Mas-Colell-Whiston-Green that we cover — is nothing more than the application of the above algorithm to specific problems. At the end of the course, it is imperative that you understand the above algorithm and know how to implement it to pass the comprehensive exams. We will start by studying the algorithm in detail for one-dimensional maximization problems in which agents only choose one variable. Then we will generalize it to multi-dimensional maximization problems in which agents choose many things at once. Exercises The exercises refer to the simple partial equilibrium model of Section 1.2. 1. [Basics] As b and c change, how do the supply and demand curves shift? How do the equilibrium price and quantity change? Sketch graphs as well as compute derivatives. 2. [Taxes] Suppose there is a tax t on the good q. A portion µ of t is paid by the consumer for each unit purchased, and a portion 1 − µ of t is paid by the firm for each unit sold. How does the consumer’s demand function depend on µ? How does the firm’s supply function depend on µ? How does the equilibrium price and quantity p∗ and q ∗ depend on µ? If t goes up, how are the market clearing price and quantity affected? Sketch a graph of the equilibrium in this market. 3. [Long-run equilibrium] (i) Suppose that there are K firms in the industry with short-run cost c function C(q) = q 2 , so that each firm produces q, but aggregate supply is Kq = Q. Consumers 2 maximize b log(Q) + m subject to w = pQ + m. Solve for the market-clearing price and quantity for each K and the short-run profits of the firms. (ii) Suppose firms have to pay a fixed cost F to enter the industry. Find the equilibrium number of firms in the long run, K ∗ , if there is entry as 10 long as profits are strictly positive. How does K ∗ vary in F , b and c? How do the market-clearing price and quantity vary in the long run with F ? 4. [Firm cost structure] Suppose a price-taking firm’s costs are C(q) = c(q) + F , where c(0) = 0, c′ (0) > 0 and c′′ (0) > 0 and F is a fixed cost. (i) Show that marginal cost C ′ (q) intersects average cost C(q)/q at the minimum of C(q)/q. This is the efficient scale. (ii) How does the firm’s optimal choice of q depend on F ? 5. [Monopoly] Suppose now that the consumers’ utility function is √ 2b q + m and the budget constraint is w = pq+m. Suppose there is a single firm with cost function C(q) = cq. (i) Derive the demand curve q D (p) and inverse demand curve pD (q). (ii) If the monopolist recognizes its influence on the market-clearing price and quantity, it will maximize max pD (q)q − cq q or max q D (p)(p − c) p Show that the solutions to these problems are the same. If total revenue is pq D (p) show that its derivative, the marginal revenue curve, lies below the total revenue curve, and compare the monopolist’s FONC with a price-taking firm’s FONC in the same market. 6. [Efficiency] A benevolent, utilitarian social planner (or well-meaning government) would choose the market-clearing quantity to maximize the sum of the two agents’ payoffs, or max(v(q) + w − pq) + (pq − C(q)) p,q Show that this outcome is the same as that selected by the perfectly competitive market. Conclude that a competitive equilibrium achieves the same outcome that a benevolent government would pick. Give an argument for why a government trying to intervene in a decentralized market would then probably achieve a worse outcome. Give an argument why a decentralized market would probably achieve a worse outcome than a well-meaning government. Show that the allocation selected by the decentralized market and utilitarian social planner is not the allocation selected by a monopolist. Sketch a graph of the situation. Congratulations, you are now qualified to be a micro TA! 11 Chapter 2 Basic Math and Logic These are basic definitions that appear in all of mathematics and modern economics, but reviewing it doesn’t hurt. 2.1 Set Theory Definition 2.1.1 A set A is a collection of elements. If x is an element or member of A, then x ∈ A. If x is not in A, then x ∈ / A. If all the elements of A and B are the same, A = B. The set with no elements is called the empty set, ∅. We often enumerate sets by collecting the elements between braces, as F = { Apples, Bananas, Pears } F ′ = { Cowboys, Bears, Chargers, Colts} We can build up and break down sets as follows: Definition 2.1.2 Let A, B and C be sets. • A is a subset of B if all the elements of A are elements of B, written A ⊆ B; if there exists an element x ∈ A but x ∈ / B, we can also write A ⊂ B. • If A ⊆ B and B ⊆ A, then A = B. • The set of all elements in either A or B is the union of A and B, written A ∪ B. • The set of all elements in both A and B is the intersection of A and B, written A ∩ B. If A ∩ B = ∅, then A and B are disjoint. • Suppose A ⊂ B. The set of elements in B but not in A is the complement of A in B, written Ac . These precisely define all the normal set operations like union and intersection, and give exact meaning to the symbol A = B. It’s easy to see that operations like union and intersection are associative (check) and commutative (check). But how does taking the complement of a union or intersection behave? Theorem 2.1.3 (DeMorgan’s Laws) For any sets A and B that are both subsets of X, (A ∪ B)c = Ac ∩ B c 12 and (A ∩ B)c = Ac ∪ B c , where the complement is taken with respect to X. Proof Suppose x ∈ (A ∪ B)c . Then x is not in the union of A and B, so x is in neither A nor B. Therefore, x must be in the complement of both A and B, so x ∈ Ac ∩ B c . This shows that (A ∪ B)c ⊂ Ac ∩ B c . Suppose x ∈ Ac ∩ B c . Then x is contained in the complement of both A and B, so it not a member of either set, so it is not a member of the union A ∪ B; that implies x ∈ (A ∪ B)c . This shows that (A ∪ B)c ⊃ Ac ∩ B c . If you ever find yourself confused by a complicated set relation ((A ∪ B ∩ C)c ∪ D...), draw a Venn diagram and try writing in words what is going on. We are interested in defining relationships between sets. The easiest example is a function, f : X → Y , which assigns a unique element of Y to at least some elements of X. However, in economics we will have to study some more exotic objects, so it pays to start slowly in defining functions, correspondences, and relations. Definition 2.1.4 The (Cartesian) product of two non-empty sets X and Y is the set of all ordered pairs (x, y) = {x ∈ X, y ∈ Y }. The product of two sets X and Y is often written X × Y . If it’s just one set copied over and over, X ×X ×...×X = X n , and an ordered pair would be (x1 , x2 , ..., xn ). For example R×R = R2 is product of the real numbers with itself, which is just the plane. If we have a “set of sets”, {Xi }N i=1 , Q N N we often write ×i=1 Xi or i=1 Xi . Example Suppose two agents, A and B, are playing the following game: If A and B both choose “heads”, or HH, or both choose “tails”, or T T , agent B pays agent A a dollar. If one player chooses tails and the other chooses heads, and vice versa, agent A pays agent B a dollar. Then A and B both have the strategy sets SA = SB = {H, T }, and an outcome is an ordered pair (sA , sB ) ∈ SA × SB . This game is called matching pennies. Besides a Cartesian product, we can also take a space and investigate all of its subsets: Definition 2.1.5 The power set of A is the set of all subsets of A, often written 2A , since it has 2|A| members. For example, take A = {a, b}. Then the power set of A is ∅, {a, b}, {a}, and {b}. 2.2 Functions and Correspondences Definition 2.2.1 Let X and Y be sets. A function is a rule that assigns a single element y ∈ Y to each element x ∈ X; this is written y = f (x) or f : X → Y . The set X is called the domain, and the set Y is called the range. Let A ⊂ X; then the image of A is the set f (A) = {y : x ∈ A, y = f (x)}. Definition 2.2.2 A function f : X → Y is injective or one-to-one if for every x1 , x2 ∈ X, x1 6= x2 implies f (x1 ) 6= f (x2 ). A function f : X → Y is surjective or onto if for all y ∈ Y , there exists an x ∈ X such that f (x) = y. Definition 2.2.3 A function is invertible if for every y ∈ Y , there is a unique x ∈ X for which y = f (x). 13 Note that the definition of invertible is extremely precise: For every y ∈ Y , there is a unique x ∈ X for which y = f (x). Do not confuse whether or not a function is invertible with the following idea: Definition 2.2.4 Let f : X → Y , and I ⊂ Y . Then the inverse image of I under f is the set of all x ∈ X such that f (x) = y and y ∈ I. √ Example On [0, ∞), the function f (x) = x2 is invertible, since x = y is the unique inverse √ element. Then for any set I ⊂ [0, ∞), the inverse image f −1 (I) = {x : y = x, y ∈ I}. On [−1, 1], f (x) = x2 has the image set I = [0, 1]. For any y 6= 0, y ∈ I, we can solve for √ x = ± y, so that x2 is not invertible on [−1, 1]. However, the inverse image f −1 ([0, 1]) is [−1, 1], by the same reasoning. There is an important generalization of a function called a correspondence: Definition 2.2.5 Let X and Y be sets. A correspondence is a rule that assigns a subset F (x) ⊆ Y to each x ∈ X. Alternatively1 , a correspondence is a rule that maps X into the power set of Y , F : X → 2Y . So a correspondence is just a “function with set-valued images”: The restriction on functions that f (x) = y and y is a singleton is being relaxed so that F (x) = Z where Z is a set containing elements of Y . Example Consider the equation x2 = c. We can think of the correspondence X(c) as the set of (real) solutions to the equation. For c < 0, the equation cannot be solved, because x2 > 0, so X(c) is empty for c < 0. For c = 0, there is exactly one solution, X(c) = {0}, and X(c) happens to be √ √ a “function at that point”. But for c > 0, X(c) = { c, − c}, since (−1)2 = 1: Here there is a non-trivial correspondence, where the image is a set with multiple elements. Example Suppose two agents are playing “matching pennies”. Their payoffs can be succinctly enumerated in a table: A H T H 1,-1 -1,1 B T -1,1 1,-1 Suppose agent B uses a mixed strategy and plays randomly, so that σ = pr[Column uses H]. Then A’s expected utility from using H is σ1 + (1 − σ)(−1) while the row player’s expected utility from using T is σ(−1) + (1 − σ)(1) We ask, “When is it an optimal strategy for the row player to use H against the column player’s mixed strategy σ?”, or, “What is the row player’s best-response to σ?” Well, H is strictly better than T if σ1 + (1 − σ)(−1) > σ(−1) + (1 − σ)(1) → σ > 1 1 2 For reasons beyond our current purposes, there are advantages and disadvantages to each approach. 14 and H is strictly worse than T if σ(−1) + (1 − σ)(1) > σ1 + (1 − σ)(−1) → σ < 1 2 But when σ = 1/2, the row player is exactly indifferent between H and T . Indeed, the row player can himself randomize over any mix between H and T and get exactly the same payoff. Therefore, the row player’s best-response correspondence is σ > 1/2 1 pr[Row uses H|σ] = 0 σ < 1/2 [0, 1] σ = 1/2 So correspondences arise naturally (and frequently!) in game theory. Example Consider the maximization problem max x1 + x2 x1 ,x2 subject to the constraints x1 , x2 ≥ 0, x1 + px2 ≤ 1. Since x1 and x2 are perfect substitutes, the solution is intuitively to pick the cheapest good and buy only that one. However, in the case when p = 1, the objective and constraint lie on top of one another, and any pair x∗1 + x∗2 = 1 is a solution. The solution to the maximization problem then is p<1 (0, 1/p) ∗ ∗ (x1 (p), x2 (p)) = (1, 0) p>1 (z, 1 − z) p = 1, z ∈ [0, 1] Therefore, we have a correspondence, where the optimal solution in the case when p = 1 is set valued, [z, 1 − z] for z ∈ [0, 1]. So correspondences arise naturally in optimization theory. It turns out that correspondences are very common in microeconomics, even though they aren’t usually studied in calculus or undergraduate math courses. 2.3 Real Numbers and Absolute Value The set of numbers we usually work in, (−∞, ∞), is called the real numbers, or R. The symbols −∞ and ∞ are not in R (they are not even really numbers), but represent the idea of an unbounded process, like 1, 2, 3, .... The features of R that make it attractive follow from the intuitive idea that R is a continuum, with no breaks. This is unlike the integers, Z = ..., −3, −2, −1, 0, 1, 2, 3, ..., which have spaces between each number. Absolute value is given by −x if x < 0 |x| = 0 if x = 0 x if x > 0 The most important feature of absolute value is that it satisfies |x + y| ≤ |x| + |y| And in particular, |x − y| ≤ |x| + |y| 15 2.4 Logic and Methods of Proof It really helps to get a sense of what is going on in this section, and return to it a few times over the course. Thinking logically and finding the most elegant (read as, “slick”) way to prove something is a skill that is developed. It is a talent, like composing or improvising music or athletic ability, in that you begin with a certain stock of potential to which you can add by working hard and being alert and careful when you see how other people do things. Even if you want to be an applied micro or macro economist, you will someday need to make logical arguments that go beyond taking a few derivatives or citing someone else’s work, and it will be easier if you remember the basic nuts and bolts of what it means to “prove” something. Propositional Logic Definition 2.4.1 A proposition is a statement that is either true or false. For example, some propositions are • “The number e is transcendental” • “Every Nash equilibrium is Pareto optimal” • “Minimum wage laws cause unemployment” These statements are either true or false. If you think minimum wage laws sometimes cause unemployment, then you are simply asserting the third proposition is false, although a similar statement that is more tentative might be true. The following are not propositions: • “What time is it?” • “This statement is false” The first sentence is simply neither true nor false. The second sentence is such that if it were true it would be false, and if it were false it would be true2 . Consequently, you can arrange symbols in ways that do not amount to propositions, so the definition is not meaningless. Our goal in theory is usually to establish that “If P is true, then Q must be true”, or “P → Q”, or “ Q is a logical implication of P ”. We begin with a set of conditions that take the form of propositions, and show that these propositions collectively imply the truth of another proposition. Since any proposition P is either true or false, we can consider the proposition’s negation: The proposition that is true whenever P is false, and false whenever P is true. We write the negation of P in symbols as ¬P as “not P”. For example, “the number e is not transcendental”, “some Nash equilibria are not Pareto optimal”, and “minimum wage laws do not cause unemployment”. Note that when negating a proposition, you have to take some care, but we’ll get to more on that later. Consider the following variations on “P → Q”: • The Converse: Q → P • The Contrapositive: ¬Q → ¬P • The Inverse: ¬P → ¬Q 2 Note that this sentence is not a proposition not because it “contradicts” the definition of a proposition, but because it fails to satisfy the definition. 16 The converse is usually false, but sometimes true. For example, a differentiable function is continuous (“differentiable → continuous” is true), but there are continuous functions that are not differentiable (“continuous → differentiable” is false). The inverse, likewise, is usually false, since non-differentiable functions need not be discontinuous, since many functions with kinks are continuous but non-differentiable, like |x|. The contrapositive, however, is always true if P → Q is true, and always false if P → Q is false. For example, “a discontinuous function cannot be differentiable” is a true statement. Another way of saying this is that a claim P → Q and its contrapositive have the same truth value (true or false). Indeed, it is often easier to prove the contrapositive than to prove the original claim. While the above paragraph shows that the converse and inverse need not be true if the original claim P → Q is true, we should show more concretely that the contrapositive is true. Why is this? Well, if P → Q, then whenever P is true, Q must be true also, which we can represent in a table: P T T F F Q T F T F P→Q T F T T So the first two columns say what happens when P and Q are true or false. If P and Q are both true, then of course P → Q is true. But if P and Q are both false, then of course P → Q is also a true statement. The proposition “P → Q” is actually only false when P is true but Q is false, which is the second line. Now, as for the contrapositive, the truth table is: P T T F F Q T F T F ¬Q → ¬P T F T T By the same kind of reasoning, when P and Q are both true or both false, ¬Q → ¬P is true. In the case where P is true but Q is false, we have ¬Q implies ¬P , but P is true, so ¬Q → ¬P is false. Thus, we end up with exactly the same truth values for the contrapositive as for the original claim, so they are equivalent. To see whether you understand this, make a truth table for the inverse and converse, and compare them to the truth table for the original claim. Quantifiers and negation Once you begin manipulating the notation and ideas, you must inevitably negate a statement like, “All bears are brown.” This proposition has a quantifier in it, namely the word “all”. In this situation, we want to assert that “All bears are brown” is false; there are, after all, giant pandas that are black and white and not brown at all. So the right negation is something like, “Some bears are not brown”. But how do we get “some” from “all” and “not brown” from “brown” in more general situations without making mistakes? Definition 2.4.2 Given a set A, the set of elements S that share some property P is written S = {x ∈ A : P (x)} (e.g., S1 = {x ∈ R : 2x > 4} = {x ∈ R : x ≥ 2}) . Definition 2.4.3 In a set A, if all the elements x share some property P , ∀x ∈ A : P (x). If there exists an element x ∈ A that satisfies a property P , then write ∃x ∈ A : P (x). The symbol ∀ is the universal quantifier and the symbol ∃ is the existential quantifier. So a claim might be made, 17 • All allocations of goods in a competitive equilibria of an economy are Pareto optimal. Then we are implicitly saying: Let A be the set of all allocations generated by a competitive equilibrium, and let P (a) denote the claim that a is a “Pareto optimal allocation”, whatever that is. Then we might say, “If a ∈ A, P (a)”, or • ∀a ∈ A : P (a) Negating propositions with quantifiers can be very tricky. For example, consider the claim: “Everybody loves somebody sometime.” We have three quantifiers, and it is unclear what order they should go in. The negation of this statement becomes a non-trivial problem precisely because of how many quantifiers are involved: Should the negation be, “Everyone hates someone sometime?” Or “Someone hates everyone all the time”? It requires care to get these details right. Recall the definition of negation: Definition 2.4.4 Given a statement P (x), the negation of P (x) asserts that P (x) is false for x. The negation of a statement P (x) is written ¬P (x). The rules for negating statements are ¬(∀x ∈ A : P (x)) = ∃x ∈ A : ¬P (x) and ¬(∃x ∈ A : P (x)) = ∀x ∈ A : ¬P (x) For example, the claim “All allocations of goods in a competitive equilibria of an economy are Pareto optimal” could be written in the above form as follows: Let A be the set of allocations achievable in a competitive equilibrium of an economy (whatever that is), let a ∈ A be a particular allocation, and let P (a) be the proposition that the allocation is Pareto optimal (whatever that is). Then the claim is equivalent to the statement ∀a ∈ A : P (a). Then the negation of that statement, then, is • ∃a ∈ A : ¬P (a) or in words, “There exists an allocation of goods in some competitive equilibrium of an economy that is not Pareto optimal.” Be careful about “or” statements. In proposition logic, P → Q1 or Q2 means, “P logically implies Q1 , or Q2 , or both”. So the negation of “or” statements is that “P implies either not Q1 or not Q2 or neither”. For example, “All U.S. Presidents were U.S. citizens and older than 35 when they took office” is negated to “There’s no U.S. President who wasn’t a U.S. citizen, or younger than 35, or both, when they took office.” Some examples: • “A given strategy profile is a Nash equilibrium if there is no player with a profitable deviation.” Negating this statement gives, “A given strategy profile is not a Nash equilibrium if there exists a player with a profitable deviation.” • “A given allocation is Pareto efficient if there is no agent who can be made strictly better off without making some other agent worse off.” Negating this statement gives, “A given allocation is not Pareto efficient if there exists an agent who can be made strictly better off without making any other agent worse off.” • “A continuous function with a closed, bounded subset of Euclidean space as its domain always achieves its maximum.” Negating this statement gives, “A function that is not continuous or whose domain is not a closed, bounded subset of Euclidean space may not achieve a maximum.” Obviously, we are not logicians, so our relatively loose statements of ideas will have to be negated with some care. 18 2.4.1 Examples of Proofs Most proofs use one of the following approaches: • Direct proof: Show that P logically implies Q • Proof by Contrapositive: Show that ¬Q logically implies ¬P • Proof by Contradiction: Assume P and ¬Q, and show this leads to a logical contradiction • Proof by Induction: Suppose we want to show that for any natural number n = 1, 2, 3, ..., P (n) → Q(n). A proof by induction shows that P (1) → Q(1) (the base case), and that for any n, if P (n) → Q(n) (the induction hypothesis), then P (n + 1) → Q(n + 1). Consequently, P (n) → Q(n) is true for all n. • “Disproof by Counterexample”: Let ∀x ∈ A : P (x) be a statement we wish to disprove. Show that ∃x′ ∈ A such that P (x′ ) is false. The claim might be fairly complicated, like • If “f (x) is a bounded, increasing function on [a, b]”, then “the set of points of discontinuity of f () is a countable set”. The “P ” is, (f (x) is a bounded, increasing function on [a, b]), and the “Q” is (the set of points of discontinuity of f (x) is a countable set). To prove this, we could start by using P to show Q must be true. Or, in a proof by contrapositive, we could prove that if the function has an uncountable number of discontinuities on [a, b], then f (x) is either unbounded, or decreasing, or both. Or, in a proof by contradiction, we could assume that the function is bounded and increasing, but that the function has an uncountable number of discontinuities. The challenge for us is to make sure that while we are using mathematical sentences and definitions, no logical mistakes are made in our arguments. Each method of proof has its own advantages and disadvantages, so we’ll do an example of each kind now. A Proof by Contradiction Theorem 2.4.5 There is no largest prime. Proof Suppose there was a largest prime number, n. Let p = n ∗ (n − 1) ∗ ... ∗ 1 + 1. Then p is not divisible by any of the first n numbers, so it is prime and greater than n, since n ∗(n − 1)∗..∗1+ 1 > n ∗ 1 + 1 > n. So p is a prime number larger than n — this is a contradiction, since n was assumed to be the largest prime. Therefore, there is no largest prime. A Proof by Contrapositive Suppose we have two disjoint sets, M and W with |M | = |W |, and are looking for a matching of elements in M to elements in W — a relation mapping each element from M into W and vice versa, which is one-to-one and onto. Each element has a “quality” attached, either qm ≥ 0 or qw ≥ 0, and the value of the match is v(qm , qw ) = 21 qm qw . The assortative match is the one where the highest quality agents are matched, the second-highest quality agents are matched, and so on. If agent m and agent w are matched, then h(m, w) = 1; otherwise h(m, w) = 0. Theorem 2.4.6 If a matching maximizes P P m w 19 h(m, w)v(qm , qw ), then it is assortative. Proof Since the proof isPby P contrapositive, we need to show that any matching that is not assortative does not maximize m w hmw v(qm , qw ) — i.e., any non-assortative match can be improved. If the match is not assortative, we can find two sets of agents where qm1 > qm2 and qw1 > qw2 , but h(m1 , w2 ) = 1 and h(m2 , q1 ) = 1. Then the value of those two matches is qm1 qw2 + qm2 qw1 If we swapped the partners, the value would be qm1 qw1 + qm2 qw2 and {qm1 qw1 + qm2 qw2 } − {qm1 qw2 + qm2 qw1 } = (qm1 − qm2 )(qw1 − qw2 ) > 0 P P . So a non-assortative match does not maximize m w h(m, w)v(qm , qw ). What’s the difference between proof by contradiction and proof by contrapositive? In proof by contradiction, you get to use P and ¬Q to arrive at a contradiction, while in proof by contrapositive, you use ¬Q to show that ¬P is a logical implication, without using P or ¬P in the proof. A Proof by Induction Theorem 2.4.7 Pn i=1 i = 1 n(n + 1) 2 P 1 Proof Basis Step: Let n = 1. Then 1i=1 i = 1 = 1(1 + 1) = 1. So the theorem is true for n = 1. 2 Pn P 1 1 Induction Step: Suppose i=1 i = n(n + 1); we’ll show that n+1 i=1 i = (n + 1)(n + 2). Now 2 2 add n + 1 to both sides: n X 1 i + (n + 1) = n(n + 1) + (n + 1) 2 i=1 Factoring 1 (n + 1) out of both terms gives 2 n+1 X i=1 1 i = (n + 1)(n + 2) 2 And the induction step is shown to be true. Now, since the basis step is true (P (1) → Q(1) is true), and the induction step is true (“P (n) → Q(n)” → “P (n + 1) → Q(n + 1)”), the theorem is true for all n. Disproof by Counter-example Definition 2.4.8 Given a (possibly false) statement ∀x ∈ A : P (x), a counter-example is an element x′ of A for which ¬P (x′ ). For example, “All continuous functions are differentiable.” This sounds like it could be true. Continuous functions have no jumps and are pretty well-behaved. But it’s false. The negation of the statement is, “There exist continuous functions that are not differentiable.” Maybe we can prove that? 20 My favorite non-differentiable function is the absolute value, −x if x < 0 |x| = 0 if x = 0 x if x > 0 For x < 0, the derivative is f ′ (x) = −1, and for x > 0, the derivative is f ′ (x) = 1. But at zero, the derivative is not well-defined. Is it +1? −1? Something in-between? Non-uniqueness of |0|′ This is what we call a kink in the function, a point of non-differentiability. So |x| is nondifferentiable at zero. But |x| is continuous at zero, since for any sequence xn → 0, |xn | → |0|. So we have provided a counterexample, showing that “All continuous functions are differentiable” is false. Necessary and Sufficient Conditions If P → Q and Q → P , then we often say “P if and only if Q”, or “P is necessary for Q and Q is sufficient for P ”. Consider • To be a lawyer, it is necessary to have attended law school. • To be a lawyer, it is sufficient to have passed the bar exam. To pass the bar exam, it is required that a candidate have a law degree. On the other hand, a law degree alone does not entitle someone to be a lawyer, since they still need to be certified. So a law degree is a necessary condition to be a lawyer, but not sufficient. On the other hand, if someone has passed the bar exam, they are allowed to represent clients, so it is a sufficient condition to be a lawyer. There is clearly a gap between the necessary condition of attending law school, and the sufficient condition of passing the bar exam; namely, passing the bar exam. Let’s look at a more mathematical example: • A necessary condition for a point x∗ to be a local maximum of a differentiable function f (x) on (a, b) is that f ′ (x∗ ) = 0. 21 • A sufficient condition for a point x∗ to be a local maximum of a differentiable function f (x) on (a, b) is that f ′ (x∗ ) = 0 and f ′′ (x∗ ) < 0. Why is the first condition necessary? Well, if f (x) is differentiable on (a, b) and f ′ (x∗ ) > 0, we can take a small step to x∗ + h and increase the value of the function (alternatively, if f ′ (x∗ ) < 0, take a small step to x∗ − h and you can also increase the value of the function). Consequently, x∗ could not have been a local maximum. So it is necessary that f ′ (x∗ ) = 0. Why is the second condition sufficient? Well, Taylor’s theorem says that f (x) = f (x∗ ) + f ′ (x∗ )(x − x∗ ) + f ′′ (x∗ ) (x∗ − x)2 + o((h)3 ) 2 where x∗ − x = h and o(h3 ) goes to zero as h goes to zero faster than the (x∗ − x)2 and (x∗ − x) terms. Since f ′ (x∗ ) = 0, we then have f (x) = f (x∗ ) + f ′′ (x∗ ) (x − x∗ )2 + o(h3 ) 2 so that for x sufficiently close to x∗ , f (x∗ ) − f (x) = −f ′′ (x∗ ) (x∗ − x)2 2 so that f (x∗ ) > f (x) only if f ′′ (x∗ ) < 0. So if the first-order necessary conditions and secondorder sufficient conditions are satisfied, x∗ must be a solution. However, there is a substantial gap between the first-order necessary and second-order sufficient conditions. For example, consider f (x) = x3 on (−1, 1). The point x∗ = 0 is clearly a solution of the first-order necessary conditions, and f ′′ (0) = 0 as well. However, f (1/2) = (1/2)3 > 0 = f (0), so the point x∗ is a critical point, but not a maximum. Similarly, suppose we try to maximize f (x) = x2 on (−1, 1). Again, x∗ = 0 is a solution to the first-order necessary conditions, but it is a minimum, not a maximum. So the first-order necessary conditions only identify critical points, but these critical points come in three flavors: maximum, minimum, and saddle-point. Alternatively, consider the function , −∞ < x < 5 x g(x) = 5 , 5 ≤ x ≤ 10 10 − x , 10 < x < ∞ This function looks like a plateau, where the set [5, 10] all sits at a level of 5, and everything before and after drops off. The necessary conditions are satisfied for all x strictly between 5 and 10, since f ′ (x) = 0 for x strictly between 5 and 10. But the sufficient conditions aren’t satisfied, since f ′′ (x) = 0 for x strictly between 5 and 10. These points are all clearly global maxima, but the second-order sufficient conditions require strict negativity to rule out pathologies like f (x) = x3 . So, even when you have useful necessary and sufficient conditions, there can be subtle gaps that you miss if you’re not careful. Exercises 1. If A and B are sets, prove that A ⊆ B iff A ∩ B = A. 2. Prove the distributive law A ∩ (B ∩ C) = (A ∩ B) ∩ (A ∩ C) 22 and the associative law (A ∪ B) ∪ C = A ∪ (B ∪ C) What is ((A ∪ B)∪))c ? 3. Sketch graphs of the following sets: {(x, y) ∈ R2 : x ≥ y} {(x, y) ∈ R2 : x2 + y 2 ≤ 1} √ {(x, y) ∈ R2 : xy ≥ 5} {(x, y) ∈ R2 : min{x, y} ≥ 5} 4. Write the converse and contrapositive of the following statements (note that you don’t have to know what any of the statements actually mean to negate them) : • A set X is convex if, for every x1 , x2 ∈ X and λ ∈ [0, 1], the element xλ = λx1 + (1 − λ)x2 is in X. • The point x∗ is a maximum of a function f on the set A if there exists no x′ ∈ A such that f (x′ ) > f (x). • A strategy profile is a subgame perfect Nash equilibrium if, for every player and in every subgame, the strategy profile is a Nash equilibrium. • A function is uniformly continuous on a set A if, for all ε > 0 there exists a δ > 0 such that for all x, c ∈ A, |x − c| < δ, |f (x) − f (c)| < ε. • Any competitive equilibrium (x∗ , y ∗ , p∗ ) is a Pareto optimal allocation. 23 Part II Optimization and Comparative Statics in R 24 Chapter 3 Basics of R Deliberate behavior is central to every microeconomics model, since we assume that agents act to get the best payoff possible available to them, given the behavior of other agents and exogenous circumstances that are outside their control. Some examples are: • A firm hires capital K and labor L at prices r and w per unit, respectively, and has production technology function F (K, L). It receives a price p per unit it sells of its good. It would like to maximize profits, π(K, L) = pF (K, L) − rK − wL • A consumer has P utility function u(x) over bundles of goods x = (x1 , x2 , ..., xN ), and a budget constraint w ≤ N i=1 pi xi . The consumer wants to maximize utility, subject to his budget constraint: max u(x) x P subject to i pi xi ≤ w. • A buyer goes to a first-price auction, where the highest bidder wins the good and pays his bid. His value for the good is v, and the probability he wins, conditional on his bid, is p(b). If he wins, he gets a payoff v − b, while if he loses, he gets nothing. Then he is trying to maximize max p(b)(v − b) + (1 − p(b))0 b These are probably very familiar to you. But, there are other, related problems of interest you might not have thought of: • A firm hires capital K and labor L at prices r and w per unit, respectively, and has production technology function F (K, L). It would like to minimize costs, subject to producing q̄ units of output, or min rK + wL K,L subject to F (K, L) = q̄. • A consumer has P utility function u(x) over bundles of goods x = (x1 , x2 , ..., xN ), and a budget constraint w ≤ N i=1 pi xi . The consumer would like to minimize the amount they spend, subject to achieving utility equal to ū, or min x subject to u(x) ≥ ū. 25 N X i=1 p i xi • A seller faces N buyers, each of whom has a value for the good unobservable to the seller. How much revenue can the seller raise from the transaction, among all possible ways of soliciting information from the buyers? We want to develop some basic, useful facts about solving these kinds of models. There are three questions we want to answer: • When do solutions exist to maximization problems? (The Weierstrass theorem) • How do we find candidate solutions? (First-order necessary conditions) • How do we show that a candidate solution is a maximum? (Second-order sufficient conditions) Since we’re going to study dozens of different maximization problems, we need a general way of expressing the idea of a maximization problem efficiently: Definition 3.0.9 A maximization problem is a payoff function or objective function, f (x, t), and a choice set or feasible set, X. The agent is then trying to solve max f (x, t) x∈X The variable x is a choice variable or control, since the agent gets to decide its value. The variable t is an exogenous variable or parameter, over which the agent has no control. For this chapter, agents can choose any real number: Firms pick a price, consumers pick a quantity, and so on. This means that X is the set of real numbers, R. Also, unless we are interested in t in particular, we often ignore the exogenous variables in an agent’s decision problem. For example, we might write π(q) = pq − cq 2 , even though π() depends on p and c. A good question to start with is, “When does a maximizer of f (x) exist?” While this question is simple to state in words, it is actually pretty complicated to state and show precisely, which is why this chapter is fairly “math-heavy”. 3.0.2 Maximization and Minimization A function is a mapping from some set D, the domain, uniquely into R = (−∞, ∞), the real √ numbers, and we write f : D √→ R. For example, x : [0, ∞) → R. The image of D under f (x) is the set f (D). For example, D = [0, ∞), and for log(x) : (0, ∞) → R, we have log(D) = (−∞, ∞). It turns out that the properties of the domain D and the function f (x) both matter for us. Definition 3.0.10 A point x∗ is a global maximizer of f : D → R if, for any other x′ in D, f (x∗ ) ≥ f (x′ ) A point x∗ is a local maximizer of f : D → R if, for some δ > 0 and for any other x′ in (x∗ − δ, x∗ + δ) ∩ D, f (x∗ ) ≥ f (x′ ) 26 Domain, Image, Maxima, Minima So a local maximum is basically saying, “If we restrict attention to (x∗ − δ, x∗ + δ) instead of (a, b), then x∗ is a global maximum in (x∗ − δ, x∗ + δ).” In even less mathematical terms, “x∗ is a local maximum if it does at least as well as anything nearby.” A point f (x∗ ) is a local or global minimum if, instead of f (x∗ ) ≥ f (x′ ) above, we have f (x∗ ) ≤ f (x′ ). However, we can focus on just maximization or just minimization, since Theorem 3.0.11 If x∗ is a local maximum of f (), then x∗ is a local minimum of −f (). Proof If f (x∗ ) ≥ f (x′ ) for all x′ in (x∗ − δ, x∗ + δ) for some δ > 0, then −f (x∗ ) ≤ −f (x′ ) for all x′ in (x∗ − δ, x∗ + δ), so it is a local minimum. So whenever you have a minimization problem, min f (x) x you can turn it into a maximization problem max −f (x) x and forget about minimization entirely. 27 Maximization vs. Minimization On the other hand, since I have programmed myself to do this automatically, I find it annoying when a book is written entirely in terms of minimization, and some of the rules used down the road for minimization (second-order sufficient conditions for constrained optimization problems) are slightly different than those for maximization. 3.1 Existence of Maxima/Maximizers So what do we mean by “existence of a maximum or maximizer”? Before moving to functions, let’s start with the maximum and minimum of a set. For a set F = [a, b] with a and b finite, the maximum is b and the minimum is a — these are members of the set, and b ≥ x for all other x in S, and a ≤ x for all other x in S. If we consider the set E = (a, b) however, the points b and a are not actually members of the set, although b ≥ x for all x in E, and a ≤ x for all x in E. We call b the least upper bound of E, or the supremum, sup(a, b) = b and a is the greatest lower bound of E, or the infimum, inf(a, b) = a Even if the maximum and minimum of a set don’t exist, the supremum and infimum always do (though they may be infinite, like for (a, ∞)). Since function is a rule that assigns a real number to each x in the domain D, the image of D under f is going to be a set, f (D). For maximization purposes, we are interested in whether the image of a function includes its largest point or not. For example, the function f (x) = x2 on [0, 1] achieves a maximum at x∗ = 1, since f (1) = 1 > f (x) for 0 ≤ x < 1, and sup f ([0, 1]) = max f ([0, 1]). However, if we consider f (x) = x2 on (0, 1), the supremum of f (x) is still 1, since 1 is the least upper bound of f ((0, 1)) = (0, 1), but the function does not achieve a maximum. 28 Existence and Non-existence Why? On the left, the graph has no “holes”, so it moves smoothly to the highest point. On the right, however, a point is “missing” from the graph exactly where it would have achieved its maximum. This is the difference between a maximizer (left) and a supremum when a maximizer fails to exist (right). Why does this happen? In short, because the choice set on the left, [a, b], “includes all its points”, while the choice set on the right, (a, b), is missing a and b. To be a bit more formal, suppose x′ ∈ (0, 1) is the maximum of f (x) = x2 on (0, 1). It must be less than 1, since 1 is not an element of the choice set. Let x′′ = (1 − x′ )/2 + x′ — this is the point exactly halfway between x′ and 1. Then f (x′′ ) > f (x′ ), since f (x) is increasing. But then x′ couldn’t have been a maximizer. This is true for all x′ in (0, 1), so we’re forced to conclude that no maximum exists. But this isn’t the only way something can go wrong. Consider the function on [0, 1] where 1 x 0≤x≤ 2 g(x) = 1 2 − x <x≤1 2 Again, the function’s graph is missing a point precisely where it would have achieved a maximum, at 1/2. Instead, it takes the value 1/2 instead of 3/2. 29 So in either case, the function can fail to achieve a maximum. In one situation, it happens because the domain D has the wrong features: It is missing some key points. In the second situation, it happens because the function f has the wrong features: It jumps around in unfortunate ways. Consequently, the question is: What features of the domain D and the function f (x) guarantee that supx∈D f (x) = m∗ is equal to f (x∗ ) for some x ∈ D? Definition 3.1.1 A (global) maximizer of f (x), x∗ , exists if f (x∗ ) = sup f (x), and f (x∗ ) is the (global) maximum of f (). 3.2 Intervals and Sequences Intervals are special subsets of the real numbers, but some of their properties are key for optimization theory. Definition 3.2.1 The set (a, b) = {x : a < x < b} is open, while the set [a, b] = {x : a ≤ x ≤ b} is closed. If −∞ < a < b < ∞, the sets (a, b) and [a, b] are bounded (both the endpoints a and b are finite). A sequence xn = x1 , x2 , ... is a rule that assigns a real number to each of the natural (counting) numbers, N = 1, 2, 3, .... Some basic sequences are: • Let xn = n. Then the sequence is 1, 2, 3, ... 1 1 1 . Then the sequence is 1, , , ... n 2 3 1 1 1 1 • Let xn = (−1)n−1 . Then the sequence is 1, − , , − ... n 2 3 4 • Let xn = • Let x1 = 1, x2 = 1, and for n > 2, xn = xn−1 +xn−2 . Then the sequence is 1, 1, 2, 3, 5, 8, 13, 21, ... nπ . Then the sequence is 1, 0, −1, 0, ... • Let xn = sin 2 At first, they might seem like a weird object to study when our goal is to understand functions. A function is basically an uncountable number of points, while sequences are countable, so it seems like we would lose information by simplifying an object this way. However, it turns out that studying how functions f (x) act on a sequence xn gives us all the information we need to study the behavior of a function, without the hassle of dealing with the function itself. Definition 3.2.2 A sequence xn = x1 , x2 , .... is a function from N → R. A sequence converges to x̄ if, for all ε > 0 there exists an N such that n > N implies |xn − x̄| < ε. Then we write lim xn = x̄ n→∞ and x̄ is the limit of xn . Everyone’s favorite convergent sequence is xn = 1 1 1 = 1, , , ... n 2 3 and the limit of xn is zero. Why? First, for any positive number ε, I can make |xn − 0| < ε by picking n large enough. Just take n > N = 1/ε, so that 1 1 < =ε n 1/ε 30 Second, since ε was arbitrary, it might as well be zero. Therefore, xn → 0 as n → ∞. Notice, however, how 1/n is never actually equal to zero for finite n; it is only in the limit that it hits its lower bound. This is another example of the difference between minimum and infimum: inf xn = 0 but min xn does not exist, since 1/n > 0 for all n. Everyone’s favorite non-convergent or divergent sequence is xn = n Pick any finite number ε, and I can make |xn | > ε, contrary to the definition of convergence. How? Just make |n| > ε. This sequence goes off to infinity, and its long-run behavior doesn’t “settle down” to any particular number. Definition 3.2.3 A sequence xn is bounded if |xn | ≤ M for some finite real number M . But boundedness alone doesn’t guarantee that a sequence converges, and just because a sequence fails to converge doesn’t mean that it’s not interesting. Consider (n − 1)π = 1, 0, −1, 0, 1, 0, −1, ... xn = cos 2 This function fails to converge (right?). But it’s still quite well-behaved. Every even term is zero, x2k = 0, 0, 0 and it includes a convergent string of ones, x4(k−1)+1 = 1, 1, 1, ... and a convergent string of −1’s, x4(k−1)+3 = −1, −1, −1, ... So the sequence xn — even though it is non-convergent — is composed of three convergent sequences. These building blocks are important enough to deserve a name: Definition 3.2.4 Let nk be a sequence of natural numbers, with n1 < n2 < ... < nk < .... Then xnk is a sub-sequence of xn . It turns out that every bounded sequence has at least one sub-sequence that is convergent. Theorem 3.2.5 (Bolzano-Weierstrass Theorem) Every bounded sequence has a convergent subsequence. This seems like an esoteric (and potentially useless) result, but it is actually a fundamental tool in analysis. Many times, we have a sequence of interest but know very little about it. To study its behavior, we use the Bolzano-Weierstrass to find a convergent sub-sequence, and sometimes that’s enough. 31 3.3 Continuity In short, a function is continuous if you can trace its graph without lifting your pencil off the paper. This definition is not precise enough to use, but it captures the basic idea: The function has no “jumps” where the value f (x) is changing very quickly even though x is changing very little. The reason continuity is so important is because continuous functions preserve properties of sets like being open, closed, or bounded. For example, if you take a set S = (a, b) and apply a continuous function f () to it, the image f (S) is an open interval (c, d). This is not the case for every function. The most basic definition of continuity is Definition 3.3.1 A function f (x) is continuous at c if, for all ε > 0, there is a δ > 0 so that if |x − c| < δ, then |f (x) − f (c)| < ε. Continuity Cauchy’s original explanation was that ε is an “error tolerance”, and a function is continuous at c if the actual error (|f (x) − f (c)|) can made smaller than any error tolerance by making x close to c (|x − c| < δ). An equivalent definition of continuity is given in terms of sequences by the following: Theorem 3.3.2 (Sequential Criterion for Continuity) A function f (x) is continuous at c iff for all sequences xn → c, we have lim f (xn ) = f lim xn = f (c) n→∞ n→∞ We could just as easily use “If limn→∞ f (xn ) = f (c) for all sequences xn → c, then f () is continuous” as the definition of continuity as the ε − δ definition. The Sequential Criterion for Continuity converts an idea about functions expressed into ε − δ notation into an equivalent idea about how f () acts on sequences. In particular, a continuous function preserves convergence of a sequence xn to its limit c. The important result is that if f (x) is continuous and xn → c, then lim f (xn ) = f lim xn n→∞ n→∞ or that the limit operator and the function can be interchanged if and only if f (x) is continuous. 32 3.4 The Extreme Value Theorem Up to this point, what are the facts that you should internalize? • Sequences are convergent only if they settle down to a constant value in the long run • Every bounded sequence has a convergent subsequence, even if the original sequence doesn’t converge (Bolzano-Weierstrass Theorem) • A function is continuous if and only if limn→∞ f (xn ) = f (x) for all sequences xn → x With these facts, we can prove the Weierstrass Theorem, also called the Extreme Value Theorem. Theorem 3.4.1 (Weierstrass’ Extreme Value Theorem) If f : [a, b] → R is a continuous function and a and b are finite, then f achieves a global maximum on [a, b]. First, let’s check that if any of the assumptions are violated, then examples exist where f does not achieve a maximum. Recall our examples of functions that failed to achieve maxima, f (x) = x2 on (0, 1) and 1 x 0≤x≤ 2 g(x) = 1 2 − x <x≤1 2 on [0, 1]. In the first example, f (x) is continuous, but the set (0, 1) is open, unlike [a, b], violating the hypotheses of the Weierstrass theorem. In the second example, the function’s domain is closed and bounded, but the function is discontinuous, violating the hypotheses of the Weierstrass theorem. So the reason the Weierstrass theorem is useful is that it provides sufficient conditions for a function to achieve a maximum, so that we know for sure, without exception, that any continuous f (x) on a closed, bounded set [a, b] will achieve a maximum. Proof Suppose that sup f (x) = m∗ We know m∗ < ∞, since if f (x) is continuous on [a, b], it is well-defined for its whole domain and has no asymptotes (unlike 1/x on [0, 1], which is discontinuous at zero). We want to show that there exists an x∗ in [a, b] so that f (x∗ ) = m∗ . Step 1: Since m∗ is the supremum of f (), we can find a sequence of points {xn } = x1 , x2 , ... in [a, b] so that 1 m∗ − ≤ f (xn ) ≤ m∗ n Why? We show this by contradiction. If the above inequality was violated, then for some n, we 1 would not be able to find an xn so that m∗ − ≤ f (xn ) ≤ m∗ , and m∗ − 1/n would be greater n than f (x) for all x in [a, b], implying that m∗ − 1/n is the supremum of f AND less than m∗ , so that m∗ was not actually the supremum in the first place, since the supremum is the least upper bound. This would be a contradiction. Step 2: Since the sequence xn defined by f (xn ) is contained in [a, b], by the Bolzano-Weierstrass theorem we can find a convergent subsequence, xnk → x∗ , where x∗ is in [a, b]. Step 3: If we take the limit of the convergent subsequence, lim m∗ − nk →∞ 1 ≤ lim f (xnk ) ≤ m∗ nk nk →∞ by continuity of f () and the Sequential Criterion for Continuity, we have limnk →∞ f (xnk ) = f (x∗ ), and m∗ ≤ f (x∗ ) ≤ m∗ 33 The above inequalities imply f (x∗ ) = m∗ , so a maximizer x∗ exists in [a, b], and the function achieves a maximum at f (x∗ ). This is the foundational result of all optimization theory, and it pays to appreciate how and why these steps are each required. This is the kind of powerful result you can prove using a little bit of analysis. 3.5 Derivatives As you know, derivatives measure the rate of change of a function at a point, or f ′ (x) = Dx f (x) = f (x + h) − f (x) df (x) = lim h→0 dx h The way to visualize the derivative is as the limit of a sequence of chords, f (xn ) − f (x) n→∞ xn − x lim that converge to the tangent line, f ′ (x). The Derivative Since this sequence of chords is just a sequence of numbers, the derivative is just the limit of a particular kind of sequence. So if the derivative exists, it is unique, and a derivative exists only if it takes the same value no matter what sequence xn → x you pick. √ For example, the derivative of the square root of x, x, can be computed “bare-handed” as √ √ √ √ √ √ x+h− x x+h− x x+h+ x √ lim = lim √ h→0 h→0 h h x+h+ x 1 1 x+h−x 1 √ = lim √ = lim √ √ = √ h→0 h→0 h 2 x x+h+ x x+h+ x For the most part, of course, no one really computes derivatives like that. We have theorems like Dx [af (x) + bg(x)] = af ′ (x) + bg′ (x) 34 Dx [f (x)g(x)] = f ′ (x)g(x) + f (x)g ′ (x) (multiplication rule) Dx [f (g(x))] = f ′ (g(x))f ′ (x) (chain rule) Dx [f −1 (f (x))] = 1 f ′ (x) as well as the derivatives of specific functional forms Dx xk = kek−1 Dx ex = ex 1 x and so on. This allows us to compute many fairly “complicated” derivatives by grinding through the above rules. But a notable feature of economics is that we are fundamentally unsure of what functional forms we should be using, despite the fact that we know a reasonable amount about what they “look” like. These qualitative features are often expressed in terms of derivatives. For example, it is typically assumed that a consumer’s benefit from a good is positive, marginal benefit is positive, but marginal benefit is decreasing. In short, v(q) ≥ 0, v ′ (q) ≥ 0 and v ′′ (q) ≤ 0. A firm’s total costs are typically positive, marginal cost is positive, and marginal cost is increasing. In short, C(q) ≥ 0, C ′ (q) ≥ 0, and C ′′ (q) ≥ 0. By specifying our assumptions this way, we are being precise as well as avoiding the arbitrariness of assuming that a consumer has a log(q) preferences for some good, but 1 − e−bq preferences for another. Dx log(x) = 3.5.1 Non-differentiability How do we recognize non-differentiability? Consider −x f (x) = |x| = 0 x if x < 0 if x = 0 , if x > 0 the absolute value of x, what is the derivative, f ′ (x)? For x < 0, the function is just f (x) = −x, which has derivative f ′ (x) = −1. For x > 0, the function is just f (x) = x, which has derivative f ′ (x) = 1. But what about at zero? First, let’s define the derivative of f at x from the left as f ′ (x− ) = lim h↑0 f (x + h) − f (x) h and the derivative of f at x from the right as f ′ (x+ ) = lim h↓0 f (x + h) − f (x) h Note that: Theorem 3.5.1 A function is differentiable at a point x if and only if its one-sided derivatives exist and are equal. Then for f (x) = |x| with x = 0, we have f ′ (0+ ) = lim h↓0 h |x + h| − |0| = lim = 1 h→0 h h 35 and f ′ (0− ) = lim h↑0 |0 + h| − |0| |h| = lim = −1 h→0 h h So we could hypothetically assign any number from −1 to 1 to be the derivative of |x| at zero. In this case, we say that f (x) is non-differentiable at x, since the tangent line to the graph of f (x) is not unique — people often say there is a “corner” or “kink” in the graph of |x| at zero. We already computed √ 1 Dx x = √ 2 x √ Obviously, we can’t evaluate this function for x < 0 since x is only defined for positive numbers. For x > 0, the function is also well behaved. But at zero, we have √ Dx 0 = 1 2∗0 which is undefined, so the derivative fails to exist at zero. So, if you want to show a function is non-differentiable, you need to show that the derivatives from the left and from the right are not equal, or that the derivative fails to exist. 3.6 Taylor Series It turns out that the sequence of derivatives of a function f ′ (x), f ′′ (x), f ′′′ (x), ..., f (k) (x), ... generally provides enough information to recover the function, or approximate it “as well as we need” near a particular point using only the first k terms. Definition 3.6.1 The k-th order Taylor polynomial of f (x) based at x0 is (x − x0 )k (x − x0 )2 + ... + f (k) (x0 ) f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f ′′ (x0 ) k! } {z 2 | k-th order approximation of f (x − x0 )k+1 + f (k+1) (c) (k + 1)! {z } | Remainder term where c is between x and x0 . Example The second-order Taylor polynomial of f (x) based at x0 is 1 f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f ′ (c)(x − x0 )2 2 For f (x) = ex with x0 = 1, we have f (x) = e + e(x − 1) + e (x − 1)3 (x − 1)2 + ec 2 6 while for x0 = 0 we have f (x) = 1 + x + 36 x3 x2 + ec 2 6 For f (x) = x5 + 7x2 + x with x0 = 3, we have f (x) = 309 + 448(x − 3) + 554 (x − 3)3 (x − 3)2 + 60c2 2 6 while for x0 = 10 we have f (x) = 100, 710 + 50, 141(x − 10) + 20, 014 (x − 10)2 (x − 10)3 + 60c2 2 6 For f (x) = log(x) with x0 = 1, we have f (x) = (x − 1) + 2 (x − 1)2 + 3 (x − 1)3 2 6c So while the Taylor series with the remainder/error term is an exact approximation, dropping the approximation introduces error. We often simply work with the approximation and claim that if we are “sufficiently close” to the base point, it won’t matter. Or, we will use a Taylor series to expand a function in terms of its derivatives, perform some calculations, and then take a limit so that the error vanishes. Why are these claims valid? Consider the second-order Taylor polynomial, f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f ′′ (x0 ) (x − x0 )3 (x − x0 )2 + f ′′′ (c) 2 6 This equality is exact when we include the remainder term, but not when we drop it. Let 2 (x − x0 ) fˆ(x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f ′′ (x0 ) 2 be our second-order approximation of f around x0 . Then the approximation error ′′′ (x − x0 )3 ˆ f (x) − f (x) = f (c) 6 is just a constant |f ′′′ (c)|/6 multiplied by |(x − x0 )3 |. Therefore, we can make the error arbitrarily small (less than any ε > 0) by making x very close to x0 : |f ′′′ (x0 )| |(x − x0 )3 | < ε −→ |x − x0 | < 6 ε ′′′ |f (x0 )| 3 We write this as f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f ′′ (x0 ) (x − x0 )2 + o(h3 ) 2 where h = x − x0 , or that the error is order h-cubed. This is understood to mean that if h = x − x0 is small enough, then f (x) − f (x0 ) will be as small as desired. This is important for maximization theory because we will often want to use low-order Taylor polynomials around a local maximum x∗ , and we need to know that if x is close enough to x∗ , the approximation will satisfy f (x∗ ) ≥ f (x) (why is this important?). 37 3.7 Partial Derivatives Most functions of interest to us are not a function of a single variable, but many. As a result, even though we’re focused on maximization where the choice variable is one-dimensional, it helps to introduce partial derivatives so we can study how solutions and payoffs vary in terms of variables outside the agent’s control. For example, a firm’s profit function c π(q) = pq − q 2 2 is really a function of q, p and c, or π(q, p, c). We will need to differentiate not just with respect to q, but also p and c. Definition 3.7.1 Let f (x1 , x2 , ..., xn ) : Rn → R. The partial derivative of f (x) with respect to xi is f (x1 , ..., xi + h, xn ) − f (x1 , ..., xi , ..., xn ) ∂f (x) = lim h→0 ∂xi h The gradient is the vector of partial derivatives ∂f (x) ∂x1 ∂f (x) ∂x2 ∇f (x) = .. . ∂f (x) ∂xN Since this notation can become cumbersome — especially when we differentiate multiple times with respect to different variables xi and then xj , and so on — we often write ∂f (x) = fxi (x) ∂xi or ∂ 2 f (x) = fxj xi (x) ∂xj ∂xi Example Consider a simple profit function c π(q, p, c) = pq − q 2 2 Then 1 ∂π(q, p, c) = − q2 ∂c 2 and ∂π(q, p, c) =q ∂p and the gradient is ∇π(q, p, c) = ∂π(q, p, c) ∂π(q, p, c) ∂π(q, p, c) , , ∂q ∂p ∂c = 1 q − cq, q, − q 2 2 The partial derivative with respect to xi holds all the other variables (x1 , ..., xi−1 , xi+1 , ..., xn ) constant and only varies xi slightly, exactly like a one-dimensional derivative where the other arguments of the function are treated as constants. 38 3.7.1 Differentiation with Multiple Arguments and Chain Rules Recall that the one-dimensional chain rule is that, for any two differentiable functions f (x) and g(y), Dg(f (x)) = g ′ (f (x))f ′ (x) Of course, since a partial derivative is just a regular derivative where all other arguments are held constant, it’s true that ∂g(y1 , y2 , ..., f (xi ), ..., yN ) ∂g(...) ′ = f (xi ) ∂xi ∂yi But we run into problems when we face a function f (y1 , ..., yN ) where many of the variables yi are functions that depend on some other, common variable, c. For example, consider f (x(c), c). The c term shows up multiple places, so it is not immediately obvious how to differentiate with respect to c. Let g(c) = f (x(c), c), and consider totally differentiating g(c) with respect to c: g′ (c) = f (x(c + h), c + h) − f (x(c), c) df (x(c), c) = lim h→0 dc h This limit looks incalculable, since both arguments are varying at the same time. Consider using a Taylor series to expand the first term in x at x(c) as f (x(c + h), c + h) = f (x(c), c + h) + ∂f (ξ, c + h) (x(c + h) − x(c)) ∂x where ξ is between x(c) and x(c + h). Let’s now expand the first term on the right hand side in c at c, as ∂f (x(c), ζ) h f (x(c), c + h) = f (x(c), c) + ∂c where ζ is between c and c + h. Inserting the second equation into the first, we get f (x(c + h), c + h) = f (x(c), c) + ∂f (x(c), ζ) ∂f (ξ, c + h) h+ (x(c + h) − x(c)) ∂c ∂x Since ξ is between x(c + h) and x(c), it must tend to x(c) as h → 0; since ζ is between c and c + h, it must tend to c as h → 0. Then re-arranging and dividing by h yields ∂f (x(c), ζ) ∂f (ξ, c + h) x(c + h) − x(c) f (x(c + h), c + h) − f (x(c), c) = + h ∂c ∂x h The left-hand side is almost the derivative of g(c), we just need to take limits with respect to h. Taking the limit as h → 0 then yields g′ (c) = ∂x(c) df (x(c), c) = fc (x(c), c) + fx (x(c), c) dc ∂c So we work argument by argument, partially differentiating all the way through using the chain rule, and then summing all the resulting terms. For example, ! N X ∂g(x(c), c) ∂xi (c) d + gyi (x(c), c) g(x1 (c), x2 (c), ..., xN (c), c) = dc ∂c ∂c i=1 39 Exercises 1. Write out the first few terms of the following sequences, find their limits if the sequence converges, and find the suprema, and infima: (−1)n xn = n √ n xn = n √ √ n+1− n xn = √ √ n+1+ n π(n − 1) xn = sin 2 (Hint: Does this sequence have multiple convergent sub-sequences? Argue that a sequence with multiple convergent sub-sequences that have different limits cannot be convergent.) 1/n 1 xn = n (Hint: Show that xn is an increasing sequence, and then argue that supn (1/n)1/n = 1. Then xn → 1, right?) 2. Give an example of a sequence xn and a function f (x) so that xn → c, but limn f (xn ) 6= f (c). 3. Suppose x∗ is a local maximizer of f (x). Let g() be a strictly increasing function. Is x∗ a local maximizer of g(f (x))? Suppose x∗ is a local minimizer of f (x). What kind of transformations g() ensure that x∗ is also a local minimizer of g(f (x))? Suppose x∗ is a local maximizer of f (x). Let g() be a strictly decreasing function. Is x∗ a local maximizer or minimizer of g(f (x))? 4. Rewrite the proof of the extreme value theorem for minimization, rather than maximization. 5. A function is Lipschitz continuous on [a, b] if, for all x′ and x′′ in [a, b], |f (x′ ) − f (x′′ )| < K|x′ − x′′ |, where K is finite. Show that a Lipschitz continuous function is continuous. Provide an example of a continuous function that is not Lipschitz continuous. How is the Lipschitz constant K related to the derivative of f (x)? 6. Using Matlab or Excel, numerically compute first- and second-order approximations of the exponential function with x0 = 0 and x0 = 1. Graph the approximations and the approximation error as you move away from each x0 . Do the same for the natural log function with x0 = 1 and x0 = 10. Explain whether the second- or third-order approximation does better, and how the performance degrades as x moves away from x0 for each approximation. 7. Suppose f ′ (x) > g ′ (x). Is f (x) > g(x)? Prove that if f ′ (x) > g ′ (x) > 0, there exists a point x0 such that for x > x0 , f (x) ≥ g(x). 8. Explain when the derivatives of f (g(x)), f (x)g(x), and f (x)/g(x) are positive for all x or strictly negative for all x. 9. Consider a function f (c, d) = g(y1 (c, d), y2 (d), c). Compute the partial derivatives of f (c, d) with respect to c and d. Repeat with f (a, b) = g(y1 (z(a), h(a, b)), y2 (w(b)). 40 Proofs Bolzano-Weierstrass Theorem If a sequence is bounded, all its terms are contained in a set I = [a, b]. Take the first term of the sequence, x1 , and let it be xk1 . Now, split the interval I = [a, b] into [a, 1/2(b − a)] and [1/2(b − a), b]. One of these subsets must have an infinite number of terms, since xn is an infinitely long sequence. Pick that subset, and call it I1 . Select any member of the sequence in I1 , and call it xk2 . Now split I1 in half. Again, one of the subsets of I1 has an infinite number of terms, so pick it and call it I2 . Proceeding in this way, splitting the sets in half and picking an element from the set with an infinite number of terms in it, we construct a sequence of sets Ik and a subsequence xnk . Now note that the length of the sets Ik is k 1 →0 (b − a) 2 So the distance between the terms of the sub-sequence xnk and xnk+1 cannot be more than (b − a)(1/2)k . Since this process continues indefinitely and (b − a)(1/2)k → 0, there will be a limit term x̄ of the sequence 1 . Then for some k we have |xnk − x̄| < (1/2)k , which can be made arbitrarily small by making k large, or for any ε > 0 if k ≥ K then |xnk − x̄| < ε for K ≥ log(ε)/ log(1/2). Sequential Criterion for Continuity Suppose f (x) is continuous at c. Then for all ε > 0, there is a δ > 0 so that |x − c| < δ implies |f (x) − f (c)| < ε. Take any sequence xn c. This implies that for all η > 0 there is some n > N so that |xn − c| ≤ δ. Then |xn − c| < δ, so that |f (xn ) − f (c)| < ε. Therefore, lim f (xn ) = f (c). Take any sequence xn → c. Then for all ε > 0 there exists an N ′ so that n ≥ N ′ implies |f (xn ) − f (c)| < ε. Since xn → c, for all δ > 0 there is an N ′′ so that n ≥ N ′ implies |xn − c| < δ. Let N = max{N ′ , N ′′ }. Then for all ε > 0, there is a δ > 0 so that |xn − c| < δ implies |f (xn ) − f (c)| < ε. Therefore f (x) is continuous. Questions about sequences and the extreme value theorem: 7. Prove that a convergent sequence is bounded. (You will need the inequality |y| < c implies −c ≤ y ≤ c. Try to bound the “tail” terms as |xn | < |ε + xN |, m > N , and then argue the sequence is bounded by the constant M = max{x1 , x2 , ..., xN , |ε + xN |}.) 8. Prove that a convergent sequence has a unique limit. (Start by supposing that xn converges to x′ and x′′ . Then use the facts that |y| ≤ c implies −c ≤ y ≤ c and |a + b| ≤ |a| + |b| to show that |x′′ − x′ | < |ε/2| + |xn − x′′ |, and |xn − x′′ | can be made less ε/2. ) 9. The extreme value theorem says that if f : [a, b] → R is continuous, then a maximizer of f (x) exists in [a, b]. Is it necessary for f (x) to be continuous for this result to be true (provide an example to explain your answer). A function f : [a, b] → R is upper semi-continuous if, for all ε > 0, there exists a δ > 0 so that if |x − c| < δ, then f (x) ≤ f (c) + ε (a) Sketch some upper semi-continuous functions. (b) Show that upper semi-continuity is equivalent to: For any sequence xn → c, lim sup f (x) ≤ f (c) n→∞ k≥n 1 This part of the proof is a little loose. The existence of x̄ is ensured by the Cauchy criterion for convergence, which is covered in exercise 9 41 (c) Use the sequential criterion for upper semi-continuity in part b to show that the extreme value theorem extends to upper semi-continuous functions. 10. A sequence is Cauchy if, for all ε > 0 and all n, m ≥ N , |xn −xm | < ε. Show that a sequence xn → x̄ if and only if it is Cauchy. The bullet points below sketch the proof for you, provide the missing steps: • To prove a convergent sequence is Cauchy: (Easy part) – Add and subtract x̄ inside |xn − xm |, then use the triangle inequality, |a + b| < |a| + |b|. Lastly, use the definition of a convergent sequence. • To prove a Cauchy sequence converges: (Hard part) – Show that a Cauchy sequence is bounded. (You will need the inequality |y| < c implies −c ≤ y ≤ c. Try to bound the “tail” terms as |xm | < |ε + xN |, m > N , and then argue the sequence is bounded by the constant M = max{x1 , x2 , ..., xN , |ε + xN |}.) – Use the Balzano-Weierstrass theorem to argue that a Cauchy sequence then has a convergent subsequence with limit x̄. – Show that xn → x̄, so that the sequence converges to x̄. Suppose there is a sub-sequence xnk of xn that does not converge to x̄, and show that this leads to a contradiction (again, the inequality |y| ≤ c implies −c ≤ y ≤ c is useful. Compare the non-convergent subsequence with the convergent one, along with the definition of a Cauchy sequence for points from the two sequences.). Cauchy sequences can be very useful when you want to prove a sequence converges, but have no idea what the exact limit is. One example is studying sequences of decision variables in macroeconomics, where, for example, a choice of capital decisions kt is generated by the economy each period t, but it is unclear whether this sequences converges or diverges. 42 Chapter 4 Necessary and Sufficient Conditions for Maximization While existence results are useful in terms of helping us understand which maximization problems have answers at all, they do little to help us find maximizers. Since analyzing any economic model relies on finding the optimal behavior for each agent, we need a method of finding maximizers and determining when they are unique. 4.1 First-Order Necessary Conditions We start by asking the question, “What criteria must a maximizer satisfy?” This throws away many possibilities, and narrows attention on a candidate list. This does not mean that every candidate is indeed a maximizer, but merely that if a particular point is not on the list, then it cannot be a maximizer. These criteria are called first-order necessary conditions. Example Suppose a price-taking firm has total costs C(q), and gets a price p for its product. Then its profit function is π(q) = pq − C(q) How can we find a maximizer of π()? Let’s start by differentiating with respect to q: π ′ (q) = p − C ′ (q) Since π ′ (q) measures the rate of change of q, we can increase profits if p − C ′ (q) > 0 at q by increasing q a little bit. On the other hand, if p − C ′ (q) < 0, we can raise profits by decreasing q a little bit. The only place where the firm has no incentive to change its decision is where p = C ′ (q ∗ ). Note that the logic of the above argument is that if q ∗ is a maximizer of π(q), then π ′ (q ∗ ) = 0. We are using a property of a maximizer to derive conditions it must satisfy, or necessary conditions. We can make the same argument for any maximization problem: Theorem 4.1.1 If x∗ is a local maximum of f (x) and f (x) is differentiable at x∗ , then f ′ (x∗ ) = 0. Proof Suppose x∗ is a local maximum of f (). If f ′ (x∗ ) > 0, then we could take a small step to x∗ + h, and the Taylor series would be f (x∗ + h) = f (x∗ ) + f ′ (x∗ )h + o(h2 ) 43 implying that for very small h, f (x∗ + h) − f (x∗ ) = f ′ (x∗ )h > 0 so that x∗ + h gives a higher value of f () than x∗ , which would be a contradiction. So f ′ (x∗ ) cannot be strictly greater than zero. Suppose x∗ is a local maximum of f (). If f ′ (x∗ ) < 0, then we could take a small step to x∗ − h, and the Taylor series would be f (x∗ − h) = f (x∗ ) − f ′ (x∗ )h + o(h2 ) implying that for very small h, f (x∗ − h) − f (x∗ ) = −f ′ (x∗ ) > 0 so that x∗ − h gives a higher value of f () than x∗ , which would be a contradiction. So f ′ (x∗ ) cannot be strictly less than zero. That leaves f ′ (x∗ ) = 0 as the only time when we can’t improve f () by taking a small step away from x∗ , so that x∗ must be a local maximum. So the basic candidate list for an (unconstrained, single-dimensional maximization problem) boils down to • The set of points at which f (x) is not differentiable. • The set of critical points, where f ′ (x) = 0. Example A consumer with utility function u(q, m) = b log(q) + m over the market good q and money spent on other goods m faces budget constraint w = pq + m. The consumer wants to maximize utility. We can rewrite the constraint as m = w−pq, and substitute it into u(q, m) to get b log(q)+w−pq. Treating this as a one-dimensional maximization problem, max b log(q) + w − pq q our FONCs are b −p=0 q∗ or q∗ = b p Note however, that the set of critical points potentially includes some local minimizers. Why? If is a maximizer of −f (x), then −f ′ (x∗ ) = 0; but then x∗ is a minimizer of f (x), and f ′ (x∗) = 0. So being a critical point is not enough to guarantee that a point is a maximizer. x∗ Example Consider the function c2 1 f (x) = − x4 + x2 4 2 The FONC is f ′ (x) = −x3 + c2 x = 0 44 One critical point is x∗ = 0. Dividing by x 6= 0 and solving by radicals gives two more: x∗ = ±c. So our candidate list has three entries. To figure out which is best, we substitute them back into the objective: 1 f (0) = 0, f (+c) = f (−c) = c4 > 0 4 So +c and −c are both global maxima (since f (x) is decreasing for x > c and x < c, we can ignore all those points). The point x = 0 is a local minimum (but not global minimum). On the other hand, we can find critical points that are neither maxima nor minima. Example Let f (x) = x3 on [−1, 1] (a solution to this problem exists, right?). Then f ′ (x) = 3x2 has only one solution, x = 0. So the only critical point of x3 is zero. But it is neither a maximum nor a minimum on [−1, 1] since f (0) = 0, but f (1) = 1 and f (−1) = −1. This is an example of an inflection point or a saddle point. However, if we have built the candidate list correctly, then one or more of the points on the candidate list must be the global maximizer. If worse comes to worst, we can always evaluate the function f (x) for every candidate on the list, and compare their values to see which does best. We would then have found the global maximizer for sure. (But this might be prohibitively costly, which is why we use second-order sufficient conditions, which are the next topic). So even though FONCs are useful for building a candidate list (non-differentiable points and critical points), they don’t discriminate between maxima, minima, and inflection points. However, we can develop ways of testing critical points to see how they behave locally, called second-order sufficient conditions. 4.2 Second-Order Sufficient Conditions The idea of second-order sufficient conditions is to provide criteria that ensure a critical point is a maximum or minimum. This gives us both a test to see if a point on the candidate list is a local maximum or local minimum, as well as provides more information about the behavior of a function near a local maximum in terms of calculus. Theorem 4.2.1 If f ′ (x∗ ) = 0 and f ′′ (x∗ ) < 0, then x∗ is a local maximum of f (). Proof The Taylor series at f (x∗ ) is f (x) = f (x∗ ) + f ′ (x∗ )(x − x∗ ) + f ′′ (x∗ ) (x − x∗ )2 + o(h3 ) 2 Since f ′ (x∗ ) = 0, we have f (x) = f (x∗ ) + f ′′ (x∗ ) (x − x∗ )2 + o(h3 ) 2 and re-arranging yields f (x∗ ) − f (x) = −f ′′ (x∗ ) (x − x∗ )2 − o(h3 ) 2 so for h = x − x∗ very close to zero, we get f (x∗ ) − f (x) = −f ′′ (x∗ ) 45 (x − x∗ )2 >0 2 and f (x∗ ) > f (x) so that x∗ is a local maximum of f (x). This is the standard proof of the SOSCs, but it doesn’t give much geometric intuition about what is going on. Using one-sided derivatives, the second derivative can always be written as f ′ (x + h) − f ′ (x) = lim f (x) = lim h→0 h→0 h ′′ f (x + h) − f (x) f (x) − f (x − h) − h h h which equals f (x + h) + f (x − h)) − 2f (x) h→0 h2 f ′′ (x) = lim or f (x + h) + f (x − h) − f (x) 2 f ′′ (x) = lim 2 h→0 h2 Now, the term f (x + h) + f (x − h) 2 is the average of the points one step above and below f (x). So if (f (x + h) + f (x − h))/2 < f (x), the average of the function values at x + h and x − h is less than the value at x, so the function must locally look like a “hill”. If (f (x + h) + f (x − h))/2 > f (x), then the average of the function values at x + h and x − h is above the function value at x, and the function must locally look like a “valley”. This is the real intuition for the second-order sufficient conditions: The second derivative is testing the “curvature” of the function to see whether it’s a hill or a valley. Second-Order Sufficient Conditions If f ′′ (x) = 0, however, we can’t conclude anything about a critical point. Recall that if f (x) = x3 , then f ′ (x) = 3x2 and f ′′ (x) = 6x. Evaluated at zero, the critical point x = 0 gives f ′′ (0) = 0. So an indeterminate second derivative provides no information about whether we are at a maximum, minimum or inflection point. Example Consider a price-taking firm with cost function max pq − C(q) q 46 To maximize profit, its FONCs are p − C ′ (q ∗ ) = 0 and its SOSCs are −C ′′ (q ∗ ) < 0 So as long as C ′′ (q ∗ ) > 0, a critical point is a maximum. For example C(q) = C ′′ (q) = c, so the SOSCs are always satisfied c 2 q satisfies 2 Example Consider a consumer with utility function u(q, m) = v(q) + m and budget constraint w = pq + m. The consumer then maximizes max v(q) + w − pq q yielding FONCs v ′ (q ∗ ) − p = 0 and SOSCs v ′′ (q ∗ ) < 0 So as long as v ′′ (q ∗ ) < 0 a critical point q ∗ is a local maximum. For example, v(q) = b log(q) satisfies this condition. Example Recall the function 1 c2 f (x) = − x4 + x2 4 2 The FONC is f ′ (x) = −x3 + c2 x = 0 And it had three critical points, 0 and ±c. The second derivative is f ′′ (x∗ ) = −3x2 + 2c2 < 0 For x∗ = 0, we get f ′′ (0) = 2c2 > 0, so it is a local minimum, not maximum. For x∗ = ±c, we get f ′′ (±c) = −3c2 + 2c2 = −c2 < 0, so these points are both local maxima. Example Suppose an agent consumes in period 1 and period 2, with utility function u(c1 , c2 ) = log(c1 ) + δ log(c2 ) where c1 + s = y1 and Rs + y2 = c2 , where R > 1 is the interest rate plus 1. Substituting the constraints into the objective yields the problem max log(y1 − s) + δ log(Rs + y2 ) s Maximizing over s yields the FONC − and the SOSC is − 1 1 + δR ∗2 =0 ∗ y1 − s Rs + y2 1 −1 + δR2 <0 ∗ 2 ∗2 (y1 − s ) (Rs + y2 )2 which is automatically satisfied, since for any s — not just a critical point — we have − −1 1 + δR2 <0 2 2 (y1 − s) (Rs + y2 )2 47 It’s a “nice” feature that the SOSCs are automatically satisfied in the previous example. If we could find general features of functions that guarantee this, it would make our lives much easier. In particular, it turned out that f ′′ (x) < 0 for all x, not just at a critical point x∗ . This is the special characteristic of a concave function. We can say a bit more about local maximizers using a similar approach. Theorem 4.2.2 If x∗ is a local maximum of f () and f ′ (x∗ ) = 0, then f ′′ (x∗ ) ≤ 0. Proof If x∗ is a local maximum, then the Taylor series around x∗ is f (x) = f (x∗ ) + f ′ (x∗ )(x − x∗ ) + f ′′ (x∗ ) (x − x∗ )2 + o(h3 ) 2 Using the FONCs and re-arranging yields f (x∗ ) − f (x) = −f ′′ (x∗ ) (x − x∗ )2 + o(h3 ) 2 Since x∗ is a local maximum, f (x∗ ) ≥ f (x), and this implies that for x close enough to x∗ , f ′′ (x∗ ) ≤ 0 These are called second-order necessary conditions, since they follow by necessity from the fact that x∗ is a local maximum and critical point (can you have a local maximum that is not a critical point?). Exercise 6 asks you to explain the difference between Theorem 4.2.1 and Theorem 4.2.2. This is one of these subtle points that will bother you in the future if you don’t figure out now. Exercises 1. Derive FONCs and SOSCs for the firm’s profit-maximization problem for the cost functions c C(q) = q 2 and C(q) = ecx . 2 2. When an objective function f (x, y) depends on two controls, x and y, and is subject to a linear constraint c = ax + by, the problem can be simplified to a one-dimensional program by solving the constraint in terms of one of the controls, y= c − ax b and substituting it into the objective to get max f x c − ax x, b Derive FONCs and SOSCs for this problem. What assumptions guarantee that the SOSC’s hold? 3. Derive FONCs and SOSCs for the consumer’s utility-maximization problem for the benefit c functions v(q) = b log(q), v(q) = 1 − e−bq , and v(q) = bq − q 2 . 2 4. For a monopolist with profit function max p(q)q − C(q) q 48 where C(q) is an increasing, convex function, derive the FONCs and SOSCs. Provide the best sufficient condition you can think of for the SOSCs to hold for any increasing, convex cost function C(q). 5. Explain the difference between Theorem 4.2.1 and Theorem 4.2.2 both (i) as completely as possible, and (ii) as briefly as possible. Examples can be helpful. 6. Prove that for a function f : Rn → R, a point x∗ is a local maximum only if ∇f (x) = 0. Suppose a firm hires both capital K and labor L to produce output through technology F (K, L), where p is the price of its good, w is the price of hiring a unit of labor, and r is the price of hiring a unit of capital. Derive FONCs for the firm’s profit maximization problem. 7. For a function f : Rn → R and a critical point x∗ , is it sufficient that f (x) have a negative second partial derivative for each argument xi ∂ 2 f (x∗ ) <0 ∂x2i for x∗ to be a local maximum of f (x)? Prove or disprove with a counter-example. (Try using Matlab or Excel to plot the family of quadratic forms f (x1 , x2 ) = a1 x1 + a2 x2 + b1 2 b2 2 x + x 2 + b3 x 1 x 2 2 1 2 and experiment with different coefficients to see if the critical point is a maximum or not. ) 8. Suppose you have a hard maximization problem that you cannot solve by hand, or even in a straightforward way on a computer. How might you proceed? (i) Write out a second-order Taylor series around an initial guess, x0 , to the point x1 . Now derive an expression for f (x1 ) − f (x0 ), and maximize the difference (you are using a local approximation of the function to maximize the increase in the function’s value as you move from x0 to x1 ). Solve the first-order necessary condition in terms of x1 . (ii) If you replace x0 with xk and x1 with xk+1 , repeating this procedure will generate a sequence of guesses xk . When do you think it converges to the true x∗ ? This numerical approach to finding maximizers is called Newton’s method. 49 Chapter 5 The Implicit Function Theorem The quintessential example of an economic model is a partial equilibrium market with a demand equation, expressing how much consumers are willing to pay for an amount q as a function of weather w, p = f d (q, w) and a supply equation, expressing how much firms require to produce an amount q as a function of technology t, p = f s (q, t) At a market-clearing equilibrium (p∗ , q ∗ ), we have f d (q ∗ , w) = p∗ = f s (q ∗ , t) But now we might ask, how do q ∗ and p∗ change when w and t change? If the firms’ technology t improves, how is the market affected? If the weather w improves and more consumers want to travel, how does demand shift? Since we rarely know much about f d () and f s (), we want to avoid adding additional, unneeded assumptions to the framework. The implicit function theorem allows us to do this. In economics, we make a distinction between endogenous variables — p∗ and q ∗ — and exogenous variables — w and t. An exogenous variable is something like the weather, where no economic agent has any control over it. Another example is the price of lettuce if you are a consumer: Refusing to buy lettuce today probably has no impact on the price, so it essentially beyond your control, and can be treated as a constant. An endogenous variable is one that is determined “within the system”. For example, the weather is out of your control, but you can choose whether to go to the beach or go to the movies. Prices might be beyond a consumer’s control, but the consumer can still choose what bundle of goods to purchase. A “good model” is one which judiciously picks what behavior to explain — endogenous variables — in terms of economic circumstances that are plausibly beyond the agents’ control — exogenous variables, in the clearest and simplest way possible. Once a good model is provided, we want to ask, “how do endogenous variables change when exogenous variables change?” The Implicit Function Theorem is our tool for doing this. You might want to see how a change in a tax affects behavior, or how an increase in the interest rate affects capital accumulation. Since data are not available for these “hypothetical” worlds, we need models to explain the relationships between exogenous and endogenous variables, so we can then adjust the exogenous variables in our theoretical laboratory. Example Consider a partial equilibrium market where consumers have utility function u(q, m) = v(q)+m where v(q) is increasing and concave, and a budget constraint w = pq+m. Then consumers maximize max v(q) + w − pq q 50 yielding a FONC v ′ (q ∗ ) − p = 0 and an inverse demand curve is given by p = v ′ (q D ). Suppose that firms have increasing, convex costs C(q), and face a tax t for every unit they sell. Then their profit function is π(q) = (p − t)q − C(q) The FONC is p − t − C ′ (q ∗ ) = 0 and the inverse supply curve is given by p = t + C ′ (q S ). A market-clearing equilibrium (p∗ , q ∗ ) occurs where v ′ (q ∗ ) = p∗ = t + C ′ (q ∗ ) If we re-arrange this equation, we get v ′ (q ∗ ) − t − C ′ (q ∗ ) = 0 Now, if we think of the market-clearing quantity q ∗ (t) as a function of taxes t, we can totally differentiate with respect to t to get v ′′ (q ∗ (t)) and re-arranging yields ∂q ∗ ∂q ∗ − 1 − C ′′ (q ∗ (t)) =0 ∂t ∂t ∂q ∗ 1 = ′′ ∗ <0 ∂t v (q (t)) − C ′′ (q ∗ (t)) So that if t increases, the market-clearing quantity falls (what happens to the market-clearing price?). The general idea of the above two exercises is called the Implicit Function Theorem. 5.1 The Implicit Function Theorem Suppose we have an equation f (x, c) = 0 Then an implicit solution is a function x(c), so that f (x(c), c) = 0 We say the variables c are exogenous variables, and that the x are endogenous variables. Theorem 5.1.1 (Implicit Function Theorem) Suppose f (x0 , c0 ) = 0 and there exists a continuous implicit solution x(c), with derivative ∂x(c) fc (x(c), c) =− ∂c fx (x(c), c) for c close to c0 . 51 ∂f (x0 , c0 ) 6= 0. Then ∂x Proof If we differentiate f (x(c), c) with respect to c, we get ∂f (x(c), c) ∂x(c) ∂f (x(c), c) + =0 ∂x ∂c ∂c Re-arranging the equation yields ∂f (x(c), c)/∂c ∂x(c) =− ∂c ∂f (x(c), c)/∂x which exists for all c only if ∂f (x(c), c)/∂x 6= 0 for all c. Since we have computed that x(c) is continuous, since all differentiable functions are continuous. ∂x(c) , it follows ∂c The key step in proving the implicit function theorem is the equation ∂f (x(c), c) ∂x(c) ∂f (x(c), c) + =0 ∂x ∂c ∂c The first term is the equilibrium effect or indirect effect: The system itself adjusts the endogenous variables to restore equilibrium in the equation f (x(c), c) = 0 (for us, this will occur because a change in c causes agents to re-optimize, cancelling out part of the change). The second term is the direct effect: By changing the parameters, the nature of the system has changed slightly, so that equilibrium will be achieved in a slightly different manner. Example Consider a monopolist’s profit-maximization problem: c max p(q)q − q 2 q 2 The first-order necessary condition is p′ (q)q + p(q) − cq = 0 This equation fits our f (x(c), c), with q(c) and c as the endogenous and exogenous parameters. Totally differentiating as in the proof yields p′′ (q)q dq dq dq + 2p′ (q) − q(c) − c =0 dc dc dc and re-arranging a bit yields dq ′′ p (q)q + 2p′ (q) − c − q(c) = 0 dc The second term is the direct effect: a small increase in c reduces the monopolist’s profits by −q(c). The first term is the equilibrium effect: The monopolist adjusts q(c) to maintain the FONC, and a small change in q changes the FONC by exactly the term in parentheses. But it is unclear how to sign this in general, since the term in parentheses may not immediately mean anything to you (is it positive? negative?). 5.2 Maximization and Comparative Statics Suppose we have an optimization problem max f (x, c) x 52 where c is some parameter the decision-maker takes as given, like the temperature or a price that can’t be influenced by gaming the market. Let x∗ (c) be a maximizer of f (x, c). Then the FONCs imply ∂f (x∗ (c), c) = fx (x∗ (c), c) = 0 ∂x and ∂ 2 f (x∗ (c), c) = fxx (x∗ (c), c) ≤ 0 ∂x2 Note that we know that fx (x∗ (c), c) = 0 and fxx (x∗ (c), c) ≤ 0 since we are assuming that x∗ (c) is a maximizer. The FONC looks exactly like the kinds of equations studied in the proof of the implicit function theorem, fx (x, c) = 0. The only difference is that it is generated by a maximization problem, not an abstractly given equation. If we differentiate the FONC with respect to c, we get fxx (x∗ (c), c) and solving for ∂x∗ /∂c yields ∂x∗ (c) + fcx (x∗ (c), c) = 0 ∂c fcx(x∗ (c), c) ∂x∗ (c) = ∂c −fxx(x∗ (c), c) This is called the comparative static of x∗ with respect to c: We are measuring how x∗ (c) — the agent’s behavior — responds to a change in c — some exogenous parameter outside their control. The SOSC implies that fxx (x∗ (c), c) < 0, so we know that ∗ ∂x (c) sign = sign (fcx (x∗ (c), c)) ∂c So that x∗ (c) is increasing in c if fcx (x∗ (c), c) ≥ 0. Theorem 5.2.1 Suppose x∗ (c) is a local maximum of f (x, c). Then sign ∂x∗ (c) = sign fcx (x∗ (c), c) ∂c Example Recall the monopolist, facing the problem c max p(q)q − q 2 q 2 His FONCs are and his SOSCs are p′ (q ∗ )q ∗ + p(q ∗ ) − cq ∗ = 0 p′′ (q ∗ )q ∗ + 2p′ (q ∗ ) − c < 0 We apply the IFT to the FONCS: Treat q ∗ as an implicit function of c, and totally differentiate to get ∂q ∗ ∂q ∗ ∂q ∗ + 2p′ (q ∗ ) −c − q∗ = 0 p′′ (q ∗ )q ∗ ∂c ∂c ∂c Re-arranging, we get, ∂q ∗ p′′ (q ∗ )q ∗ + 2p′ (q ∗ ) − c − q∗ = 0 | {z } ∂c SOSCs, <0 or q∗ ∂q ∗ = ′′ ∗ ∗ <0 ∂c p (q )q + 2p′ (q ∗ ) − c So we use the information from the SOSCs to sign the denominator, giving us an unambiguous comparative static. 53 Example Suppose an agent consumes in period 1 and period 2, with utility function u(c1 , c2 ) = log(c1 ) + δ log(c2 ) where c1 + s = y1 and Rs + y2 = c2 , where R > 1 is the interest rate plus 1. Substituting the constraints into the objective yields the problem max log(y1 − s) + δ log(Rs + y2 ) s Maximizing over s yields the FONC − 1 1 + δR ∗2 =0 ∗ y1 − s Rs + y2 and the SOSC is automatically satisfied, since for any s — not just a critical point — we have − −1 1 + δR2 <0 2 2 (y1 − s) (Rs + y2 )2 How does s∗ vary with R? Well, I don’t want to write it all out again. Let the FONC be the function defined as f (s∗ (R), R) = 0 Then we know that ∂s∗ fR (s∗ (R), R) = ∂R −fS (s∗ (R), R) From the SOSCs, fS (s∗ (R), R) is negative, so the denominator is positive, and since fR (s∗ (R), R) = δ Rs∗ 1 1 − δRs∗2 ∗2 + y2 (Rs + y2 ) We only need to sign fR (s∗ (R), R) to get sign of the expression. It is positive if δ(Rs∗2 + y2 ) − δRs∗2 = δy2 > 0 which is true. So s∗ is increasing in R (and look, we didn’t even solve for s∗ (R)). Example Consider a firm with maximization problem π(q) = pq − C(q, t) where t represents the firm’s technology. In particular, increasing t reduces the firm’s marginal costs for all q, or ∂ ∂C(q, t) = Cqt (q, t) < 0 ∂t ∂q The firm’s FONC is p − Cq (q ∗ , t) = 0 and the SOSC is −Cqq (q ∗ , t) < 0 We can study how the optimal quantity q ∗ varies with either p or t. Let’s start with p. Applying the implicit function theorem, we get 1 ∂q ∗ (p) = >0 ∗ ∂p Cqq (q (p), t) 54 so the supply curve is upward sloping. If t increases instead, we get −Cqt (q ∗ , t) ∂q ∗ (t) = >0 ∂t Cqq (q ∗ (p), t) so that if technology improves, the supply curve shifts up. We might ask, what is the cross-partial with respect to both t and p? There’s nothing stopping us from pushing this approach further. Differentiating the FONC with respect to p, we get 1 − Cqq (q ∗ (t, p), t) ∂q ∗ (t, p) =0 ∂p and again with respect to t we get −Cqqq (q ∗ (t, p), t) ∂q ∗ (t, p) ∂q ∗ (t, p) ∂q ∗ (t, p) ∂ 2 q ∗ (t, p) − Ctqq (q ∗ (t, p), t) − Cqq (q ∗ (t, p), t) =0 ∂t ∂p ∂p ∂t∂p But now we’re in trouble. What’s the sign of Cqqq (q ∗ (t, p), t)? Can we make any sense of this? This is the kind of game you often end up playing as a theorist. We have a complicated, apparently ambiguous comparative static that we would like to make sense of. The goal now is to figure out what kinds of worlds have unambiguous answers. Let’s start by re-arranging to get the comparative static of interest alone and seeing what we can sign with our existing assumptions: Cqqq (q ∗ (t, p), t) ∂ 2 q ∗ (t, p) = ∂t∂p ∂q ∗ (t, p) ∂q ∗ (t, p) ∂q ∗ (t, p) +Ctqq (q ∗ (t, p), t) ∂t ∂p ∂p {z } | | {z } + + ∗ −Cqq (q (t, p), t) | {z } − So if Cqqq (q, t) and Ctqq (q, t) have the same sign, the cross-partial of q with respect to p and t will be unambiguous. Otherwise, you need to make more assumptions about how variables relate, adopt specific functional forms, or rely on the empirical literature to argue that some quantities are more important than others. Example As an application of the implicit function theorem, consider the equation f −1 (f (x)) = x If we differentiate this with respect to x, we get df −1 (y) ′ f (x) = 1 dy or df −1 (y) 1 = ′ dy f (x) ′ Theorem 5.2.2 (Inverse Function Theorem) If f ′ (x) > 0 for all x, then f −1 (y) > 0 for ′ all y, and if f ′ (x) < 0 for all x, then f −1 (y) < 0 for all y. If f (x) = y and f −1 (y) = x, then 1 df −1 (y) = ′ dy f (x) This allows us to sign the inverse of a function just from information about the derivative of the function itself. This is actually pretty useful. This exact theorem is a key step in deriving the Nash equilibrium of many pay-as-bid auctions, for example. 55 Example From the implicit function theorem, we have ∂x(c) fxc(x(c), c) =− ∂c fxx(x(c), c) Multiplying by c and dividing by x(c) gives ∂x(c) c fxc(x(c), c) c =− ∂c x(c) fxx(x(c), c) x(c) which is often written as %∆x(c) fxc(x(c), c) c ∂ log(x(c)) = =− ∂ log(c) %∆c fxx(x(c), c) x(c) This quantity ∂ log(x(c)) ∂x(c) c = ∂c x(c) ∂ log(c) is called an elasticity. Why are these useful? Suppose we are comparing the effect of a tax on the gallons of water and bushels of apples traded. We might naively compute the derivatives, but what do the numbers even mean? One good is denominated in gallons, and the other in bushels. We might convert gallons to bushels by comparing the weight of water and apples and coming up with a gallons-to-bushels conversion scale, like fahrenheit to celsius. But again, this is somewhat irrelevant. People consume a large amount of water everyday, while they probably only eat a few dozen apples a year (if that). So we might gather data on usage and improve our conversion scale to have economic significance, rather than just physical significance. This approach, though, seems misguided. The derivative of the quantity q(t) of water traded with respect to t is q(t′ ) − q(t) t →t t′ − t lim ′ so the numerator is the change in the quantity, measured in gallons, while the denominator is the change in the tax, measured in dollars. If we multiply by t/q(t), we get q(t′ ) − q(t) gallons t dollars q(t′ ) − q(t) t × = t′ − t dollars q(t) gallons t′ − t q(t) so the units “cancel” leaving a dimensionless quantity, the elasticity. This can be freely compared across goods since we don’t have to keep track of currency, quantity denominations, economic significance of units, and so on. Exercises 1. A monopolist faces demand curve p(q, a), where a is advertising expenditure, and has costs C(q). Solve for the monopolist’s optimal quantity, q ∗ (a), and explain how the optimal quantity varies with advertising if pa (q, a) ≥ 0. 2. Suppose there is a perfectly competitive market, where consumers maximize utility v(q, w) + m subject to budget constraints w = (p + t)q + m, where t is a tax on consumption of q, and firms have costs C(q) = cq. (i) Characterize a perfectly competitive equilibrium in the market, and show how tax revenue — tq ∗ (t, w) — varies with t and w. (ii) Suppose w changes — there is a shock to consumer preferences — and the government wants to adjust t to hold tax revenue constant. Use the IFT to show how to achieve this. 56 3. An agent is trying to decide how much to invest in a safe asset that yields return 1 and a risky asset that returns H with probability p and L with probability 1 − p, where H > 1 > L. His budget constraint is w = pa + S, where p is the price of the risky asset, a is the number of shares of the risky asset he purchases, and S is the number of shares of the safe asset he purchases. His expected utility (objective function) is max pu(S + aH) + (1 − p)u(S + aL) a,S (i) Provide FONCs and SOSCs for the agent’s investment problem; when are they satisfied? (ii) How does the optimal a∗ vary with p and H? (iii) Can you figure out how the optimal a∗ varies with wealth? 4. Re-write Theorem 5.2.1 to apply to minimization problems, rather than maximization problems, and provide a proof. Consider the consumer’s expenditure minimization problem, e(p, u) = minq,m pq + m subject to v(q) + m = u, where a consumer finds the cheapest bundle that provides ū of utility. How does q ∗ (p, u) vary in p and u? 5. Consider a representative consumer with utility function u(q1 , q2 , m) = v(q1 , q2 ) + m and budget constraint w = p1 q1 + p2 q2 + m. The two goods are produced by firm 1, whose costs are c2 C1 (q1 ) = c1 q1 , and firm 2, whose costs are C2 (q2 ) = q22 . (i) Solve for the consumer’s and firms’ 2 demand and supply curves. How does demand for good 1 change with respect to a change in the price of good 2? How does supply of good 1 change with respect to a change in the price of good 2? (ii) Solve for the system of equations that characterize equilibrium in the market. How does the equilibrium quantity of good 1 traded change with respect to a change in firm 2’s marginal cost? 6. Do YOU want to be a New York Times columnist? Let’s learn IS-LM in one exercise! We start with a system of four equations: (i) The NIPA accounts equation, Y =C +I +G so that total national income, Y , equals total expenditure, consumption C plus investment I plus government expenditure G. (ii) The consumption function C = f (Y − T ) giving consumption as a function of income less taxes. Assume f ′ () ≥ 0, so that more income-lesstaxes translates to more spending on consumption. (iii) The investment function I = i(r) where r is the interest rate. Assume i′ (r) ≥ 0, so that a higher interest rate leads to more investment. (iv) The money market equilibrium equation M s = h(Y, r) so that supply of money must equal demand for money, which depends on national income and the interest rate. Assume hy ≥ 0, so that more income implies a higher demand for currency for trade, and hr ≤ 0, so that a higher interest rate moves resources away from consumption into investment, thereby reducing demand for currency. This can be simplified to two equations, Y − f (Y − T ) − i(r) = G h(Y, r) = M s 57 The endogenous variables are Y and r, and the exogenous variables are α, β, γ, M s , and G. (i) What is the effect of increases in G affect Y and r? What does this suggest about fiscal interventions in the economy? (ii) How does an increase in M s affect Y and r? What does this suggest about monetary interventions in the economy? (iii) Suppose there is a balanced-budget amendment so that G = T . How does an increase in G affect Y and r? Explain the effect of the amendment on the government’s ability to affect the economy. 7. Consider a partial equilibrium model with a tax t where λ of the legal incidence falls on the consumer and 1 − λ falls on the firm. The consumer has objective maxq v(q) + w − (p + λt)q and the firm has objective maxq (p − (1 − λ))q − C(q). (i) Compute the elasticity of the supply and demand curves with respect to p, λ, and t. (ii) Compute the change in the total taxes paid by the consumer and producer in equilibrium with respect to t and λ. (iii) Show that for a small increase in t starting from t = 0, the increase in the consumer’s tax burden is larger than the producer’s tax burden when the consumer’s demand curve is more inelastic than the producer’s supply curve. 58 Chapter 6 The Envelope Theorem Recall our friend the monopolist: c max p(q)q − q 2 q 2 His FONCs are p′ (q ∗ )q ∗ − p(q ∗ ) − cq ∗ = 0 and SOSCs are p′′ (q ∗ )q ∗ − 2p′ (q ∗ ) − c < 0 This defines an implicit solution q ∗ (c) in terms of the marginal cost parameter, c. But then we can define the value function or indirect profit function as c π(c) = p(q ∗ (c))q ∗ (c) − (q ∗ (c))2 2 giving the monopolist’s maximized payoff for any value of c. If we differentiate with respect to c, we get ∂q ∗ (c) 1 ∗ 2 ∂q ∗ (c) ∂q ∗ (c) ∗ q (c) + p(q ∗ (c)) − q (c) − 2q ∗ (c) π ′ (c) = p′ (q ∗ (c)) ∂c ∂c 2 ∂c To sign this, it appears at first that we have to use the Implicit Function Theorem to derive ∂q ∗ (c)/∂c, and then maybe substitute it in or just sign as much as we can. But wait! If we re-arrange it to put all the ∂q ∗ (c)/∂c terms together, we get ∂q ∗ (c) 1 ∗ 2 π ′ (c) = p′ (q ∗ (c))q ∗ (c) + p(q ∗ (c))2q ∗ (c) − q (c) 2 | {z } ∂c FONCs Since the FONCs are zero at q ∗ (c), we get 1 π ′ (c) = − q ∗ (c)2 < 0 2 Without using the IFT at all. This is the basic idea of Envelope Theorems. The reason it is called an “envelope” is that the value function traces out a curve which is the maximum along all the global maxima of the objective function: 59 The Envelope Theorem When you differentiate the value function, you are studying how the peaks shift. 6.1 The Envelope Theorem Suppose an agent faces the maximization problem max f (x, c) x where c is some parameter. The FONC is fx (x∗ (c), c) = 0 Now, consider the value function or indirect payoff function V (c) = f (x∗ (c), c) This is the agent’s optimized payoff, given the parameter c. We might want to know how V (c) varies with c, or V ′ (c). That derivative equals V ′ (c) = fx (x∗ (c), c) | {z } ∂x∗ (c) + fc (x∗ (c), c) ∂c FONC At first glance, it looks like we’ll need to determine ∂x∗ (c)/∂c using the implicit function theorem, substitute it in, and then try to sign the expression. But since fx (x∗ (c), c) is zero at the optimum, the expression reduces to V ′ (c) = fc (x∗ (c), c) This means that the derivative of the value function with respect to a parameter is the partial derivative of the objective functions evaluated at the optimal solution. Theorem 6.1.1 (Envelope Theorem) For the maximization problem max f (x, c) x the derivative of the value function is ∂f (x, c) V (c) = ∂c ′ 60 x=x∗ (c) Again, in words, the derivative of the value function is the partial derivative of f (x, c) with respect to c, evaluated at the optimal choice, x∗ (c). Example Consider c π(q, c) = pq − q 2 2 The FONCs are p − cq ∗ (p, c) = 0 Substituting this back into the objective yields V (p, c) = p2 2c and p >0 c p2 Vc (p, c) = − 2 < 0 2c If we use the envelope theorem, however, Vp (p, c) = Vp (p, c) = πp (q ∗ (p, c), p, c) = q ∗ (p, c) = p >0 c p2 1 Vc (p, c) = πc (q ∗ (p, c), p, c) = − q ∗ (p, c)2 = − 2 < 0 2 2c These have the correct signs, and illustrate the usefulness of the envelope theorem. Example Suppose a consumer has utility function u(q, m) and budget constraint w = pq + m. Then his value function will be V (p, w) = u(q ∗ (p, w), w − pq ∗ (p, w)) Without even computing FONCs or SOSCs, then, we know that Vp (p, w) = −um (q ∗ (p, w), w − pq ∗ (p, w))q ∗ (p, w) and Vw (p, w) = um (q ∗ (p, w), w − pq ∗ (p, w)) Do you see that without grinding through the FONCs? So we can figure out how changes in the environment affect the welfare of an agent without necessarily solving for the agent’s optimal behavior explicitly. This can be very useful, not just in determining welfare changes, but also in simplifying the analysis of models. The next two examples illustrate how this works. Example As a simple way of incorporating dynamic concerns into models, we might consider models of firms in which some variables are fixed — capital, technology, capacity, etc. — and maximize over a flexible control variable like price or quantity to generate a short-run profit function. Then, the fixed variable becomes flexible in the long-run, giving us the long-run profit function. Suppose a price-taking firm’s cost function depends on the quantity it produces, q, as well as its technology, t. Its short-run profit function is πS (q, p, t) = max pq − C(q, t) q 61 Let’s stop now to think about what “better” technology means. Presumably, better technology should, at the least, mean that Ct (q, t) < 0, so that total cost is decreasing in t. Similarly, Cqt (q, t) < 0 would also make sense: Marginal cost Cq (q, t) is decreasing in t. However, is this always the case? Some technologies have scale effects, where for q < q̄, Cqt (q, t) > 0, but for q > q̄, Cqt (q, t) < 0, so that the cost-reducing benefits are sensitive to scale. For example, having a hightech factory where machines do all the labor will certainly be more efficient than an assembly line with workers, provided that enough cars are being made. Let’s see what the FONCs and SOSCs we derive say about the relationships between these partials and cross-partials. In the “short-run” we can treat t as fixed and study how the firm maximizes profits: It has an FONC p − Cq (q, t) = 0 and SOSC −Cqq (q, t) < 0 So as long as C(q, t) is convex in q, there will be a unique profit-maximizing q ∗ (t). Using the implicit function theorem, we can see how the optimal q ∗ (t) varies in t: −Cqq (q ∗ (p, t), t) or ∂q ∗ (p, t) − Ctq (q ∗ (p, t), t) = 0 ∂t ∂q ∗ (p, t) Ctq (q ∗ (t), t) = ∂t −Cqq (q ∗ (t), t) If better technology lowers marginal cost, Ctq (q, t) < 0, and this q ∗ (p, t) is increasing in t. Otherwise, ∂q ∗ (p, t)/∂t will be decreasing in t. Note that there is no contradiction between Ct (q, t) < 0 and Cqt (q, t) > 0. But now suppose we take a step back to the “long-run”. How should the firm invest in technology? Consider the long-run profit function πL (p, t) = πS (q ∗ (p, t), p, t) − kt where the −kt term is the cost of adopting level t technology. Then the firm’s FONC is d [πS (q ∗ (p, t), p, t)] − k = 0 dt but by the envelope theorem, the first term is just the derivative of the short run value function with respect to t, so −Ct (q ∗ (p, t∗ (p)), t∗ (p)) − k = 0 and the SOSC is −Cqt (q ∗ (p, t∗ (p)), t∗ (p)) ∂q ∗ (t) − Ctt (q ∗ (p, t∗ (p)), t∗ (p)) < 0 ∂t Since ∂q ∗ (p, t)/∂t < 0, for the SOSC to be satisfied it must be the case that Ctt (q ∗ (p, t), t) < 0. This would mean that the marginal cost reduction in t is decreasing in t: Better technology always reduces the total cost function, Ct (q, t) < 0, but at a decreasing rate Ctt (q, t) < 0. How does the profit-maximizing t∗ (p) depend on price? Using the implicit function theorem, ∗ ∂t∗ (p) ∂q (p, t∗ (p)) ∂q ∗ (p, t∗ (p)) ∂t∗ (p) ∗ ∗ ∗ + =0 − Ctt (q ∗ (p, t∗ (p)), t∗ (p)) −Cqt (q (p, t (p)), t (p)) ∂p ∂t ∂p ∂p 62 or ∂q ∗ ∂p |{z} ∂q ∗ ∂p + = = ∂q ∗ ∂q ∗ ∂p −Cqt − − Ctt − Ctt /Cqt ∂t ∂t The numerator is positive, and the denominator is positive if (using the comparative static from the short-run problem) Ctq − − Ctt /Cqt > 0 −Cqq or Cqt Ctq − Cqq Ctt > 0 ∂t∗ Cqt It will turn out that the above inequality implies that C(q, t) is a convex function when considered as a two-dimensional object. This kind of short-run/long-run model is very useful in providing simple but rigorous models of how firms behave across time. Of course, there are no dynamics here, but it captures the idea of how in the short run the firm can vary output but not technology, but in the long run things like technology, capital, and other “investment goods” become choice variables. Example This is an advanced example of how outrageously powerful the envelope theorem can be. In particular, we’ll use it to derive the (Bayesian Nash) equilibrium strategies of a first-price auction. At a first-price auction, there are i = 1, 2, ..., N buyers competing for a good. Buyer i knows his own value, vi > 0, but no one else’s. The buyers simultaneously submit a bid bi . The highest bidder wins, and gets a payoff vi − bi , while the losers get nothing. Let p(bi ) be the probability that i wins given a bid of bi . Presumably, all the buyers’ bids should be increasing in their values. For example, if buyer 1’s value is 5 and buyer 2’s value is 3, buyer 1 should bid a higher amount. Said another way, the bid function bi = b(vi ) is increasing. Then buyer i’s expected payoff is U (vi ) = max p(bi )(vi − bi ) bi with FONC p′ (bi )(vi − bi ) − p(bi ) = 0 The envelope theorem implies U ′ (vi ) = p(b(vi )) And integrating with respect to vi yields U (vi ) = Z vi p(b(x))dx 0 Now, equating the two expressions for U (vi ) gives p(b(vi ))(vi − b(vi )) = U (vi ) = and re-arranging yields b(v) = v − Rv 0 63 Z vi 0 p(b(x))dx p(b(v)) p(b(x))dx If we knew the probability that an agent with value v making a bid b(v) won, we could solve the about expression. But consider the rules of the auction: The highest bidder wins. If b(v) is increasing, then the probability of i being the highest bidder with bid b(vi ) is pr[b(vi ) ≥ b(v1 ), ..., b(vi ) ≥ b(vi−1 ), b(vi ) ≥ b(vi+1 ), ..., b(vi ) ≥ b(vN )] = pr[vi ≥ v1 , ..., vi ≥ vi−1 , vi ≥ vi+1 , ..., vi ≥ vN ] If the bidders’ values are independent and the probability distribution of each bidder’s value v is F (v), then the probability of having the highest value given vi is pr[vi ≥ v1 , ..., vi ≥ vi−1 , vi ≥ vi+1 , ..., vi ≥ vN ] = pr[vi ≥ v1 ]...pr[vi ≥ vi−1 ]pr[vi ≥ vi+1 ]...pr[vi ≥ vN ] = F (vi )...F (vi )F (vi )...F (vi ) = F (vi )N −1 So that p(b(vi )) = F (vi )N −1 and b(v) = v − Rv 0 F (x)N −1 dx F (v)N −1 Since this function is indeed increasing, it satisfies the FONC above. So using the envelope theorem and some basic probability, we’ve just derived the strategies in one of the most important strategic games that economists study. Other derivations require solving systems of differential equations or using a sub-field of game theory called mechanism design. Exercises 1. Suppose that a price-taking firm can vary its choice of labor, L, in the short run, but capital, K is fixed. Quantity q is produced using technology F (K, L). In the long run the firm can vary capital. The cost of labor is w, and the cost of capital is r, and the price of its good is p. (i) Derive the short-run profit function and show how q ∗ varies with r and p. (ii) Derive the long-run profit function and show how K varies with r. (iii) How does a change in p affect the long-run choices of K and L, and the short run choice of L? 2. Suppose an agent maximizes max f (x, c) x yielding a solution x ∗ (c). Suppose the parameter c is perturbed to c′ . Use a second-order Taylor series expansion to characterize the loss that arises from using x∗ (c) instead of the new maximizer, x∗ (c′ ). Show that the loss is proportional to the square of the maximization error, x∗ (c′ ) − x∗ (c). 3. Suppose consumers have utility function u(Q, m) = b log(Q)+m and face a budget constraint w = pQ + m. Solve for the firms’ short-run profit functions in equilibrium. (i) If there are K pricec taking firms, each with cost function c(q) = q 2 and aggregate supply is Q = Kq, solve for the 2 firms’ short-run profit functions πS (K) in terms of K. (ii) If there is a fixed cost F to entry and there no other barriers to entry, characterize the long-run profit function πL (K) and solve for the long-run number of firms K ∗ . How does K ∗ vary in b, c, and F ? (iii) Can you generalize this analysis to a convex, increasing C(q) and concave, increasing v(q) using the IFT and envelope theorem? In particular, how do p∗ and q ∗ respond to an increase in F , and how does K ∗ respond to an increase in F ? 64 4. Consider a partial equilibrium model with a consumer with utility function u(q, m) = v(q)+m and budget constraint w = pq + m, and a firm with cost function C(q). Suppose the firm must pay a tax t for each unit of the good it sells. Let social welfare be given by W (t) = (v(q ∗ (t)) − p∗ (t)q ∗ (t) + w) + ((p∗ (t) − t)q ∗ (t) − C(q ∗ (t))) + tq ∗ (t) Use a second-order Taylor series to approximate the loss in welfare from the tax, W (t) − W (0). Show that this welfare loss is approximately proportional to the square of the tax. Sketch a graph of the situation. 5. Consider a partial equilibrium model with a consumer with utility function u(q, m, R) = v(q) + m + g(R) and budget constraint w = pq + m, where R is government expenditure and g(R) is the benefit to the consumers from government expenditures. The firm has a cost function C(q), and pays a tax t for each unit of the good it sells. Tax revenue is used to fund government services, which yield benefit g(R) to consumers, where R = tq is total tax revenue. Define social welfare as W (t) = (v(q) − pq + w) + ((p − t)q − C(q)) + g(tq) where the last term represents tax revenue. (i) If the firm and consumer maximize their private benefit taking R as a constant, what are the market-clearing price and quantity? How do the market-clearing price and quantity vary with the tax? (ii) What is the welfare maximizing level of t? 65 Chapter 7 Concavity and Convexity At various times, we’ve seen functions for which the SOSCs are satisfied for any point, not just the critical point. For example, c • A price-taking firm with cost function C(q) = q 2 has SOSCs −c < 0, independent of q 2 • A price-taking consumer with benefit function v(q) = b log(q) has SOSCs −1/q 2 < 0, which is satisfied for any q • A household with savings problem maxs log(y1 − s) + δ log(Rs + y2 ) has SOSCs −1 −1 + δR2 <0 2 (y1 − s) (Rs + y2 )2 which is satisfied for any s The common feature that ties all these examples together is that f ′′ (x) < 0 for all x, not just a critical point x∗ . This implies that the first-order necessary condition f ′ (x) is a monotone decreasing function in x, so if it has a zero, it can only have one (keep in mind though, some decreasing functions have no zeros, like e−x ). Concave functions have a unique critical point, or none This actually solves all our problems with identifying maxima: If a function satisfies f ′′ (x) < 0, it has at most one critical point, and any critical point will be a global maximum. This is a special class of functions that deserves some study. 66 7.1 Concavity Recall the partial equilibrium consumer with preferences u(q, m) = v(q) + m. Earlier, we claimed that v ′ (q) > 0 and v ′′ (q) < 0 were good economic assumptions: The agent has positive marginal value for each additional unit, but this marginal value is decreasing. Let’s see what these economic assumptions imply about Taylor series. The benefit function, v(q), satisfies v(q) = v(q0 ) + v ′ (q0 )(q − q0 ) + v ′′ (q0 ) (q − q0 )2 + o(h3 ) 2 If we re-arrange this a little bit, we get v(q) − v(q0 ) q − q0 + o(h3 ) = v ′ (q0 ) + v ′′ (q0 ) q − q0 2 and since we know that v ′′ (q0 ) ≤ 0, this implies, for q0 close to q v(q) − v(q0 ) ≤ v ′ (q0 ) q − q0 In words or pictures, the derivative at q0 is steeper than the chord from v(q0 ) to v(q). Concavity This is one way of defining the idea of a concave function. Here are the standard, equivalent definitions: Theorem 7.1.1 The following are equivalent: Let f : D = (a, b) → R. • f (x) is concave on D • For all λ ∈ (0, 1) and x′ , x′′ in D, f (λx′ + (1 − λ)x′′ ) ≥ λf (x′ ) + (1 − λ)f (x′′ ) • f ′′ (x) ≤ 0 for all x in D, where f () twice differentiable • For all x′ and x′′ , f ′ (x′ ) ≥ f (x′′ ) − f (x′ ) x′′ − x′ 67 If the weak inequalities above hold as strict inequalities, then f (x) is strictly concave. Note that concavity is a global property (holds on all of D) that depends on the domain of the function, D = (a, b). For example, log(x) is concave on (0, ∞), since f ′′ (x) = −1 <0 x2 p ∞), and concave for (−∞, 0), for all x in (0, ∞). The function |x|, however, is concave for (0,p but not concave for (−∞, p ∞). Why? If we connect the points at | ± 1| = 1 by a chord, it lies above the function at |0| = 0, violating the second criterion for concavity. So a function can have f ′′ (x) ≤ 0 for some x and not be concave: It is concave only if f ′′ (x) < 0 for all x in its domain, D. Theorem 7.1.2 If f (x) is differentiable and strictly concave, then it has at most one critical point which is the unique global maximum. Proof If f (x) is strictly concave, the FONC is f ′ (x) = 0 Since f (x) is strictly concave, the derivative f ′ (x) is monotone decreasing. This implies there is at most one time that f ′ (x) crosses zero, so it has at most one critical point. Since f (x∗ ) − f (x) = −f ′′ (x∗ ) (x − x∗ )2 >0 2 holds (if f ′′ (x) < 0 for all x, not just x∗ ), this is a local maximum. Since there is a unique critical point x∗ and f ′′ (x) < 0 for all x, the candidate list consists only of x∗ . There is some room for confusion about what the previous theorem buys you, because a concave function might not have any critical points. For example, f (x) = log(x) has FONC 1 x which reaches f ′ (x) = 0 only as x → ∞ (∞ is not a number). Since log(x) → as x → ∞, there isn’t a finite maximizer. On the other hand f ′ (x) = f (x) = log(x) − x has FONC f ′ (x) = 1 −1 x and SOSC f ′′ (x) = − so that x∗ = 1 is the unique global maximizer. 68 1 x2 7.2 Convexity Similarly, our cost function C(q), satisfies C(q) = C(q0 ) + C ′ (q0 )(q − q0 ) + C ′′ (c) (q − q0 )2 2 Re-arranging it the same way, we get the opposite conclusion, C(q) − C(q0 ) ≥ C ′ (q0 ) q − q0 so that the chord from C(q0 ) to C(q) is steeper than the derivative C ′ (q0 ). This is a convex function. Convexity Theorem 7.2.1 The following are equivalent: Let f : D = (a, b) → R. • f (x) is a convex on D. • For all λ ∈ (0, 1) and x′ , x′′ in D, f (λx′ + (1 − λ)x′′ ) ≤ λf (x′ ) + (1 − λ)f (x′′ ) • f ′′ (x) > 0 for all x in D • If f (x′ ) ≥ f (x′′ for all x′ , x′′ in D, then f (x′′ ) − f (x′ ) ≥ f ′ (x′ ) x′′ − x′ If the above weak inequalities hold strictly, then f (x) is strictly convex. Convexity is a useful property for, in particular, minimization. 69 Exercises 1. Rewrite Theorem 7.1.2, replacing the word “concave” with “convex” and “maximization” with “minimization.” Sketch a graph like the first one in the chapter to illustrate the proof. 2. Show that if f (x) is concave, then −f (x) is convex. If f (x) is convex, then −f (x) is concave. Pn 3. Show that if f1 (x), f2 (x), ..., fn (x) arePconcave, then f (x) = i=1 fi (x) is concave. If f1 (x), f2 (x), ..., fn (x) are convex, then f (x) = ni=1 fi (x) is convex. 4. Can concave or convex functions have discontinuities? Provide an example of a concave or convex function with a discontinuity, or show why they must be continuous. 70 Part III Optimization and Comparative Statics in RN 71 Chapter 8 Basics of RN Now we need to generalize all the key results (extreme value theorem, FONCs/SOSCs, and the implicit function theorem) to situations in which many decisions are being made at once, sometimes subject to constraints. Since we need functions involving many choice variables and possibly many parameters, we need to generalize the real numbers to RN , or N -dimensional Euclidean vector space. An N -dimensional vector is a list of N ordered real numbers, x1 x2 x= . . . xN where certain rules of addition and multiplication are defined. In particular, if x and y are vectors and a is a real number, scalar multiplication is defined as ax1 ax2 ax = . .. axN and vector addition is defined as x+y = x1 + y 1 x2 + y 2 .. . xN + y N The transpose of a column vector is a row vector, x′ = (x1 , x2 , ..., xN ) and vice versa. A basis vector ei is a vector with a 1 in the i-th spot, and zeros everywhere else. The set of N -dimensional vectors with real entries RN is Euclidean, since we define length by the Euclidean norm. The norm is a generalization of absolute value, and is given by q ||x|| = x21 + x22 + ... + x2N Though, we could just as easily use ||x|| = max xi i 72 or ||x|| = X i xpi !1/p These are all just ways of summarizing how “large” a vector is. The Norm √ For N = 1, this is just R, since ||x|| = x2 = |x|. With this norm in mind, we define distance between two vectors x and y as p ||x − y|| = (x1 − y1 )2 + (x2 − y2 )2 + ... + (xN − yN )2 Instead of defining an open interval as (a, b), we define the open ball of radius δ at x0 , given by Bδ (x0 ) = {y : ||y − x0 || < δ} and instead of defining a closed interval [a, b], we define the closed ball of radius δ at x0 is B̄δ (x0 ) = {y : ||y − x0 || ≤ δ} The last special piece of structure on Euclidean space is vector multiplication, or the inner product or dot product: x′ y =< x, y >= x · y = x1 y1 + x2 y2 + ... + xN yN In terms of geometric intuition, the dot product is related to the angle θ between x and y: cos(θ) = x·y ||x||||y|| Note that if x · y = 0, then cos(θ) = 0, so that the vectors must be at right angles to each other, or x is orthogonal to y. 73 Orthogonal Vectors So a Euclidean space has notions of direction, distance, and angle, and open and closed balls are easy to define. Not all spaces have these properties, which makes Euclidean space special1 . 8.1 Intervals → Topology In R, we had open sets (a, b), closed sets [a, b], and bounded sets, where a and b are finite. In multiple dimensions, however, it’s not immediately obvious what the generalizations of these properties should be. For example, we could define open cells as sets C = (0, 1) × (0, 1) × ...(0, 1), which look like squares or boxes. Or we could define open balls as sets B = {y : ||y|| < 1}. But then, we’re going to study things like budget sets, defined by inequalities like {(x, y) : x ≥ 0, y ≥ 0, px x + py y ≤ w}, which are neither balls nor cells. To avoid these difficulties, we’ll use some ideas from topology. Topology is the study of properties of sets, and what kinds of functions preserve those properties. For example, the function f (x) = x maps the set [−1, 1] into the set [−1, 1], but the function ( x if |x| = 6 1 g(x) = 0 if |x| = 1 maps [−1, 1] into (−1, 1). So f (x) maps a closed set to a closed set, while g(x) maps a closed set to an open set. So the question, “What kinds of functions map closed sets to closed sets?” is a topological question. Why do YOU care about topology? What we want to do is study the image of a function, f (D) and see if it achieves its maximum. The function f (x) = x above achieves a maximum at x = 1. The second function, g(x), does not achieve a maximum, since sup g([−1, 1]) = 1 but 1 is not in g([−1, 1]). Our one-dimensional Weierstrass theorem tells us that we shouldn’t expect g(x) to achieve a maximum since it is not a continuous function. But how do we generalize this to Euclidean space and more general sets than simple intervals (right? Because now we might be maximizing over spheres, triangles, tetrahedrons, simplexes, and all kinds of non-pathological sets that are more complicated than [a, b])? 1 For example, the space of continuous functions over a closed interval [a, b] is also a vector space, called C([a, b]), with scalar multiplication f a = (af (x)), vector addition f + g = (f (x) + g(x)), norm ||f || = maxx∈[a,b] f (x) and distance d(f, g) = maxx∈[a,b] |f (x) − g(x)|. You use this space all the time when you do dynamic programming in macroeconomics, since you are looking for an unknown function that satisfies certain properties. This space has very different properties — for example, the closed unit ball {g : d(f, g) ≤ 1} is not compact. 74 Definition 8.1.1 • A set is open if it is a union of open balls • x is a point of closure of a set S if for every δ > 0, there is a y in S such that ||x − y|| < δ • The set of all points of closure of a set S is S̄, and S̄ is the closure of S • A set S is closed if S̄ = S • A set S is closed if its complement, S c is open • A set S is bounded if it is a subset of an open ball Bδ (x0 ), where δ is finite. Otherwise, the set is unbounded Open Balls, Closed Balls, and Points of Closure Note that x can be a point of closure of a set S without being a member of S. For example, the open interval (−1, 1) is an open set, since it is an open ball, B1 (0) = {y : |y − 0| < 1} = (−1, 1). Each point in B1 (0) is a point of closure of (−1, 1), since for all δ > 0 they contain an element of (−1, 1). However, the points −1 and 1 are also points of closure of (−1, 1), since for all δ > 0, −1 and 1 contain some points of (−1, 1). So the closure of (−1, 1) is [−1, 1]. As you might have expected, this is a closed set. This works just as well in Euclidean space. The open ball at zero, Bδ (0) = {y : ||y − 0|| < δ} is an open ball, so it is a union of open balls, so it is open. All elements of Bδ (0) are points of closure of Bδ (0), since for all δ > 0, they contain a point of Bδ (0). However, the points satisfying {y : ||y − 0|| = δ} are also points of closure, since part of the open ball around any of these points must intersect with Bδ (0). Consequently, the set of all points of closure of Bδ (0) is the set {y : ||y − 0|| ≤ δ}, which is the closed ball. There are many other properties that are considered topological: connected, dense, convex, separable, and so on. But for our purposes of developing maximization theory, there is a special topological property that will buy us what we need to generalize the extreme value theorem: 75 Definition 8.1.2 A subset K of Euclidean space is compact if each sequence in K has a subsequence that converges to a point in K. The Bolzano-Weierstrass theorem (which is a statement about the properties of sequences) looks similar, but this definition concerns the properties of a sets. A set is called compact if any sequence constructed from it has the Balzano-Weierstrass property. For example, the set (a, b) is not compact, because xn = b − 1/n is a sequence entirely in (a, b), but all of its subsequences converge to b which is not in (a, b). Characterizing compact sets in a space is the typical starting point for studying optimization in that space2 . As it happens, there is an easy characterization for RN : Theorem 8.1.3 (Heine-Borel) In Euclidean space, a set is compact iff it is closed and bounded. Basically, bounded sets that include all their points of closure are compact in RN . Non-compact sets are either unbounded, like {(x, y) : (x, y) ≥ (0, 0)}, or open, like {(x, y) : x2 + y 2 < c}. Example Consider the set in RN described by xi ≥ 0 for all i = 1, ..., N , and N X i=1 p i xi = p · x ≤ w This is called a budget set, Bp,w . In two dimensions, this looks like a triangle with vertices at 0, w/p1 and w/p2 . This set is compact. We’ll use the Heine-Borel theorem, and show the set is closed and bounded. Budget Sets It is bounded, since if we take δ = maxi w/pi + 1, the set is contained in Bδ (0) (right?). It is closed since if we take any sequence xn satisfying xn ≥ 0 for all n and p · xn ≤ w for all n, the limit of the sequence must satisfy these inequalities as well (a weakly positive sequence for all terms can’t become negative at the limit). If you don’t like that argument that it is closed, we can prove Bp,w is closed by showing that the complement of Bp,w is open: Let y be a point outside Bp,w . Then the ray z = α0 + (1 − α)y where α ∈ [0, 1] starts in the budget set, but eventually exits it to reach y. Take α∗ to be the point at which the ray exits the set. Then we can draw an open ball around y with radius r = (1 − α∗ )/2, so that Br (y) contains no points of Bp,w . That means that for any c y, we can draw an open ball around it that contains no point in Bp,w . That means that Bp,w is a union of open balls, so Bp,w is closed. 2 See the footnote on page 1: Closed, bounded sets in C([a, b]) are not compact. 76 So competitive budget sets are compact. In fact, we’ve shown that any set characterized by x ≥ 0 and a · x ≤ c is compact, which is actually a sizeable number of sets of interest. Example The N -dimensional unit simplex, ∆N is constructed by taking the N basis vectors, (e1 , e2 , ..., en ), and considering all convex combinations xλ = λ1 e1 + λ2 e2 + ... + λN eN P such that N i=1 λi = 1 and 0 ≤ λi ≤ 1. Then ∆N is bounded, since if we take the N -dimensional open ball B2 (0) of radius 2, it includes the simple. And ∆N is closed, since, taking any sequence λn → λ, and limn→∞ λ′n (e1 , ..., eN ) = λ′ (e1 , ..., eN ), which is in the simplex. So the simplex is its own closure, ∆¯N = ∆N , so it is closed. The simplex comes up in probability theory all the time: It is the set of all probability distributions over N outcomes. 8.2 Continuity Having generalized the idea of “[a, b]” to RN , we now need to generalize continuity. Continuity is more difficult to visualize in RN since we can no longer sketch a graph on a piece of paper. For a function f : R2 → R we can visualize the graph in three dimensions. A continuous function is one for which the “sheet” is relatively smooth: It may have “ridges” or “kinks” like a crumpled-up piece of paper that has been smoothed out, but there are no “rips” or “tears” in the surface. Definition 8.2.1 A function f : D → R is continuous at c ∈ D if for all ε > 0, there exists a δ > 0 so that if ||x − c|| < δ, then |f (x) − f (c)| < ε. The only modification from the one-dimensional definition is that we have replaced the set (c − δ, c + δ) with an open ball, Bδ (c) = {x : ||x − c|| < δ}. Otherwise, everything is the same. Subsequently, the proof of the Sequential Criterion for Continuity is almost exactly the same: Theorem 8.2.2 (Sequential Criterion for Continuity) A function f : D → R is continuous at c iff for all sequences xn → c, we have lim f (xn ) = f lim xn = f (c) n→∞ n→∞ Again, continuity allows us to commute the function f () with a limit operator lim. This is the last piece of the puzzle of generalizing the extreme value theorem. 8.3 The Extreme Value Theorem So, what do you have to internalize from the preceding discussion in this chapter? • A set is open if it is a union of open balls. • A set is closed if it contains all its points of closure. A set is closed if its complement is open. • A set is compact in RN if it is closed and bounded. • In compact sets, all sequences have a convergent subsequence. 77 • If a function is continuous in RN , then lim f (xn ) = f n→∞ for any sequence xn → c. lim xn x→∞ The above facts let us generalize the Extreme Value Theorem: Theorem 8.3.1 (Weierstrass’ Extreme Value Theorem) If K is a compact subset of RN and f : K → R is a continuous function, then f (x) achieves a maximum on K. Proof Let sup f (K) = m∗ Then we can construct a sequence xn satisfying m∗ − 1 ≤ f (xn ) ≤ m∗ n Since K is compact, every sequence xn has a convergent subsequence xnk → x∗ . Taking limits throughout the inequality, we get lim m∗ − nk →∞ 1 ≤ lim f (xnk ) ≤ m∗ nk nk →∞ and by continuity of f (), 1 ≤f lim m − nk →∞ nk ∗ so that lim xnk nk →∞ ≤ m∗ m∗ ≤ f (x∗ ) ≤ m∗ Since K is closed, the limit x∗ of the convergent subsequence xnk is in K (See the proof appendix; all closed sets contain all of their limit points). Therefore, a maximizer exists since x∗ is in K and achieves the supremum, f (x∗ ) = m∗ . This is the key result in explaining when maximizers and minimizers exist in RN . Moreover, the assumptions are generally easy to verify: Compact sets are closed and bounded, which we can easily find in Euclidean space, and since our maximization problems generally involve calculus, f (x) will usually be differentiable, which implies continuity. 8.4 Multivariate Calculus As we discussed earlier, differentiability of a function f : RN → R is slightly more complicated that for the one-dimensional case. We review some of that material here briefly and introduce some new concepts. Definition 8.4.1 • The partial derivative of f (x) with respect to xi , where f : D → R and D is an open subset of R, is f (x1 , x2 , ..., xi−1 , xi + h, xi+1 , ..., xN ) − f (x1 , x2 , ..., xi−1 , xi , xi+1 , ..., xN ) ∂f (x) = fxi (x) = lim h→0 ∂xi h • The gradient of f (x) is the vector of partial derivatives, ∂f (x) ∂f (x) ∂f (x) ∇f (x) = , , ..., ∂x1 ∂x2 ∂xN 78 • The total differential of f (x) at x is df (x) = N X fxi (x)dxi i=1 Now that we have a slightly better understanding of RN , the geometric intuition of the gradient is more clear. The partial derivative with respect to xi is a one-dimensional derivative in the i-th dimension, giving the change in the function value that can be attributed to perturbing xi slightly. The gradient is just the vector from the point (x, f (x)) pointing in the direction ∇f (x), which represents how the function is changing at that point. If we multiple ∇f (x) by the basis vector ei — where ei = 1 in the i-th entry, but is zero otherwise — we get ∂f (x) ∇f (x) · ei = ∂xi so we are asking, “If we increase xi a small amount, how does f (x) change?” What if we wanted to investigate how f (x) changes in some other direction, y? The change in f (x) in the direction yh as h goes to zero is given by f (x1 + y1 h, ..., xN + yN h) − f (x1 , ...xN ) lim h→0 h To compute this, we can use a Taylor series dimension-by-dimension on f (x1 + y1 h, x2 + y2 h) to get ∂f (ξ1 , x2 + y2 h) f (x1 + y1 h, x2 + y2 h) = f (x1 , x2 + y2 h) + y1 h ∂x1 where ξ1 is between x1 and x1 + y1 h. Doing this again with respect to x2 gives ∂f (x1 , ξ2 ) f (x1 , x2 + y2 h) = f (x1 , x2 ) + y2 h ∂x2 Substituting the above equation into the previous one yields ∂f (x1 , ξ2 ) ∂f (ξ1 , x2 + y2 h) f (x1 + y1 h, x2 + y2 h) = f (x1 , x2 ) + y2 h + y1 h ∂x2 ∂x1 Re-arranging and dividing by h yields ∂f (ξ1 , x2 + y2 h) ∂f (x1 , ξ2 ) f (x1 + y1 h, x2 + y2 h) − f (x1 , x2 ) = y1 + y2 h ∂x1 ∂x2 Since x1 + y1 h → x1 as h → 0, ξ1 → x1 and ξ2 → x2 . Then taking limits in h yields ∂f (x1 , x2 ) ∂f (x1 , x2 ) Dy f (x) = y1 + y2 ∂x1 ∂x2 Note that even through y is not an “infinitesimal vector”, this is the differential change in the value of f (x) with respect to an infinitesimal change in x in the direction y. In general, Theorem 8.4.2 The change in f (x) in the direction y is given by the directional derivative of f (x) in the direction y, N X ∂f (x) Dy f (x) = yi ∂xi i=1 or Dy f (x) = ∇f (x) · y This is certainly useful for maximization purposes: You should anticipate that x∗ is a local maximum if, for any y, Dy f (x∗ ) = 0. Notice also that the total differential df (x), is just the direction derivative with respect to the unit vector e = (dx1 , dx2 , ..., dxN ), or df (x) = ∇f (x) · dx 79 The Hessian The gradient only generalizes the first derivative, however, and our experience from the onedimensional case tells us that the second derivative is also important. But Exercise 8 from Chapter 4 suggests that the vector of second partial derivatives of f (x) is not sufficient to determine whether a point is a local maximum or minimum. The correct generalization of the second derivative is not another vector, but a matrix. Definition 8.4.3 The Hessian of f (x) is the matrix ∂ 2 f (x) ∂x21 2 ∂ f (x) ∂x2 ∂x1 .. . 2 ∂ f (x) ∂xN ∂x1 ∂ 2 f (x) ∂x1 ∂x2 ∂ 2 f (x) ∂x22 ∂ 2 f (x) ∂xN ∂x1 2 ∂ f (x) ... .. . ∂x2N Various notations for the Hessian are ∇2x f (x), D 2 f (x), ∇xx = ∇x · ∇′x , and H(x). For notational purposes, it is sometimes easier to write the matrix ∂2 ∂2 ∂2 ... ∂x1 ∂x2 ∂xN ∂x1 ∂x21 2 2 ∂ ∂ ∂x ∂x 2 ∂x2 2 1 ∇xx = . . . . . . 2 2 ∂ ∂ ∂xN ∂x1 ∂x2N and think of the Hessian as ∇xx ∗ f (x) = H(x). In particular, ∇xx = ∇x · ∇′x , so that when you take the “gradient of a gradient” you get a Hessian. To begin understanding this matrix, consider the quadratic form, 1 1 f (x1 , x2 ) = a1 x1 + a2 x2 + b11 2x21 + b22 x22 + b3 x1 x2 + c 2 2 This is the correct generalization of a quadratic function bx2 + ax + c to two dimensions, allowing interaction between the x1 and x2 arguments. It can be written as 1 f (x) = a′ x + x′ Bx + c 2 where a = [a1 , a2 ]′ and B= b11 b3 b3 b22 By increasing the rows and columns of a and H, this can easily be extended to mappings from RN for any N . The gradient of f (x1 , x2 ) is a1 + b11 x1 + b3 x2 ∇f (x) = a2 + b22 x2 + b3 x1 80 But you can see that x2 appears in fx1 (x), and vice versa, so that changes in x2 affect the partial derivative of x1 . To summarize this information, we need the extra, off-diagonal terms of the Hessian, or b11 b3 =B H(x) = b3 b22 A basic fact of importance is that the Hessian is symmetric matrix when f (x) is continuous, or or H ′ = H: Theorem 8.4.4 (Young’s Theorem) If f (x) is a continuous function from RN → R, then ∂ 2 f (x) ∂ 2 f (x) = ∂xi ∂xj ∂xj ∂xi and more generally, the order of partial differentiation can be interchanged arbitrarily for all higherorder partial derivatives. 8.5 Taylor Polynomials in Rn Second-order approximations played a key role in developing second-order sufficient conditions for the one-dimensional case, and we need a generalization of them for the multi-dimensional case. Again, however, the right “generalization” is somewhat subtle. The second-order Taylor polynomials of the one-dimensional case were quadratic functions with an error term, (x − x0 )2 + o(h3 ) 2 In the multi-variate case, we start with a quadratic function, 1 f (x) = c + a′ x + x′ Bx 2 and replace each piece with its generalization of the one-dimensional derivative and second derivative to get a quadratic approximation f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f ′′ (x0 ) H(x0 ) (x − x0 ) 2 and this quadratic approximation becomes a Taylor polynomial when we add the remainder/error term, o(h3 ): f (x) = f (x0 ) + ∇f (x0 )(x − x0 ) + (x − x0 )′ f (x) = f (x0 ) + ∇f (x0 )(x − x0 ) + (x − x0 )′ H(x0 ) (x − x0 ) + o(h3 ) 2 where h = x − x0 . Definition 8.5.1 Let f : D → R be a thrice-differentiable function, and D an open subset of RN . Then the second-order Taylor polynomial of f (x) at x0 is 1 f (x) = f (x0 ) + ∇f (x0 )(x − x0 ) + (x − x0 )′ H(x0 )(x − x0 ) + o(h3 ) 2 3 where h = x − x0 , and the remainder o(h ) → 0 as x → x0 . A linear approximation of f around a point x0 is the function f (x) = f (x0 ) + ∇f (x0 )(x − x0 ) and a quadratic approximation of f around a point x0 is the function H(x0 ) (x − x0 ) 2 Higher-order approximations require a lot more notation, but second-order approximations are all we need. f (x) = f (x0 ) + ∇f (x0 )(x − x0 ) + (x − x0 )′ 81 8.6 Definite Matrices In the one dimensional case, the quadratic function f (x) = c + ax + Ax2 was well-behaved for maximization purposes when it had a negative second derivative, or A < 0. In the multi-variate case, 1 f (x) = c + a · x + x′ Ax 2 it is unclear what sign x′ Ax takes. For instance, a · x > 0 if a ≥ 0 and x ≥ 0. But when is x′ Ax ≥, ≤, >, < 0? If we knew this, it would be easier to pick out “nice” optimization problems, just as we know that the one-dimensional function f (x) = Ax2 + ax + c will only have a “nice” solution if A < 0. We can easily pick out some matrices A so that x′ Ax < 0 for all x ∈ RN . If a1 0 0 0 0 a2 0 0 A= 0 0 ... 0 0 then x′ Ax = PN 2 i=1 ai xi , 0 0 aN and this will only be negative if each ai < 0. Notice that, 1. The eigenvalues of A are precisely the diagonal terms a1 , ..., aN 2. The absolute value of ai is greater than the sum of all the off-diagonal terms (which is zero) 3. If we compute the determinant of each sub-matrix starting from the upper left-hand corner, the determinants will alternate in sign These are all characterizations of a negative definite matrix, or one for which x′ Ax < 0 for all x. But once we start adding off-diagonal terms, this all becomes much more complicated because it is no longer sufficient merely to have negative terms on the diagonal. Consider the following matrix: −1 −2 A= −2 −1 All the terms are negative, so many students unfamiliar with definite matrices think that x′ Ax will then be negative, as an extension of their intuitions about one-dimensional quadratic forms ax2 + bx. But this is false. Take the vector x = [1, −1]: x′ Ax = [1, −1]′ [1, −1] = 1 + 1 = 2 > 0 The failure here is that the quadratic form x′ Ax = −x21 + −x22 − 4x1 x2 looks like a “saddle”, not like a “hill”. If x1 x2 < 0, then −4x1 x2 > 0, and this can wipe out −x21 − x22 , leaving a positive number. So a negative definite matrix must have a negative diagonal, and the “contribution” of the diagonal terms will outweigh those of the off-diagonal terms. Let A be a N × N matrix, and x = (x1 , x2 , ..., xn ) a vector in RN . A quadratic form on RN is the function N N X X aij xi xj x′ Ax = i=1 j=1 82 Definition 8.6.1 Let A be an n × n symmetric matrix. Then A is • positive definite if, for any x ∈ Rn \0, x′ Ax > 0 • negative definite if, for any x ∈ Rn \0, x′ Ax < 0 • positive semi-definite if, for any x ∈ Rn \0, x′ Ax ≥ 0 • negative semi-definite if, for any x ∈ Rn \0, x′ Ax ≤ 0 Why does this relate to optimization theory? When maximizing a multi-dimensional function, a local (second-order Taylor) approximation of it around a point x∗ should then look like a hill if x∗ is a local maximum. If, locally around x∗ , the function looks like a saddle or a bowl, the value of the function could be increased by moving away from x∗ . So these tests are motivated by the connections between the geometry of maximization and quadratic forms. In short, these definitions generalize the idea of a positive or negative number in R to a “positive or negative matrix”. Definition 8.6.2 The leading principal minors of a matrix A are the matrices, A1 = a11 a11 A2 = a21 .. . a11 a21 An = .. . an1 a12 a22 a12 a22 .. . an2 . . . a1n .. .. . . .. .. . . . . . ann Here are the main tools for characterizing definite matrices: Theorem 8.6.3 Let A be an n × n symmetric matrix. Then A is negative definite if • All the eigenvalues of A are negative • The leading principal minors of A alternate in sign, starting with det(A1 ) < 0 • The matrix has a negative dominant diagonal: For each row r, X |arr | > ari i6=r and arr < 0. Example Consider A= −2 1 1 −2 The leading principal minors are −2 < 0 and 4 − 1 = 3 > 0, so this matrix is negative definite. The eigenvalues are found by solving det(A − λI) = (−2 − λ)2 − 1 = 0 and λ∗ = −2 ± 1 < 0. So the eigenvalues of A are all negative. Since 2 > 1, it has the dominant diagonal property. 83 Example Consider A= −1 −2 −2 −1 The leading principal minors are −1 < 0 and 1 − 4 < 0, so this matrix is not negative definite. The eigenvalues are found by solving det(A − λI) = λ2 + 2λ − 3 = 0 and λ∗ = −1 ± 2. It is not negative definite since it has an eigenvalue 1 > 0. The principal minors test is obviously not satisfied. Since A is neither negative nor positive (semi-)definite, it is actually indefinite. These tests seem “magical”, so it’s important to develop some intuition about what they mean. • If H is a real symmetric matrix, it can be factored into an orthogonal matrix P and a diagonal matrix D where H = P DP ′ and the eigenvalues (λ1 , λ2 , ...λn ) of H are on the diagonal. If we do this, x′ Hx = x′ P DP ′ x = (x′ P )D(P ′ x) = (P ′ x)′ D(P ′ x) Note that P ′ x is a vector; call it y. Then we get X y ′ Dy = yi2 λi ℓ If all the eigenvalues λi of H are negative, the yi2 terms are positive scalars, and we can conclude x′ Hx < 0 if and only if H has all strictly negative eigenvalues. • How can the principal minors test be motivated? We can get a feel for why this must be true from the following observations: (i) If x′ Hx < 0 for all x, then if we use x̃ = (x1 , x2 , ..., xk , 0, ..., 0), with not all components zero, then x̃′ H x̃ < 0 as well. But because of the zeros, we’re really just considering a k × k sub-matrix of H, Hk . (ii) The determinant of a matrix A is equal to the product of its eigenvalues. So if we take H1 , the determinant is h11 , and that should be negative; det(H2 ) = λ1 λ2 , which should be positive if H is negative definite (since the determinant is the product of the eigenvalues, and we know that the eigenvalues of a negative definite matrix are negative) ; det(H3 ) = λ1 λ2 λ3 < 0; and so on. So combining these two facts tells us that if H is negative definite, then we should observe this alternative sign pattern in the determinants of the principal minors. (Note that this only shows that if H is negative definite, then the principal minors test holds, not the converse.) • Necessity of the dominant diagonal theorem is easy to show, but sufficiency is much harder. Suppose H is negative definite. Take the set of vectors 1 = (1, 1, ..., 1), and observe the quadratic form XX 1′ H1 = hij < 0 i P j Now, if for each row i, we had j hij < 0, the above inequality holds, since we then just sum over all i. But this is equivalent to X hji < 0 hii + j6=i 84 Re-arranging yields Multiplying by −1 yields and taking absolute values yields hii < − X −hii > X hji j6=i hji j6=i X |hii | > hji j6=i which is the dominant diagonal condition. There are other tests and conditions available, but these are often the most useful in practice. Note that it is harder to identify negative semi-definiteness, because the only test that still applies is that all eigenvalues be weakly negative. In particular, if all the determinants of the principal minors alternate in sign but are sometimes zero, the quadratic form might be a saddle, similar to f (x) = x3 being indefinite at x = 0, with f ′′ (x) = 6x. Exercises 1. Show that RN is both open and closed. Show that the union and intersection of any collection of closed sets is closed. Show that the union of any collection of open sets is open, and the intersection of a finite number of open sets is open. Show that a countable intersection of open sets can be closed. 2. Determine if the following matrices are positive or negative (semi)-definite: −1 3 0 −1 −1 1 1 −1 −1 5 3 −10 1 −2 1 −2 4 −2 1 −2 1 3. Find the gradients and Hessians of the following functions, evaluate the Hessian at the point √ given, and determine whether it is positive or negative (semi-)definite. (i) f (x, y) = x2 + x at √ (1, 1). (ii) f (x, y) = xy at (3, 2), and any (x, y). (iii) f (x, y) = (xy)2 at (7, 11) and any (x, y). √ (iv) f (x, y, z) = xyz at (2, 1, 3). 4. Give an example of a matrix that has all negative entries on the diagonal but is not negative definite. Give an example of a matrix that has all negative entries but is not negative definite. 5. Prove that a function is continuous iff the inverse image of an open set is open, and the inverse image of a closed set is closed. Prove that if a function is continuous, then the image of a bounded set is bounded. Conclude that a continuous function maps compact sets to compact sets. Use this to provide a second proof of the extreme value theorem. 85 Proofs The proofs here are mostly about compactness and the Heine-Borel theorem. Definition 8.6.4 A subset K of Euclidean space is compact if each sequence in K has a subsequence that converges to a point in K. So a set is compact3 if the Bolzano-Weierstrass theorem holds for every sequence constructed only from points in the set. For an example of a non-compact set, consider (a, b). The sequence xn = b − 1/n is constructed completely from points in (a, b), but its limit, b, is not in the set, so it does not have the Bolzano-Weierstrass property. Recall that in our proof of the Weierstrass theorem, we insured a maximum existed by studying convergent sub-sequences. This will be the key to again ensuring existence in the N -dimensional case. The next proof theorem characterizes “closedness” in terms of sequences. Theorem 8.6.5 Suppose xn is a convergent sequence such that xn is in S for all n. Then xn converges to a point in the closure of S, S̄. Proof (Sketch a picture as you go along). Suppose, by way of contradiction, that x̄ is not in S̄. Then we can draw an open ball around x̄ of radius ε > 0 such that no points of S̄ are in Bε (x̄). But since xn → x̄, we also know that for all ε > 0, for n ≥ H, ||xn − x̄|| < ε, so that countably many points of S̄ are in Bε (x̄). This is a contradiction. Since a closed set satisfies S̄ = S, this implies that every sequence in a closed set — if it converges — converges to a member of the set. However, there are sets like [0, ∞) that are closed — since any sequence that converges in this set converges to a point of the set — but allow plenty of non-convergent sequences, like xn = n. For example, if you were asked to maximize f (x) = x on the set [b, ∞), no maximum exists: f ([b, ∞)) = [b, ∞), and the range is unbounded. So for maximization theory, it appears that closedness and boundedness of the image of a set under a function are key properties. In fact, they are equivalent to compactness in RN . Theorem 8.6.6 (Heine-Borel) In Euclidean space, a set is compact iff it is closed and bounded. Proof Consider any sequence xn in a closed and bounded subset K of RN . We will show it has a convergent subsequence, and consequently that K is compact. Since K is bounded, there is a N -dimensional “hypercube” H, which satisfies H = ×N i=1 [ai , bi ] and K ⊂ H. Without loss of generality we can enlarge the hypercube to [a, b]N , where a = mini ai and b = maxi bi , so that bi − ai is the same length for all dimensions. Now cut H in N 2 equally sized sets, each of size ((b − a)/N )N . One of these cubes contains an infinite number of terms of the sequence in K. Select a term from that cube and call it xn1 , and throw the rest of the cubes away. Now repeat this procedure, cutting the remaining cube N ways along each dimensions; one of these sub-cubes contains an infinite number of terms of the sequence; select a term from that subcube and call it xnk , and throw the rest away. The volume of the subcubes at each step k is equal to (b − a)N Nk 3 The most general definition of compactness is, “A subset K of Euclidean space is compact if any collection of K K open sets {Oi }∞ i=1 for which K ⊂ ∪i Oi has a finite collection {Oik }k=1 so that K ⊂ ∪k=1 Oik ”, which is converted into words by saying, “K is compact if every open cover of K — K ⊂ ∪∞ O — has a finite sub-cover — K ⊂ ∪K i=1 i k=1 Oik ”, or that if an infinite collection of open sets covers K, we can find a finite number of them that do the same job. This is actually easier to work with than the sequential compactness we’ll use, but they are equivalent for N . 86 which is clearly converging to zero as k → ∞. Then a bound on the distance from each term to all later terms in the sequence is given by the above estimate. Therefore, the sequence constructed from this procedure has a limit, x̄ (since it is a Cauchy sequence). Therefore, the subsequence xnk converges. Since K is closed, it contains all of its limit points by Theorem 8.8.2 above, so x̄ is an element of K. Therefore K is compact. The Bolzano-Weierstrass theorem was a statement about bounded sequences: Every bounded sequence has a convergent subsequence. The Heine-Borel theorem is a statement about closed and bounded sets: A set is compact iff it is closed and bounded. The bridge between the two is that an infinitely long sequence in a bounded set must be near some point x̄ infinitely often. If the set contains all of its points of closure, this point x̄ is actually a member of the set K, and compactness and boundedness are closely related. However, boundedness is not sufficient, since a sequence might not converge to a point in the set, like the case (a, b) with b − 1/n. To ensure the limit is in the set, we add the closedness condition, and we get a useful characterization of compactness. 87 Chapter 9 Unconstrained Optimization We’ve already seen some examples of unconstrained optimization problems. For example, a firm who faces a price p for its good, hired capital K and labor L at rates r and w per unit, and produces output according to the production technology F (K, L) faces an unconstrained maximization problem max pF (K, L) − rK − wL K,L In this and similar problems, we usually allow the choice set to be RN , and allow the firm to pick any K and L that maximizes profits. As long as K and L are both positive, this is a fine approach, and we don’t have to move to the more complicated world of constrained optimization. Similarly, many constrained problems have the feature that the constraint can be re-written in terms of one of the controls, and substituted into the objective. For example, in the consumer’s problem max u(x) x subject to w = p · x, the constraint can be re-written as P w − i = 1N −1 pi xi xN = pN and substituted into the objective to yield P w − i = 1N −1 pi xi max u x1 , ..., xN −1 , x1 ,...,xN−1 pN which is an unconstrained maximization problem in x1 , ..., xN −1 . Even though we have better methods in general for solving the above problem from a theoretical perspective, solving an unconstrained problem numerically is generally easier that solving a constrained one. Definition 9.0.7 Let f : Rn → R, x ∈ Rn . Then the unconstrained maximization problem is max f (x) x∈RN A global maximum of f on X is a point x∗ so that f (x∗ ) ≥ f (x′ ) for all other x′ ∈ X. A point x∗ is a local maximum of f if there is an open set S = {y : ||x − y|| < ε} for which f (x∗ ) ≥ f (y) for all y ∈ S. How do we find solutions to this problem? 88 9.1 First-Order Necessary Conditions Our first step to solving unconstrained maximization problems is to build up a candidate list using FONCs, just like in the one-dimensional case. Definition 9.1.1 If ∇f (x∗ ) = 0, then x∗ is a critical point of f (x). Example Recall the firm with profit function π(K, L) = max pF (K, L) − rK − wL K,L Then if we make a small change in K, the change in profits is ∂π(K, L) = pFK (K, L) − r ∂K and if we make a small change in L, the change in profits is ∂π(K, L) = pFL (K, L) − w ∂L If there are no profitable adjustments away from a given point (K ∗ , L∗ ), then it must be a local maximum, so the equations above both equation zero. But then ∇π(K ∗ , L∗ ) = 0 implies that (K ∗ , L∗ ) is a critical point of f (x). The above argument works for any function f (x): Theorem 9.1.2 (First-Order Necessary Conditions) If x∗ is a local maximum of f and f is differentiable at x∗ , then x∗ is a critical point of f . Proof If x∗ is a local maximum and f is differentiable at x∗ , there cannot be any improvement in any direction v 6= 0. The directional derivative is ∇f (x∗ ) · v = X ∂f (x∗ ) ∂xi i vi So we can think of the differential change as the sum of one-dimensional directional derivatives in the direction y = (0, ..., vi , 0, ..., 0) where y is zero except for the i-th slot, taking the value vi 6= 0: ∇f (x∗ ) · y = X ∂f (x∗ ) i ∂xi vi = ∂f (x∗ ) vi = 0 ∂xi So that each partial derivative must be zero along each dimension individually, implying that a local maximum is a critical point. Like in the one-dimensional case, this gives us a way of building a candidate list of potential maximizers: Critical points and any points of non-differentiability. Example Consider a quadratic objective function, f (x1 , x1 ) = a1 x1 + a2 x2 − b1 2 b2 2 x − x2 + cx1 x2 2 1 2 The FONCs are a1 − b1 x∗1 + cx∗2 = 0 89 a2 − b2 x∗2 + cx∗1 = 0 Any solution to these equations is a critical point. If we solve the system by hand, the second equation is equivalent to a2 + cx∗1 x∗2 = b2 and substituting it into the first gives a1 − b1 x∗1 + c or x∗1 = x∗2 = a2 + cx∗1 =0 b2 a1 b2 + a2 c b1 b2 − c2 a2 b1 + a1 c b1 b2 − c2 If, instead, we convert this to a matrix equation, ∗ a1 −b1 c x1 + =0 a2 c −b2 x∗2 Re-arranging yields − −b1 c c −b2 x∗1 x∗2 = a1 a2 Which looks exactly like −Bx∗ = a. From linear algebra, we know that there is a solution as long as B is non-singular (it has full rank ↔ all its eigenvalues are non-zero ↔ it is invertible ↔ it has non-zero determinant). Then x∗ = (−B)−1 a and we have a solution. The determinant of B is non-zero iff b1 b2 − c2 6= 0 So if the above condition is satisfied, there is a unique critical point. Notice, however, that we could have done the whole analysis as f (x) = ax + x′ B x 2 yielding FONCs a + Bx∗ = 0 or x∗ = −B −1 a Moreover, this equation can be solved in a few lines of Matlab code, making it useful starting point for playing with non-linear optimization problems. However, we still don’t know know whether x∗ is a maximum, minimum, or saddle point. 90 9.2 Second-Order Sufficient Conditions With FONCs, we can put together a candidate list for any unconstrained maximization problem: Critical points and any points of non-differentiability. However, we still don’t know whether a given critical point is maximum, minimum or saddle/inflection point. Example Consider the function f (x, y) = −x2 y The FONCs for this function are fx (x, y) = −2xy = 0 fy (x, y) = −x2 = 0 The unique critical point is (0, 0). But if we evaluate the function at (1, −1), we get f (1, −1) = −(1)(−1) = 1 > 0 = f (0, 0). What is going on? Well, if we were considering −x2 alone, it would have a global maximum at zero. But the y term has no global maximum. When you multiply these functions together, even though one is well-behaved, the mis-behavior of the other leads to non-existence of a maximizer in R2 (I can make f (x, y) arbitrarily large by making y arbitrarily negative and x = 1). Still worse, as you showed in Exercise 8 of Chapter 4, it is not enough that a critical point (x∗ , y ∗ ) satisfy fxx (x∗ , y ∗ ) < 0 and fyy (x∗ , y ∗ ) < 0 to be a local maximum. Example Consider the function 1 1 f (x, y) = − x2 − y 2 + bxy 2 2 where b > 0. The FONCs are fx (x, y) = −x + by = 0 fy (x, y) = −y + bx = 0 The unique critical point of the system is (0, 0). However, if we set x = y = z, the function becomes f (z) = −z 2 + bz 2 and if b > −1, we can make the function arbitrarily large. The function is perfectly well behaved in each direction: If you plot a cross-section of the function for all x setting y to zero, it achieves a maximum in x at zero, and if you plot a cross-section of the function for all y setting x to zero, it achieves a maximum in y at zero. Nevertheless, this is not a global maximum if b is too large. If we use a Taylor series, however, we can write any multi-dimensional function as f (x) = f (x0 ) + ∇f (x0 ) · (x − x0 ) + (x − x0 )′ H(x0 ) (x − x0 ) + o(h3 ) 2 If x∗ is a critical point, we know that ∇f (x∗ ) = 0, or f (x) = f (x∗ ) + (x − x∗ ) · H(x∗ ) (x − x∗ ) + o(h3 ) 2 and re-arranging yields f (x∗ ) − f (x) = −(x − x∗ ) · 91 H(x∗ ) (x − x∗ ) − o(h3 ) 2 Letting h = x − x∗ be arbitrarily close to zero and noting that if x∗ is a local maximum of f (x), we then have H(x∗ ) (x − x∗ ) > 0 f (x∗ ) − f (x) = −(x − x∗ ) · 2 Or that, for any vector y, y ′ H(x∗ )y < 0 This is the definition of a negative definite matrix, giving us a workable set of SOSCs for multidimensional maximization: Theorem 9.2.1 (Second-Order Sufficient Conditions) If x∗ is a critical point of f (x) and H(x∗ ) is negative definite, then x∗ is a local maximum of f (x). For example, the Hessian for 1 1 f (x, y) = − x2 − y 2 + bxy 2 2 is −1 b b −1 The determinant of the Hessian is 1 − b2 , which is positive if b < 1. So if b is sufficiently small, f (x, y) will have a local maximum at (0, 0). On the other hand, if we start with the assumption that x∗ is a local maximum of f (x) and not just a critical point, using our Taylor series approximation, we get f (x∗ ) − f (x) = −(x − x∗ )′ H(x∗ ) (x − x∗ ) + o(h3 ) 2 Since x∗ is a local maximum, we know that f (x∗ ) ≥ f (x), so that −(x − x∗ )′ H(x∗ ) (x − x∗ ) ≥ 0 2 for x sufficiently close to x∗ , implying that y ′ H(x∗ )y ≤ 0. This is negative semi -definiteness. To summarize: Theorem 9.2.2 (Second-Order Sufficient Conditions) Let f : Rn → R be a twice differentiable function, and let x∗ be a point in Rn . • If x∗ is a critical point of f (x) and H(x∗ ) is negative definite, then x∗ is a local maximum of f. • If x∗ is a critical point of f (x) and H(x∗ ) is positive definite, then x∗ is a local minimum of f. • If f has a local maximum at x∗ , then H(x∗ ) is negative semi-definite. • If f has a local minimum at x∗ , then H(x∗ ) is positive semi-definite. The first two points are useful for checking whether or not a particular point is a local maximum or minimum. The second two are useful when using the implicit function theorem. Here are some examples: 92 Example Suppose a price-taking firm gets a price p for its product, which it produces using capital K and labor L and technology F (K, L) = q. Capital costs r per unit and labor costs w per unit. This gives a profit function of π(K, L) = pF (K, L) − rK − wL The FONCs are pFK (K, L) − r = 0 pFL (K, L) − w = 0 And our SOSCs are that FKK (K ∗ , L∗ ) FKL (K ∗ , L∗ ) FLK (K ∗ , L∗ ) FLL (K ∗ , L∗ ) be a negative-definite matrix. This implies FKK (K ∗ , L∗ ) < 0 FLL (K ∗ , L∗ ) < 0 and FKK (K ∗ , L∗ )FLL (K ∗ , L∗ ) − FKL (K ∗ , L∗ )2 > 0 So F (K, L) must be own-concave in K and L, or FKK (K ∗ , L∗ ) < 0 and FLL (K ∗ , L∗ ) < 0. Likewise, the cross-partial FKL (K ∗ , L∗ ) cannot be “too large”. In terms of economics, K and L cannot be such strong substitutes or complements that switching from one to the other has a larger impact than using more of each. A simple example might be F (K, L) = a1 K − Our Hessian would then be b2 a2 2 K + b1 L − L2 + cKL 2 2 −a2 c c −b2 with determinant So that (K ∗ , L∗ ) is a local maximum if √ a2 b2 − c2 > 0 a2 b2 > c. Example Slightly different firm problem: A firm hires capital K at rental rate r and labor L at wage rate w. The firm’s output produced by a given mix of capital and labor is F (K, L) = φK α Lβ . The price of the firm’s good is p. This is an unconstrained maximization problem in (K, L), so we can solve max pφK α Lβ − rK − wL K,L This gives first-order conditions Dividing these two equations yields r = pφαK α−1 Lβ w = pφβK α Lβ−1 r αL = w βK Solving for K in terms of L yields L= rβ K wα 93 Substituting back into the first-order condition for K gives β rβ α−1 K r = pφαK wα So that K∗ = α1+β wβ pφ β 1+β β r 1 1−α−β Hypothetically, we could now differentiate K ∗ with respect to α, β, r, φ, or w, to see how the firm’s choice of capital varies with economic conditions. But that looks inconvenient, especially for α or β, which appear everywhere and in exponents. What are the SOSCs? The Hessian is α(α − 1)K α−2 Lβ αβK α− Lβ−1 αβK α− Lβ−1 β(β − 1)K α Lβ−2 The upper and lower corners are negative if 0 < α < 1 and 0 < β < 1. The determinant is det H = α(α − 1)β(β − 1)K 2α−2 L2β−2 − α2 β 2 K 2α−2 L2β−2 or det H = {(α − 1)(β − 1) − αβ} αβK 2α−2 L2β−2 Which is positive if αβ − α − β + 1 − αβ > 0, or 1 > α + β. So as long as 0 < α < 1, 0 < β < 1 and 1 > α + β, the Hessian is negative definite, and the critical point (K ∗ , L∗ ) is a local maximum. Since it is the only point on the candidate list, it is a global maximum. Example Suppose we have a consumer with utility function u(q1 , q2 , m) = v(q1 , q2 ) + m over two goods and money, with wealth constraint w = p1 q1 + p2 q2 + m. Substituting the constraint into the objective, we get max v(q1 , q2 ) + w − p1 q1 − p2 q2 q1 ,q2 The FONCs are v1 (q1∗ , q2∗ ) − p1 = 0 v2 (q1∗ , q2∗ ) − p2 = 0 and the SOSCs are that v11 (q1∗ , q2∗ ) v21 (q1∗ , q2∗ ) v12 (q1∗ , q2∗ ) v22 (q1∗ , q2∗ ) be negative definite. Can we figure out how purchases of q1∗ respond to a change in p2 ? Well, the two functions q1∗ (p2 ) and q2∗ (p2 ) are implicitly determined by the system of FONCs. If we totally differentiate, we get ∂q1 ∂q2 v11 + v21 =0 ∂p2 ∂p2 v12 ∂q2 ∂q1 + v22 −1=0 ∂p2 ∂p2 Solving the first equation in terms of ∂q2 /∂p2 , we get ∂q1 v11 ∂q2 ∂p2 =− ∂p2 v21 94 and substituting into the second equation gives v12 and solving for ∂q1 /∂p2 yields ∂q1 − v22 ∂p2 v11 ∂q1 ∂p2 v21 −1=0 −v21 ∂q1 = ∂p2 v11 v22 − v12 v21 Notice that the denominator is the determinant of H(q ∗ ), so it must be positive. Consequently, the sign is determined by the numerator, ∂q1 sign = sign (−v21 ) ∂p2 So q1 are gross complements when v21 > 0, and gross substitutes when v21 < 0. So it is pretty straightforward to apply the implicit function theorem to this two-dimensional problem. But when the number of controls becomes large, it is less obvious that this will work. We will need to develop more subtle tools for working through higher-dimensional comparative statics problems. 9.3 Comparative Statics Perhaps if we take a broader perspective, we can see some of the structure behind the comparative statics exercise we did for the consumer above. The unconstrained maximization problem max f (x, c) x has FONCs ∇x f (x∗ (c), c) = 0 where c is a single exogenous parameter (the extension to a vector of parameters is easy, and in any math econ text). If we differentiate the FONCs with respect to c, we get ∇xx f (x∗ (c), c) · ∇c x∗ (c) + ∇x fc (x∗ (c), c) = 0 Since ∇xx f (x∗ (c), c) = H(x∗ (c), c), we can write H(x∗ (c), c)∇c x∗ (c) = −∇x fc (x∗ (c), c) Since H(x∗ (c), c) is negative definite, all its eigenvalues are negative, so it is invertible, and ∇c x∗ (c) = −H(x∗ (c), c)−1 (∇x fc (x∗ (c), c)) If x and c are both one-dimensional, this is just fxc(x∗ (c), c) ∂x∗ (c) = ∂c fxx (x∗ (c), c) So the H(x∗ (c), c)−1 term is just the generalization of 1/fxx (x∗ (c), c), and ∇x fc (x∗ (c), c) is just the generalization of fxc (x∗ (c), c). 95 Theorem 9.3.1 (Implicit Function Theorem ) Suppose that ∇f (x(c), c) = 0 and that ∇f (x, c) is differentiable in x and c. Then there is a locally continuous, implicit solution x∗ (c) with derivative ∇c x∗ (c) = −H(x∗ (c), c)−1 (∇x fc (x∗ (c), c)) Example Recall the profit-maximizing firm with general production function F (K, L), and let’s see how a change in r affects K ∗ and L∗ . The system of FONCs is pFK (K ∗ , L∗ ) − r = 0 pFL (K ∗ , L∗ ) − w = 0 Totally differentiating with respect to r yields pFKK ∂L∗ ∂K ∗ + pFLK −1=0 ∂r ∂r pFKL ∂K ∗ ∂L∗ + pFLL =0 ∂r ∂r Doing the system “by hand” yields −pFKL ∂L∗ = ∂r FKK FLL − FKL FLK Note that the denominator is the determinant of the Hessian. You can always grind the solution out by hand, and I like to do it sometimes to check my answer. But there is another tool, Cramer’s rule, for solving equations like this. Writing the system of equations in matrix notation gives ∂K ∗ /∂r 1 pFKK pFKL = ∗ ∂L /∂r 0 pFLK pFLL {z } | {z } | {z } | ∇xx f (x∗ (c),c)=H(x∗ (c)) ∇c x∗ (c) ∇x fc (x∗ (c),c) So we have a matrix equation, Ax = b and we want to solve for x, the vector of comparative statics. Theorem 9.3.2 (Cramer’s Rule) Consider the matrix equation x1 .. . xi−1 Ax = (A.1 , A.2 , ..., A.N ) xi = b xi+1 .. . xN where A.k is the k-th column of A. Then xi = det([A.1 , ..., A.i−1 , b, A.i+1 , ..., A.n ]) det(A) 96 So to use Cramer’s rule to solve for the i-th component of x, replace the i-th column of A with b and compute that determinant, then divide by the determinant of A. For the firm example, we get pFKK 1 det pFKL 0 −pFKL ∂L∗ = = ∂r det(H) det(H) So the sign of ∂L∗ /∂r is equal to the sign of FKL . In fact, this should always work, because H(x∗ (c), c) ∇c x∗ (c) = ∇x fc (x∗ (c), c) | {z } | {z } | {z } N ×N matrix N ×1 vector N ×1 vector can always be written as a matrix equation Hx = b. Since H is symmetric, it has all real eigenvalues, so should be invertible. 9.4 The Envelope Theorem Similarly, we can characterize an envelope theorem for unconstrained maximization problems that shows how an agent’s payoff varies with an exogenous parameter. The generic unconstrained maximization problem is max f (x, c) x with FONCs ∇x f (x∗ (c), c) = 0 If we consider the value function, V (c) = f (x∗ (c), c) this gives the maximized payoff of the agent for each value of the exogeneous parameter c. Differentiating with respect to c yields ∇c V (c) = ∇x f (x∗ (c), c) ·∇c x∗ (c) + ∇c f (x∗ (c), c) {z } | FONC Again, since the FONC must be zero at a maximum, we have ∇c V (c) = ∇c f (x∗ (c), c) So the derivative of the value function with respect to a given parameter is just the partial derivative with respect to that parameter. Example Consider a consumer with utility function V (p1 , p2 , w) = v(q1∗ , q2∗ ) + w − p1 q1∗ − p2 q2∗ If we differentiate with respect to, say, p2 , each maximized value q1∗ and q2∗ must also be differentiated, yielding ∂V ∂q ∗ ∂q ∗ ∂q ∗ ∂q ∗ = v1 1 + v2 2 − p1 1 − p2 2 − q2∗ ∂p2 ∂p2 ∂p2 ∂p2 ∂p2 Which is a mess. It would appear that we have to go back and compute all the comparative statics to sign this (and even then there would be no guarantee that would work). But re-arranging yields ∂q ∗ ∂q ∗ ∂V = (v1 − p1 ) 1 + (v2 − p2 ) 2 − q2∗ ∂p2 ∂p2 ∂p2 97 Using the FONCs, we get ∂V = −q2∗ ∂p2 This is the logic of the envelope theorem. So we can skip all the intermediate re-arranging steps, and just differentiate directly with respect to parameters once the payoff-maximizing behavior has been substituted in: ∂V =1 ∂w ∂V = −q1∗ ∂p1 Exercises 1. Prove that if x∗ is a local maximizer of f (x), then it is a local minimizer of −f (x). 2. Suppose x∗ is a global maximizer of f : Rn → R. Show that for any monotone increasing transformation g : R → R, x∗ is a maximizer of g(f (x)). 3. Find all critical points of f (x, y) = (x2 − 4)2 + y 2 and show which are maxima and which are minima. 4. Find all critical points of f (x, y) = (y − x2 )2 − x2 and show which are maxima and which are minima. 5. Describe the set of maximizers and set of minimizers of p x2 + y 2 f (x, y) = cos 6. Suppose a firm produces two goods, q1 and q2 , whose prices are p1 and p2 , respectively. The costs of production are C(q1 , q2 ). Characterize profit-maximizing output and show how q1 varies with p2 . If C(q1 , q2 ) = c1 (q1 ) + c2 (q2 ) + bq1 q2 , explain when a critical point is a local maximum of the profit function. How do profits vary with b and p1 ? 7. Suppose you have a set of dependent variables {y1 , y2 , ..., yN } generated by independent variables {x1 , x2 , ..., xN }, and believe the true model is given by yi = β ′ xi + εi where ε is a normally distributed random variable with mean m and variance σ 2 . The sum of squared errors is N X ′ ′ ′ (yi − β ′ xi )2 (y − (β x)) (y − (β x)) = i=1 Check that this is a quadratic programming problem. Compute the gradient and Hessian. Solve for the optimal estimator β̂. (This is just OLS, right?) 8. A consumer with utility function u(q1 , q2 , m) = (q1 − γ1 )q2α + m and budget constraint w = p1 q2 + p2 q2 + m is trying to maximize utility. Solve for the optimal bundle (q1∗ , q2∗ , m∗ ) and show how q1∗ varies with p2 , and how q2∗ varies with p1 , both using the closed-form solutions and the IFT. How does the value function vary with γ1 ? 9. Suppose you are trying to maximize f (x1 , x2 , ..., xN ) subject to the non-linear constraint that g(x1 , ..., xN ) = 0. Use the Implicit Function Theorem to (i) use the constraint to define xN (x1 , ..., xN −1 ) and substitute this into f to get an unconstrained maximization problem in x1 , ..., xN −1 , (ii) derive FONCs for x1 , ..., xN −1 . 98 Chapter 10 Equality Constrained Optimization Problems Optimization problems often come with extra conditions that the solution must satisfy. For example, consumers can’t spend more than their budget allows, and firms are constrained by their technology. Some examples are: • The canonical equality constrained maximization problem comes from consumer theory. There is a consumer choosing between bundles of x1 and x2 . He has a utility function u(x1 , x2 ), which is increasing and differentiable in both arguments. However, he only has wealth w and the prices of x1 and x2 per unit are p1 and p2 , respectively, so that his budget constraint is w = p1 x1 + p2 x2 . Then his maximization problem is max u(x1 , x2 ) x1 ,x2 subject to w = p1 x1 +p2 x2 . Here, the objective function is non-linear in x, while the constraint is linear in x. • Consider a firm who transforms inputs z into output q through a technology F (z) = q. The cost of input zk is wk . The firm would like to minimize costs subject to producing a certain amount of output, q̄. Then his problem becomes min w · z z subject to F (z) = q̄. This is a different problem from the consumer’s, primarily because since the objective is linear in z while the constraint is non-linear in z. These are equality constrained maximization problems because the set of feasible points is described by an equation, w = p1 x1 + p2 x2 (as opposed to an inequality constraint, like w ≥ p1 x1 + p2 x2 ). To provide a general theory for all the constrained maximization problems we might encounter, then, we need to write the constraints in a common form. Often, the constraints will have the form f (x) = c, where x are the choice variables and c is a constant. • For maximization problems, move all the terms to the side with the choice variables, and define a new function g(x) = f (x) − c Then whenever g(x) = 0, the constraints are satisfied. 99 • For minimization problems, move all the firms to the side with the constant, and define a new function g(x) = c − f (x) Then whenever g(x) = 0, the constraints are satisfied. This will ensure that the Lagrange multiplier — see below — is always positive, so that you don’t have to figure out its sign later on. For example, the constraint w = p1 x1 + p2 x2 becomes 0 = p1 x1 + p2 x2 − w = g(x). Definition 10.0.1 Let f : D → R. Suppose that the choice of x is subject to an equality constraint, such that any solution must satisfy g(x) = 0 where g : D → R. Then the equality-constrained maximization problem is max f (x) x subject to g(x) = 0 Simply differentiating f (x) with respect to x is no longer a sensible approach to finding a maximizer, since that ignores the constraints. We need a theory that incorporates constraints into the search for a maximum. 10.1 Two useful but sub-optimal approaches There are two ways of approaching the question of constrained maximization that are instructive, but not necessarily efficient. If you look at proof of Lagrange’s theorem, these ideas show up, however, and they give a lot of intuition about how this kind of maximization problem works. 10.1.1 Using the implicit function theorem Note that the constraint g(x1 , x2 , ..., xN −1 , xN ) = 0 can be used to formulate an implicit function, xN (x1 , x2 , ..., xN −1 ), defined by g(x1 , x2 , ..., xN −1 , xN (x1 , x2 , ..., xN −1 )) = 0 Then the unconstrained problem in x1 , ..., xN −1 can be stated as: max x1 ,...,xN−1 f (x1 , x2 , ..., xN −1 , xN (x1 , x2 , ..., xN −1 )) with first-order necessary conditions for k = 1, ..., N − 1, ∂f (x∗ ) ∂f (x∗ ) ∂xN (x∗ ) + =0 ∂xk ∂xN ∂xk Using the implicit function theorem on the constraint, we get ∂g(x∗ ) ∂g(x∗ ) ∂xN (x∗ ) + =0 ∂xk ∂xN ∂xk Substituting this into the FONC for k, we get ∂f (x∗ ) ∂f (x∗ ) ∂g(x∗ )/∂xk − =0 ∂xk ∂xN ∂g(x∗ )/∂xN 100 Yielding the tangency conditions ∂f (x∗ )/∂xk ∂g(x∗ )/∂xk = =0 ∂f (x∗ )/∂xN ∂g(x∗ )/∂xN which is a generalization of the familiar “marginal utility of x over marginal utility of y equals the price ratio” relationship in consumer theory. However, developing and verifying second-order sufficient conditions using this approach appears to be quite challenging. We would need to apply the implicit function theorem to the system of FONC’s, leading to a very complicated Hessian that might not obviously be negative definite. 10.1.2 A more geometric approach Since we are using calculus, let’s focus on what must be true locally for a point x∗ to maximize f (x) subject to g(x) = 0. In particular, imagine that instead of maximizing f (x) in RN but being restricted to the points such that g(x) = 0, imagine that the set of points such that g(x) = 0 is the set over which we maximize f (x). What do I mean by that? Let g(x0 ) = 0. The set of feasible local variations on x0 are the set of points for which the directional derivative of g(x) at x0 evaluated in the direction x′ are zero: ∇g(x0 ) · x′ = 0 This is the set of points x′ for which the constraint is still satisfied if a differential step is taken in that direction: Imagine standing at the point x0 and taking a small step towards x′ so that your foot is still in the set of points satisfying g(x) = 0. Define this set as Y (x0 ) = {x′ : ∇g(x0 ) · x′ = 0} Now, a point x∗ is a local maximum of f (x) subject to g(x) = 0 if it is a local maximum of f (x) on the set Y (x∗ ) (right?). This implies the following: ∇f (x∗ ) · x′ = 0 for all x′ such that ∇g(x∗ ) · x′ = 0. This gives great geometric intuition: For any feasible local variation x′ on x∗ , it must be the case that the gradient of the objective function f (x) and the gradient of the constraint g(x) are both orthogonal to x′ . This supplies more geometric intuition and clarifies the set of points which must consider as potential improvements on a local maximizer (the set of feasible local variations), but doesn’t seem to provide an algorithm for solving for maximizers or testing whether they are local maximizers, minimizers, or neither. 10.2 First-Order Necessary Conditions The preferred method of solving these problems is Lagrange maximization. To use this approach, we introduce a special function: Definition 10.2.1 The Lagrangean is the function L(x, λ) = f (x) − λg(x) where λ is a real number. 101 The Lagrangean is designed so that if g(x) 6= 0, the term −λg(x) acts as a penalty on the objective function f (x). You might even imagine coming up with an algorithm that works by somehow penalizing the decision-maker for violating constraints by raising the penalties to push them towards a “good” solution that makes f (x) large and satisfies the constraint. I find that it is best to think of L(x, λ) as a convenient way of converting a constrained maximization problem in x subject to g(x) = 0 into an unconstrained maximization problem in terms of (x, λ). When you subtract λg(x) from the objective in the Lagrangian, you are basically imposing an extra cost on the decision-maker for violating the constraint. When you maximizer with respect to λ, you are minimizing the pain of this cost, λg(x) (since maximizing the negative is the same as minimizing). So Lagrange maximization trades off between increasing the objective function, f (x), and satisfying the constraints, g(x), by introducing this fictional cost of violating them. This is the basic idea of our new first-order necessary conditions: Theorem 10.2.2 (Lagrange First-Order Necessary Conditions) If 1. x∗ is a local maximum of f (x) that satisfies the constraints g(x) = 0 2. the constraint gradients ∇g(x∗ ) 6= 0 (this is called the constraint qualification) 3. f (x) and g(x) are both differentiable at x∗ Then there exists a unique vector λ∗ such that the FONCs ∇x L(x∗ , λ∗ ) = ∇f (x∗ ) − λ∗ ∇g(x∗ ) = 0 ∇λ L(x∗ , λ∗ ) = −g(x∗ ) = 0 hold. Proof We will use the penalty approach. Define the sequence of unconstrained objective functions α k F k (x) = f (x) − ||g(x)||2 − ||x − x∗ ||2 2 2 where x∗ is the local maximum of the constrained problem, and α is some scalar. The second term is a fictional cost for violating the constraint, and the final term is a relaxation term that ensures we are going to look at a sequence of solutions x1 , x2 , ... that converges to x∗ . First, let’s show that we can study the unconstrained solutions to this sequence of problems for sufficiently large k. Since x∗ is a local maximum, we can select ε > 0 so that for all x in the closed sphere S = {x : ||x − x∗ || ≤ ε} we have f (x∗ ) ≥ f (x). Define the sequence of maximization problems max F k (x). x∈S A solution exists by the Weierstrass theorem. We will show the sequence of iterates xk converges to x∗ , and that in the limit the Lagrange multiplier is well-defined. Note that for all k, k α F k (xk ) = f (xk ) − ||h(xk )||2 − ||xk − x∗ ||2 ≥ F k (x∗ ) = f (x∗ ) 2 2 since x∗ is feasible but not necessarily optimal for F k (x). Since f (x) is continuous, lim ||h(xk )|| = 0 k→∞ 102 since otherwise the term −(k/2)||h(xk )||2 would tend to negative infinity, which cannot be optimal since x∗ is feasible and does not return an optimized value of −∞. Therefore, every limit x̄ of k {xk }∞ k=1 satisfies limk→∞ h(x ) = 0. Taking the limit as k → ∞ then implies f (x̄) − α ||x̄ − x∗ ||2 ≥ f (x∗ ) 2 implying that f (x̄) ≥ f (x∗ ). But since x∗ is a local maximum on S, it must be that f (x̄) ≤ f (x∗ ), and it follows that f (x̄) = f (x∗ ). Then ||x̄ − x∗ || = 0 in the limit, and we have x̄ = x∗ . Thus, xk is an unconstrained local minimum of F k (x) for sufficiently large k, and we can exploit the theory of unconstrained maxima we have already developed. Second, we show that a Lagrange multiplier λ∗ exists and is unique. The first-order necessary condition for F k (xk ) for sufficiently large k is then 0 = ∇F k (xk ) = ∇f (xk ) − k∇g(xk )g(xk ) − α(xk − x∗ ). Now, we use a trick that is not obvious. The problem is that the column vector ∇g(xk ) is not “invertible”, in any sense, but the outer product matrix ∇g(xk )′ ∇g(xk ) is, so let’s pre-multiply the above first-order necessary condition by (∇g(xk )′ ∇g(xk ))−1 ∇g(xk )′ to get 0 = (∇g(xk )′ ∇g(xk ))−1 ∇g(xk )′ (∇f (xk ) − α(xk − x∗ )) − kg(xk ) and re-arranging yields kg(xk ) = (∇g(xk )′ ∇g(xk ))−1 ∇g(xk )′ (∇f (xk ) − α(xk − x∗ )). The sequence {kg(k )}∞ k=1 converges to the vector λ∗ = (∇g(x∗ )′ ∇g(x∗ ))−1 ∇g(x∗ )′ ∇f (x∗ ) which is unique, since all convergent sequences have a unique limit. Returning to the first-order necessary condition and evaluating the limit of kg(xk ) in particular, we then have ∇f (x∗ ) − λ∗ ∇g(x∗ ) = 0 as was to be shown. Basically, the FONCs say that any local maximum of f (x) subject to g(x) = 0 is a critical point of the Lagrangian when the Lagrangian is viewed as an unconstrained maximization problem in (x, λ). As usual, these are necessary conditions and help us identify a candidate list: • Points where f (x) and g(x) are non-differentiable • Points where the constraint qualification fails • The critical points of the Lagrangian All we know from the FONCs is that the global maximizer of f (x) subject to g(x) = 0 must be on the list, not whether any particular entry on the list is a maximum, minimum, or saddle. Example Consider the problem max xy x,y subject to a = bx + cy. The Lagrangian is L(x, y, λ) = xy − λ(bx + cy − a) 103 with FONCs Lx (x, y, λ) = y − λb = 0 Ly (x, y, λ) = x − λc = 0 Lλ (x, y, λ) = −(bx + cy − a) = 0 To solve the system, notice that the first two equations can be rewritten as y = λb x = λc so that y b = x c Solving in terms of x yields b y= x c We have one equation left, the constraint. Substituting the above equation into the constraint yields b a = bx + c x c or a x∗ = 2b a ∗ y = 2c and a λ∗ = 2bc Since f (x) and g(x) are continuously differentiable and the constraint qualification is everywhere satisfied, the candidate list consists of this single critical point (As of yet, we cannot determine whether it is a maximum or a minimum, but it’s a maximum.). Example Consider the firm cost minimization problem: A firm hires capital K and labor L at prices r and w to product output F (K, L) = q. The firm minimizes cost subject to achieving output q̄. Then the firm is trying to solve min rK + wL K,L subject to F (K, L) = q̄. First, we convert to a maximization problem, max −rK − wL K,L subject to g(K, L) = −F (K, L) + q̄ = 0, and then form the Lagrangean, L(z, λ) = −rK − wL − λ(q̄ − F (K, L)) Then the FONC’s are −r + λFK (K ∗ , L∗ ) = 0 −w + λFL (K ∗ , L∗ ) = 0 Which characterizes the cost-minimizing plan in terms of w, r, and q̄. 104 Example Consider maximizing f (x, y) = xy subject to the constraint g(x, y) = x2 + y 2 − 1, so we are trying to make xy as large as possible on the unit circle. Then the Lagragnean is L(x, y, λ) = xy − λ(x2 + y 2 − 1) and the FONCs are y ∗ − λ∗ 2x∗ = 0 x∗ − λ∗ 2y ∗ = 0 −(x∗2 + y ∗2 − 1) = 0 Then we can use the same approach, re-arranging and dividing the first two equations. This yields x∗2 = y ∗2 So the set of critical points are all the points on the circle such that such points, r ! r 1 1 ,± ± 2 2 p √ x∗2 = y ∗2 . There are four There are two global maxima, and two global minima. One maximum is r r ! 1 1 , 2 2 and the other is − r 1 ,− 2 r ! 1 2 + r 1 ,− 2 r ! 1 2 − r 1 ,+ 2 r ! 1 2 The two global minima are and Right? 10.2.1 The Geometry of Constrained Maximization Lagrange maximization is actually very intuitive geometrically. Let’s start by recalling what indifference curves are. Suppose we fix a value, C, and set f (x, y) = C Then by the implicit function theorem, there is a function x(y) that solves f (x(y), y) = C at least locally near y for some C. Now if we totally differentiate with respect to y, we get fx x′ (y) + fy = 0 105 and ∂x(y) fy (x(y), y) =− ∂y fx (x(y), y) This derivative is called the marginal rate of substitution between x and y: It expresses how much x must be given to or taken from the agent to compensate him for a small increase in the amount of y. So if we give the agent one more apple, he is presumably better off, so we have to take away a half a banana, and so on. If we graph x(y) in the plane, we see the set of bundles (x, y) which all achieve an f (x, y) = C level of satisfaction. The agent is indifferent among all these bundles, and the locus of points (y, x(y)) is an indifference curve. Indifference Curves Since ∇f (x, y) ≥ 0, the set of points above an indifference curve are all better than anything on the indifference curve, and we call these the upper contour sets of f (x): U C(a) = {x : f (x) ≥ a} while the set of points below an indifference curve are all worse than anything on the indifference curve, and we call these the lower contour sets of f (x): LC(a) = {x : g(x) ≤ a} Now, along any indifference curve (y, x(y)), let’s compute ∇f (x) · (x′ (y), 1) This gives −fx fy + fy = 0 fx Recall that if v1 · v2 = 0, then v1 and v2 are at a right-angle to each other (they are orthogonal). This implies that the gradient and the indifference curve are at right angles to one another. 106 Tangency of the Gradient to Indifference Curves If ∇f (x) ≥ 0, the gradient gives the direction in which the function is increasing. Now, if we started on the interior of the constraint set and consulted the gradient, taking a step in the x direction if fx (x, y) ≥ fy (x, y) and taking a step in the y direction if fy (x, y) > fx (x, y), we would drift up to the constraint. At this point, we would be on an indifference curve that is orthogonal to the gradient and tangent to the constraint, g(x), and any more steps would violate the constraint. Now, the FONCs are ∇f (x∗ ) = λ∇g(x∗ ) g(x∗ ) = 0 Recall that the gradient, ∇f (x), is a vector from the point f (x) in the direction in which the function has the greatest rate of change. So the equation ∇f (x∗ ) = λ∇g(x∗ ) implies that the constraint gradient is a scalar multiple of the gradient of the objective function: 107 Tangency of the Gradient to the Constraint So at a local maximum, it must be the case that the indifference curve is tangent to the constraint, since the constraint gradient points the same direction as the gradient of the objective, and the gradient of the objective is tangent to the indifference curve. This is all of the intuition: We’re looking for a spot where the agent’s indifference curve is tangent to the constraint, so that the agent can’t find any local changes in which his payoff improves. But if the function or constraint set have a lot of curvature, there can be multiple local solutions: Multiple Solutions This is the basic idea of Lagrange maximization, expressed graphically. 10.2.2 The Lagrange Multiplier What is the interpretation of the Lagrange multiplier? Consider the problem of maximizing f (x) subject to a linear constraint w = p′ x, p ≫ 0. You can think of w as an agent’s wealth, p as the prices, and f (x) as their payoff from consuming a bundle x. The Lagrangean evaluated at the optimum is: L(x∗ , λ∗ ) = f (x∗ ) − λ(p′ x∗ − w) But if we consider this instead as a function of w and differentiate, we get ∂λ∗ V ′ (w) = (∇f (x∗ (w)) − λp) ∇w x∗ (w) − (p · x∗ − w) +λ∗ (w) {z } {z } | ∂w | =0 First-order condition So V ′ (w) = λ∗ (w), and the Lagrange multiplier gives the change in the optimized value of the objective for a small relaxation of the constraint set. Economists often call it the shadow price of w — how much the decision-maker would be willing to pay to relax the constraint slightly. 108 10.2.3 The Constraint Qualification The constraint qualification can be confusing because it is a technical condition that has nothing to do with maximization. To make some sense of it (and generalize things a bit), consider the following problem: max f (x) x subject to m = 1, 2, ..., M equality constraints, g1 (x) = 0, g2 (x) = 0, ..., gM (x) = 0. Let g1 (x) g2 (x) g(x) = .. . gM (x) The new Langrangean is given by L(x, λ) = f (x) − M X m=1 λm gm (x) = f (x) − λ′ g(x) Theorem 10.2.3 (First-Order Necessary Conditions) If 1. x∗ is a local maximum of f (x) that satisfies the constraints g(x) = 0 2. the constraint gradients ∇g(x∗ ) 6= 0 (this is called the constraint qualification) 3. f (x) and g(x) are both differentiable at x∗ Then there exists a unique vector λ∗ such that the FONCs ∇x L(x∗ , λ∗ ) = ∇f (x∗ ) − λ∗′ ∇g(x∗ ) ∇λ L(x∗ , λ∗ ) = −g(x∗ ) = 0 hold. Now, the new FONCs are ′ ∇f (x) − λ ∇g(x) = ∇f (x) − M X m=1 λm ∇gm (x) −g(x) = 0 The term ∇g(x) = ∂g1 (x) ∂x1 ∂g2 (x) ∂x1 .. . ∂gM (x) ∂x1 ∂g1 (x) ∂x2 ∂g2 (x) ∂x2 ... .. . ∂g1 (x) ∂xN ∂gM (x) ∂xN is actually an M × N matrix. Call this matrix G, just to clean things up a bit. Now our FONCs are ∇f (x) = λ′ G g(x) = 0 109 Remember that we are trying to solve for λ — that’s what Lagrange’s theorem guarantees — so we need G to be invertible, so that λ∗ = G−1 ∇f (x) But if G fails to be invertible, we can’t solve for λ∗ , and we can’t “finish the proof”. So the constraint qualification is just saying, “The matrix G is invertible,” nothing more. Example Here’s an example where the constraint qualification fails with only one constraint: max {x,y:y 3 −x2 =0} −y The Lagrangean is L = −y − λ(y 3 − x2 ) with first-order necessary conditions −(y 3 − x2 ) = 0 −1 − λ3y 2 = 0 λ2x = 0 The constraint y 3 − x2 = 0 requires that both x and y be (weakly) positive. Since the objective is decreasing in y and non-responsive in x, the global maximum is at y ∗ = x∗ = 0. But then the first-order necessary conditions are 0 0 −1 6= 0 0 0 and cannot be satisfied. So we have a simple situation where the Lagrange approach fails. The reason is that the constraints have gradient ∂g(0, 0)/∂x 0 = ∂g(0, 0)/∂y 0 so the constraint vanishes at the optimum. Then ∇L = ∇f + λ∇g cannot be solved if ∇f 6= 0 but ∇g is. This is the basic idea of the constraint qualification. 10.3 Second-Order Sufficient Conditions Just like unconstrained problems, there are second-order conditions, but they more complicated and require patience and practice to understand. There are two main differences from the unconstrained problem. First, we need not consider all possible local changes in the choice variables x∗ to verify whether a particular critical point is a local maximum, but only those that do not violate the constraints. For example, if the function if f (x, y) = xy and the constraint is w = px + y, there are always better allocations nearby any critical point, but most of them violate the constraint w = px + y. The set of changes (dx, dy) that leave the total expenditure unchanged satisfy pdx + dy = 0, and these are the only changes which we are really interested in. Definition 10.3.1 Let (x∗ , λ∗ ) be a critical point of the Lagrangian. The set of feasible local variations is Y = {y ∈ Rn : ∇g(x∗ ) · y = 0} 110 This is the set of vectors y local to x∗ for which all of the constraints are still satisfied, in the sense of the directional derivative, ∇g(x∗ ) · y = 0. Second, the Hessian of L(x, λ) is not just the Hessian of f (x). Intuitively, I have suggested that we should think of equality constrained maximization of x subject to g(x) = 0 as unconstrained maximization of L(x, λ) in terms of (x, λ). This is reflected in the fact that the Hessian of L(x, λ) is the focus of the SOSCs, not the Hessian of f (x). Theorem 10.3.2 (Second-Order Sufficient Conditions) Suppose (x∗ , λ∗ ) is a critical point of the Lagrangian (i.e., the constraint qualification and the FONCs ∇f (x∗ ) + λ∗′ ∇g(x∗ ) = 0 and g(x∗ ) = 0 are satisfied). Let ∇xx L(x∗ , λ∗ ) be the Hessian of the Lagrangean function with respect to the choice variables only. Then • If y ′ ∇xx L(x∗ , λ∗ )y < 0 for all feasible local changes y, then x∗ is a local maximum of f • If y ′ ∇xx L(x∗ , λ∗ )y > 0 for all feasible local changes y, then x∗ is a local minimum of f Since we are not solving an unconstrained maximization problem and the solution must lie on the locus of points defined by g(x) = 0, it follows that we can imagine checking the local, secondorder sufficient conditions only on the locus. Since we are using calculus, we restrict attention to feasible local changes, and the consequence is that for all “small steps” y that do not violate the constraint, the Lagrangian must have a negative definite Hessian in terms of x alone. This does not mean that the Hessian of the Lagrangian is negative definite, because the Lagrangian depends on x and λ. However, we would like a test that avoids dealing with the set of feasible local variations directly, since that set appears difficult to solve for and manipulate. Example If L(x, y, λ) = f (x, y) − λ(ax + by − c) then the FONCS are −(ax + by − c) = 0 fx (x∗ , y ∗ ) − λ∗ a = 0 fy (x∗ , y ∗ ) − λ∗ b = 0 Now, differentiating all the equations again in the order λ∗ , x∗ , and y ∗ gives the matrix 0 −a −b ∇xx L(x∗ , y ∗ , λ∗ ) = −a fxx (x∗ , y ∗ ) fyx (x∗ , y ∗ ) −b fxy (x∗ , y ∗ ) fyy (x∗ , y ∗ ) The above Hessian is called bordered, since the first column and first row are transposes of each other, while the lower right-hand corner is the Hessian of the objective function. For a single constraint g(x), the Hessian of the Lagrangean looks like this: 0 ∇x g(x∗ )′ ∗ ∗ (10.1) ∇xx L(x , λ ) = ∇x g(x∗ ) ∇xx f (x∗ ) + λ∗ ∇xx g(x∗ ) This is called the bordered Hessian. Suppose you are maximizing f (x) subject to a single constraint, g(x) = 0 (you can find the generalization to any number of constraints in any math econ textbook). Then the determinant of 111 the upper-left hand corner is trivially zero. The non-trivial leading principal minors of the bordered Hessian are — for the case with a single constraint — 0 −gx1 (x∗ ) −gx2 (x∗ ) H3 = −gx1 (x∗ ) fx1 x1 (x∗ ) − λ∗ gx1 x1 (x∗ ) fx2 x1 (x∗ ) − λ∗ gx2 x1 (x∗ ) −gx2 (x∗ ) fx1 x2 (x∗ ) − λ∗ gx1 x2 (x∗ ) fx2 x2 (x∗ ) − λ∗ gx2 x2 (x∗ ) 0 −gx1 (x∗ ) −gx2 (x∗ ) −gx3 (x∗ ) −gx1 (x∗ ) fx1 x1 (x∗ ) − λ∗ − gx1 x1 (x∗ ) fx2 x1 (x∗ ) − λ∗ − gx2 x1 (x∗ ) fx3 x1 (x∗ ) − λ∗ − gx3 x1 (x∗ ) H4 = −gx2 (x∗ ) fx1 x2 (x∗ ) − λ∗ − gx1 x2 (x∗ ) fx2 x2 (x∗ ) − λ∗ − gx2 x2 (x∗ ) fx3 x2 (x∗ ) − λ∗ − gx3 x2 (x∗ ) −gx3 (x∗ ) fx1 x3 (x∗ ) − λ∗ − gx1 x3 (x∗ ) fx2 x3 (x∗ ) − λ∗ − gx2 x3 (x∗ ) fx3 x3 (x∗ ) − λ∗ − gx3 x3 (x∗ ) .. . Hk+1 = 0 −gx1 (x∗ ) ... −gxk (x∗ ) −gx1 (x∗ ) fx1 x1 (x∗ ) − λ∗ gx1 x1 (x∗ ) .. .. . . ∗ −gxk (x ) fxk xk (x∗ ) − λ∗ gxk xk (x∗ ) So Hk+1 is the upper left-hand k + 1 × k + 1 principal minor of the full bordered Hessian. The “+1” term comes from the fact that we have an extra leading row and column that correspond to the constraint. An example makes this a bit clearer: Example Let f (x1 , x2 , x3 ) = x1 x2 + x2 x3 + x1 x3 subject to x1 + x2 + x3 = 3 Then the Lagrangean is L = x1 x2 + x2 x3 + x1 x3 − λ(x1 + x2 + x3 − 3) This generates a system of first-order conditions: ∂L/∂λ (3 − x1 − x2 − x3 ) ∂L/∂x1 x2 + x3 − λ ∂L/∂x2 = x1 + x3 − λ ∂L/∂x3 x2 + x1 − λ The only critical point is x∗ = (1/3, 1/3, 1/3). The bordered Hessian at the critical point is 0 −1 −1 −1 −1 0 1 1 ∇xx L(x∗ , λ∗ ) = −1 1 0 1 −1 1 1 0 and the leading principal minors are 0 −1 −1 H3 = −1 0 −1 −1 −1 0 112 and 0 −1 −1 −1 −1 0 1 1 H4 = −1 1 0 1 −1 1 1 0 It turns out that we can develop a test based on the bordered Hessian of the whole Lagrangian, including derivatives with respect to the multiplier, rather than focusing on the Hessian of the Lagrangian restricted only to the choice variables. Theorem 10.3.3 (Alternating Sign Test) • A critical point (x∗ , λ∗ ) is a local maximum of f (x) subject to the constraint g(x) = 0 if the determinants of all the principal minors of the bordered Hessian alternate in sign, starting with det H3 > 0 • A critical point (x∗ , λ∗ ) is a local minimum of f (x) subject to the constraint g(x) = 0 if the determinants of all the principal minors of the bordered Hessian are all negative, starting with det H3 < 0 Returning to the example, Example We need to decide whether the bordered Hessian is negative definite or not. The non-trivial leading principal minors of the bordered Hessian are det(H3 ) = 2 > 0, det(H4 ) = −3 < 0 And the second-order condition is satisfied for any vector y ′ Hy. Example Let f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ) where f1 (x1 ) and f2 (x2 ) are increasing and f1′′ (x), f2′′ (x) < 0, and fi (0) = 0. There is a constraint that C = x1 + x2 . Then the Lagrangean is L = f1 (x1 ) + f2 (x2 ) − λ(x1 + x2 − C) This generates a system of first-order conditions ∂L/∂λ −(x1 + x2 − C) ∂L/∂x1 = f1′ (x1 ) − λ ′ ∂L/∂x2 f2 (x2 ) − λ with bordered Hessian 0 −1 −1 0 H(x, λ) = −1 f1′′ (x1 ) ′′ −1 0 f2 (x2 ) Then the bordered Hessian’s non-trivial leading principal minors starting from H3 have determinants det(H3 ) = −f1′′ (x1 ) − f2′′ (x2 ) > 0 So the Lagrangian is negative definite, and an interior solution to the first-order conditions is a solution to the maximization problem. 113 Example Suppose a consumer is trying to maximize utility, with utility function u(x, y) = xα y β and budget constraint w = px + y. Then we can write the Lagrangian as (why?) L = α log(x) + β log(y) − λ(px + y − w) we get a system of first-order conditions ∂L/∂λ w − px − y ∂L/∂x = α/x − λp ∂L/∂y β/y − λ The Hessian of the Lagrangean is 0 −p −1 The determinants of the non-trivial 0 −p −α −p det (x∗ )2 −1 0 −p −α (x∗ )2 0 −1 −β (y ∗ )2 0 leading principal minors of the bordered Hessian are −1 −β −α 0 = p(−p) ∗ 2 − 1(−1)(−1) ∗ 2 > 0 (x ) (x ) −β (y ∗ )2 So the alternating sign test is satisfied. This works because adding the border imposes the restriction that ∇g(x) = 0 on the test of whether the objective is negative definite or not. Consequently, when we use the alternating sign test, we are really asking, “Is the Hessian of the objective negative definite on Y ?” This requires more matrix algebra to show, but you can find the details in MWG or Debreu’s papers, for example. 10.4 Comparative Statics Our system of first-order necessary conditions ∇x f (x∗ ) − λ∗ ∇x g(x∗ ) = 0 −g(x∗ ) = 0 is a non-linear system of equations with endogenous variables (x∗ , λ∗ ), just like any other we have applied the IFT to. The only “new” part is that you have to keep in mind that λ∗ is an endogenous variable, so that it changes when we vary any exogenous variables. Second, the sign of the determinant of the Hessian is determined by whether we are looking at a maximum or minimum, and the number of equations. For simplicity, consider a constrained maximization problem max f (x, y, t) x,y subject to a single equality constraint ax + by − s = 0, where t and s are exogenous variables. This is simpler than assuming that g(x, s) = 0, since the second-order derivatives of the constraint are 114 all zero. However, there are many economic problems of interest where the constraints are linear, and a general g(x, s) can be very complicated to work with. Then the Lagrangean is L(x, y, λ) = f (x, y, t) − λ(ax + by − s) and the FONCs are −(ax∗ + by ∗ − s) = 0 ∇f (x∗ , y ∗ , t) − λ∗ ∇g(x∗ , y ∗ , s) = 0 Since we have two parameters — t shifts the objective function, s shifts the constraint — we can look at two different comparative statics. Example Let’s start with t: Totally differentiate the FONCs to get three equations ∂y ∗ ∂x∗ +b =0 − a ∂t ∂t ∂y ∗ ∂λ∗ ∂x∗ + fyx − a=0 ∂t ∂t ∂t ∂x∗ ∂y ∗ ∂λ∗ fty + fxy + fyy − b=0 ∂t ∂t ∂t If we write this as a matrix equation, 0 −a −b ∂λ∗ /∂t 0 −a fxx fxy ∂x∗ /∂t = −ftx −b fyx fyy ∂y ∗ /∂t −fty ftx + fxx Which is just a regular “Ax = b”-type matrix equation. On the right-hand side, the bordered Hessian appears; from the alternating sign test and the fact that x∗ is a local maximum, we can determine the sign of its determinant. To some extent, we are finished, since all that remains is to solve the system. You can solve the system by hand by solving for each comparative static and reducing the number of equations, but that is a lot of work. It is simpler to use Cramer’s rule. To see how x∗ varies with t, 0 0 −b det −a −ftx fxy ∗ −b −fty fyy ∂x = ∂t det H3 which is −b(afty (x∗ , y ∗ , t) − bftx (x∗ , y ∗ , t) ∂x∗ = ∂t det H3 Since (x∗ , y ∗ ) is a local maximum, det H3 > 0, and the sign of the comparative static is just the sign of the numerator. Having done t, let’s do s: Example Totally differentiate the system of FONCs to get ∂y ∗ ∂x∗ +b −1 =0 − a ∂s ∂s fxx ∂x∗ ∂y ∗ ∂λ∗ + fyx − a=0 ∂s ∂s ∂s 115 ∂y ∗ ∂λ∗ ∂x∗ + fyy − b=0 ∂s ∂s ∂s Writing this as a matrix equation yields 0 −a −b ∂λ∗ /∂s −1 −a fxx fxy ∂x∗ /∂s = 0 −b fyx fyy ∂y ∗ /∂s 0 fxy Let’s focus on ∂y ∗ /∂s. Using Cramer’s rule, 0 −a −1 det −a fxx 0 ∗ −b fyx 0 ∂y = ∂s det H3 ∂y ∗ afyx + bfxx = ∂s det H3 An example with more economic content is: Example Consider the consumer problem L = α log(x) + β log(y) − λ(px + y − w) we get a system of first-order conditions ∂L/∂λ w − px − y ∂L/∂x = α/x − λp ∂L/∂y β/y − λ This is a system of non-linear equations, so there’s no problem differentiating it with respect to p, for instance, and using all the same steps as for the unconstrained firm. You need to remember, however, that λ∗ (w, p, α, β) is now a function of parameters as well, so it is not a constant that drops out. Totally differentiating with respect to p gives ∂x ∂y − ∂p ∂p −α ∂x ∂λ − p−λ x2 ∂p ∂p −β ∂y ∂λ − y 2 ∂p ∂p 0 = −x − p 0 = 0 = Solving the system the “long way” for ∂x/∂p gives β px∗ + λ∗ ∂x y ∗2 <0 =− β 2 α ∂p + p x∗2 y ∗2 Since the Lagrange multiplier λ∗ is positive, if p goes up, the consumer reduces his consumption of x∗ . 116 10.5 The Envelope Theorem The last tool we need to generalize is the envelope theorem. Since a local maximum of the Lagrangian satisfies L(x∗ , λ∗ ) = f (x∗ ) − λ∗′ g(x∗ ) = f (x∗ ) we can use it as our value function. As a start, suppose we are studying the consumer’s problem with utility function u(x, y) and budget constraint w = px + y. Then the Lagrangian at any critical point is L(x∗ , y ∗ , λ∗ ) = u(x∗ , y ∗ ) − λ∗ (px∗ + y ∗ − w) Let’s consider it as a value function in terms of w, V (w) = u(x∗ (w), y ∗ (w)) − λ∗ (w)(px∗ (w) + y ∗ (w) − w) Then V ′ (w) = ux x∗′ (w) + uy y ∗′ (w) −λ∗ (w)px∗′ (w) − λ∗ (w)y ∗′ (w) −λ∗′ (w)(px∗ (w) + y ∗ (w) − w) +λ∗ (w) {z }| {z } | {z } {z }| | II I III IV There are a bunch of terms that each represent a different consequence of giving the agent more wealth. First, the agent re-optimizes and the value of the objective function changes (I). Second, this has an impact on the constraint, and its implicit cost changes (II). Third, since the constraint has been relaxed, the Lagrange multiplier changes, again changing the implicit cost (III). Fourth, there is the direct effect on the objective of increasing w (IV). Re-arranging, however, gives V ′ (w) = (ux − λ∗ (w)p)x∗′ (w) + (uy − λ∗ (w))y ∗′ (w) −λ∗′ (w) (px∗ (w) + y ∗ (w) − w) +λ∗ (w) | {z } {z } | Constraint FONCs so that the FONCs are zero, and since the constraint is satisfied, that term also equals zero, leaving just the direct effect, V ′ (w) = λ∗ (w) or dL(x∗ (w), y ∗ (w), λ∗ (w), w) ∂L(x∗ (w), y ∗ (w), λ∗ (w), w) = ∂w dw So as before, we can simply take the partial derivative of the Lagrangean to see how an agent’s payoff changes with respect to a parameter, rather than working out all the comparative statics through the implicit function theorem. V ′ (w) = Theorem 10.5.1 (Envelope Theorem) Consider the constrained maximization problem max f (x, t) x subject to g(x, s) = 0. Let x∗ (t, s), λ∗ (t, s) be a critical point of the Lagrangean. Define V (t, s) = L(x∗ (t, s), λ∗ (t, s), t, s) = f (x∗ (t, s), t) + λ∗ (t, s)g(x∗ (t, s), s) Then and ∂f (x∗ (t, s), t) ∂V (t, s) = ∂t ∂t ∂V (t, s) ∂g(x∗ (t, s), s) = ∂s ∂s 117 In words, the envelope theorem implies that a change in the agent’s payoff with respect to an exogenous variable is the partial derivative of the non-optimized Lagranian evaluated at the optimal decision x∗ (t, s). Example Suppose a firm has production function F (K, L, t) where K is capital, L is labor, and t is technology. The price of the firm’s good is p. Then the we can write the profit maximization problem as a constrained maximization problem as π(t) = max pq − rK − wL q,K,L subject to q = F (K, L, t). The Lagrangean then is L(q, K, L, λp ) = pq − rK − wL − λp (q − F (K, L, t)) And — without any actual work — π ′ (t) = λ∗p Ft (K ∗ , L∗ , t) Similarly, the cost minimization problem is C(q, t) = min rK + wL K,L subject to F (K, L, t) ≥ q. The Lagrangean then is L(K, L, λ) = −rK − wL + λc (F (K, L, t) − q) And Ct (q, t) = λ∗c Ft (K ∗ , L∗ , t) So be careful: The envelope theorem takes the derivative of the Lagrangean, then evaluates it at the optimal choice to obtain the derivative of the value function. Exercises 1. Suppose we take a strictly increasing transformation of the objective function and leave the constraints unchanged. Is a solution of the transformed problem a solution of the original problem? Suppose we have constraints g(x) = c and take a strictly increasing transformation of both sides. Is a solution of the transformed problem a solution of the original problem? 2. Consider the maximization problem max x1 + x2 x1 ,x2 subject to x21 + x22 = 2 Sketch the constraint set and contour lines of the objective function. Find all critical points of the Lagrangian. Verify whether each critical point is a local maximum or a local minimum. Find the global maximum. 3. Consider the maximization problem max xy x,y 118 subject to a 2 b 2 x + y =r 2 2 Sketch the constraint set and contour lines of the objective function. Find all critical points of the Lagrangian. Verify whether each critical point is a local maximum or a local minimum. Find the global maximum. How does the value of the objective vary with r? a? How does x∗ respond to a change in r; show this using the closed-form solutions and the IFT. 4. Solve the problem max 2x − 2y + z x,y,z subject to x2 + y 2 + z 2 = 9. 5. Solve the following two dimensional maximization problems subject to the linear constraint w = p1 x1 + p2 x2 , p1 , p2 > 0. Then compute the change in x1 with respect to p2 , and a change in x1 with respect to w. Assume α1 > α2 > 0. For i and ii, when do the SOSCs hold? i. Cobb-Douglas f (x) = xα1 1 xα2 2 ii. Stone-Geary f (x) = (x1 − γ1 )α1 (x2 − γ2 )α2 iii. Constant Elasticity of Substitution ρ 1/ρ α1 x1 + α2 x1ρ 2 iv. Leontief min{α1 x1 , α2 x2 } 6. An agent receives income w1 in period one and no income in the second period. He consumes c1 in the first period and c2 in the second. He can save in the first period, s, and receives back Rs in the second period. His utility function is u(c1 , c2 ) = log(c1 ) + δ log(c2 ) where R > 1 and 0 < δ < 1. (i) Write this down as a constrained maximization problem. (ii) Find first-order necessary conditions for optimality and check the SOSCs. What does the Lagrange multiplier represent in this problem? (iii) Compute the change in c2 with respect to R and δ. (iv) How does the agent’s payoff change if R goes up? δ? 7. For utility function u(x1 , x2 ) = (x1 − γ1 )xα2 and budget constraint w = p1 x1 + p2 x2 , compute the utility-maximizing bundle. You should get closed-form solutions. Now compute ∂x∗1 /∂p2 using the closed-form solution as well as the implicit function theorem from the system of FONCs. Repeat with ∂x∗2 /∂p1 . How does the agent’s payoff vary in γ1 ? w? p2 ? 8. Generalize and proof the envelope theorem in the case where f (x, t) and g(x, t) both depend on the same parameter, t. Construct an economically relevant problem in which this occurs, and use the implicit function theorem to show how x∗ (t) varies in t when both the constraint and objective are varying at the same time. 9. Suppose you are trying to maximize f (x). Show that if the constraint is linear, so that p · x = w, you can always rewrite the constraint as PN −1 p i xi xN = i=1 pN 119 and substitute this into the objective, turning the problem into an unconstrained maximization problem. Why do we need Lagrange maximization then? 10. Suppose you have a maximization problem max f (x) x subject to g(x, t) = 0. Show how to use the implicit function theorem to derive comparative statics with respect to t. Explain briefly what the bordered Hessian looks like, and how it differs from the examples in the text. 120 Chapter 11 Inequality-Constrained Maximization Problems Many problems — particularly in macroeconomics — don’t involve strict equality constraints, but involve inequality constraints. For example, a household might be making a savings/consumption decision, subject to the constraint that savings can never become negative — this is called a borrowing constraint. As a result, many households will not have to worry about this, as long as they maintain positive savings in all periods. A simple version is: Example Consider a household with utility function u(c1 ) + δu(c2 ) The household gets income y1 in period one and y2 in period two, with y2 > y1 . The household would potentially like to borrow money today to smooth consumption, but faces a financial constraint that it can borrow an amount B only up to R < y2 . Then we have the constraints c1 = y1 + B c2 = y2 − B B≤R Substituting the first two constraints into the objective yields max u(y1 + B) + δu(y2 − B) B subject to B ≤ y2 . Now, sometimes the constraint will be irrelevant, or u′ (y1 + B ∗ ) = δu′ (y2 − B ∗ ) and B ∗ < R, and the constraint is “slack,” or “inactive.” Other times, however, B ∗ > R, and the constraint becomes “binding” or “active”. Then B ′ = R, and we have two potential solutions, and the one selected depends on the parameters of the problem. The simple borrowing problem illustrates the general issues: Having inequality constraints creates the possibility that for different values of the parameters, different collections of the constraints are binding. The maximization theory developed here is designed to deal with these shifting sets of constraints. 121 Definition 11.0.2 Let f (x) be the objective function. Suppose that the choice of x is subject to m = 1, 2, ..., M equality constraints, gm (x) = 0, and k = 1, 2, ..., K inequality constraints, hk (x) ≤ 0. Then the inequality-constrained maximization problem is max f (x) x such that for m = 1, 2, ..., M and k = 1, 2, ..., K, gm (x) = 0 hk (x) ≤ 0 Note that we can brute force solve this problem as follows: Since there are K inequality constraints, there are 2K ways to pick a subset of the inequality constraints. For each subset, we can make the chosen inequality constraints into equality constraints, and solve the resulting equality constrained maximization problem (which we already know how to do). Each of these sub-problems might generate no candidates, or many, and some of the candidates may conflict with some of the constraints we are ignoring. Once we compile a list of all the solutions that are actually feasible, then the global maximum must be on the list somewhere. Then we can simply compare the payoffs from our 2K sets of candidate solutions, and pick the best. Example Suppose an agent is trying to maximize f (x, y) = ax + by, a, b > 0, subject to the constraints x ≥ 0, y ≥ 0, and c ≥ px + qy. Since ∇f = (a, b) ≫ 0, it is immediate that the constraint c ≥ px+qy will bind with an equality. For if c > px + qy, we can always increase x or y a little bit, thus improving the objective function’s value as well as not violating the constraint. This is feasible and better than any proposed solution with c > px + qy, so we must have c = px + qy. Then we have three cases remaining: 1. The constraint x ≥ 0 binds and x = 0, but y > 0 2. The constraint y ≥ 0 binds and y = 0, but x > 0 3. The constraints x, y ≥ 0 are both slack, and x, y > 0 We then solve case-by-case: 1. If y = 0, then the constraint implies x = c/p, giving a value of ac/p 2. If x = 0, then the constraint implies y = c/q, giving a value of bc/q 3. If x, y > 0, then we solve the constrained problem using a Lagrangean, L(x, y, λ) = ax + by − λ(px + qy − c) with FONCs a − λp = 0 b − λq = 0 and −(px + qy − c) = 0 This has a solution only if a/b = p/q, in which case any x and y that satisfy the constraints are a solution, since the objective is equivalent to (a/b)x + y and the constraint is equivalent to (p/q)x + y, from which you can easily see that the indifference curves and constraint set are exactly parallel, so that any px∗ + qy ∗ = c with x∗ , y ∗ > 0 is a solution. 122 So we have two candidate solutions with either x or y equal to zero, and a continuum of solutions when the objective and constraint are parallel. As happened in the previous problem, we differentiate between two kinds of solutions: Definition 11.0.3 A point x∗ that is a local maximizer of f (x) subject to gm (x) = 0 for all m and hk (x) ≤ 0 for all k is a corner solution if hk (x∗ ) = 0 for some k, and an interior solution if hk (x∗ ) < 0 for all k. A corner solution is one like x∗ = c/p and y ∗ = 0, while an interior solution corresponds to the third case from the previous example, where the constraint c = px + qy binds. This example suggests we don’t really need a new maximization theory, but either a lot of paper or a powerful computer. However, just as Lagrange multipliers can be theoretically useful, the multipliers we’ll attach to the inequality constraints can be theoretically useful. In fact, the framework we’ll develop is mostly just a systematic way of doing the brute force approach described in the previous paragraph. 11.1 First-Order Necessary Conditions For simplicity, we’ll work with one equality constraint g(x) = 0 and k = 1, ..., K inequality constraints, hk (x) ≤ 0. As before, we form the Lagrangean, putting a Lagrange multiplier λ on the equality constraint, and µ = (µ1 , µ2 , ..., µK ) Kuhn-Tucker multipliers on the inequality constraints: X µk hk (x) = f (x) − λg(x) − µ′ h(x) L(x, λ, µ) = f (x) − λg(x) − k But now we have a dilemma, because some constraints may be binding at the solution while others are not. Moreover, there may be many solutions that correspond to different binding constraint sets. How are we supposed to know which bind and which don’t? We proceed by making the set of FONCs larger to include complementary slackness conditions. Theorem 11.1.1 (Kuhn-Tucker First-Order Necessary Conditions) Suppose that 1. Suppose x∗ is a local maximum of f (x) subject to the constraints g(x) = 0 and hk (x) ≤ 0 for all k. 2. Let A(x∗ ) = {1, 2, ..., ℓ} be the set of inequality constraints that binding at x∗ , so that hj (x∗ ) = 0 for all j ∈ A(x∗ ) and hj (x∗ ) < 0 for all j ∈ / E. Suppose that the matrix formed by the ∗ gradient of the equality constraint, ∇g(x ), and the binding inequality constraints in A(x∗ ), ∇h1 (x∗ ), ∇h2 (x∗ ), ..., ∇hℓ (x∗ ), is non-singular (this is the constraint qualification). 3. The objective function, equality constraint, and inequality constraints are all differentiable at x∗ Then, there exist unique vectors λ∗ and µ∗ of multipliers so that • the FONCs ∇f (x∗ ) − λ∗ ∇g(x∗ ) − ∗ X j −g(x ) = 0 123 µ∗j hj (x∗ ) = 0 • and the complementary slackness conditions µ∗j hj (x∗ ) = 0 for all k = 1, ..., K are satisfied. So, we always have the issue of the constraint qualification lurking in the background of a constrained maximization problem, so set that aside. The rest of the theorem says that if x∗ is a local maximum subject to the equality constraint and inequality constraints, some will be active/binding and some will be inactive/slack. If we treat the active/binding constraints as regular equality constraints, then µ∗j ≥ 0 for each binding constraint, but each slack constraint will satisfy hj (x∗ ) < 0, so µ∗j = 0 — these are the complementary slackness conditions. (Note that it is not obvious right now that µ∗j ≥ 0, but this is shown later.) Example Suppose a consumer’s utility function is u(x, y), and he faces constraints x ≥ 0, y ≥ 0 and w = px + y. Suppose that ∇u(x, y) ≥ 0, so that the consumer’s utility is weakly increasing in the amount of each good. Then we can form the Lagrangian L(x, y, µ) = u(x, y) − µx x − µy y − λ(px + y − w) The FONCs are ux (x∗ , y ∗ ) − µ∗x − λ∗ p = 0 uy (x∗ , y ∗ ) − µ∗y − λ∗ = 0 −(px∗ + y ∗ − w) = 0 and the complementary slackness conditions are µ∗x x∗ = 0 µ∗y y ∗ = 0 Now we hypothetically have 22 = 4 cases: 1. µ∗x , µ∗y ≥ 0 (which will imply x∗ = 0, y ∗ = 0 by the complementary slackness conditions) 2. µ∗x = 0,µ∗y ≥ 0 (which will imply x∗ ≥ 0, y ∗ = 0 by the complementary slackness conditions) 3. µ∗x ≥ 0,µ∗y = 0 (which will imply x∗ = 0, y ∗ ≥ 0 by the complementary slackness conditions) 4. µ∗x , µ∗y = 0 (which will imply x∗ , y ∗ ≥ 0 by the complementary slackness conditions) We now solve the FONCs case-by-case: 1. In the first case, µ∗x , µ∗y ≥ 0, we look at the corresponding complementary slackness conditions µ∗x x∗ = 0 µ∗y y ∗ = 0 For these to hold, it must be the case that x∗ = 0 and y ∗ = 0. Substituting these into the FONCs yields ux (0, 0) − µ∗x − λ∗ p = 0 uy (0, 0) − µ∗y − λ∗ = 0 w=0 which yields a contradiction, since w 6= 0. This case provides no candidate solutions. 124 2. In the second case, µ∗x = 0,µ∗y ≥ 0, we look at the complementary slackness conditions µ∗x x∗ = 0 µ∗y y ∗ = 0 For these to hold, y ∗ must be zero. Substituting this into the FONCs yields ux (x∗ , 0) − λ∗ p = 0 uy (x∗ , 0) − µ∗y − λ∗ = 0 −(px∗ − w) = 0 from which we get x∗ = w/p from the final equation. This case gives a candidate solution x∗ = w/p and y ∗ = 0. 3. In the third case, µ∗x 6= 0,µ∗y = 0, we look at the complementary slackness conditions µ∗x x∗ = 0 µ∗y y ∗ = 0 For these to hold, we must have x∗ = 0. Substituting this into the FONCs yields ux (0, y ∗ ) − µ∗x − λ∗ p = 0 uy (0, y ∗ ) − λ∗ = 0 −(y ∗ − w) = 0 from which we get y ∗ = w from the final equation. This case gives a candidate solution x∗ = 0 and y ∗ = w. 4. In the last case, µ∗x , µ∗y = 0, the complementary slackness conditions are satisfied automatically. Substituting this into the FONCS yields ux (x∗ , y ∗ ) − λ∗ p = 0 uy (x∗ , y ∗ ) − λ∗ = 0 −(px∗ + y ∗ − w) = 0 which gives an interior solution x∗ , y ∗ > 0. So the KT conditions produce three candidate solutions: All x and no y, all y and no x, and some of each. Which solution is selected will depend on whether ∇u(x, y) is strictly greater than zero or only weakly greater than zero in each good. The complementary slackness conditions work as follows: Suppose we have an inequality constraint x ≤ 0. Then the complementary slackness condition is that µ∗x x∗ = 0, where µ∗x is the Kuhn-Tucker multiplier associated with the constraint x ≤ 0. Then if µ∗x ≥ 0, it must be the case that x∗ = 0 for the complementary slackness condition to hold; if µ∗x = 0, then x∗ can take any value. This is how KT theory systematically works through all the possibilities of binding constraints: if the KT multiplier is non-zero, the constraint must be binding, while if the KT multiplier is zero, it is slack. 125 Example Consider maximizing f (x, y) = xα y β subject to the inequality constraints y + 2x ≥ 6 x + 2y ≥ 6 x≥0 y≥0 Then there are four inequality constraints. The set of points satisfying all four is not empty (graph it to check). However, the first two constraints only bind at the same time when x = y = 2; otherwise either the first is binding or the second but not both. This means that the Lagrangean is not the right tool, because you could only satisfy both equations as equality constraints at the point x = y = 2, which may not be a solution. You could set up a series of problems where you treat the constraints as binding or non-binding, and solve the 24 resulting maximization problems that come from considering each subset of constraints separately, but this is exactly what Kuhn-Tucker does for you, in a systematic way. The Lagrangean is L = α log(x) + β log(y) − µ1 (y + 2x − 6) − µ2 (x + 2y − 6) − µ3 x − µ4 y The first-order necessary conditions are ∂L/∂x ∂L/∂y α − 2µ1 − µ2 − µ3 x = β − µ1 − 2µ2 − µ4 y The complementary slackness conditions are µ1 (y + 2x − 6) µ2 (x + 2y − 6) µ3 x µ4 y =0 =0 =0 =0 We now proceed by picking combinations of multipliers, and setting µj ≥ 0 if hj (x∗ , y ∗ ) = 0, and µj = 0 if hj (x∗ , y ∗ ) > 0. This is what the “complementary” part of complementary slackness means. First, note that the gradient is strictly increasing, so that if x = 0 or y = 0 at the optimium, it occurs at an extreme point of the set. So if x = 0 (µ3 ≥ 0), then y must be as large as possible, and if y = 0 (µ4 ≥ 0), x must be as large as possible. Second, note that if 6 = y + 2x is binding (µ1 ≥ 0), then 6 = x + 2y is slack (µ2 = 0), and vice versa. So we really have five possibilities: 1. x = y = 2 and both µ1 , µ2 ≥ 0 and µ3 , µ4 = 0 2. x, y ≥ 0 and µ1 ≥ 0, µ2 , µ3 , µ4 = 0 3. x, y ≥ 0 and µ2 ≥ 0, µ1 , µ3 , µ4 = 0 4. x ≥ 0 and y = 0, and µ1 ≥ 0, µ2 , µ3 , µ4 = 0 5. y ≥ 0 and x = 0, and µ2 ≥ 0, µ1 , µ3 , µ4 = 0 For each case, we find the set of critical points that satisfy the resulting first-order conditions: 1. The unique critical point is x = y = 2. 126 2. If x, y ≥ 0 and µ1 ≥ 0, the binding constraint is that y + 2x = 6. The first-order necessary conditions become α/x − 2µ1 0 = 0 β/y − µ1 This implies (why?) αy = 2βx. With the constraint, this implies x∗ = 6β 3α , y∗ = α+β α+β 3. If x, y ≥ 0 and µ2 ≥ 0, the binding constraint is that 2y + x = 6. The first-order necessary conditions become 0 α/x − µ2 = β/y − 2µ2 0 Using the conditions and constraint gives x∗ = 3β 6α , y∗ = α+β α+β 4. If x ≥ 0 and y ∗ = 0, and µ1 ≥ 0, then the constraint 2x + y = 6 binds, and x∗ = 3. 5. If y ≥ 0 and x∗ = 0, and µ2 ≥ 0, then the constraint x + 2y = 6 binds, and y ∗ = 3. So we have a candidate list of five potential solutions: 2 6α α+β 3α α+β 3 , 2 3β α+β 6β , α+β , 0 , 0 , 3 To figure out which is the global maximum, you generally need to plug these back into the objective and determine which point achieves the highest value. This will depend on the value of α and β. Example Suppose an agent is trying to solve an optimal portfolio problem, where two assets, x and y are available. Asset x has mean mx and variance vx , while asset y has mean my and variance vy . The covariance of x and y is vxy . He has wealth w, shares of x cost p, and shares of y cost 1, so his budget constraint is w = xmx + ymy , but x ≥ 0 and y ≥ 0, so he cannot short. The objective function is the mean minus the variance of the portfolio, xmx + ymy − (x2 vx + ym∗y − 2xyvxy ) Then the Lagrangean is L = xmx + ymy − (x2 vx + ym∗y − 2vxy ) − λ(w − px − y) − µx x − µy y The FONCs are mx − (2x∗ vx − 2y ∗ vxy ) − λp − µx = 0 my − (2y ∗ vy − 2x∗ vxy ) − λ − µy = 0 −(w − px∗ − y ∗ ) = 0 127 and the complementary slackness conditions are µ∗x x∗ = 0 µ∗y y ∗ = 0 Then there are 22 = 4 cases: 1. µ∗x = mu∗y = 0, placing no restrictions on x∗ and y ∗ , by the complementary slackness conditions 2. µ∗x = 0 and µ∗y ≥ 0, so that y ∗ = 0, by the complementary slackness conditions 3. µ∗x ≥ 0 and µ∗y = 0, so that x∗ = 0, by the complementary slackness conditions 4. µ∗x ≥ 0 and µ∗y ≥ 0, so that x∗ = y ∗ = 0, by the complementary slackness conditions We solve case by case: 1. If µ∗x = µ∗y = 0, then the FONCs are mx − (2x∗ vx − 2y ∗ vxy ) − λ∗ p = 0 my − (2y ∗ vy − 2x∗ vxy ) − λ∗ = 0 −(w − px∗ − y ∗ ) = 0 which is a quadratic programming problem and can be solved by matrix methods. In particular, w(vx + pvxy ) − 2p (mx − pmy ) ∗ y = p(2vxy + vy ) + vx 2. If µ∗x = 0 and µ∗y ≥ 0, then the FONCs are mx − (2x∗ vx ) − λ∗ p = 0 my − (−2x∗ vxy ) − λ∗ − µ∗y = 0 −(w − px∗ ) = 0 And the last equation gives a candidate solution x∗ = w/p and y ∗ = 0. 3. This case is similar to the previous one, with y ∗ = w/p and x∗ = 0. So the portfolio optimization problem produces three candidate solutions. Our candidate solution with x∗ and y ∗ both positive requires p w(vx + pvxy ) − (mx − pmy ) > 0 2 If p = 1, this reduces to w> mx − my 2(vx + vxy ) so that the difference in means can’t be too large (mx − my ) or the agent will prefer to purchase all x. If vx or vxy is large, however, this makes the agent prefer to diversify, even if mx > my . The KT conditions allow us to pick out the various solutions and compare them, even for large numbers of assets (since this is just a quadratic programming problem). 128 11.1.1 The sign of the KT multipliers We haven’t yet shown that µj ≥ 0. To do this, consider relaxing the first inequality constraint: h1 (x) ≤ ε where ε is a relaxation parameter that makes the constraint easier to satisfy. The Lagrangian for a problem with one equality constraint and J inequality constraints becomes L = f (x) − λg(x) − µ1 (h1 (x) − ε) − J X µj hj (x) j=2 Let V (ε) be the value function in terms of ε. If we have relaxed the constraint, the agent’s choice set is larger, and he must be better off, so that V ′ (ε) ≥ 0. But by the envelope theorem, V ′ (ε) = ∂L(x∗ , λ∗ , µ∗ ) = µ∗1 ≥ 0 ∂ε So the KT multipliers must be weakly positive, as long as your constraints are always written as hj (x) ≤ 0. 11.2 Second-Order Sufficient Conditions and the IFT Some good news: Since KT maximization is essentially the same as Lagrange maximization once you have fixed the set of binding constraints, the SOSCs and implicit function theorem for an inequality-constrained problem are equivalent to those in an equality-constrained maximization problem where the set of binding constraints are equality constraints, and the slack constraints are completely ignored. Remember: KT is just a formal way of working through the process of testing all the possible combinations of binding constraints. So for a particular set of binding constraints, the problem is equivalent to an equality constrained problem where these are the only constraints. To check whether a critical point is a local maximum or compute comparative statics, we use exactly the same approach as an equality constrained problem. The envelope theorem is similar, but you have to be careful. In an inequality constrained problem there are potentially many local solutions for some values of the exogenous variables. But the envelope theorem is computed from the maximum of all of them, and as you vary the exogenous variables, you have to be careful that you don’t drift into another solution’s territory. Locally, there is not much to worry about, since you are almost never on such a boundary (in the sense of Lebesgue measure zero). But suppose you want to plot the maximized value of the objective function for an agent facing a savings behavior for a variety of interest rates where there is a borrowing constraint that binds for some values of the interest rate but not for others. However, this upper envelope over all maximization solutions is continuous, at least (but not necessary differentiable at “kinks” where the regime switches): Theorem 11.2.1 (Theorem of the Maximum) Consider maximizing f (x, c) over x, where the choice set defined by the equality and inequality constraints is compact and potentially depends on c. Then f (x∗ (c), c) is a continuous function in c. The theorem of the maximum is a kind of generalization of the envelope theorem that says, “While V (c) may not be differentiable because of the kinks where the set of binding constraints shifts, the value functions will always be at least continuous in a parameter c.” 129 Exercises 1. Suppose an agent has the following utility function: √ √ u(x1 , x2 ) = x1 + x2 and faces a linear constraint 1 = px1 + x2 and non-negativity constraints x1 ≥ 0, x2 ≥ 0. i. Verify the objective function is concave in (x1 , x2 ). ii. Graph the budget set for various prices p, and some of the upper contour sets. iii. Solve the agent’s optimization problem for all p > 0. Check the second-order conditions are satisfied at the optimum. iv. Sketch the √ set of maximizers as √ a function of p, x1 (p). v. Solve the agent’s problem when u(x1 , x2 ) = x1 + x2 + 1. Is it possible to get corner solutions in this case? Explain the difference between the old objective function and the new one. 2. Suppose a firm hires capital K and labor L to produce output using technology q = F (K, L), with ∇F (K, L) ≥ 0. The price of capital is r and the price of labor is w. Solve the cost minimization problem min rK + wL K,L subject to F (K, L) ≥ q̄, K ≥ 0 and L ≥ 0. Check the SOSCs or otherwise prove that your solutions are local maxima subject to the constraints. How does (K ∗ , L∗ ) vary in r and q̄ at each solution? How does the firm’s cost function vary in q̄? 3. Warning: This question is better approached by thinking without a Lagrangean than trying to use KT theory. A manager has two factories at his disposal, a and b. The cost function for factory a is Ca (qa ) where Ca (0) = 0 and Ca (qa ) is increasing, differentiable, and convex. The cost function for factory b is ( 0 if qb = 0 Cb (qb ) = cb qb + F if qb > 0 i. Is factory b’s cost function convex? Briefly explain the economic intuition of the property Cb (0) = 0 but limqb ↓0 Cb (qb ) = F . ii. Solve for the cost-minimizing production plan (qa , qb ) that achieves q̄ units of output, qa + qb = q̄, qa ≥ 0, qb ≥ 0. Is the cost function C ∗ (q̄) continuous? Convex? Differentiable? iii. Let p be the price of the good. If the firm’s profit function is π(q) = pq − C ∗ (q), what is the set of maximizers of π(q)? 4. Suppose an agent has the following utility function: ( √ x1 x2 , x2 > x1 x1 + x2 u(x1 , x2 ) = , x2 ≤ x1 2 and faces a linear constraint 1 = px1 + x2 and non-negativity constraints x1 ≥ 0, x2 ≥ 0. (i.) Verify that these preferences are continuous but non-differentiable along the ray x1 = x2 . (ii.) Graph the budget set for various prices p, and some of the upper contour sets. (iii.) Solve the agent’s optimization problem for all p > 0. (iv.) Sketch the set of maximizers as a function/correspondence of p, x1 (p). 5. Suppose an agent gets utility from consumption, c ≥ 0, and leisure, ℓ ≥ 0. He has one unit of time, which can also be spent working, h ≥ 0. From working, he gets a wage of w per hour, and his utility function is u(c, ℓ) 130 However, he faces taxes, which take the form t(wh) = 0 , wh < t0 τ wh , wh ≥ t0 Therefore, his income is wh, consumption costs pc, and he is taxed based on whether his wage is above or below a certain threshold, linearly at rate τ . i. Sketch the agent’s budget set. Is it a convex set? ii. Formulate a maximization problem and characterize any first-order necessary conditions or complementary slackness conditions. iii. Characterize the agent’s behavior as a function of w and t0 . What implications does this model have for the design of tax codes? iv. If the tax took the form 0 , wh < t0 t(wh) = τ , wh ≥ t0 sketch the agent’s budget set. Is it convex? What implications does the geometry of this budget set have for consumer behavior? 6. A consumer has wealth w, and access to a riskless asset with return R > 1, and a risky asset with return r̃ ∼ N (r, σ 2 ). He places φ1 of his wealth into the riskless asset, and φ2 of his wealth into the risky asset: w = φ1 + φ2 and he maximizes the mean-less-the-variance of his returns: U (φ1 , φ2 ) = φ1 R + φ2 r − γ 2 2 φ (σ + r 2 ) 2 2 (i) Find conditions on λ, R, r ∗ , σ 2 so the agent holds some of both the risky and riskless assets, and solve for the optimal portfolio weights, (φ∗1 , φ∗2 ). (ii) How do the optimal portfolio weights vary with r ∗ , σ 2 and R? (iii) How does his payoff vary with w and R when he uses the optimal portfolio? 131 Chapter 12 Concavity and Quasi-Concavity Checking second-order sufficient conditions for equality- and inequality-constrained maximization problems is often outrageously painful. It is time-consuming and error-prone, and pain increases exponentially in the number of choice variables. For this reason, it would be very helpful to know when the second-order sufficient conditions for multi-dimensional maximization problems are automatically satisfied. There are two classes of function that are useful for maximization: Concave and Quasi-concave. (The corresponding classes for minimization are Convex and Quasi-convex.) In the one-dimensional case, strict concavity meant that f ′′ (x) < 0 for all x. In the multidimensional case, strictly concavity will similarly mean that y ′ ∇xx f (x)y < 0 for all x and y, and it follows that the second-order conditions will be satisfied. This is the best case. However, recall that if x∗ is a local maximum of f (x), then for any strictly increasing function g(), x∗ is also a maximum of g(f (x)). This is nice, because it means that the “units” of f (x) are irrelevant to the set of maximizers. But concavity and convexity are not preserved under monotone transformations. For example, log(x) is one of our prototype concave functions. But if we take the strictly increasing transformation g(y) = (ey )2 , we get g(log(x)) = x2 which is a convex function. This has the potential to cause problems for us as economists, because we want the set of maximizers to be independent of how we describe f (x) up to increasing transformations (that lets us claim that we don’t need “utils” or cardinal utility to measure people’s preferences, we can just observe how they behave). We want our theories to be based on ordinal utility, meaning that the numbers provided by a utility function are meaningless in themselves, and only serve to verify whether one option is better than another. This is where quasi-concavity comes in. It is a property similar to concavity that is preserved under monotone transformations, and has the same convenient theorems built in. 12.1 Some Geometry of Convex Sets For analyzing maximization problems, it is helpful to consider the behavior of the “better than” sets: Definition 12.1.1 The upper contour sets of a function are the sets U C(a) = {x : f (x) ≥ a} Now, we need to be careful about the difference between convex sets and convex functions, since we’re about to use both words side by side quite a bit. 132 Definition 12.1.2 A set S is convex if, for any x′ and x′′ in S and any λ ∈ [0, 1], the point xλ = λx′ + (1 − λ)x′′ R is also in S. The interior of a convex set S, (S), are all the points x for which there is a ball Bδ (x) ⊂ S. If any convex combination is in the interior of S, then S is a strictly convex set. Convexity Convex sets are well behaved since you can never drift outside the set along a chord between two points. Geometrically, if your upper contour sets are convex and the constraint set is strictly convex, then we should be able to “separate” them, like this: 133 Separation of Constraint and Upper-contour Sets This would mean there is a unique maximizer, since there is a unique point of tangency between the indifference curves and the constraint set. Why? If we pick a point x0 on an indifference curve and some point x1 better than x0 , all of the options along the path from x0 to x1 are better than x0 . We might express this more formally by saying that for λ ∈ [0, 1], the options xλ = λx1 + (1 − λ)x0 are all better than x0 , or that the upper contour sets are convex sets. This is a key geometric feature of well-behaved constrained maximization problems. 12.2 Concave Programming Concave functions are the best-case scenario for optimization, since they imply that first-order necessary conditions are sufficient for a critical point to be a global maximum (this makes checking second-order sufficient conditions unnecessary, and life is much easier). Definition 12.2.1 A function f (x) is concave if for all x′ and x′′ and all λ ∈ [0, 1], f (λx′ + (1 − λ)x′′ ) ≥ λf (x′ ) + (1 − λ)f (x′′ ) It is strictly concave if the equality is strict for all λ ∈ (0, 1). 134 Concave functions This is a natural property in economics: Agents often prefer variety — consuming the bundle λx′ +(1−λ)x′′ — to consuming either of two “extreme” bundles. For example, which sounds better: Five apples and five oranges, or ten apples with probability 1/2 and ten oranges with probability 1/2? If you say, “Five apples and five oranges”, you have concave preferences. However, there are many, equivalent ways to characterize a concave function: Definition 12.2.2 The following are equivalent: Let D ⊂ of RN be the domain of f (). • f (x) is concave on D • For all x′ , x′′ ∈ D and all λ ∈ (0, 1), f (λx′ + (1 − λ)x′′ ) ≥ λf (x′ ) + (1 − λ)f (x′′ ) • For all x′ , x′′ ∈ D, f (x′ ) − f (x′′ ) ≤ ∇f (x′′ )(x′ − x′′ ) • The Hessian of f (x) satisfies y ′ H(x)y ≤ 0 for all x ∈ D and y ∈ RN . If the weak inequalities above are replaced with strict inequalities, the function is strictly concave. The Hessian characterization — concave if H(x) is negative semi-definite for all x in the domain of f (x), and strictly concave if H(x) is negative definite for all x in the domain of f (x) — is extremely useful, since the Hessian is so ubiquitous in proving SOSCs. If the objective function is strictly concave, then at a critical point x∗ , f (x) = f (x∗ ) + (x − x∗ )′ H(x∗ ) (x − x∗ ) + o(h3 ) 2 or for x close to x∗ , H(x∗ ) (x − x∗ ) > 0 2 since y ′ H(x)y < 0 for any x, including x∗ . Then we can conclude that f (x∗ ) > f (x), so that a critical point x∗ is a local maximum. f (x∗ ) − f (x) = −(x − x∗ ) 135 Theorem 12.2.3 • Consider an unconstrained maximization problem maxx f (x). If f (x) is concave, any critical point is a global maximum of f (x). If f (x) is strictly concave, any critical point is the unique global maximum. • Consider an inequality constrained maximization problem maxx f (x) subject to hk (x) ≤ 0, for k = 1, ..., K. If f (x) is concave and the constraint set is convex, any critical point of the Lagrangian is a global maximum. If f (x) is strictly concave and the constraint set is convex, any critical point of the Lagrangian is the unique global maximum. Note that for Kuhn-Tucker, the situation is slightly more complicated than it might appear, since the objective function might be strictly concave so that for each set of active constraints, there is a unique global maximizer (since each is just an equality-constrained maximization problem), but these candidates must still be somehow compared. 12.3 Quasi-Concave Programming Concavity is a cardinal property, not an ordinal one, and we would like to have a generalization of concavity that is merely ordinal. For example, if we consider xα y β and compute the Hessian, we get α(α − 1)xα−2 y β αβxα−1 y β−1 αβxα−1 y β−1 β(β − 1)xα y β−2 which is not concave if α + β > 1. But f (x, y) = xy still has convex upper contour sets, so we would expect it to be well-behaved for maximization purposes, even if it isn’t concave (see Proposition 12.3.2 below). To deal with this, we introduce a new kind of function, quasi-concave functions. We’re going to motivate quasi-concavity in a somewhat roundabout way. Recall that for a function f (x, y), the indifference curve x(y) was defined as the implicit solution to the equation f (x(y), y) = c Now, if the indifference curves x(y) were 1. Downward sloping, so that taking some y away from the agent required increasing x to keep them on the same indifference curve 2. A convex function, so that taking away a lot of y away from the agent requires giving them ever higher quantities of x to compensate him 3. Invariant to strictly increasing transformations we would have the right geometric properties for maximization without the cardinal baggage that comes with assuming concavity. When are 1-3 satisfied? Well, for a strictly increasing transformation g(), g(f (x(y), y)) has indifference curves defined by g′ (f (x(y), y))(fx (x(y), y)x′ (y) + fy (x(y), y)) = 0 which are the same as those generated by f (x(y), y). So far so good. Now, the derivative of x(y) is x′ (y) = − fy (x(y), y) fx (x(y), y) 136 And the second derivative is x′′ (y) = − which implies or (fxy x′ (y) + fyy )fx − (fxx x′ (y) + fyx )fy >0 fx2 fy fy −fxy + fyy fx − −fxx + fyx fy < 0 fx fx −fxy fy fx + fyy fy2 + fx2 fxx − fxy fy fx < 0 Which doesn’t appear to be anything at first glance. Actually, it is the determinant of the matrix 0 fx fy fx fxx fxy fy fyx fyy So if this matrix is negative semi-definite, the indifference curves x(y) will be concave functions, and if it is negative definite, the indifference curves x(y) will be strictly concave functions. This is the idea of quasi-concavity. Definition 12.3.1 The following are equivalent: Let D be the domain of f (x). • f (x) is quasi-concave on D • For every real number a, the upper contour sets of f (x), U C(a) = {x : f (x) ≥ a} are convex. • The bordered Hessian H(x) 0 ∇x f (x)′ ∇x f (x) ∇xx f (x) is negative semi-definite on D, so that y ′ H(x)y ≤ 0 for all x ∈ D for all y ∈ RN . • For all λ in [0, 1] and x ∈ D, f (λx′ + (1 − λ)x′′ ) ≥ min{f (x′ ), f (x′′ )} • For all x′ , x′′ ∈ D, if f (x′ ) ≥ f (x′′ ), then ∇f (x′′ )(x′ − x′′ ) ≥ 0 If the weak inequalities above are replaced with strict inequalities, the function is strictly quasiconcave. Now, the above concepts are usual, but the following provides what is usually the easiest route to proving quasi-convexity of a function: Proposition 12.3.2 Let g() be an increasing function. If g(f (x)) is concave, then f (x) is quasiconcave. 137 Proof If g(f (x)) is concave, then g(f (λx + (1 − λ)x′ )) ≥ λg(f (x)) + (1 − λ)g(f (x′ )) Without loss of generality, let g(f (x)) ≤ g(f (x′ )); this implies f (x) ≥ f (x′ ). But then g(f (λx + (1 − λ)x′ )) ≥ λg(f (x)) + (1 − λ)g(f (x′ )) ≥ g(f (x′ )) and we get f (λx + (1 − λ)x′ ) ≥ f (x) = min{f (x), f (x′ )} so that F () is quasi-concave. For example, this works with f (x, y) = xα y β with α+β > 1: Take the monotone transformation g(z) = z 1/(α+β+1) , which “concavifies” the function. The quasi-concavity of g(f (x, y)) is then easy to show using standard tool for concavity, which are somewhat easier than those for quasi-concavity. Quasi-concavity is useful because of the following theorem: Theorem 12.3.3 Consider an inequality constrained maximization problem maxx f (x) subject to hk (x) ≤ 0 for k = 1, ..., K. If f (x) is quasi-concave and the constraint set is convex, any critical point of the Lagrangian is a global maximum. If f (x) is strictly concave and the constraint set is convex, any critical point of the Lagrangian is the unique global maximum. Note that in unconstrained problems, quasi-concavity might not be enough to guarantee a critical point is a maximum. For example, the function f (x) = x3 is quasi-concave, since the upper contour sets U C(a) = {x : x3 ≥ a} are convex sets. However, it fails to achieve a maximum. To check that a function is quasi-concave, let 0 ∂f (x)/∂x1 ∂f (x)/∂x2 ... ∂f (x)/∂xk−1 ∂f (x)/∂x1 ∂ 2 f (x)/∂x2 ∂x1 . . . ∂ 2 f (x)/∂x1 ∂xk−1 ∂ 2 f (x)/∂x21 ∂ 2 f (x)/∂x1 ∂x2 ∂ 2 f (x)/∂x22 . . . ∂ 2 f (x)/∂x2 ∂xk−1 Hk (x) = ∂f (x)/∂x2 .. .. .. .. .. . . . . . ∂f (x)/∂xk−1 ∂ 2 f (x)/∂xk−1 ∂x1 ∂ 2 f (x)/∂xk−1 ∂x2 . . . ∂ 2 f (x)/∂x2k−1 Then we have an alternating sign test on the determinants of the principal minors: Theorem 12.3.4 (Alternating Sign Test) If f is quasiconcave, then the determinants of the leading principal minors of the bordered Hessian alternate in sign, starting with det H3 (x) ≥ 0, det H4 (x) ≤ 0, and so on. If the leading principal minors of the bordered Hessian alternate in sign, starting with det H3 (x) > 0, det H4 (x) < 0, and so on, then f is quasiconcave. 12.4 Convexity and Quasi-Convexity The set of convex and quasi-convex functions play a similar role in minimization theory as concave and quasi-concave functions play in maximization theory. Definition 12.4.1 The following are equivalent: • f (x) is convex • For all x′ and x′′ and all λ ∈ [0, 1], λf (x′ ) + (1 − λ)f (x′′ ) ≤ f (λx′ + (1 − λ)x′′ ) 138 • For all x′ and x′′ , f (x′ ) − f (x′′ ) ≥ Df (x′′ )(x′ − x′′ ) • The Hessian of f (x) satisfies y ′ H(x)y ≥ 0 for all x in the domain of f (x) and y ∈ RN . If the weak inequalities above are replaced with strict inequalities, the function is strictly convex. and Definition 12.4.2 The following are equivalent: • f (x) is quasi-convex • For every real number a, the lower contour sets of f (x), LC(a) = {x : f (x) ≤ a} are convex. • The bordered Hessian H(x) 0 ∇x f (x)′ ∇x f (x) ∇xx f (x) is positive semi-definite, y ′ H(x)y ≥ 0 for all y. • For all λ in (0, 1), • If f (x′ ) ≤ f (x′′ ), then f (λx′ + (1 − λ)x′′ ) ≤ max f (x′ ), f (x′′ )} ∇f (x′′ )(x′ − x′′ ) ≤ 0 If the weak inequalities above are replaced with strict inequalities, the function is strictly quasiconvex. and we have Theorem 12.4.3 • Consider an unconstrained maximization problem maxx f (x). If f (x) is convex, then any critical point of f (x) is a global minimizer of f (x). If f (x) is strictly convex, any critical point is the unique global minimum. • Consider an inequality constrained maximization problem maxx f (x) subject to hk (x) ≤ 0 for k = 1, ..., K. If f (x) is convex or quasi-convex, any critical point of the Lagrangian is a global maximum. If f (x) is strictly convex or strictly quasi-convex and the constraint set is convex, any critical point of the Lagrangian is a unique global minimum. The alternating sign test for quasi-convexity is slightly different: Theorem 12.4.4 (Alternating Sign Test) If f is quasiconvex, then the determinants of the leading principal minors of the bordered Hessian are all weakly negative. If the leading principal minors of the bordered Hessian are all strictly negative, then f is quasiconvex. 139 12.5 Comparative Statics with Concavity and Convexity There are some common tricks to deriving comparative statics relationships under the assumptions of concavity and convexity. Example Suppose a firm chooses inputs z = (z1 , z2 ) whose costs per unit are w = (w1 , w2 ), subject to a production constraint F (z1 , z2 ) = q. Let c(w, q) = min w · z z subject to F (z) = q. First, note that c(w, q) is concave in w. How do we prove this? Suppose z ′ minimizes costs at w′ and z ′′ minimizes costs at w′′ . Now consider the price vector wλ = λw′ + (1 − λ)w′′ , and its cost-minimizing solution zλ . Then c(wλ , q) = wλ zλ = (λw′ + (1 − λ)w′′ ) · zλ = λw′ · zλ + (1 − λ)w′′ · zλ But by definition w′ · zλ ≥ w′ · z ′ and w′′ · zλ ≥ w′′ · z ′′ , so that wλ zλ ≥ λw′ · z ′ + (1 − λ)w′′ · z ′′ or c(λw′ + (1 − λ)w′′ , q) ≥ λc(w′ , q) + (1 − λ)c(w′′ , q) so that the cost function is concave. Second, we use the envelope theorem to differentiate the cost function with respect to w: ∇w c(w, q) = z(w, q) or cw1 (w1 , w2 , q) cw2 (w1 , w2 , q) = z1 (w1 , w2 , q) z2 (w1 , w2 , q) and again with respect to w to get ∇ww c(w, q) = ∇w z(w, q) or cw1 w1 (w1 , w2 , q) cw2 w1 (w1 , w2 , q) cw1 w2 (w1 , w2 , q) cw2 w2 (w1 , w2 , q) ∂z1 (w1 , w2 , q) = ∂z (w∂w, 1w , q) 2 1 2 ∂w1 ∂z1 (w1 , w2 , q) ∂w2 ∂z2 (w1 , w2 , q) ∂w2 Finally, since c(w, q) is concave, every element on its diagonal is weakly negative. Therefore, every element on the diagonal of z(w, q) must also be negative, or ∂zk (w, q) ∂ 2 c(w1 , w2 , q) ≤0 = ∂wk ∂wk2 so that if wk ↑ then zk ↓. This exercise would actually be pretty difficult without the knowledge that c(w, q) is concave, and that is the key. 140 Example Suppose a consumer buys bundles q = (q1 , ..., qN ) of goods, has utility function u(q, m) = v(q) + m, and budget constraint w = p′ q + m. Suppose v(q) is concave. Then the objective function is max v(q) − p′ q + w q The FONCs are ∇q v(q ∗ ) − p = 0 and totally differentiating with respect to p yields ∇qq v(q ∗ )∇p q ∗ − 1 = 0 and ∇p q ∗ = [∇qq v(q ∗ )]−1 Since v(q) is concave, its Hessian is negative definite, and the inverse of a negative definite matrix is negative definite, so all the entries on the diagonal are weakly negative. Therefore, ∂qk∗ (p) ≤0 ∂pk Again, the concavity of the objective function is what makes the last part do-able. Without knowing that [∇qq v(q ∗ )]−1 is negative semi-definite by the concavity of v(q), we would be unable to decide the sign of the comparative statics of each good with respect to a change in its own-price. 12.6 Convex Sets and Convex Functions Since the relationship between a convex set and a convex function is so important, this section focuses on that alone. This is somewhat repetitive after reading the previous content of the section, but there are some new results. Definition 12.6.1 1. A set S is convex if for all x′ , x′′ ∈ S and all λ ∈ (0, 1), λx′ +(1−λ)x′′ ∈ S. 2. A function F is convex if for all x′ , x′′ in the domain of F and all λ ∈ (0, 1), F (λx′ + (1 − λ)x′′ ) ≤ λF (x′ ) + (1 − λ)F (x′′ ) In economics we often have statements like, “For the utility maximization problem, if u(x) is quasi-concave, then the upper contour sets of u(x) are convex, and the set of maximizers is a convex set.” This fixes a bunch of objects together that all share some aspect of convexity or concavity: (i) the function u(x) is quasi-concave, (ii) since u(x) is quasi-concave, its upper contour sets are convex, (iii) since the upper contour sets are convex and the constraint set is convex (p · x ≤ w, x ≥ 0), the separating hyperplane theorem guarantees we can separate them, (iv) the set of solutions to the UMP will itself be a convex set. Note that you cannot lose track of which object is being referred to: The function u(x), the sets U C(α), or the set of maximizers x(p, w). There are some equivalent definitions of convexity: The following are equivalent: 1. A function F is convex 2. For all x′ , x′′ in the domain of F , F (λx′ + (1 − λ)x′′ ) ≤ λF (x′ ) + (1 − λ)F (x′′ ) (convexity inequality) 3. For all x in the domain of F , the Hessian at x satisfies y ′ H(x)y ≥ 0 for all y 6= 0 (second-order test) 141 4. For all x′ , x′′ in the domain of F , F (x′ ) + ∇F (x′ )(x′′ − x′ ) ≤ F (x′′ ) (first-order test) If the weak inequalities in the above are made strict, the function is strictly convex. Now, note, that if a function is convex, it is automatically quasi-convex. Why is that? Well, a function is quasi-convex if F (λx′ + (1 − λ)x′′ ) ≤ max{F (x′ ), F (x′′ )} But since λF (x′ ) + (1 − λ)F (x′′ ) ≤ max{F (x′ ), F (x′′ )}, a convex function automatically satisfies the inequality defining quasi-convexity. Therefore, we can always exploit information about quasiconvex functions as well: The following are equivalent: 1. A function F is quasi-convex 2. For all x′ , x′′ in the domain of F , F (λx′ + (1 − λ)x′′ ) ≤ max{F (x′ ), F (x′′ )} (quasi-convexity inequality) 3. For all x in the domain of F , the bordered Hessian at x satisfies y ′ BH(x)y ≥ 0 for all y 6= 0(second-order test) 4. If F (x′ ) ≤ F (x′′ ), then ∇F (x′′ )(x′ − x′′ ) ≥ 0 (first-order test) 5. The lower contour sets of F (), LC(a) = {x : F (x) ≤ a}, are convex sets If the above weak inequalities are made strict, the function is strictly quasi-convex. Note the important fact that since convex functions are quasi-convex, their lower contour sets are convex sets. However, the hassle of computing determinants is exponential in the number of rows and columns, so that it is easier to use the second-order test for convex functions than the second-order test for quasi-convex functions. Quasi-convex functions, therefore, have slightly more straightforward geometric features, while convex functions are easier to work with analytically. Note that for optimization, quasi-convexity and minimization go together perfectly: If the lower contour sets of F (x) are convex and our feasible/constraint set is a convex set, we can use the separating hyperplane theorem to prove that the “better than” set can be separated from the feasible set. If the function is strictly quasi-convex, we even get uniqueness of the solution (this was proved in class for a quasi-concave function and a convex feasible set, but the details are almost exactly the same). 1. A sum of convex functions is convex, but a sum of quasi-convex functions is not necessarily quasi-convex 2. An intersection of convex sets is a convex set 3. A function F is quasi-convex if there is an increasing transformation g such that g(F (x)) is convex 4. If a function F is convex and g is a strictly increasing, convex function, then g(F (x)) is convex 5. A function F is convex if −F is concave (so this all applies to concave functions with suitable reworking of the direction of inequalities) With these facts around, there are two methods of proving a set is convex: 1. Take two arbitrary elements in the set and show their convex combination is in the set 2. Show that the set is the lower contour set of a convex or quasi-convex function 142 The most useful tools are the convexity inequalities and the second-order tests, and points 3 and 4 of proposition 1.3. I have rarely used the other items myself, but they are potentially useful alternatives when more obvious approaches fail. Note that you want to be very clear about what set or function you are trying to prove is convex. For example, if you want to prove that the set S(α) = {x : F (x) ≤ α} is convex, you are considering the lower contour sets of F (x), which are in RN . This would come up in a situation where, perhaps, you are minimizing and want to invoke a result about separating hyperplanes. On the other hand, you might have a function F (x) = y and you are interested in the set S = {(x, y) : F (x) − y ≤ 0} which is a set in RN +1 . An example of this is the production sets from class, for instance. Since proving functions are concave or convex is something we’ve talked about before — xα y β is concave when α, β > 0 and α, β < 1, for instance — let’s focus on the second kind of problem: Proving that sets of the form S = {(x, y) : F (x) − y ≤ 0} are convex. Example Consider the function f (x) = x2 . We will show that the set S = {(x, y) : y ≥ f (x)} is convex (sketch this set). To start, take any two points that satisfy y ≥ x2 y ′ ≥ x′2 Then for all λ ∈ (0, 1), λy + (1 − λ)y ′ ≥ λx2 + (1 − λ)x′2 But since g(z) = z 2 is a convex function (g′ (z) = 2z, g ′′ (z) = 2 > 0), we then have λy + (1 − λ)y ′ ≥ λx2 + (1 − λ)x′2 ≥ (λx + (1 − λ)x′ )2 so that λy + (1 − λ)y ′ ≥ (λx + (1 − λ)x′ )2 and the convex combination of the points (x, y) and (x′ , y ′ ) lies in S, and S is convex. √ Now consider the function f (x) = x. We will show that the set S = {(x, y) : y ≤ f (x)} is convex. Suppose √ y≤ x √ y ′ ≤ x′ Then for λ ∈ (0, 1), But since h(z) = or √ √ √ λy + (1 − λ)y ′ ≤ λ x + (1 − λ) x′ z is a concave function (h′ (z) = .5z −.5 , g′′ (z) = −.25z −1.5 < 0), we have √ p √ λy + (1 − λ)y ′ ≤ λ x + (1 − λ) x′ ≤ λx + (1 − λ)x′ λy + (1 − λ)y ′ ≤ p λx + (1 − λ)x′ so that if (x, y) and (x′ , y ′ ) are in the set S, then the convex combination (λx+(1−λ)x′ , λy+(1−λ)y ′ ) is in the set, so S is a convex set. 143 Now, the above examples use the definition of convexity to prove this property directly from the definitions. How can we use the second order test? Example Consider the function f : R → R given by y = x2 . Let’s convert it into a function F : R2 → R as F (x, y) = y − x2 So as we vary y and x2 , the original function is represented by the isoquant/indifference curve/onedimensional manifold F (x, y) = 0, but the function can be “translated” up and down. Consider the set F (x, y) ≥ 0. This is equivalent to the set {(x, y) : f (x) ≥ y} considered above. Now, if F (x, y) is a concave function, then its upper contour sets will be convex sets, and we can conclude that the set {(x, y) : F (x, y) ≥ 0} = {(x, y) : y ≥ f (x)} is a convex set. We show this with the second derivative test. Differentiate F (x, y) to get the Hessian 0 0 0 −2 which is everywhere negative semi-definite. Unlike the function x3 − x2 , where it is sometimes positive definite, sometimes negative definite, and sometimes negative semi-definite, it is uniformly negative semi-definite, so we can conclude the function is concave and its upper contour sets are convex. Note that a faster route to proving concavity is to say that y is a concave function of (x, y) and −x2 is a concave function of (x, y), so that their sum must be concave. √ Now consider the function f : R+ → R given by y = x. This function is not defined on all of √ R, so we must be careful. Consider the function F (x, y) = y − x that maps R+ × R → R. If this is a convex function we can conclude that its upper contour sets are concave. The Hessian is 0 0 0 .25x−1.5 which is undefined at (0, 0), where the lower-right hand corner takes the value .25/0. So on the set R+ \{0} × R, the Hessian is positive semi-definite, and the function is convex on this set. Consequently, its lower contour sets are convex sets, and we’ve shown that {(x, y) : y ≤ f (x)} is √ a convex set. Note that a faster route to proving convexity is to point out that y and − x are convex functions of (x, y), so that F (x, y) is a sum of convex functions. The previous two examples were for functions f : R → R. But what about higher dimensional functions? Example Let F (z) = xα y β , z = (x, y), where α, β > 0 but α + β < 1. We want to show that S = {(µ, z) : µ ≤ F (z)} is a convex set. Suppose µ ≤ xα y β µ′ ≤ x′α y ′β Then for all λ ∈ (0, 1), λµ + (1 − λ)µ′ ≤ λxα y β + (1 − λ)x′α y ′β But we know that F (z) is concave when α + β < 1 and α, β > 0, so we have λµ + (1 − λ)µ′ ≤ λxα y β + (1 − λ)x′α y ′β ≤ (λx + (1 − λ)x′ )α (λy + (1 − λ)y ′ )β or λµ + (1 − λ)µ′ ≤ (λx + (1 − λ)x′ )α (λy + (1 − λ)y ′ )β 144 so that the convex combination (λµ + (1 − λ)µ′ , λz + (1 − λ)z ′ ) is in the set S. I find this approach straightforward and easy, but to use the function approach rather than the set approach, define G(µ, z) = µ − F (z). This function is the sum of two convex functions, µ and −F (z), so its lower contour sets are convex. Or, you can compute the Hessian and check the signs of its leading principal minors. The next example generalizes all of the previous ones. Example We will show that the epigraph, epif = {(x, µ) : µ ≥ f (x)} of a convex function is a convex set. This is the set of points above the graph of f (x). For example, think about f (x) = x2 : Shade all the points above the graph, and you get the epigraph. If f (x) is convex, we want to show that the epigraph is a convex set. This means that for any µ ≥ f (x) and µ′ ≥ f (x′ ), λµ + (1 − λ)µ′ ≥ f (λx + (1 − λ)x′ ). Now, we have the inequalities that µ ≥ f (x) µ′ ≥ f (x′ ) since (x, µ) and (x′ , µ′ ) are in the epigraph of f . Then multiplying by λ and 1 − λ yields λµ + (1 − λ)µ′ ≥ λf (x) + (1 − λ)f (x′ ) but by convexity of f (x), λµ + (1 − λ)µ′ ≥ λf (x) + (1 − λ)f (x′ ) ≥ f (λx + (1 − λ)x′ ) or λµ + (1 − λ)µ′ ≥ f (λx + (1 − λ)x′ ) so that if (x, µ) and (x′ , µ′ ) are in the epigraph of f , then the convex combination (λx + (1 − λ)x′ , λµ + (1 − λ)µ′ ) are in the epigraph, and it is a convex set. Similarly the hypograph, hypof = {(x, µ) : µ ≤ f (x)} of a concave function is a convex set. We’ll go faster for this one. Suppose µ ≤ f (x) µ′ ≤ f (x′ ) Then for λ ∈ (0, 1), λµ + (1 − λ)µ′ ≤ λf (x) + (1 − λ)f (x′ ) and by concavity of f (x), λµ + (1 − λ)µ′ ≤ λf (x) + (1 − λ)f (x′ ) ≤ f (λx + (1 − λ)x′ ) or λµ + (1 − λ)µ′ ≤ f (λx + (1 − λ)x′ ) so that if (x, µ) and (x′ , µ′ ) are in the hypograph of f (x), then so is (λx + (1 − λx′ ), λµ + (1 − λ)µ′ ), and the hypograph of a concave function is a convex set. 145 Note that the epigraphs and hypographs are NOT the upper and lower contour sets of the function. To visualize them, imagine the surface mapping RN → R: It is like a sheet floating in space. The epigraph is the sheet and everything above it, while the hypograph is the sheet and everything below it, which are objects in RN +1 . The upper and lower contour sets are subsets of RN . The previous example generalizes the earlier ones to arbitrary concave and convex functions F : RN → R: Convex functions have convex epigraphs, and concave functions have convex hypographs. These are the production sets Y of chapter four: They are essentially defined as the epigraphs of transformation functions F (y). Exercises 1. Let (x, y) ∈ R2+ . (i) When is f (x, y) = xα y β concave? Quasi-concave? (ii) When is f (x, y) = xρ + y ρ )1/ρ concave? (iii) When is min{ax, by} concave? Quasi-concave? 2. Show that if f (x) is quasi-concave and h(y) is strictly increasing, then h(f (x)) is quasiconcave. Show that x∗ maximizes f (x) subject to h(x) = 0 iff x∗ maximizes h(f (x)) subject to g(x) = 0. 3. Show that any increasing function f : R → R is quasi-concave, but not every increasing function is concave. Show that the sum of two concave functions is concave, but the sum of two quasi-concave functions is not necessarily quasi-concave. 4. Prove that if f (x) is convex, then any critical point of f (x) is a global minimum of the unconstrained maximization problem. 5. Show that all concave functions are quasi-concave. Show that all convex functions are quasi-convex. 6. Suppose a firm maximizes profits π(q, K, L) = pq − rK − wL subject to the constraint where F (K, L) = q. Show that π() is convex in (p, r, w). Prove that the gradient of π with respect to (p, r, w) is equal to (q, −K, −L). Lastly, show that q ∗ is increasing in p, K ∗ is decreasing in r, and L∗ is decreasing in w. Note how these comparative statics do not rely on the properties of F (K, L). 146 Part IV Classical Consumer and Firm Theory 147 Chapter 13 Choice Theory 13.1 Preference Relations Definition 13.1.1 Suppose an agent is facing a decision problem where he can choose any x ∈ X. Then for any x, y ∈ X, let x y if and only if the agent weakly prefers x to y — i.e., whenever x and y are both available, the agent is at least as happy choosing x as y. If he is strictly happier, write x ≻ y, and if he is “strictly indifferent”, say x ∼ y. Then we say (or ≻) is a preference relation. A preference relation is rational if it is • Complete: For all x, y ∈ X, either x y or y x (or both) • Transitive: For all x, y, z ∈ X, if x y and y z, then x z It is natural for economics to begin with preference relations in mind as a basic object of study. It seems fair to require that agents be able to sensibly rank a list of alternatives from best to worst, allowing for indifference. It might be the case that these preferences are contingent on other circumstances, so that the preference ordering varies across states of the world such as wealth or weather, or that it is costly for agents to decide whether a given alternative is better than another. It might even be the case that agents’ preferences are simply somewhat stochastic, so that the answer to whether a is better than b varies over time holding all else fixed (see the random utility models used in empirical industrial organization). But these issues are all embellishments on the basic idea that agents can rank alternatives. The added restrictions of completeness and transitivity merely reflect a reasonable level of internal consistency: Agents’ preferences don’t have internal contradictions, and no pairwise comparison is impossible. Example An agent has choice set X = {x1 , x2 , x3 }. • The following are all rational preferences on X: x1 ≻ x2 ≻ x3 x1 ∼ x2 ≻ x3 x1 ∼ x2 ∼ x3 • Consider the following preferences on X: x1 ≻ x2 ≻ x3 ≻ x1 This isn’t transitive, since it implies x1 ≻ x1 . Therefore, these preferences aren’t rational. 148 • Consider the following preferences on X: x1 ≻ x2 , x1 ≻ x3 This isn’t complete: No relationship is specified between x2 and x3 . Therefore, these preferences aren’t rational. Here is a non-trivial example of an economic agent with non-rational preferences: Example There are three persons voting on alternatives a, b, and c. Their preference orderings are 1:a≻b≻c 2:b≻c≻a 3:c≻a≻b The three persons have resolved to decide the group’s preference ordering by doing a sequence of pairwise comparisons of the alternatives, and picking the alternative that wins. So the “agent” of this example is the group itself, not the three persons (who indeed have complete, transitive preferences). Suppose the group starts by comparing a and b. Then a gets two votes and b gets one, so that a wins, and a ≻ b. Then a and c are compared, and c wins, since it gets two votes and a gets one, so c ≻ a. Then b and c are compared, and b wins, since it gets two votes and c gets one, so b ≻ c. Then the preference ordering is a ≻ b ≻ c ≻ a... which is non-transitive. It is arguable that the group studied above using the pairwise run-off social choice function is not rational, since its preferences are non-transitive. In particular, a ≻ b ≻ c ≻ a. This is an example of Arrow’s impossibility theorem which says that even if individual voters have rational preferences, there is, in general, no rule that aggregates those preferences into a rational preference ordering for the entire group. So, when we think about “representative” agents or talk about societal preferences, it is important to remember that there is no way to guarantee that such aggregative agents are rational (but we will eventually do it, anyway). What about “social” or “other regarding” preferences? Example There are two apples. Amy and Bob can each have 0 or 1 apples, so that (xa , xb ) is an allocation of the apples where xa is the number of apples that Amy gets, and xb is the number of apples Bob gets. So there are four possibilities: {(0, 0), (1, 0), (0, 1), (1, 1)}. Amy cares about equality, so she prefers (1, 1) {(0, 1), (1, 0)} (0, 0). Bob actually prefers inequality in his own favor, and hates it when others do better than him, so that (0, 1) {(1, 0), (1, 1)} (0, 0). These preferences depend on the consumption bundles of other agents, but are completely rational: Each agent has complete, transitive preferences over all the possible outcomes. It is important to note that we are economists, so our job is not to judge preferences and say some are better than others, or that some fail to make sense to us. Our job is to explain how preferences translate into economic outcomes of interest, and we impose restrictions on only insofar as the assumptions allow us to make useful inferences, not because we agree or disagree with agents’ choices or desires. 149 13.2 Utility Functions Preference relations are not typically very convenient to work with — in particular, maximizing them requires searching through all the possible alternatives for the best option. This makes it difficult to use the kind of maximization theory developed in the first part of the course. Definition 13.2.1 A function u : X → R is a utility function representing preference relation if, for all x, y ∈ X, x y ≡ u(x) ≥ u(y) The big question for economists is, “When can be represented by a utility function u, so that optimization theory and comparative statics can be used?” In particular, we want to know whether utility functions and preference relations contain all the same information, so that working with one is equivalent to working with the other. Theorem 13.2.2 (1.B.2) If a preference relation can be represented by a utility function, then it is rational. What do we have to do to prove the theorem? Just show that if we have a utility function representing , then is transitive and complete. Proof Transitivity: Since u() represents , and u(x) ≥ u(y) and u(y) ≥ u(z) (implying that x y and y z, since u represents ). Then u(x) ≥ u(z), by transitivity of ≥. Since u() represents , then, x z, and is transitive. Completeness: Since u() represents , for all elements x, y, we have u(x) ≥ u(y) or u(y) ≥ u(x). But then since u() represents , that means either x y or y x, so is complete. So we’ve shown that if we have a utility function representing some preferences, it must be rational. But when is rationality enough to guarantee that there is a utility function representing the preferences? Unfortunately, it depends on the topology of the choice set and the behavior of the preference ordering, but there are two cases that can easily be studied (there are plenty of others you could understand fairly quickly, but they would require new definitions and proofs that take time): Theorem 13.2.3 Suppose the choice set X is finite. Then a utility function u() represents preferences if and only if they are complete and transitive. We already have the “if” direction from the previous theorem. We need to show that any complete and transitive preference relation can be represented by a utility function when the choice set is finite. Proof The proof will be by induction on the size of the choice set. Note that if we have a rational preference relation on X, then the preference relation is also rational on X ′ ⊂ X, because rationality is decided by pairwise comparisons, which are unchanged by considering a smaller choice set on which is already rational. So we can start by constructing a choice function for a set of two elements, then work our way back up to the entire set. (Base Case) Let’s start with a choice set of two options, x0 and x1 . Because is complete and transitive, either x0 ∼ x1 , x0 ≻ x1 , or x1 ≻ x0 . If x0 ∼ x1 , assign them both to the number u(x0 ) = u(x1 ) = .5. If x0 ≻ x1 , assign u(x0 ) = .75 and u(x1 ) = .25. If x1 ≻ x0 , assign u(x1 ) = .75 and u(x0 ) = .25. Then u() represents on X1 = {x0 , x1 }. (Induction Case) At the k-th step, suppose we have a utility function u() that represents over k + 1 elements, Xk . We will show there is a way to extend u() to k + 2 elements in Xk+1 so that the extensions represents . First, pick an element xk from X\Xk . We need to add it to the utility function in such a way that u() represents on Xk+1 = Xk ∪ {xk }. There are four cases: 150 • We compare xk to all the elements in Xk , and find that it is worse than all of them; this is possible because is transitive and complete. Take one of the worst elements in Xk , x, and find its utility value, u(x). Set u(x) u(xk ) = 2 Then u() represents on Xk+1 by construction. • We compare xk to all the elements in Xk , and find that it is better than all of them; this is possible because is transitive and complete. Take one of the best elements in Xk , x, and find its utility value, u(x̄). Set u(xk ) = 1 − u(x̄) + u(x̄) 2 Then u() represents on Xk+1 by construction. • We compare xk to all the elements in Xk , and find that xb ≻ xk ≻ xa , with no other elements xc of Xk “between” xb and xa , so that xb ≻ xc ≻ xa (if there were such an element, we could compare it to xk and see if we were still in this case with xc now taking the role of xa or xb ; or if we end up with an indifference xk ∼ xc , we move to the next case); this is possible because is transitive and complete. Then set u(xk ) = u(xb ) − u(xa ) + u(xa ) 2 Then u() represents on Xk+1 by construction. • We compare xk to all the elements in Xk , and find that xk ∼ xc ; this is possible because is transitive and complete. Then set u(xk ) = u(xc ) Then u() represents on Xk+1 by construction. So by induction, we can construct a utility function u() that represents on any finite choice set X. For uncountable choice sets ( like Rn ), the induction proof given above doesn’t work, since we can’t sit around doing pairwise comparisons — we would literally never, even if given a countably infinite amount of time to do the comparisons, get through all the possibilities. Definition 13.2.4 Say that preferences are continuous on X if, for any sequence of pairwise n n n n comparisons {(xn , y n )}∞ n=1 with x y for all n, then limn→∞ x = x and limn→∞ y , x y. So for a violation of continuity, you need to construct a sequence of choices where arbitrarily close to the limit xn yn , but at the limit y ≻ x. This means that for an infinitesimal change in the things being compared, the agent’s preferences suddenly reverse. MWG have a good example: Example Preferences are lexicographic on R2 if, for all pairs x = (x1 , x2 ) and y = (y1 , y2 ), x y if x1 > y1 and if x1 = y1 , then x y if x2 ≥ y2 . This basically means that there is one dimension that is always more important than the other dimension, and if two options are equal, then the second dimension decides the choice. For example, in a computer, you might only care about speed and not care about monitor size unless two computers are equally fast. Or you might have children and put the safety features of a car above all considerations of horsepower, but compare two equally safe cars on the dimension of horsepower. 151 These preferences are discontinuous, however. Take the sequences xn = (1/n, 0) and y n = (0, 1) Then for all finite n, xn y n because (1/n, 0) ≻ (0, 0), but the limits are x = (0, 0) and y = (0, 1), so the preference reverses at the limit. These preferences also cannot be represented by a utility function. So it isn’t guaranteed that rationality implies that the preferences can be represented by a utility function. But it turns out that Theorem 13.2.5 (3.C.1) Suppose that the preference relation on X is rational and continuous. Then there exists a continuous utility function u() representing . So sufficiently “smooth” preferences can be represented not just by any utility function, but a continuous one. Consequently, we know that a rational, continuous preference relation on a compact set has a maximizer by the Bolzano-Weierstrass theorem. 13.3 Consistent Decision-Making and WARP Economics, however, aspires to be an observational science. In particular, we don’t ever actually get to observe , only the way that agents actually choose, and — when we’re lucky — some detailed information about the circumstances under which those choices were made. For example, you might have data about which digital camcorders consumers purchased over five years. If you are even luckier, you have data about those consumers’ incomes and whether they have children. Ideally, you even know that the consumers all bought the camcorders online, and can observe their browsing behavior, such as what online stores they browsed, so you can reconstruct the prices and products that they were comparing. But even with all of this detailed information, you still only get to observe their final choice, not the pairwise comparisons among all the possible options they faced. So we want to say something about whether the observed choices of agents are consistent with them having a rational preference relation. Definition 13.3.1 A choice structure on X is (i) a set of non-empty subsets B of X, B, and (ii) a choice rule C assigning a set of chosen elements to each B ∈ B: C(B) ⊂ B. Example Let X = {x1 , x2 , x3 }. Then let B = {B1 = {x1 , x2 }, B2 = {x2 , x3 }}. Then if an agent has preferences x1 ≻ x2 ≻ x3 , he would behave as C(B1 ) = x1 and C(B2 ) = x2 . Definition 13.3.2 A choice structure (B, C) satisfies the weak axiom of revealed preference if for any B ∈ B with x, y ∈ B where x ∈ C(B), then for any B ′ ∈ B with x, y ∈ B ′ and y ∈ C(B ′ ), then x ∈ C(B ′ ). Whenever you get a very ambiguous looking definition, you should negate it immediately. Sometimes the negation of a definition is more intuitive than the definition itself (and you get a “theorem” for free): Theorem 13.3.3 A choice structure (B, C) does not satisfy WARP if there exist two budget sets B, B ′ ∈ B with x, y ∈ B, B ′ , x ∈ C(B), y ∈ C(B ′ ), but x ∈ / C(B ′ ). 152 So if we see x chosen in the presence of y once, it must always be chosen whenever y is chosen. This keeps decision-makers from making choices that seem to obviously contradict the preferences they have exhibited in other situations. Example Let X = {x, y, z} B1 = {x, y} B2 = {x, y, z} If the agent chooses C(B1 ) = x and C(B2 ) = z, they seem to be making consistent choices — this person probably prefers x to y and z to x and y. But if C(B1 ) = x and C(B2 ) = y, we certainly have a violation. The option to choose z has now motivated the agent to pick y over x — why would the existence of z as an option affect the behavior of the agent in this way? The interesting case you don’t want to make a mistake with is the following: Suppose C(B1 ) = {y} and C(B2 ) = {x, y}. This violates WARP, because we observed y chosen along with x at B2 , but not we’re throwing x away at B1 . Where exactly is the violation? Go back to the negation of WARP: Let B = B2 , and B ′ = B1 . Then we have x ∈ C(B) and y ∈ C(B ′ ), but x ∈ / C(B ′ ). Another way of looking at this is that the decision-maker is “elevating” elements into choice sets when they’ve previously been discarded in favor of other options; this also violated WARP. However, if we only observe some of an agent’s choices, we can’t be sure if the agent’s preferences are rational or not. We can only discuss whether a given preference relation is consistent with the observed behavior, and whether observed behavior is consistent with a given preference relation: Definition 13.3.4 Suppose that is a preference relation over X, and let B be any collection of choice sets. Then the choice rule C(B, ) = {x ∈ B : x y for all y ∈ B} is generated by . Similarly, a choice structure (B, C(B)) is rationalized by a preference relation if C(B) = C ∗ (B, ) for all B ∈ B. Given a choice structure (B, C()), the revealed preference relation ∗ is defined as x ∗ y if there is some B ∈ B such that x, y ∈ B and x ∈ C(B). So when are choice structures generated by preference relations, and when can preference relations be observed from choice structures? Example Suppose X = {w, x, y, z}. Then all of the possible budget sets are wx, wy, wz, xy, xz, yz, wxy, wyz, wxz, xyz, wxyz Let B = {x, y, z}, {w, y, z} Suppose C({x, y, z}) = {x, y}, C({w, y, z}) = w. Then we can say that x and y are revealed preferred to z, and that w is revealed preferred to y and z. If the decision-maker’s true preferences satisfy WARP, then C({x, y}) = {x, y}, C({w, y}) = w, C({w, z}) = w, C({z, y}) = y, C({x, z}) = x That leaves wx, wxyz, wxz, wxy 153 as the only sets for which we might not be sure about what the decision-maker will do. But from WARP, we know that whenever x is chosen and y is available, y must be chosen as well. We also know that w was chosen over y, so y can never be chosen when w is available. Now, can x be chosen with w? We’ve seen x chosen with y, so to avoid a violation of WARP, we can’t see x chosen but y not chosen. But w is revealed preferred to y. Therefore, there can’t be choice set with w, x, y where x or y is chosen. Therefore, C({w, x, y, z}) = C({w, x, z}) = C({w, x, y}) = w But if C({w, x, y}) = w, then w would be revealed preferred to x. Then our best guess of the agent’s preferences — given WARP — is w ≻ x ∼ y ≻ z. Now suppose B = {w, y}, {x, z} and C({w, y}) = w, C({x, z}) = x What can we conclude? Well, w ≻∗ y, so y can never be chosen with or above w if ≻ satisfies WARP, and x ≻∗ z, so z can never be chosen with or above x, if ≻ satisfies WARP. But we have no clues to whether w is better or worse than x. Here are some preference relations that rationalize the observed choices: w≻x≻y≻z w≻y≻x≻z x≻w≻z≻y and so on. So it’s not obvious that preferences can be recovered from observed choices. So we want a general idea of when we can or can’t recover true preferences from observed choices. Theorem 13.3.5 (1.D.1) Suppose that is a rational preference relation. Then the choice structure (B, C ∗ (B, )) generated by satisfies the weak axiom. Proof Suppose that for some B ∈ B, we have x, y ∈ B and x ∈ C ∗ (B, ) (this is the set-up of WARP). The definition of C ∗ (B, ) then implies that x y. Now take some B ′ ∈ B with x, y ∈ B ′ , and y ∈ C ∗ (B ′ , ). Then for any other z ∈ B ′ , we must have y z (if z is not chosen y ≻ z, and if z ∈ C(B, ), y ∼ z). But we know that x y, and y z, so by transitivity, x z for all z ∈ B ′ , so x ∈ C ∗ (B ′ , ). Therefore, the weak axiom is satisfied. The following is a partial converse to the previous theorem: Theorem 13.3.6 (1.D.2) Suppose (B, C(B)) is a choice structure that satisfies the weak axiom, and B contains all subsets of X of up to three elements. Then there is a rational preference relation that rationalizes C(B) on B. This relation is unique. Proof First, the revealed preference relation ∗ is rational. Since B contains all choice sets up to three elements, we can see how the decision-maker does all pairwise comparisons: For any x, y, we have B = {x, y}, and C({x, y}) provides either x ∗ y or y ∗ x or both, so ∗ is complete. To show transitivity, assume x ∗ y and y ∗ z. Consider the three-element budget set {x, y, z} ∈ B. Suppose that y ∈ C({x, y, z}). Since x ∗ y, then x ∈ C({x, y, z}) as well, since x is assumed to be revealed preferred to y. Now if z ∈ C({x, y, z}), then it must be the case that y ∈ C({x, y, z}), since y was revealed preferred to z. But then x must also be in C({x, y, z}), since x was revealed preferred to y. Therefore, x ∗ z, since it is chosen from {x, y, z} in any case. 154 Now we have to show that C ∗ (B, ∗ ) is the same function as C(B) — if we knew the agent’s revealed preference relation ∗ on B, we could make all the same decisions as he would. First, suppose x ∈ C(B). Then x is revealed preferred to all other elements y ∈ B, so x ∈ C ∗ (B, ∗ ), and C(B) ⊂ C ∗ (B, ∗ ). Now suppose that x ∈ C ∗ (B, ∗ ); then x ∗ y for any other y ∈ B. Then for any y ∈ B, consider the sets taking the form By ∈ B with {x, y} ⊆ By , and x ∈ C(By ). The weak axiom then implies that x ∈ C(B), because it is revealed preferred in the smaller set By , so C ∗ (B, ∗ ) ⊂ C(B). Consequently, C ∗ (B, ∗ ) = C(B). So revealed preference axioms like WARP are central to constructing an equivalence between the abstract idea of a preference relation that represents an agent’s tastes and the choice structure (B, C(B)) that they are observed to exhibit. If WARP is satisfied, we can think of choice structures and preference relations as being equivalent objects. If WARP fails, however, this equivalence may fail. Exercises MWG: 1.B.1, 1.B.2, 1.C.1, 1.B.4, 1.C.2, 1.D.3, 1.D.4 155 Chapter 14 Consumer Behavior Imagine consumers are agents with preferences represented by a continuous utility function u(x), over bundles of goods. The bundles are vectors in RL , such as (apples, car tires, socks, ..., pencils). The prices of goods are a vector p = (p1 , p2 , ..., pL ). Then the consumer chooses max u(x) x such that p · x ≤ w. 14.1 Consumer Choice For the case of preferences over goods, there are some useful additional definitions: Definition 14.1.1 The preference relation is • monotone on X if y ≫ x implies y ≻ x • strongly monotone if y ≥ x implies y x. • locally non-satiated if, for all x and every ε > 0, there is a y such that ||y − x|| ≤ ε and y ≻ x. • convex if, for all x, y, z and λ ∈ (0, 1), if y x and z x, then λy + (1 − λ)z x. • strictly convex if, for all x, y, z and λ ∈ (0, 1), if y x and z x, then λy + (1 − λ)z ≻ x. As we know, if is continuous, it can be represented by a continuous utility function, u(x). Since p · x ≤ w is a compact set, a solution exists to the utility maximization problem. Definition 14.1.2 Suppose the consumer’s choice set is X ⊂ RL . Then the consumer’s competitive budget set is B(p, w) = {x ∈ X : p · x ≤ w} We proved that a closed subset of a compact set is compact, so if X is compact, so is B(p, w). The set of points in B(p, w) are feasible for the consumer. A particular bundle x∗ ∈ B(p, w) is optimal if u(x∗ ) ≥ u(x′ ) for all x′ 6= x∗ . If u(x) is a differentiable function, the problem described above can be formulated as a constrained maximization problem with Lagrangean L(x, λ) = u(x) − λ(w − p · x) + µ · x 156 Since u(x) is locally non-satiated, Walras’ law holds, and the solution will satisfy the constraint with an equality. The Kuhn-Tucker multipliers acknowledge the possibility of a corner solution where some of one good is not consumed at all. Then the first-order necessary conditions are 0 = Du(x) − λp + µ and w−p·x=0 If there is an interior solution where µ = 0, the FONCs imply that the gradient of the utility function is a scalar multiple of p: Du(x) = λp. Moreover, if u() is strictly quasiconcave, there will be a unique interior solution, so that we can use comparative statics to find relationships between xk and xℓ . 14.2 Walrasian Demand The solution is called the Walrasian demand (function or correspondence), x(p, w). Before doing anything with maximization, some things can already be proved: Theorem 14.2.1 Suppose u is a continuous utility function representing a locally nonsatiated preference relation defined on the consumption set X ⊂ RL . Then Walrasian demand satisfies • Homogeneity of degree zero: x(αp, αw) = x(p, w), for all p, w, and α > 0. • Walras’ law: p · x = w for all x ∈ x(p, w). • Convexity/uniqueness: If is convex, so that u() is quasiconcave, then x(p, w) is a convex set. If is strictly convex so that u() is strictly quasiconcave, then x(p, w) is a singleton. Proof (i) If the original budget set is B(p, w) = {x : p · x ≤ w}, consider B(αp, αw) = {x : αp · x ≤ αw} = {x : p · x ≤ w} = B(p, w). So the choice set is the same, so the maximizer should be the same. (ii) Suppose the agent chose x ( assume x is optimal) but p · x < w. Then there exists a y satisfying ||x − y|| < ε but y ≻ x. For ε small enough, y will also satisfy p · y ≤ w (y is affordable). But this contradicts x being optimal. Therefore, any optimizer must satisfy p · x = w. (iii) Suppose u() is quasiconcave and that x 6= x′ are both optimal (x, x′ ∈ x(p, w)). Note that xλ = λx + (1 − λ)x′ satisfies p · xλ = p · (λx + (1 − λ)x′ ) = λp · x + (1 − λ)p · x′ ≤ w (so xλ is feasible). If x and x′ are both optimal, u(x) = u(x′ ); by quasiconcavity, λx + (1 − λ)x′ . By quasiconcavity, u(xλ ) ≥ u(x) = u(x′ ), so xλ is optimal. Therefore, xλ ∈ x(p, w). If u is strictly quasiconcave, then u(x∗ ) > u(x) = u(x′ ) for all λ ∈ (0, 1). So if xλ is feasible, this contradicts the assumption that x, x′ ∈ x(p, w), so x(p, w) must be a singleton. Before using any information about the preferences, we can learn a lot about Walrasian demand just from the previous proposition. In particular, we want to start studying how changes in prices or wealth change agents’ behavior. So the wealth effect for the ℓ-th good is ∂xℓ (p, w) ∂w A good is normal if ∂xℓ /∂w ≥ 0, and inferior if ∂xℓ /∂w < 0. The price effect of pk on the ℓ-th good is ∂xℓ (p, w) ∂pk You might think ∂xℓ /∂pℓ ≤ 0, or the own-price effect is negative, but goods for which ∂xℓ /∂pℓ > 0 are called Giffen goods. From these assumptions, we might start to search for interesting comparative statics results. Unfortunately, with so little assumed, it is hard to say much: 157 Proposition 14.2.2 (2E1,2E2,2E3) Suppose u is a continuous utility function representing a locally nonsatiated preference relation defined on the consumption set X ⊂ RL . • If the Walrasian demand function x(p, w) is homogenous of degree zero, then for all (p, w), Dp x(p, w)p + Dw x(p, w)w = 0 or, for each ℓ = 1, ..., L. L X ∂xℓ (p, w) k=1 ∂pk pk + ∂xℓ (p, w) w=0 ∂w • If x(p, w) satisfies Walras’ law, then for all (p, w), p · Dp x(p, w) + x(p, w)′ = 0′ or for each k = 1, ..., L, K X ℓ=1 pℓ ∂xℓ (p, w) + xk (p, w) = 0 ∂pk • If x(p, w) satisfies Walras’ law, then for all (p, w), p · Dw x(p, w) = 1 or K X ℓ=1 14.3 pℓ ∂xℓ (p, w) =1 ∂w The Law of Demand The archetypal example of a comparative statics result is the Law of Demand: If the price of a good goes up, the quantity demanded by consumers goes down. Does that hold for our full-blown consumer model? Likewise, how does this idea of quantity demanded and price moving in opposite directions relate to WARP and consistent decision-making? We developed a weak axiom of revealed preference for situations where X was simply an abstract choice set. Now that we have specialized to the case of consumers with competitive budget sets, we can restate that idea Definition 14.3.1 Suppose u is a continuous utility function representing a locally non-satiated preference relation defined on the consumption set X ⊂ RL . The Walrasian demand function x(p, w) satisfies the weak axiom of revealed preference (WARP) if the following holds for any two price-wealth situations (p, w) and (p′ w′ ): If p · x(p′ , w′ ) ≤ w and x(p′ , w′ ) 6= x(p, w), then p′ · x(p, w) > w′ . In other words, if x(p, w) is chosen over x(p′ , w′ ) at (p, w) when x(p′ , w′ ) was affordable, then x(p, w) must be unaffordable at (p′ , w′ ). What we want to study now is how behavior changes when prices and wealth change. Define ∆p = (p′ − p) , ∆w = ∆p · x(p, w) A compensated price change is any move from one price-wealth pair (p, w) to (p′ , w′ ) that satisfies (p′ , w′ ) = (p′ , p′ · x(p, w)). This kind of change ensure that if x(p, w) was chosen at (p, w), the same bundle is affordable at the new price-wealth pair (p′ , p′ · x(p, w)). 158 Proposition 14.3.2 (2F1) Suppose that the Walrasian demand function x(p, w) is homogeneous of degree zero and satisfies Walras’ law. Then x(p, w) satisfies WARP iff, for any compensated price change from (p, w) to (p′ , w′ ) = (p′ , p′ · x(p, w)), (p − p′ ) · (x(p′ , w′ ) − x(p, w)) = ∆p∆x ≤ 0 with strict inequality whenever x(p, w) 6= x(p′ , w′ ) (this is the compensated law of demand). Proof (WARP implies CLD) Suppose x(p, w) 6= x(p′ , w′ ). Then (p − p′ ) · (x(p′ , w′ ) − x(p, w)) = p · (x(p′ , w′ ) − x(p, w)) − p′ · (x(p′ , w′ ) − x(p, w)) Because the price change is compensated, p′ · x(p, w) = w′ , so WARP implies that p · x(p′ , w′ ) < w, and Walras’ law implies that p · x(p, w) = w. Then p · x(p′ , w′ ) − p · x(p, w) − p′ · x(p′ , w′ ) + p′ · x(p, w) = p · x′ − w − w′ + w′ = p · x′ − w < 0 So the inequality holds. (CLD implies WARP) p. 31 in MWG So the weak axiom of revealed preference is equivalent to a form of the law of demand, that prices and quantities move in opposite directions: ∆p∆x < 0. The differential version of the CLD is that dp · dx ≤ 0 whenever dw = x(p, w) · dp. This is equivalent to the following string of manipulations: dx = Dp x(p, w)dp + Dw x(p, w)dw dx = Dp x(p, w)dp + Dw x(p, w)[x(p, w) · dp] dx = [Dp x(p, w) + Dw x(p, w)x(p, w)′ ]dp Multiplying both sides by dp yields, by the CLD, dpdx = dp[Dp x(p, w) + Dw x(p, w)x(p, w)′ ]dp ≤ 0 Then the matrix Dp x(p, w) + Dw x(p, w)x(p, w)′ must be negative semi-definite, since the price change is arbitrary. This special matrix is called the Slutsky matrix, S(p, w) = Dp x(p, w) + Dw x(p, w)x(p, w)′ and its properties play a central role in determining whether consumer behavior is rational or not. The above has shown: Proposition 14.3.3 If the uncompensated law of demand holds, the Slutsky matrix must be negative semi-definite. The entries of the Slutsky matrix are sℓk (p, w) = ∂xℓ (p, w) ∂xℓ (p, w) + xk (p, w) ∂pk ∂w This describes how a consumer is adjusting his consumption of xℓ in response to a differential change in price pk when wealth w is adjusted to make the original bundle affordable. In particular, sℓk (p, w) = ∂xℓ (p, w) ∂xℓ (p, w) xk (p, w) + ∂p ∂w {z } | {zk } | Price Effect 159 Wealth Effect 14.4 The Indirect Utility Function If x∗ (p, w) is a solution of the utility maximization problem max {x:p·x≤w} u(x) Then the indirect utility function is the function v(p, w) = u(x∗ (p, w)) This gives the utility value of the consumer for any price-wealth situation (p, w). Proposition 14.4.1 (3D3) The indirect utility function v(p, w) is • homogeneous of degree zero • strictly increasing in w and non-increasing in pℓ for each ℓ • quasiconvex — the set {(p, w) : v(p, w) ≤ v̄} is convex for any v̄ • continuous in (p, w) These results all follow because you can’t lower a consumer’s wealth and make him better off, or raise prices and make him better off, since these both shrink the size of the choice set. The proofs of each point formalize one implication of that fact. Exercises MWG : 2E1, 2E2, 2E3, 2E5, 3B1, 3B2, 3C6, 3D6 160 Chapter 15 Consumer Behavior, II For any maximization problem max u(x) x subject to b(x) = w, we can consider the problem min b(x) subject to u(x) = y. Then we can the original problem the primal problem and the second program with the roles of the objective and constrained reversed the dual problem. At first, you might wonder why we would study such a problem. The Lagrangian for the primal is Lp = u(x) − λp (b(x) − w) with FONCs ∇u(xp ) = λp ∇b(xp ) −(b(xp ) − w) = 0 and the Lagrangian for the dual is Ld = b(x) − λd (u(x) − y) with FONCs b(xd ) = λd ∇u(xd ) −(u(xd ) − y) = 0 These two problems have the same solution only if λp = λd , or that b(xp ) = w = b(xd ), and u(xp ) = y = u(xd ). But this is actually incredibly useful information. As we vary exogenous parameters, we know that the primal and dual solutions xp and xd start off with the same value, but react differently to the variation in things like prices and wealth. This gives us a variety of angles of attack on the problem that can clarify what changes are a result of price effects (relate to the dual, which does not involve wealth) and what changes are a result of wealth effects (relate to the primal, which does). 15.1 The Expenditure Minimization Problem Consider the dual of the utility maximization problem, the expenditure minimization problem: min p · x x 161 subject to u(x) ≥ ū. If u(x) is differentiable, we can form the Lagrangean and first-order necessary conditions for a minimum: L(x, λ) = −p · x − λ(−u(x) + ū) yielding the FONCs p = λ∇u(x) u(x) = ū The solution to this problem is called Hicksian Demand or Compensated Demand, h(p, u), similar to Walrasian/uncompensated demand. For example, consider u(x) = xα1 xβ2 . Then the EMP is min p1 x1 + p2 x2 x subject to xα1 xβ2 ≥ ū. This has Lagrangean L = −p1 x1 − p2 x2 + λ(xα1 xβ2 − ū) and solution h∗1 = βp1 h∗2 = αp2 αp2 βp1 β !1/(α+β) ū αp2 βp1 β !1/(α+β) ū Proposition 15.1.1 (3E1) Suppose u() is a continuous utility function representing a locally nonsatiated preference relation and the price vector p ≫ 0. Then • If x∗ (p, w) is optimal in the UMP, then x(p, w)∗ is optimal in the EMP when u(x∗ (p, w)) = u • If h∗ (p, u) is optimal in the EMP, then h∗ (p, u) is optimal in the UMP when p · h∗ (p, u) = w Proof If x∗ is optimal the UMP at p, then it is feasible in the EMP when the utility target is u(x∗ ); if it weren’t optimal, that would imply that there were a vector x′ that was strictly cheaper and achieved the same utility as x∗ , and by local non-satiation, x∗ couldn’t have been optimal in the first place (since a better bundle would have been affordable), leading to a contradiction. The same kind of argument establishes the other part of the proposition. This implies that x(p, e(p, u)) = h(p, u) and x(p, w) = h(p, v(p, w)). These are important identities to keep in mind. Similarly, just as v(p, w) = u(x∗ (p, w)) for the UMP, there is a value function for the EMP: e(p, u) = p · h∗ (p, u). Proposition 15.1.2 (3E3) Suppose u is a continuous utility function representing a locally nonsatiated preference relation . Hicksian demand h(p, u) possesses the following properties: • homogeneity of degree zero in p: h(αp, u) = h(p, u) for all α > 0 • no excess utility: for any h ∈ h(p, u), u(h(p, u)) = u • convexity/uniqueness: If is convex, then h(p, u) is a convex set, and if is strictly convex, then h(p, u) is a function 162 Proof (i) Minimizing p · x subject to u(x) ≥ u is the same as minimizing αp · x subject to u(x) ≥ u, α ≥ 0, since this is just a monotone transformation of the utility functions. (ii) Suppose there was an x ∈ h(p, u) so that u(x) > u. Then the bundle x′ = αx with α ∈ (0, 1) satisfies u(x′ ) = u(αx) > u(x), if α is close to 1 by continuity of u. But then the bundle x′ is slightly less expensive, so αp · x < p · x. Consequently, x could not have been optimal in the first place (since x′ is cheaper and less expensive). Proposition 15.1.3 (3E2) Suppose u is a continuous utility function representing a locally nonsatiated preference relation . The expenditure function e(p, u) is • Homogeneous of degree one in p • Strictly increasing in u and non-decreasing in pℓ for all ℓ • Concave in p • Continuous in p and u Proof (i) e(αpu) = αp · x = α (p · x) = αe(p, u). This turns out to be a monotone transformation of the objective, so the solution doesn’t change. (ii) Suppose e(p, u) were not strictly increasing in u, and let x′ be optimal at u′ and x′′ be optimal at u′′ , where u′′ > u′ and p · x′ ≥ p · x′′ . Consider a bundle x̃ = αx′′ , α ∈ (0, 1) with p · x′ > p · x̃. Then if α is close to 1, u(x̃) > u′ but p · x′ > p · x̃, so x̃ is feasible and cheaper, so it is strictly better than x′ , contradicting the assumption that x′ was optimal. (iii) Suppose p′′ and p′ have p′′ℓ ≥ p′ℓ but are otherwise equal. Let x′′ be optimal in the EMP at p′′ . Then e(p′′ , u) = p′′ · x′′ ≥ p′ · x′′ ≥ e(p′ , u). Then p′′ ≥ p′ implies e(p′′ , u) ≥ e(p′ , u). (iv) For concavity, fix u and let p′′ = αp + (1 − α)p′ for α ∈ [0, 1]. Suppose x′′ is optimal in the EMP at p′′ . Then e(p′′ , u) = p′′ · x′′ = αp · x′′ + (1 − α)p′ · x′′ ≥ αe(p, u) + (1 − α)e(p′ , u). And there is a corresponding law of demand for Hicksian demand: Proposition 15.1.4 (3E4) Suppose u is a continuous utility function representing a locally nonsatiated preference relation . Suppose h(p, u) is a function for all p ≫ 0. Then the Hicksian demand function satisfies the compensated law of demand: For all p and p′ , (p′ − p)(h(p′ , u) − h(p′ , u)) = ∆p∆h ≤ 0 Proof For any p ≫ 0, h(p, u) is optimal in the EMP, so it achieves a lower expenditure at p than any other bundle offering that much utility. Therefore p′′ · h(p′′ , u) ≤ p′′ · h(p′ , u) and p′ · h(p′′ , u) ≥ p′ · h(p′ , u) Subtracting yields (p′′ − p′ ) · h(p′′ , u) ≤ (p′′ − p′ ) · h(p′ , u) This is one of our most important results to keep in mind: Proposition 15.1.5 (3G1) For all p and u, Hicksian demand h(p, u) satisfies h(p, u) = Dp e(p, u) 163 Proof Using the envelope theorem: Take V (p, u) = −p · h(p, u) + λ(u(h(p, u)) − ū). Now totally differentiate with respect to p: Dp V (p, u) = −h(p, u) − pDp h(p, u) + λDx u(h(p, u))Dp h(p, u). Since the FOC of the EMP is 0 = −x + λDx u(x), all the terms drop out except Dp V (p, u) = −h(p, u). But Dp V (p, u) = −Dp e(p, u), so Dp e(p, u) = h(p, u). Now, since h(p, u) = Dp e(p, u) and we know that h(p, u) has certain properties in p, we can differentiate again to learn more. This trick is very common, especially in producer theory, so it pays to understand this proposition: Proposition 15.1.6 (3G2) Suppose h(p, u) is differentiable at p with Jacobian matrix Dp h(p, u). Then • Dp h(p, u) = Dp2 e(p, u) • Dp h(p, u) is negative semi-definite • Dp h(p, u) is symmetric • Dp h(p, u) · p = 0 Proof (i) This follows from the previous proposition (3G1). (ii) Since e(p, u) is a concave function (proposition 3E2), its Hessian is negative definite, which is the Jacobian of h(p, u). (iii) Any negative semi-definite matrix is symmetric by definition. (iv) Since h(p, u) homogeneous of degree zero in p , we have h(αp, u) − h(p, u) = 0, and differentiating with respect to α yields Dp h(αp, u) · p = 0; evaluation at α = 1. So now that the expenditure minimization problem has been characterized and its solution and value function studied, let’s ask again, what is this about? The EMP says, “Find the least expensive way to give the consumer u worth of utility.” If you worked at a hospital and they asked you to maximize the nutritional value of a patient’s diet subject to a budget constraint, you would be solving the UMP. If you worked at a hospital and they asked you to minimize the cost of a patient’s diet subject to achieving a certain level of nutrition, you would be solving the EMP. Naturally, the solutions and value functions of these problems are related, and their solutions have similar properties. But there are actually some deeper connections. 15.2 The Slutsky Equation and Roy’s Identity Slutsky’s equation and Roy’s identity link all the ideas developed up to now together. In particular, Slutsky’s equation links Hicksian to Walrasian demand, and Roy’s identity shows how the indirect utility function relates to Walrasian demand. Proposition 15.2.1 (3G3, the Slutsky Equation) For all (p, w) and u = v(p, w), Dp h(p, u) = Dp x(p, w) + Dw x(p, w)x(p, w)′ or for all ℓ and k, ∂xℓ (p, w) ∂xℓ (p, w) ∂hℓ (p, u) = + xk (p, w) ∂pk ∂pk ∂w Proof Consider a consumer at prices (p, w) with utility u. That implies that w = e(p, u), and h(p, u) = x(p, e(p, u)). Differentiating with respect to p yields Dp h(p, u) = Dp x(p, e(p, u)) + Dw x(p, e(p, u))Dp e(p, u)′ 164 But then 3G1 implies Dp h(p, u) = Dp x(p, e(p, u)) + Dw x(p, e(p, u))h(p, u)′ and since w = e(p, u) and x(p, w) = x(p, e(p, u)) = h(p, u), Dp h(p, u) = Dp x(p, w) + Dw x(p, w)x(p, w)′ This also shows that S(p, w) = Dp h(p, u), since the right-hand side of the identity is exactly the Slutsky matrix. Since we know that Dp h(p, u) is negative semi-definite, symmetric, and satisfies Dp h(p, u) · p = 0, then these properties must be satisfies for S(p, w) as well. Another indispensable tool that comes from exploiting duality theory is Roy’s Identity, which allows us to derive the Walrasian/Uncompensated Demand function from the indirect utility function. Proposition 15.2.2 (3G4, Roy’s identity) Suppose that the indirect utility function v(p, w) is differentiable at (p, w) ≫ 0. Then x(p, w) = −1 Dp v(p, w) Dw v(p, w) or for all ℓ = 1, ..., L, xℓ (p, w) = − ∂v(p, w)/∂pℓ ∂v(p, w)/∂w Proof (i) Note that u = v(p, w), so that u = v(p, e(p, u)). Differentiating with respect to p yields 0 = Dp v(p, e(p, u)) + Dw v(p, e(p, u))Dp e(p, u) From 3G1, Dp e(p, u) = h(p, u), and 0 = Dp v(p, e(p, u)) + Dw v(p, e(p, u))h(p, u) and since e(p, u) = w and u = v(p, e(p, u)), 0 = Dp v(p, w) + Dw v(p, w)h(p, v(p, w)) 0 = Dp v(p, w) + Dw v(p, w)x(p, w) so that x(p, w) = −Dp v(p, w)/Dw v(p, w). 15.3 Example Suppose u(x1 , x2 ) = x1 x2 . Then the EMP is L = −p1 x1 − p2 x2 + λ(x1 x2 − ū) This generates Hicksian demands h1 (p, u) = h2 (p, u) = 165 r r ū p2 p1 ū p1 p2 and an expenditure function e(p, u) = √ √ ū(2 p1 p2 ) If we make the substitution w = e(p, u) and ū = v(p, w), p √ w = v(p, w)(2 p1 p2 ) and v(p, w) = Using Roy’s identity gives x1 (p, w) = − w2 4p1 p2 −w2 4p1 p2 w = 2 2p1 4p2 p1 2w which are the solutions of the UMP: L = x1 x2 + λ(w − p1 x1 − p2 x2 ) which has solutions w 2p1 w x2 (p, w) = 2p2 from these we can compute the indirect utility function x1 (p, w) = v(p, w) = w2 4p1 p2 Making the substitution ū = v(p, w) and w = e(p, u), we get ū = e(p, u)2 4p1 p2 Re-arranging to solve for e(p, u) yields √ e(p, u) = 2 p1 p2 ū and differentiating with respect to p1 and p2 yield √ ūp2 h1 (p, u) = √ p1 √ ūp1 h2 (p, u) = √ p2 which are exactly the Hicksian demands we started with. So, in principal, the following is true: • Walrasian demand can be generated from the UMP • Hicksian demand can be generated from the EMP • The indirect utility function can be generated from Walrasian demand • The expenditure function can be generated from Hicksian demand • The expenditure function can be generated from the indirect utility function, and the indirect utility function can be generated from the expenditure function • Walrasian demand can be generated from the indirect utility function by Roy’s identity • Hicksian demand can be generated from the expenditure function by 3G1 So if you are given any piece of the puzzle, you can, in principal, generate all of the other objects. 166 Exercises MWG: 3E4, 3E6, 3E8, 3E9, 3G1, 3G3, 3G7, 167 Chapter 16 Welfare Analysis Suppose a consumer has a rational, continuous, locally non-satiated preference relation represented by a continuous utility function u(x). Consider an initial price p0 and a final price p1 . You might want to know, “How does the consumer’s welfare change when prices move from p0 to p1 ?” However, the consumer’s behavior varies in response to those prices, making some price changes more painful than others. Moreover, what if some prices increase increase, and some decrease? For example, a shock to oil prices has an impact on many products, and measuring the impact will be complicated precisely because the consumer behavior will adjust to shrug off some of the welfare loss by buying substitute goods or taking other steps to adjust to the new price regime. Since v(p, w) is our theoretical index of the consumer’s welfare for all price-wealth situations, the real question is, “What is v(p1 , w) − v(p0 , w)?” But we don’t actually ever observe v(p, w), so is there some other function φ(p, v(p, w)), that is observed (in principle) and is closely related to v(p, w)? The expenditure function e(p, u) = e(p, v(p, w)) is strictly increasing in u and is observable, so it is a natural choice; this is called the money metric utility function: e(p̄, v(p1 , w)) − e(p̄, v(p0 , w)) Note that the prices are fixed, but the utility is varying. Since the choice of p̄ is still undetermined, there are two natural choices: Compensating Variation CV (p0 , p1 , w) = e(p1 , v(p1 , w)) − e(p1 , v(p0 , w)) = e(p1 , u1 ) − e(p1 , u0 ) = w − e(p1 , u0 ) and Equivalent Variation EV (p0 , p1 , w) = e(p0 , v(p1 , w)) − e(p0 , v(p0 , w)) = e(p0 , u1 ) − e(p0 , u0 ) = e(p0 , u1 ) − w The value of CV measures how much wealth the consumer would have to be paid after the price change to make her indifferent about the price change, while EV measures how much she would have to be paid before the price change to make her indifferent about it. Or, EV is the amount she would pay to stop the price change, while CV is how much she would have to be paid after the price change to achieve the same utility level. Example Suppose we can a consumer with preferences u(x1 , x2 ) = x1 x2 . Then his expenditure function is √ e(p, u) = 2 p1 p2 ū and v(p, w) = 168 w2 4p1 p2 and our composite function a b e(p , v(p , w)) = w s pa1 pa2 pb1 pb2 is the money-metric utility function described above. Suppose that prices are initially (1, 1), but then the price of good two changes from 1 to 2. Our compensating variation is s s √ (1)(2) (1)(2) CV (p0 , p1 , w) = e(p1 , v(p1 , w)) − e(p1 , v(p0 , w)) = w −w = w(1 − 2) (1)(2) (1)(1) while the equivalent variation is EV (p0 , p1 , w) = e(p0 , v(p1 , w)) − e(p0 , v(p0 , w)) = w s (1)(1) −w (1)(2) s p (1)(1) = w( 1/2 − 1) (1)(1) So in this case it is easy to calculate closed-form solutions. But suppose we want to know more generally about how changes in p2 affect the consumer’s welfare. Hold p1 fixed, and consider CV (p0 , p1 , w) = e(p1 , v(p1 , w)) − e(p1 , v(p0 , w)) = w − e(p1 , u0 ) = e(p0 , u0 ) − e(p1 , u0 ) In this case, that equals CV ((p1 , p02 ), (p1 , p12 ), w) = e((p1 , p02 ), u0 ) − e((p1 , p12 ), u0 ) From the fundamental theorem of calculus, e((p1 , p02 ), u0 ) − e((p1 , p12 ), u0 ) Z = p02 p12 ∂e(p1 , x, w) dx ∂p2 From 3G1, CV = e((p1 , p02 ), u0 ) − e((p1 , p12 ), u0 ) integration yields CV = And inserting u0 = v(p0 , w) = p w2 yields 4p1 p02 = Z p02 p12 h2 (p1 , y, w)dy = q q u0 p1 2( p02 − p12 ) p ! p1 CV = w 1 − p 20 p2 Evaluating at p0 = (1, 1) and p1 = (1, 2) yields CV = w(1 − which is exactly as before. 169 √ 2) Z p02 p12 r u0 p1 dy y So the example shows that when it comes to welfare, we are talking about integrals of Hicksian demand, while our undergraduate measure was Consumers’ Surplus, based on integrals of Walrasian demand. So suppose only the price of good ℓ is changing, from p0ℓ to p1ℓ . In general, how can EV or CV be computed? CV = e(p1 , v(p1 , w)) − e(p1 , v(p0 , w)) = w − e(p1 , v(p0 , w)) = e(p0 , v(p0 , w)) − e(p1 , v(p0 , w)) = e(p0 , u0 ) − e(p1 , u0 ) Let p̄−ell = (p1 , p2 , ..., pℓ−1 , pℓ+1 , ..., pL ) be the vector of prices that are being held fixed. Then from the Fundamental Theorem of Calculus, 0 0 1 0 CV = e(p , u ) − e(p , u ) = and from 3G1 CV = Z p0ℓ p1ℓ Z p0ℓ p1ℓ ∂e(z, p̄−ℓ , u0 ) dz ∂pℓ hℓ (z, p̄−ℓ , u0 )dz So compensating variation is the area to the left of the Hicksian demand curve, evaluated at u0 , and equivalent variation is Z p0 ℓ hℓ (z, p̄−ℓ , u1 )dz EV = p1ℓ or the area to the left of the Hicksian demand curve, evaluated at u1 . This is the new Consumer Surplus. In fact, consider the Slutsky equation: Dp h(p, u) = Dp x(p, w) + Dw x(p, w)x(p, w)′ R R If there were no wealth effects, p Dp h(p, u)dp = p Dp x(p, w)dp (in the sense of line integrals), and Hicksian and Walrasian demand would be essentially the same. Then the integral in the above example could replace h2 (p, u) with x(p, w), and we would have computed the area beneath a demand curve to compute the agent’s welfare. Since wealth effects are not negligible, however, we must work with Hicksian demand. 16.1 Taxes For example, the deadweight loss triangle represents the loss from a tax in introductory microeconomics. Consider a tax t on good ℓ such that p1ℓ = p0ℓ + t, with all other prices remaining the same. Her equivalent variation is the amount he would need to be paid to be compensated for the introduction of the tax. If T is the total tax burden the consumer faces, he is worse off after the tax if EV < T — i.e., the burden of the tax per se is worse than merely facing the change in prices (remember, a tax changes the relative prices of goods and also reduces his wealth). In particular, −T − EV (p0 , p1 , w) = w − e(p0 , u1 ) − T gives exactly the measure of the welfare lost by imposing the tax. The burden of the tax is equal to T = txℓ (p1 , w), his demand for good ℓ after the tax times the size of the tax, or T = thℓ (p1 , u1 ). Then Z p1 1 ℓ ∂e(z, p −ℓ , u ) 0 1 1 1 0 1 1 1 dz − thℓ (p1 , u1 ) −T − EV (p , p , w) = e(p , u ) − e(p , u ) − thℓ (p , u ) = 0 ∂p ℓ pℓ 170 Then p1ℓ = p0ℓ + t equals = Z p0ℓ +t p0ℓ Z ∂e(z, p−ℓ , u1 ) dz−(p0ℓ +t−p0ℓ )hℓ (p0ℓ +t, p−ℓ , , u1 ) = ∂pℓ p0ℓ +t p0ℓ hℓ (z, p−ℓ , u1 )−hℓ (p0ℓ +t, p−ℓ , u1 )dz Again, this is an integral involving the Hicksian demand curve. 16.2 Consumer Surplus vs. CV and EV The Area Variation in MWG is just Consumer Surplus, or 0 1 AV (p , p , w) = Z p0ℓ p1ℓ xℓ (z, p−ℓ , w)dz As we mentioned, this is the wrong welfare measure. But consider 0 1 AV (p , p , w) = versus EV (p0 , p1 , w) = p0ℓ Z p1ℓ Z p0ℓ p1ℓ + Z hℓ (p1ℓ , p−ℓ , u1 ) + Z x(p1ℓ , p−ℓ , w) z p1ℓ z p1ℓ ∂xℓ (y, p−ℓ , w) dydz ∂pℓ ∂hℓ (y, p−ℓ , u1 ) dydz ∂pℓ From the Slutsky equation, we know that ∂hℓ (p, u) ∂xℓ (p, w) ∂xℓ (p, w) = + xℓ (p, w) ∂pℓ ∂pℓ ∂w Substituting this in yields 0 1 EV (p , p , w) = Z p0ℓ p1ℓ hℓ (p1ℓ , p−ℓ , u1 ) + or EV (p0 , p1 , w) = Z p0ℓ p1ℓ Z z p1ℓ Z z p1ℓ ∂xℓ (y, p−ℓ , w) ∂xℓ (y, p−ℓ , w) + xℓ (y, p−ℓ , w)dydz ∂pℓ ∂w ∂xℓ (y, p−ℓ , w) xℓ (y, p−ℓ , w)dydz + AV (p0 , p1 , w) ∂w | {z } Wealth Effect On page 89, this is why MWG say that AV tends to underestimate EV when ℓ is a normal good: If ∂xℓ /∂w ≥ 0 and p1ℓ < p0ℓ , the integral term will be positive, so EV = AV + ε where ε > 0. A similar derivation can be done for the compensating variation to show that AV > CV , giving EV > AV > CV Exercises MWG: 3I1, 3I2, 3I3, 3I5, 3I7 171 Chapter 17 Production In classical economics, consumers demand and firms supply. Now we apply the theory of rational choice to firms, just as we did for consumers in the previous two chapters. 17.1 Production Sets A production vector y ∈ RL is a plan describing the net outputs of a production process. For example, if you had a machine that takes a pound of plastic and a half a pound of capacitors to make 1 computer, y = (−1/2, −1/2, 1). Also, firms might sell each other goods, so that the computer produced by one firm is sold to another firm to monitor the chemicals being sprayed on oranges, and so on. Each firm has a set Y ⊂ RL of possible production plans, the production set. Any y ∈ Y is feasible, while any y ∈ RL \Y is not. Another way of describing Y is the transformation function F , which satisfies Y = {y ∈ RL : F (y) ≤ 0} The boundary of F are the set of plans y such that F (y) = 0; this is like the production possibilities frontier of introductory microeconomics. Definition 17.1.1 A production set Y 6= ∅ satisfies • “no free lunch” if y ∈ Y and y ≥ 0, then y = 0 (it is not possible to produce something with nothing) • free disposal if y ∈ Y and y ′ ≤ y, then y ′ ∈ Y • irreversibility if y ∈ Y implies −y ∈ /Y • non-increasing returns to scale if for any y ∈ Y , αy ∈ Y for α ∈ [0, 1] • non-decreasing returns to scale if for any y ∈ Y , αy ∈ Y for α > 1 • constant returns to scale if for any y ∈ Y , α ≥ 0 • additivity if for any y, y ′ ∈ Y , y + y ′ ∈ Y • convexity if y, y ′ ∈ Y and α ∈ [0, 1], then αy + (1 − α)y ′ ∈ Y • is closed if it includes its boundary; for any sequence y n → y, y ∈ Y 172 Example Consider F (y1 , y2 ) = y1 − 1 + eβy2 and assume that y2 ≤ 0 and p1 > p2 /β (sketch the set y ≤ F (y) before going any further). Then you can think of this as a firm who can transform y2 into y1 and vice versa according to the technology F . Then the firm maximizes max p · y y≤F (y) This is probably not what you have in mind when you think of a canonical profit maximization problem. For example, you probably have in mind max F (z1 , z2 ) − w1 z1 − w2 z2 z1 ,z2 where the z’s are factor inputs, like capital and labor. But that is not how MWG set up the profit maximization or cost minimization problems. They are being “agnostic” about what it means for something to be an input or an ouput, and merely considering a technology F : RL → RL that gives bounds on what kinds of vectors a firm can produce. In this case, the Lagrangean takes the form L = p1 y1 + p2 y2 + λ(y1 − 1 + eβy2 ) with solution p2 βp1 1 p2 y2∗ = log β βp1 y1∗ = 1 − in which it is not clear a priori which good is the input and which is the output. Note that we didn’t assume which good was the output and which was the input — this was determined by the technology and prices. For example, a farm might raise cows and corn, producing milk. Under some prices, the firm doesn’t sell the corn and uses it simply to feed cows to produce milk. At other prices (for instance, high demand for ethanol to make gasoline), the farm begins to sell corn as well as use it to feed cows, so that it’s an input and output. Then, the prices shift so much towards corn that the farm stops raising cows, butchers them for meat, and focuses on growing corn (but note that corn kernels are an input in growing corn, so it is, in some sense, its own input). MWG want to have this level of generality of inputs and outputs in mind, and it’s important in macroeconomics as well (Blinder wrote an influential paper about how sectoral shocks propogate through intermediate goods channels, for example). Make sure you understand the above example, because MWG never actually provide a fully worked example of the profit-maximization problem they’ve chosen, and if you think of it as a “pF (K, L) − rK − wL” kind of problem, the propositions below will probably become confusing. In particular, the functional form π = pF (K, L) − rK − wL assumes that K and L are inputs and substitutes q = F (K, L) into the objective. How can we transform it into the kind of framework MWG are using? First, let K̃ = −K and L̃ = −L, and F̃ (q, K̃, L̃) = F (−K̃, −L̃) − q, and then consider max pq + r K̃ + wL̃ q,K̃,L̃ subject to F̃ (q, K̃, L̃) ≤ 0. This generates a Lagrangean L = pq + r K̃ + wL̃ + λF̃ (q, K̃, L̃) That’s exactly the PMP defined at the bottom of page 135. So remember that MWG are including output in the production plan, so we’re working with F̃ . 173 17.2 Profit Maximization Let p = (p1 , p2 , ..., pL ). Then the profit maximization problem is π(p) = max p · y y∈Y Since Y = {y : F (y) ≤ 0}, so that π(p) = max p · y y≤F (y) Proposition 17.2.1 (5C1) Suppose π() is the profit function of the production set Y and y() is the supply correspondence. If Y is closed and satisfies the free disposal property, then 1. π(p) is homogeneous of degree one 2. π(p) is convex 3. If Y is convex, then Y = {y ∈ RL : p · y ≤ π(p)} 4. y() is homogeneous of degree zero 5. If Y is convex, then y(p) is a convex set. If Y is strictly convex, y(p) is single-valued. 6. If y(p) is a function, then π(p) is differentiable and Dπ(p) = y(p) (Hotelling’s Lemma) 7. If y(p) is a function, then Dy(p) = D 2 π(p) is a symmetric and positive semidefinite matrix with Dy(p) · p = 0 • (i) If π(p) = maxy∈Y py, then αpy is a monotone transformation of the objective, and will have the same solution set • (ii) If π(p) = p · y and π(p′ ) = p′ · y ′ Then π(λp + (1 − λ)p′ ) = (λp + (1 − λ)p′ )y(λp + (1 − λ)p′ ) = λp · y(λp + (1 − λ)p′ ) + (1 − λ)p · λy(λp + (1 − λ)p′ ) ≤ λp · y(p) + (1 − λ)p · λy(p′ ) = λπ(p) + (1 − λ)π(p′ ) so that π(p) is convex. • (iv) Since the constraint set is unaffected by a scalar transformation of the prices p′ = αp, if y is optimal for maxy∈Y p · y, then y is optimal for maxy∈Y αp · y = α maxy∈Y p · y • (v) If Y is convex, then for all y, y ′ ∈ Y , λy + (1 − λ)y ′ ∈ Y . Suppose y and y ′ are optimal at p. Then π(p) = p · y = p · y ′ = λp · y + (1 − λ)p · y ′ = p · (λy + (1 − λ)y ′ ) So that λy + (1 − λ)y ′ must also be optimal. If Y is strictly convex, then y = y ′ in the above calculations, or else λy + (1 − λ)y ′ gives strictly higher profits than y or y ′ , contradicting the assumption that y and y ′ were optimal. 174 • (vi) Hotelling’s lemma follows from π(p) = p · y(p) and Dπ(p) = y(p) + p · Dy(p) = y(p) from the Envelope Theorem or a Duality argument. • (vii) Last, since π(p) is convex, we know that D 2 π(p) is a positive semi-definite matrix. Then Hotelling’s Lemma implies that y(p) = Dπ(p), so differentiating again yields D 2 π(p) = Dy(p), so the Jacobian of the firm’s supply curve is the Hessian of the profit function. In some sense, the above results are the UMP version of the firm’s problem, and y(p) behaves much like Walrasian demand. Example Recall the example with F (y1 , y2 ) = y1 − 1 + eβy2 and p2 > p1 . Then the firm’s profit-maximizing production plan was p2 βp1 1 p2 y2 = log β βp1 y1 = 1 − Clearly, these are HOD-0 with respect to p1 and p2 . The Jacobian of the optimal plan is p −1 2 21 p1 Dp y(p) = p −1 1 p1 p2 which is positive semi-definite (and symmetric). In particular, the terms on the diagonal are positive, so that the law of supply will hold. The profit function becomes p2 1 p2 log + p2 π(p) = p1 1 − p1 β βp1 which is HOD-1 in p, since π(αp) = αp1 αp2 1− αp1 + αp2 1 log β αp2 αβp1 = απ(p) So all the required properties are satisfied. 17.3 The Law of Supply The following kind of argument is very common in economics: Suppose y solves maxỹ p · ỹ and y ′ maximizes maxỹ p′ · ỹ. Then p · y ≥ p · y′ and p′ · y ′ ≥ p′ · y Subtracting the second inequality from the first gives (p − p′ ) · y ≥ (p − p′ ) · y ′ 175 or (p − p′ ) · (y − y ′ ) ≥ 0 yielding ∆p∆y ≥ 0 which is the Law of Supply. In particular, (pℓ − p′ℓ )(yℓ − yℓ′ ) ≥ 0 so that if good ℓ’s price goes up, the firm produces more of yℓ . That all looked easy, but this is a often a very effective argument. Proposition 17.3.1 Suppose p ≥ p′ . Then y ≥ y ′ , or (p − p′ ) · (y − y ′ ) = ∆p∆y ≥ 0 Note that we don’t have “Walrasian” supply and “Hicksian” supply, and an “Compensated Law of Supply” — we just have the supply function y(p) and a law of supply. Remember that we defined compensated price-wealth changes dpx = dw, and then considered dx = Dp x · dp + Dw xdw = Dp x · dp + Dw x(dpx) = [Dp x + Dw x · x′ ]dp giving us the Slutsky equation S(p, w) = Dp x(p, w) + Dw x(p, w) · x(p, w)′ . This equation included wealth effects through Dw x(p, w), making the evaluation of welfare complicated, since price changes had an indirect on the consumer by changing his real wealth. When it comes to the classical theory of firms, there is no budget constraint that plays a similar role to the wealth constraint that the consumer faces. Consequently, dy = Dp y(p)dp, and the law of supply holds for all prices changes. 17.4 Cost Minimization Now we focus on a slightly different problem, where we which goods are inputs and which are outputs. The dual problem to the profit maximization problem is the cost minimization problem (CMP): min w · z z subject to f (z) ≥ q̄. The solution to this problem is a set of factor demands, expressing how much of each input to hire at a vector of wage rates w. Proposition 17.4.1 (5C2) Let c(w, q) be the cost function of a single-output technology Y with production function f (), and that z(w, q) is the factor demand correspondence. Assume that Y is closed and satisfies the free disposal property. Then • c() is homogeneous of degree one in w and non-decreasing in q • c() is concave in w • z() is homogeneous of degree zero in w • If z(w, q) is single-valued, then c() is differentiable with respect to w and Dw c(w, q) = z(w, q) 2 c(w, q) is a symmetric and negative semi• If z() is differentiable at w, then Dw z(w, q) = Dw definite matrix with Dw z(w, q) · w = 0 176 • If f () is homogeneous of degree one, then c() and z() are homogeneous of degree one in q • If f () is concave, then c() is a convex function of q Proof • (i) Since the objective is w · z, a monotone transformation of the objective αw · z has the same solution, so it is HOD-1: c(αz, q) = (z, q). By increasing q, the firm must produce more goods. Suppose that z is the solution at (w, q) but z ′ is the solution at (w, q ′ ), with q ′ ≥ q. Then z ′ is feasible at (z, q), since f (z ′ ) ≥ f (z). Suppose that c(w, q ′ ) ≤ c(w, q). Then w · z ′ < w · z, and z ′ is actually a solution at (w, q). But this contradicts z being optimal. • (ii) Let z be the solution at (w, q) and z ′ the solution at (w′ , q). Then c(λw + (1 − λ)w′ , q) = λw · z ′′ + (1 − λ)w′ · z ′′ Since w · z = c(w, q) and w′ · z ′ = c(w′ , q), the plan z ′′ must satisfy w · z ′′ ≥ w · z and w · z ′′ ≥ w′ · z ′ . Then λw · z ′′ + (1 − λ)w′ · z ′′ ≥ λw · z + (1 − λ)w′ · z ′ = λc(w, q) + (1 − λ)c(w′ , q) • (iii) Since a monotone transformation of the objective doesn’t change the solution correspondence, z(αw, q) = z(w, q). • (v) (Shepard’s lemma) The Lagrangean of the constrained optimization problem is L = w · z + λ(f (z) − q) Differentiating at the optimum with respect to w yields z(w, q) + w · Dw z(w, q) + λDz f (z(w, q))Dw z(w, q) + Dw λ(f (z) − q) which implies Dw c(w, q) = z(w, q) 2 c(w, q) is a symmetric and negative • (vi) If c(w, q) is differentiable at w, then Dw z(w, q) = Dw semidefinite matrix satisfying Dw c(w, q)·w = 0 from the concavity of c(w, q) in w and HOD-0 in w of z(w, q). Example The example F (y1 , y2 ) = y1 + eβy2 − 1 is trivial, since y1 = q and this can be solved for y2 = log(1 + q)/β. Instead, let F (q, z2 , z3 ) = q + eβz2 z3 − 1, so that there is an interesting trade-off between z2 and z3 . Then the Lagrangean for the CMP is L = −w2 z2 − w3 z3 + λ(eβz2 z3 − 1 − q) This has first order conditions w2 = λβz3 eβz2 z3 w3 = λβz2 eβz2 z3 implying that w2 z2 = w3 z3 . Substituting this relationship into the constraint gives s w3 log(1 + q) z2∗ = w2 β s w2 log(1 + q) z3∗ = w3 β These are the conditional factor demands, and give the least-cost way of producing q = y1 units. 177 So we can play many of the same games with firms using Shepard’s lemma and Hotelling’s lemma as we did for consumers with Roy’s identity and the Slutsky Equation. In particular, if we take the last example and solve for c(w, q) we get r w2 w3 p log(1 + q) c(w, q) = 2 β From which we can describe a profit maximization problem r w2 w3 p log(1 + q) pq − c(w, q) = pq − 2 β 17.5 A Cobb-Douglas Example In a lot of economic problems, we describe the firm’s production function f : RL → R that takes some subset of L inputs z and produces a single ouput, q. Consider q = z1α z2β with w = (w1 , w2 ) and output price p. Then the profit maximization problem is (substituting in the technology into the profit function and writing inputs as negative quantities) π(p) = max pz1α z2β − w1 z1 − w2 z2 z1 ,z2 A necessary condition for maximization is pαz1α−1 z2β − w1 = 0 pβz1α z2β−1 − w2 = 0 and the second-order sufficient conditions are that " # pα(α − 1)z1α−2 z2β pβαz1α−1 z2β−1 pβαz1α−1 z2β−1 pβ(β − 1)z1α z2β−2 be negative semi-definite (which is satisfied as long as α + β < 1). Then the solution to the FOC’s is β 1−β !1−α−β α β ∗ z1 = p w2 w1 z2∗ = Using the envelope theorem, and p α w1 α β w2 1−α !1−α−β dπ = −z1∗ dw1 dπ = z1∗ α z2∗ β = q dp just as Hotelling’s Lemma predicts (make sure you see why this is true). Similarly, the cost minimization problem is min w1 z1 + w2 z2 z1 ,z2 178 subject to z1α z2β ≥ q. Solving the first order conditions and plugging back into the constraint yields z1∗ = z2∗ q αw2 βw1 β !1/(α+β) 1/(α+β) βw1 α = q αw2 Plugging this in to get the cost function q c(w1 , w2 , q) = w1 αw2 βw1 β !1/(α+β) 1/(α+β) βw1 α + w2 q αw2 And using the envelope theorem to totally differentiate the cost function yields dC(w, q) = z1 dw1 and dC(w, q) = z2 dw2 which is Shephard’s Lemma. 17.6 Scale A firm has constant returns to scale (CRS) if f (αz) = αf (z) for all α ≥ 0. Differentiating with respect to α yields X ∂f (z) = z · Df (z) f (z) = zℓ ∂zℓ ℓ which is a useful property often evoked in macroeconomics. When do our usual production functions satisfy this conditions? For CES technologies, f (αz) = !1 X ρ = βℓ (αzℓ )ρ αρ ℓ X !1 ρ ρ βℓ zℓ ℓ = αf (z) So there is at least one non-trivial CRS technology. When is Cobb-Douglas CRS? Y Y Y Y β f (αz) = (αzℓ )βℓ = αβℓ zℓβℓ = zℓ ℓ αβℓ ℓ which equals f (z) iff ℓ Y ℓ P αβℓ = α ℓ βℓ =α ℓ which requires P ℓ βℓ = 1. If P ℓ βℓ < 1, f (αz) < αf (z) P so there are decreasing returns to scale and if ℓ βℓ > 1, f (αz) > αf (z) 179 so there are increasing returns to scale. In economics, decreasing and constant returns to scale technologies are generally well-behaved, but increasing returns to scale can pose a number of problems. To see this clearly, consider a one-input technology q = z β , where the firm faces price p for its ouput and pays w for the input. Then π(z) = pz β − wz The first-order condition is 0 = pβz β−1 − w and the second-order condition is 0 > pβ(β − 1)z β−2 when evaluated at a critical point. The first thing to note is that if α > 1, the second-order condition cannot be satisfied, since it will always be positive. Therefore, when conceptualizing increasing returns to scale situations, the story needs to be adjusted so that (i) the firm faces varying input costs w or ouput prices p (ii) is a monopolist or part of a concetrated industry. For example, a firm might completely saturate demand in a market, driving p to zero, or there might be a rare input that is costly to acquire that limits the firm’s scale of production. In other industries, costs might be constant, but there may be fixed costs of entry and incumbent firms build up capacity, thereby discouraging entry of new competitors. Consequently, the increasing returns to scale technology is not the part of the story that determines the firm’s maximizing behavior, but its strategic motives. For decreasing returns to scale technologies, f (αz) = αβ z β < z β , so that β < 1. In this case, the second-order condition is automatically satisfied, and any solution to the first-order conditions is a local maximum. For constant returns to scale technologies, the profit function becomes π = max pz − wz = (p − w)z z If p > w, then the firm wants to make an infinite amount. If p < w, the firm can never make any money, so it would produce nothing. If p = w exactly, then the firm is indifferent about whether and how much to make, since it doesn’t make an economic profit on any of the units. Consequently, the solution correspondence is ,p < w 0 z(p, w) = (0, ∞) , p = w ∞ ,p > w 17.7 Efficient Production The goal of much of economics is to decide whether or not a given market generates “efficient” outcomes. Generally, this is in the sense of Pareto: An outcome is efficient if no agent can be made strictly better off without making some agent strictly worse off. For example, if a society were simply throwing away tons of food, this is ostensibly inefficient because even if no individual member of the society wanted the food, the resources that went into producing the excess output could have been put into something more socially useful. This is the same sense in which we’ll talk about efficient production: Definition 17.7.1 A production plan y ∈ Y is efficient if there is no y ′ ∈ Y such that y ′ ≥ y and y ′ 6= y. 180 This means that a firm is operating efficiently if there is no other plan that uses weakly fewer inputs to produce weakly more outputs. If you consider F (y1 , y2 ) = y1 + eβy2 − 1 there are many choices of (y1 , y2 ) that satisfy y1 + eβy2 − 1 < 0, and choosing an interior point is like a consumer choosing a consumption bundle on the interior of their budget set. This motivates Proposition 17.7.2 (5F1) If y ∈ Y is profit maximizing for some p ≫ 0, then y is efficient. Proof Suppose there is a y ′ ∈ Y such that y ′ 6= y and y ′ ≥ y. Because p ≫ 0, this implies p · y ′ > p · y, implying that y is not profit maximizing. This is a simple version of the First Fundamental Theorem of Welfare Economics (look up Proposition 10.D.1 on page 326 and Proposition 16.C.1 on page 549), that competitive, price-taking behavior generates efficient outcomes. We might ask the opposite question, however: For an efficient plan y, does there exist a price vector for which that y is profit-maximizing? Note that we are really asking whether a particular kind of objective function exists. The answer comes from separating hyperplane theorems such as the following: Theorem 17.7.3 (Edelheit) Let K1 and K2 be convex sets in RN such that K1 has interior points and K2 contains no interior points of K1 . Then there is a closed hyperplane H separating K1 and K2 ; i.e., there is a vector x∗ such that sup x · x∗ ≤ inf x · x∗ x∈K1 x∈K2 So if we have a convex set K1 such that there is some x ∈ K1 with a ball Bδ (x) ⊂ K1 , and any convex set K2 (even a single point), we can separate them with a hyperplane x · x∗ . To use this theorem, we set K1 = Y , the set of feasible production plans, and assume it is convex. Pick an efficient y ∗ that we want to make profit maximizing. We then define the set K2 = {y : y ≥ y ∗ }, which are all the plans that are more efficient than y ∗ , and infeasible, since they lie to the northeast of the feasible set Y . Note that K2 contains no interior point of K1 = Y , and both sets are convex. Then the Edelheit separation theorem implies there exists a vector p∗ such that max y ∗ · p∗ ≤ min y · x∗ y∈Y =K1 y∈K2 Note that the left-hand side of the inequality says that y ∗ is profit-maximizing over Y , so we have proven a p∗ exists so that the efficient production plan y ∗ is profit-maximizing. The right-hand side says that any plan satisfying y ≥ y ∗ will do better, but is not feasible. A bit more formally, and using MWG’s version of the separating hyperplane theorem: Proposition 17.7.4 (5F2) Suppose that Y is convex. Then every efficient production plan y ∈ Y is a profit-maximizing production plan for some price vector p ≥ 0. Proof (i) Since y is efficient, any plan y ′ ≥ y must lie “above” Y , namely to the north-east, or else there would be a y ′ ∈ Y with y ′ ≥ y, so y could not have been efficient. (ii) That means the set Py = {y ′ ≫ y} is to the northeast of some point on the boundary of Y . Note that Py is a convex set as well, since if we take any two vectors y ′ , y ′′ ≫ y, their convex combination is also strictly greater than y. (iii) Then we have two convex sets, Y and Py , with Y ∩ Py = ∅. These are the hypotheses of the Separating Hyperplane Theorem — there exists a hyperplane H = p · y for which Y is on one side of the hyperplane, and Py is on the other side. Consequently, there exists a vector 181 p for which p · y ′ ≥ p · y for all y ′ ∈ Py and p · y ′′ ≤ p · y for all y ′′ ∈ Y . The second inequality implies that y achieves at least as much profit as any other feasible y ′′ ∈ Y , and the first inequality implies that anything that achieves a higher profit must be infeasible at p. Therefore, any efficient y is profit-maximizing. This is a simple version of the Second Fundamental Theorem of Welfare Economics (look up Proposition 10.D.2 on page 327 and Proposition 16.D.1 on page 553), that any efficient outcome can be supported by a competitive, price-taking behavior, provided that preferences are convex. This is a condition that generally can’t be relaxed in the assumptions of the second welfare theorem, but note that it isn’t assumed in the statement of the first welfare theorem. Exercises MWG: 5B2, 5B6, 5C3, 5C6, 5C9, 5C10, 5D1, 5D2 182 Chapter 18 Aggregation We have two basic questions: • When can we treat a collection of j = 1, ..., J firms as a single firm? • When can we treat a collection of i = 1, ..., I consumers as a single consumer? In terms of firms, we want to think of a single aggregate supply curve y(p) that accurately reflects how all the individual firms behave. In terms of consumers, we want to know when ! I I X X wi xi (p, wi ) = x p, i=1 i=1 so that a single aggregate Walrasian demand curve captures all the information contained in all the individual demand curves, and only relies on aggregate wealth. These questions are intimately related to the idea of a “representative firm” and a “representative consumer”: Can groups of agents be imagined as a single entity without loss of generality? In that case, we could think of the profit function π(p) associated with aggregate supply y(p) as describing the payoffs and behavior of a single firm rather than dealing with industries of firms. PI Likewise, if we could work with a single indirect utility function v p, i=1 wi that was associated with the aggregate Walrasian demand curve, we could treat all the consumers are a single entity. 18.1 Firms Let the individual supply correspondences of firm j be yj∗ (p) = Argmaxp · yj yj ∈Yj where Yj is firm j’s production set. If we imagine that the firms are all merged into one entity, this entity would solve J X p · yj max y1 ∈Y1 ,...,yJ ∈YJ j=1 yielding the aggregate supply correspondence ∗ y (p) = Argmax J X y1 ∈Y1 ,...,yJ ∈YJ j=1 183 p · yj Theorem 18.1.1 The aggregate supply correspondence equals the sum of the individual supply correspondences, J X yj∗ (p) y ∗ (p) = j=1 which is the supply function of the representative firm, and the aggregate profit function equals the sum of the individual profit functions, ∗ π (p) = J X j=1 p · yj∗ (p) which is the profit function of the representative firm. Proof Note that anything that is feasible for the individual firms individually is feasible for the aggregate firm, and anything that is feasible for the aggregate firm can be achieved by having the individual firms each pick their part of the optimal aggregate plan. First, we show that the individual firms can’t do worse than the aggregate firm. For if the individual firms are profit maximizing, for all plans yj′ feasible for firm j, p · yj∗ (p) ≥ p · yj′ and summing over j yields J X j=1 p · yj∗ (p) ≥ J X j=1 p · yj′ so that the sum of profits of the individual firms is an upper bound on the profits achievable by the aggregate firm. Since this is a feasible production plan for the aggregate firm, it P must be the global PJ ∗ ∗ ∗ maximizer. Therefore y (p) = j=1 yj (p), and multiplying by p yields π (p) = Jj=1 p · yj∗ (p). 18.2 Consumers Consumers are more complicated because of wealth effects, as usual. We would like to know when consumers’ individual demand curves x1 (p, w1 ), x2 (p, w2 ), ..., xI (p, wI ) satisfy ! I I X X wi xi (p, wi ) = x p, i=1 i=1 P where x p, Ii=1 wi is the aggregate demand curve, which only depends on prices and aggregate wealth. Imagine we redistribute somePmoney among all the consumers in the economy but take none away from them, so that dw = Ii dwi = 0. Then the effect on consumption of good j by this redistribution is I X ∂xij (p, wi ) dwi ∂w i=1 If this were equal to zero for (i) all goods j, (ii) all initial wealth distributions (w1 , ..., wI ), and (iii) all patterns of redistribution dw, then we could conclude that the distribution of wealth does not matter for determining demand behavior, only the total amount 184 of wealth. This would then imply that understand this completely). This requires P I (check that you x (p, w ) = x p, w i i=1 i i=1 i PI I X ∂xij (p, wi ) i=1 ∂w dwi = 0 Let’s consider consumer k and consumer ℓ, where dwk + dwℓ = 0 ; since all dw must be considered, so that we can repeat the following steps for all pairs of individuals in the economy, so the actual k and ℓ selected are not important. Then ∂xℓj (p, wℓ ) ∂xkj (p, wk ) dwk + dwℓ = 0 ∂w ∂w Then dwℓ = −dwk , and ∂xkj (p, wk ) ∂xℓj (p, wℓ ) dwk = −dwℓ ∂w ∂w and ∂xℓj (p, wℓ ) ∂xkj (p, wk ) = ∂w ∂w implying that the wealth effects must be constant across all individuals in the economy for all goods. But that in turn implies that consumer’s demands be essentially linear in wealth (this is the key to the Gorman form). Theorem 18.2.1 Suppose that consumer’s indirect utility functions satisfy the Gorman form, vi (p, wi ) = ai (p) + b(p)wi P Then there exists a representative consumer with indirect utility function v p, Ii=1 wi whose Walrasian demand curve satisfies ! I X X xi (p, wi ) wi = x p, i i=1 . Proof If consumers’ indirect utility functions satisfy the Gorman form, define ! I I X X ai (p) + b(p)wi wi = v p, i=1 i=1 If the consumers’ indirect utility functions are in wi , non-increasing in p, quasi-convex increasing PI P P PI in (p, w), and continuous, then v p, i=1 wi = i=1 ai (p) + b(p)wi = Ii=1 ai (p) + b(p) Ii=1 wi will be as well, so it is a valid indirect utility function. Now, Roy’s identity implies that the consumers’ individual demand curves must be xi (p, wi ) = − ∇p ai (p) + ∇p b(p)wi b(p) xi (p, wi ) = − I X ∇p ai (p) + ∇p b(p)wi which implies I X i=1 i=1 185 b(p) P P Applying Roy’s identity to v p, Ii=1 wi with W = Ii=1 wi yields x(p, W ) = − PI i=1 ∇p ai (p) + b(p) and x(p, W ) = I X ∇p b(p)W xi (p, wi ) i=1 Therefore, the representative consumer with indirect utility function v(p, W ) has the same aggregate demand curve x(p, W ) as the sum of the individual Walrasian demand curves. Now, you might ask, “can we do better?” For example, maybe if we allow aggregate Walrasian demand to depend on more moments of the distribution of wealth like the variance, skewness, and so on, maybe this added information will allow more general results? Of course, but there’s clearly a limit. If consumers all behave differently at all different wealth levels so that individual quirkiness doesn’t cancel out in the aggregate, there won’t be a representative consumer. For your research, it is better to adopt more advanced tools in industrial organization or macroeconomics to deal with heterogeneity without assuming that representative agents exist, or focus on questions for which a representative agent can be shown to be a relatively good approximation. For example, researchers have adopted discrete choice models to study micro-level consumer behavior and do welfare analysis (See the Handbook of Economics chapters by McFadden, Train’s book, or Anderson, dePalma and Thisse’s Discrete Choice Theory of Product Differentiation). These models go many steps further in allowing comparisons of consumer behavior and incorporate broader information about the characteristics of the goods and consumers. Exercises Exercises: 5E1, 5E5, 4B1, 4C11, 4D2 • Show that the aggregate profit function satisfies Hotelling’s Lemma and that ∇p π ∗ (p) = PJ ∗ j=1 yj (p). • Show that the Hessian of the aggregate profit functions is the sum of the derivatives of the individual firms’ supply functions with respect to p. • If each individual firm’s cost function is ck (w, qk ), can the aggregate firm’s profit maximization problem can be written as X max pq − cj (w, qj ) q j P subject to q = j qj ? If so, show that the solution to this problem is the same as the solution the profit maximization problem studied above. • Suppose consumer i’s utility function is ui (x1 , x2 ) = xα1 i xβ2 i . For what restrictions on (αi , βi ) do the indirect utility functions of this collection of consumers satisfy the Gorman form? What does the aggregate demand curve look like? • Suppose consumer i’s utility function is ui (x1 , x2 ) = xα1 i (x2 − γi ) where αi + βi = 1 for all i. For what restrictions on (αi , βi , γi ) do the indirect utility functions of this collection of consumers satisfy the Gorman form? What does the aggregate demand curve look like? 186 • For consumers with quasi-linear utility functions, ui (x, m) = φi (x) + m where x = (x1 , ..., xL ) and m is “money spent on other things”, show that there is always a representative consumer and aggregate demand curve for the goods x that depends only on aggregate wealth. • For a collection of consumers whose preferences satisfy the Gorman form, derive the expenditure function for the representative consumer. Derive Hicksian demand and compensating variation. • Is the Slutsky matrix that corresponds to aggregate Walrasian demand the sum of Slutsky matrices that correspond to individual Walrasian demand? 187 Part V Other Topics 188 Chapter 19 Choice Under Uncertainty 19.1 Probability and Random Variables Many interesting economic situations involve uncertainty, and a theory of decision-making under risk becomes necessary to provide meaningful results. Before we look at decision-making under uncertainty, we want a simple model of randomness that doesn’t involve too much detail, but does have the structure we need to think about uncertainty. Let Ω be a space of outcomes, like the pips on a die, the temperature on a given day, the color and number of a slot on a roulette wheel, or suit and type of a card drawn from a deck of cards. This is just a set — it doesn’t have to be numerical, ordered, or anything else. Then we define three objects to describe an uncertain outcome: Definition 19.1.1 A probability space is a triple, (Ω, E, p), such that • Ω is a set of outcomes • E is a set of events: E is a set of subsets of Ω such that (i) ∅ and Ω are in E, (ii) if E ∈ E, then E c ∈ E, (iii) If E1 , E2 , ... are events in E, then ∪i Ei ∈ E and ∩i Ei ∈ E • p(E) is a probability distribution: p : E → [0, 1] so that p(Ω) = P 1, p(∅) = 0, and if E1 , E2 , ..., En , ... are an arbitrary collection of disjoint sets, p (∪i Ei ) = i p(Ei ) A probability space can be any random environment: For example, the various things that can happen when you roll a die, like getting six dots, an even or an odd number of dots, or one, two or five dots. Similarly, you could reach into a jar of red and green balls, and the outcomes are R or G. The definition of a probability space puts just enough structure on the outcomes and events so that any resulting (conditional or marginal) probability distribution will be well-defined, and we can take the kinds of limits we want to. Example A household might have a concave utility functions u(ct ) for consumption in each period, but face a sequence of random wages yt and prices pt . It can save assets at at a risk-free interest rate r. Then the household wants to solve: # " T X U (y0 , p0 ) = lim Eyt ,pt u(ct ) y0 , p0 T →∞ t=0 subject to at+1 = (1 + r)(at + yt − pt ct ) , ct ≥ 0. See how even computing U (y0 , p0 ) is an involved mathematical process? We want to take a limit of an expected sum, but that requires interchanging the limit and expectation (which is itself a kind 189 of limit). For such manipulations to be well-defined, your probability space has to have at least the properties given above. For quantitative analysis, we define a random variable, mostly separate from the probability space: Definition 19.1.2 A random variable is a function X : Ω → R, where R ⊂ R. Note that a random variable X : Ω → R is just a function that assigns real numbers to each of the outcomes. There is no such thing as a “normal random variable”: There are random variables that are normally distributed. The variable is just the mapping from outcomes to real numbers, and depends on the application. In fact, any probability space can have many random variables assigned on it: Which ones are interesting is driven by the economic application at hand. Example Consider rolling a die. Then the following are probability spaces and random variables: • The random variable is the number of realized pips, so that Ω = {1, 2, 3, 4, 5, 6}, the set of events E is the set of all subsets of Ω, p(k) = 1/6, and X(ω) = ω. • The random variable is whether or not the die roll is even, so that Ω = {1, 2, 3, 4, 5, 6}, the set of events E is the set {∅, O, EΩ} where O = {1, 3, 4} and E = {2, 4, 6}, p(O) = 1/2 and p(E) = 1/2, and X(E) = 1 and X(O) = 0. • The random variable is whether a six is rolled, so that Ω = {1, 2, 3, 4, 5, 6}, the set of events E is the set {∅, Ω, {1, 2, 3, 4, 5}, 6} where A = {1, 2, 3, 4, 5} and S = {6}, p(A) = 5/6 and p(S) = 1/6, and X(S) = 1 and X(A) = 0. Clearly, there are many probability spaces and random variables that we might construct from the idea of a die roll. If we are betting on the outcome 6, the last version is the one of interest. If we care about the average number of pips, the first version is best. Example For flipping two coins, consider the probability space • Ω = {HH, HT, T H, T T } • E = {∅, Ω, {HH, T H, HT }, {HH, T T, T H}, {T T, T H, HT }, {HH, T T, HT }, {HH, HT }, {T H, T T }, {HT, T H}, {HH, T T }, HH, HT, T H, T T } • p(Ω) = 1, p(∅) = 0, p({HH, HT }) = p({T H, T T }) = p({HT, T H}) = p({HH, T T }) = 12 , p(HH) = p(T H) = p(HT ) = p(T H) = p(T T ) = 41 , p({HH, T H, HT }) = p({HH, T T, T H}) = p({T T, T H, HT } = p({HH, T T, HT }) = 34 . This is a complete probabilistic characterization of flipping two coins. For instance, we can ask, 3 “What’s the probability at least one coin is a heads?” That’s p({HT, T H, HH}) = . Or we can 4 1 c ask, “What’s the probability of getting no heads?” That’s p({HT, T H, HH} ) = p(T T ) = . 4 190 We mostly use continuous random variables in economics, where Ω = (a, b) ⊂ R. Then E is the set of open subsets of R, and we define the probability distribution function or cumulative density function as the function F (x) = p ({ω ∈ Ω : X(ω) ≤ x}) for example, F (x) = Z x −∞ 1 z−µ 2 1 √ e− 2 ( σ ) dz 2πσ is the probability that a normally distributed random variable is less than x. Or, for a variable uniformly distributed between 0 and 1, x , x ∈ [0, 1] F (x) = 0 ,x < 0 1 ,x > 1 The derivatives of these distribution functions are density functions, or probability density functions, 1 x−µ 2 f (x) = e− 2 ( σ ) or f (x) = 1 , x ∈ [0, 1] 0 , Otherwise The most popular thing to do with a random variable is take the expected value, R Z , R⊂R ω∈Ω x(ω)p(ω)dω P x(ω)dP (ω) = E[X] = , R discrete ω∈Ω ω∈Ω p(ω)x(ω) If a function is first applied to X(ω) to get u(X(ω)), this is written R Z , R⊂R ω∈Ω u(x(ω))p(ω)dω P u(x(ω))dP (ω) = E[u(X)] = , R discrete ω∈Ω ω∈Ω p(ω)u(x(ω)) There are two key facts to remember when working with economic problems and random variables. Theorem 19.1.3 (Jensen’s Inequality) If u() is a concave function, then E[u(X)] ≤ u(E[X]) and if u() is a convex function, then E[u(X)] ≥ u(E[X]) Proposition 19.1.4 Suppose that X is distributed f on [a, b] and u() is normalized so that u(a) = 0. Then Z b u′ (x)(1 − F (x))dx E[u(X)] = a This is because Z b Z b Z u′ (x)(1 − F (x))dx u(x)d {1 − F (x)} = −[(1 − F (x))u(x)]ba + E[u(X)] = u(x)f (x)dx = − x a a and 1 − F (b) = 0. This is really helpful because we often have information about F and u′ (), but not about f (for example, look up the definition of first-order stochastic dominance). 191 Note that if you have two random variables, X and Y over the same probability space (Ω, E, p), then Z Z Z y(ω)dP (ω) = E[X] + E[Y ] x(ω)dP (ω) + x(ω) + y(ω)dP (ω) = E[X + Y ] = E[αX] = ω∈Ω ω∈Ω ω∈Ω Z αx(ω)dP (ω) = α Z x(ω)dP (ω) = αE[X] ω∈Ω ω∈Ω so that the integral is additive and linear. However, the following is not, in general, true: Z x(ω)y(ω)dP (ω) 6= E[X]E[Y ] E[XY ] = ω∈Ω For example, the covariance of X and Y is defined as E[(X − E[X])(Y − E[Y ])] = E[XY − E[X]Y − Y E[X] + E[X]EY ] = E[XY ] − E[X]E[Y ] − E[Y ]E[X] + E[X]EY ] = E[XY ] − E[X]E[Y ] so that E[XY ] = cov(X, Y )+E[X]E[Y ], and the expectation operator can be broken up in E[XY ] = E[X]E[Y ] only if cov(X, Y ) = 0, which is a special case1 . Two random variables X and Y are independent iff they have a joint probability distribution f (x, y) = fX (x)fY (y), so that Z Z Z Z Z Z xyfX (x)fY (y)dxdy = xfX (x)dx yfY (y)dy = E[X]E[Y ] xyf (x, y)dxdy = E[XY ] = x 19.2 y x x y y Choice Theory with Uncertainty Now we want to develop a theory of choice similar to what we had for goods — a rational preference relations — but for situations where outcomes are random. Suppose there is a set of final outcomes C1 , C2 , ..., CN that are non-stochastic, like the prizes in a lottery. We then have the set of probability distributions over {C1 , C2 , ..., CN }, which is a simplex, and is written L as the space of “lotteries”. We then say L L′ if the agent prefers to face lottery L to lottery L′ . Definition 19.2.1 A simple lottery L is a list L = (p1 , p2 , ..., pn ) with pn ≥ 0 for all n and P n pn = 1, where pn is the probability that outcome n occurs. Given PK simple lotteries, LK = (pk1 , pk2 , ..., pkN ) with k = 1, 2, ..., K, and weights α1 , α2 , ..., αK with k αk = 1, the compound lottery (L1 , L2 , ..., LK ; α1 , α2 , ..., αK ) is the risky alternative that yields the simple lottery Lk with probability αk . So a compound lottery is a simple lottery with probability α1 , α2 , ..., αK of each outcome k, where the outcome is itself another lottery. For example, as a senior high school student you might be uncertain about where you’ll be admitted to college, which will then lead to a lottery over first jobs or graduate schools, which then leads to a lottery over careers. At each step, some uncertainty is resolved, but a lot of uncertainty remains. The reduced lottery L∗ is obtained from the compound lottery (L1 , L2 , ..., LK ; α1 , α2 , ..., αK ) by setting p∗n = α1 p1n + α2 p2n + ... + αn pK n 1 And also E[XX] = cov(X, X) + E[X]E[X] so that E[X 2 ] = var(X) + E[X]2 . 192 So instead of worrying about the intermediate uncertainty of college and first-jobs or grad school, we combine the probabilities over all the paths to generate a simple lottery over final careers. If we suppose the decision-maker only cares about the final outcomes, C = {C1 , C2 , ..., CN }, then this is a reasonable way to describe many situations. In particular, let L be the set of all simple lotteries over the set C. This is called a probability simplex, which is created by taking the N -dimensional vector space generated by the N basis vectors en = (0, 0, ..., 0, 1, 0, ..., 0) where the 1 occurs at the n-th spot, and considering the set of lotteries X L = {p : p1 e1 + p2 e2 + ... + pn en = 1, pn = 1, 0 ≤ pn ≤ 1} n Each basis vector en is then the lottery with outcome Cn for sure, and each element L ∈ L is some combination of simple lotteries. These lotteries are what the decision-maker is going to have preferences over, rather than the sure outcomes themselves. Definition 19.2.2 (6B3) The rational preference relation on L is continuous if, for all L, L′ , L′′ ∈ L and all α ∈ [0, 1], {α : αL + (1 − α)L′ L′′ } and {α : L′′ αL + (1 − α)L′ } are closed. Think of it this way: There is a sequence αN → ᾱ slowly shifting weight from L′ to L. Then if the terms of the sequence of compound lotteries αn L + (1 − αn )L′ are preferred to L′′ for all n, then the limiting lottery ᾱL + (1 − ᾱ)L′ should also be preferred to L′′ . So if we perturb a lottery very slightly, it doesn’t change the agent’s preferences. The continuity axiom also guarantees that there exists a continuous function U (L) that represents the preferences on L. What we would like, however, is that U (L) essentially look like an expected value. Definition 19.2.3 (6B4) The preference relations on L satisfies the independence axiom if for all L, L′ , L′′ ∈ L and α ∈ (0, 1), L L′ iff αL + (1 − α)L′′ αL′ + (1 − α)L′′ So if we have two lotteries L and L′ with L L′ and we add a possibility of a third, L′′ , the preference ordering cannot reverse. This one is probably violated systematically and often, and many choice theorists think about how to relax this axiom without losing desirable parts of choice theory. The most desirable part of this theory deals with the form and existence of utility functions. For choice sets X, it was true that any continuous, rational preference relation could be represented by a continuous utility function u(). For these lotteries, there is a similar result. Definition 19.2.4 (6B5) The utility function U : L → R has the expected utility form if there is an assignment of numbers (u1 , u2 , ..., uN ) to the N outcomes in C such that for every simple lottery L = (p1 , ..., pN ) ∈ L, we have U (L) = p1 u1 + p2 u2 + ... + pN uN Such a utility function is called a von Neumann-Morgenstern expected utility function. 193 P This is mathematically a nice form, because U (L) = E[ui ] = i pi ui , so the decision-maker’s preferences can be represented as an expectation. That is the main benefit of this form. There is also substantial evidence that in a statistical comparison between this and other theories, such R as prospect theory, the gains are not very large (prospect theory replaces p(x)u(x)dx with x R φ(p(x))u(x)dx, so that agents have “non-linear” sensitivities to the weights). The next proposix tion is very important: Proposition 19.2.5 (6B1) A utility function u : L → R has an expected utility form iff it is linear: ! X X U αk Lk = αk U (Lk ) k k for any K lotteries Lk ∈ L and probabilities (α1 , α2 , ..., αK ) with P k αk = 1. To put this result in perspective, think about the similar result that “A preference relation can be represented by a continuous utility function u() only if is rational and continuous.” The above proposition says that preferences have the expected utility form if and only if we can move seamlessly between one level of compound lotteries and the next. For example, consider the a situation with two certain outcomes C = {c1 , c2 }. If U has the expected utility form, then U (α1 L1 + α2 L2 ) = α1 U (L1 ) + α2 U (L2 ) for any probability distribution (α1 , α2 ). But then note that the terms U (L1 ) and U (L2 ) appear in the new expression. Suppose that L1 corresponds to the lottery (α′1 , α′2 ) over two new lotteries over the certain outcomes, L3 and L4 , and L2 corresponds to the lottery (α′′1 , α′′2 ) over two more new lotteries, L5 and L6 . Then U α1 L1 + α2 L2 ) = α1 α′1 U (L3 ) + α1 α′2 U (L4 ) + α′′1 U (L5 ) + α′′2 U (L6 ) We can keep “drilling down” until we reach lotteries U (e1 ) and U (e2 ) that correspond to the certain outcomes. For example, let L3 = L5 = e1 and L4 = L6 = e2 . Then U α1 L1 + α2 L2 ) = (α1 α′1 + α2 α′′1 )U (e1 ) + (α1 α′2 + α2 α′′2 )U (e2 ) This is incredibly valuable, since now the expected utility of the agent depends only on the value of the certain outcomes and the probabilities that are generated by the above “drilling down” process. Proof The expected utility form allows us to write ! X X αk Lk = un U n k X αk pkn k ! just as the “drilling down” process above illustrates. Our job is to show that the linear form is equivalent to “drilling down” to the expected utility form. P which is equivalent P Suppose U satisfies U ( k αk Lk ) = k αk U (Lk ) (it is linear). Then the above argument shows that we can write any lottery as ! X X X U αk Lk = αk U (Lk ) = αk un k k k so it has the expected utility form. Suppose now that U has the expected utility form. ! ! X X X X X X k k U αk Lk = un U ( αk pn ) = un p n = αk αk U (Lk ) k k k k 194 n k Monotone transformations of ordinal utility functions induced the same orderings over the options, so if an agent’s preferences could be represented by a continuous utility function u(x) then if g() was a monotone increasing function, g(u(x)) represented the same preferences. This is not true for expected utility. The above proposition shows that a utility function has the expected utility form iff it is linear. Consequently, only linear (affine) transformations of linear functions are still linear: ! ! ! X X X X X X βU (L) + γ = β un Lk + γ = (βun + γ) Lk = ũn Lk = Ũ (L) n n k k n k What the above manipulation shows is that a linear transformation of an expected utility function is simply changing the units of the utility of the “certain” outcomes, like going from Fahrenheit to Celsius. However, this is a cardinal property, not ordinal. Proposition 19.2.6 (6B2) Suppose that U : L → R is an expected utility function for the preference relation on L. Then Ũ : L → R is another expected utility function for iff Ũ (L) = βU (L) + γ for some scalars β > 0 and γ for all L ∈ L. Proof Choose two lotteries L̄ and L with L̄ L L for all L ∈ L. If L̄ ∼ L, then all the lotteries are equivalent and the result is true. Suppose that L̄ ≻ L. Suppose that U (L) is an expected utility function and Ũ (L) = βU (L) + γ. THen ! ! X X αk Lk + γ αk Lk = βU Ũ k k = β X αk U (Lk ) + γ k = X αk β [U (Lk ) + γ] k = X αk Ũ (Lk ) k Since this function Ũ then satisfies proposition 6B1, it has the expected utility form. Now, suppose that U and Ũ have the expected utility form; we’ll construct constants β and γ that have the desired properties. Define U (L) = λU (L̄) + (1 − λ)U (L) so that λ= U (L) − U (L) U (L̄) − U (L) Note that since λU (L̄) + (1 − λ)U (L) = U (L), then the expected utility form implies that U (λL̄ + (1 − λ)U (L) = U (L), so that the agent is indifferent between L and λL̄ + (1 − λ)U (L. Then Ũ (L) = Ũ λL̄ + (1 − λ)L = λŨ L̄ + (1 − λ)Ũ (L) h i = λ Ũ L̄ − Ũ (L) + Ũ (L) Substituting in λ yields Ũ (L) = i U (L) − U (L) h Ũ L̄ − Ũ (L) + Ũ (L) U (L̄) − U (L) 195 or # " # Ũ (L̄) − Ũ (L) Ũ (L̄) − Ũ (L) U (L) − Ũ (L) − U (L) − Ũ (L) = U (L̄) − U (L) U (L̄) − U (L) " Ũ (L) = βU (L) + γ People often express this by saying that “the set of v.n.M preferences is closed under linear transformations”. So the previous two propositions say that if agents have preferences that take the expected utility form, then we can represent them as X U (L) = βun pn + γ n where un is the utility of the certain outcome cn , and (β, γ) is a positive affine transformation. But what restrictions on preferences guarantee that a utility function exists that has this kind of form? Proposition 19.2.7 (Expected Utility Theorem, 6B3) Suppose that the rational preference relation on the space of lotteries L satisfies the continuity and independence axioms. Then can be represented by an expected utility function, so that for any two reduced lotteries L = (p1 , p2 , ..., pn ) and L′ = (p′1 , p′2 , ..., p′n ), X X L L′ iff p n un ≥ p′n un n n The proof is long and includes many steps, so I’ll omit parts that are straightforward to prove (see p. 176 in MWG for the whole proof). It is very similar to the regular proof that a utility function exists, with the only difference being that the “index of preference” constructed in the real numbers occurs in step three, where it is shown that for any lotteries L′ ≻ L ≻ L′′ , there exists a lottery αL L′ + (1 − αL )L′′ ∼ L. The αL ’s become the index of preference of the intermediate lotteries. Proof Suppose that there are best and worst lotteries in L, L̄ and L, respectively, so that L̄ L L for all L ∈ L. • First note that if L ≻ L′ , then L ≻ αL + (1 − α)L′ . This follows from the independence axiom. • Let α, β ∈ [0, 1]. Then β L̄ + (1 − β)L ≻ αL̄ + (1 − α)L iff β > α. (This follows from the previous step, which follows from the independence axiom) • For any L ∈ L, there is a unique αL ∈ [0, 1] such that αL L̄ + (1 − αL )L ∼ L. This follows from the continuity axiom, and uniqueness comes from the previous step. • Consider the function U : L → R that assigns U (L) = αL ; we’ll show that it represents . The previous step implies that for any two lotteries L, L′ ∈ L, we have L L′ iff αL L̄ + (1 − αL )L αL′ L̄ + (1 − αL′ )L. By step 2 (hence the independence axiom) we have it that L L′ iff αL ≥ αL′ . 196 • Consider the function U : L → R that assigns U (L) = αL ; we’ll show that it is linear (so it has the expected utility form). Take L ∼ U (L)L̄ + (1 − U (L))L and L′ ∼ U (L′ )L̄ + (1 − U (L′ ))L Then the independence axiom (used twice) implies βL + (1 − β)L′ ∼ β[U (L)L̄ + (1 − U (L))L] + (1 − β)[U (L′ )L̄ + (1 − U (L′ ))L] = [βU (L) + (1 − β)U (L′ )]L̄ + [1 − βU (L) − (1 − β)U (L′ )]L Consequently we can interpret [βU (L)+(1−β)U (L′ )] as the probability of L̄ and [1−βU (L)− (1 − β)U (L′ )] as the probability of L when facing the compound lottery βL + (1 − β)L′ . This implies (L) + (1 − β)U (L′ ) = U (βL + (1 − β)L′ ) So for any two lotteries L, L′ , the utility function has the expected utility form. So the first two steps of the proof establish a relationship between preferences over compound lotteries αL̄ + (1 − α)L; namely that lotteries with a higher α are better. Then in step 3, we show that any lottery can be uniquely assigned an αL . This has set up a way of comparing lotteries L, L′ by comparing αL with αL′ that is consistent with the underlying preferences . Step four and step five then show the results we want; that independence and continuity imply linearity and existence of a linear utility function representing . Note that the continuity axiom is only used once, to establish a unique correspondence between lotteries L and index values αL . On the other hand, the independence axiom is used many times. So while the continuity axiom is mostly a technical axiom used in a technical way, the independence axiom has a strong behavioral interpretation and is used many times to “break up” lotteries so that they can be reduced into simpler ones. Without independence, it’s clear that many parts of the above proof would no longer work. Now, what do you need to internalize from this discussion? • The theory of choice under uncertainty depends on the independence and continuity axioms, which are mathematically convenient but not rock solid behaviorally. However, these allow you to write an agent’s expected payoff as a vNM expected utility function. • The phrase “(von Neumann - Morgenstern) Expected Utility” refers to expected utility functions of the form X pn un n or Z p(n)u(n)dn where un or u(n) is the utility of getting outcome n with certainty. • The set P utility functions is closed under transformations of the kind βU (L)+ Pof vNM expected γ = β n pn un + γ = n pn (αun + γ), β > 0. The utility function is linear in the probability weights, pn . 197 19.3 Risk Aversion The following kinds of situations are very common in economics: • There is a lottery at the fair, where for a entry fee e you can drop a business card into a bucket, and at the end of the night the lottery-owner draws a business card, and that player wins half of the proceeds. If N people participate, 1 N −1 1 N −e + u(−e) E[U ] = u N 2 N and not playing gives expected utility E[U ] = u(0) • A risk-averse agent with utility function over money u() is deciding whether or not to buy homeowner’s insurance. He has wealth w, a policy costs t and pays out r in the case of an accident, the loss of an accident is L, and the proabability of an accident is p. E[U ] = pu(w − L − t + r) + (1 − p)u(w − t) while his utility without the policy is E[U ] = p(w − L) + (1 − p)u(w) • A risk-neutral agent is competing in a first-price auction for a good for which he has value v; namely, he submits a sealed bid b, and if his bid is the highest of all those submitted, he wins the good and pays his bid b. Then there are two outcomes, winning which yields a payoff of v − b, and losing, which yields a payoff of zero. Suppose that the agent wins the auction with probability φ(b) if he bids b. Then U (v, b) = φ(b)(v − b) + (1 − φ(b))0 = φ(b)(v − b) But now we have the right tools to analyze them formally. Namely, we consider outcomes as realizations of a random variable X : Ω → R where (Ω, E, p) is a probability space, and the agent gets certain utility u(x(ω)) from each of the outcomes ω ∈ Omega. The utility of the certain outcomes given by u(x(ω)) includes the function u(), which is called the agent’s Bernoulli utility function. Then each lottery L is a probability distribution over a set Ω, and the agent’s expected utility is Z u(x(ω))dP (ω) U (L) = E[u(X)] = ω In the discrete case, this becomes U (L) = E[u(X)] = X u(x(ω))p(ω) ω and in the case where Ω is a one-dimensional subset of R, Z U (L) = E[u(X)] = u(x)f (x)dx And because of the work in the previous section, this is all well-defined in terms of there being a rational, continuous preference relation that satisfies the independence axiom that gives rise to U (L). 198 Definition 19.3.1 (6C1) A decision maker is risk averse if for any lottery F (), E[u(X)] < u (E[X]) A decision maker is risk neutral if for any lottery F (), E[u(X)] = u (E[X]) A decision maker is risk inclined or risk seeking if for any lottery F (), E[u(X)] > u (E[X]) It follows immediately from Jensen’s inequality that if u() is concave, an agent is risk averse; if u() is linear, the agent is risk neutral; and if u() is convex, the agent is risk inclined. Example Let x be distributed F (x) = xα on [0, 1], and suppose the agent has preferences u(x) = xβ (so that β > 1 implies convexity and β < 1 implies concavity). Then his expected utility is 1 Z 1 Z 1 α α xα+β = xβ dxα αxβ+α−1 dx = U (L) = α + β α + β 0 0 0 and the expectation of X is E[X] = Z 1 xdxα = Z 1 αxα dx = 0 0 So that U (L) < u(E[X]) ←→ α 1+α α α < α+β 1+β So that the agent is risk averse only if β > 1. If β = 1, he is risk neutral, and if β < 1, his risk inclined. Example Let X be a random variable with mean m and variance σ 2 . If an agent has quadratic preferences γ u(x) = x − x2 2 then his expected utility is (see the footnote a few pages back) U (L) = E[x − γ 2 γ x ] = m − (σ 2 + m2 ) 2 2 Then the agent is risk averse if U (L) < u(E[X]) iff γ 2 (σ + m2 ) < m 2 which is true as long as γ > 0. If γ = 0, he is risk neutral, and if γ < 0, he is risk-inclined. m− Example Suppose we have an agent with wealth w who faces a stochastic loss D with probability p, and no loss with probability 1 − p. The agent can buy insurance, such that for each unit he must pay q but receives a payout of 1 if the loss occurs, so that his ex post wealth in the event of a loss is w − αq − D + α and if no loss occurs w − αq. You might expect him to under- or over-insure, because of the risk aversion. Then he maximizes max pu(w − αq − D + α) + (1 − p)u(w − αq) α 199 Whenever α∗ > 0, the FOC holds: 0 = p(−q + 1)u′ (w − α∗ q − D + α) − (1 − p)qu′ (w − α∗ q) This equates the marginal utility of consumption across the two states. Suppose that q = p, so that the price of insurance is equal to the expected cost; you can imagine perfectly competitive insurance companies undercutting each other until they get to the point where the price is the expected marginal cost. Then u′ (w − α∗ p − D + α) = u′ (w − α∗ p) Since u′′ () < 0, the function u′ () is monotone, and we can invert it to get: w − α∗ p − D + α = w − α∗ p Or α∗ = D So that the optimal amount of insurance is exactly equal to the expected loss, and the agent fully insures. Example There is an agent with an increasing, concave Bernoulli utility function u(x) and initial wealth w0 . There is a safe asset that yields a return of 1 with certainty, and a risky asset that yields a return H > q with probability p and L < q with probability 1 − p. The price of the risky asset is q, the amount of the risk asset purchased is x and the amount of the safe asset purchased is y, giving the agent a budget constraint w0 = qx + y. Assume both x and y must be weakly greater than zero. The agent’s expected payoff is pu(Hx + y) + (1 − p)u(Lx + y) Then the agent solves the expected utility maximization problem max pu(Hx + y) + (1 − p)u(Lx + y) x,y subject to w0 = qx + y, x ≥ 0, y ≥ 0, or max pu((H − q)x + w0 ) + (1 − p)u((L − q)x + w0 ) + µx x with FONC pu′ ((H − q)x + w0 )(H − q) + (1 − p)u′ ((L − q)x + w0 )(L − q) + µ = 0 and complementary slackness condition µx = 0 So the agent holds no risky asset (x = 0) if µ ≥ 0, or pu′ (w0 )(H − q) + (1 − p)u′ (w0 )(L − q) ≤ 0 or pH + (1 − p)L ≤ q So if the expected return is less than q, the agent holds none of the risky asset. The agent only holds a strictly positive amount if the expected return, pH + (1 − p)L, is strictly greater than the price, q. 200 The certain value of x at which the agent is indifferent between facing the lottery and taking the certain value is called the certainty equivalent. You can imagine an agent sitting in front of an insurer, and the insurer slowly increasing the price of a policy until the agent is indifferent between facing the risk or having the insurer bear the risk instead. Similarly, we can ask, “For a small gamble over {x + ε, x − ε}, what is the deviation from a fair lottery over the two outcomes required for the decision-maker to be indifferent between the gamble and getting x for sure?” This is called the probability premium. You see something like this in sports all the time, when there is an uninteresting game. For example, if one team is almost sure to win, people might bet on the “spread”, giving a higher likelihood of winning to the people who bet on the team that is likely to lose. This compensates them for bearing the risk by increasing the probability that they win by accepting the bet. R Definition 19.3.2 The certainty equivalent of F (), c(F, u), is defined as u(c(F, u)) = u(x)dF (x). The probability premium for a fixed value x and ε is given by u(x) = u(x + ε)[.5 + π(x, ε, u)] + u(x − ε)[.5 − π(x, ε, u)]. √ Example Suppose an agent has Bernoulli utility function u(x) = x, and faces a lottery where the outcome is 0 with probability 1 − p and y 2 with probability p. Then the agent’s expected payoff is p √ E[u(X)] = (1 − p) 0 + p y 2 = py Now, what is the amount of money which makes the agent indifferent between facing the gamble and not? p u(c(F, u)) = c(F, u) = py and c(F, u) = p2 y 2 This is the certainty equivalent. We can express risk aversion in four ways, then: Proposition 19.3.3 (6C1) Suppose a decision maker is an expected utility maximizer with increasing Bernoulli utility function u(). Then the following are equivalent: • The decision maker is risk averse • u() is concave R • c(F, u) < xdF (x) for all F () • π(x, ε, u) ≥ 0 for all x, ε Proof ( i and ii ) Jensen’s inequality. ( ii and iii ) If u is increasing, u(c(F, u)) < u which is equivalent to Jensen’s inequality. (i and iv) Linearize around x: u(x + ε) = u(x) + u′ (x)(ε) + u′′ (c)ε2 /2 u(x − ε) = u(x) + u′ (x)(−ε) + u′′ (c)(ε)2 /2 Multiply by p and 1 − p respectively and add to get pu(x + ε) + (1 − p)u(x − ε) = u(x) + (2p − 1)u′ (x)ε + u′′ (c)ε2 /2 If u′′ (c) < 0, u′ (c) > 0, then evaluating at p = .5 yields 1 1 u(x + ε) + u(x − ε) = u(x) + u′′ (c)ε2 /2 2 2 201 R xdF (x) , so that 1 1 u(x + ε) + u(x − ε) < u(x) 2 2 and to compensate the agent for the risk, we must increase the likelihood of outcome x+ ε. Namely, from .5 to the value of .5 + π that solves [.5 + π]u(x + ε) + [.5 − π]u(x − ε) = u(x) We often need to compare risk aversion within or across agents. For example, agents with large quantities of wealth are probably less averse to a small bet than agents who have very little wealth. Agents who accept one kind of bet may refuse another, similar one. We want to develop a framework in which to think about these questions. As is shown above, saying that an agent is risk averse is equivalent to claiming that they have a concave utility function. So to say an agent is “more risk averse”, what we want to say is that they have a “more concave” utility function. But concavity refers to the curvature of a function, so it is a difficult property to measure. We could start with u′′ (x), but if we took the same agent and took a positive affine transformation to get βu(x) + γ, their risk aversion would then equal βu′′ (x), so the measure wouldn’t be invariant to affine transformations. So we need something else. Definition 19.3.4 If u() is twice-differentiable, then the Arrow-Pratt measure of absolute risk aversion at x is defined as u′′ (x) rA (x) = − ′ u (x) If u() is twice-differentiable, then the Arrow-Pratt measure of relative risk aversion at x is defined as xu′′ (x) rA (x) = − ′ u (x) So this measure normalizes the second-derivative by the first, and multiplies by −1 to make it a positive number. For a linear function (risk neutrality), the measure is zero. If we wanted constant absolute risk aversion, we are looking for a function satisfying − u′′ (x) =r u′ (x) for all x. This is a differential equation u′′ (x) + ru′ (x) = 0 with general solution u(x) = Ce−rx We can check this easily, since r 2 Ce−rx − r 2 Ce−rx = 0. However, we want u(x) to be increasing, or u′ (x) = −rCe−rx > 0 So C < 0 if this is a utility function. In particular, the choice C = −1/r is often made so that the first derivative is just e−rx . Then the CARA utility function is u(x) = −1 −rx e r 202 or (when you want “utility” to be a positive number) 1 u(c) = 1 − e−rx r There are a few others whose solutions are generally found using the above pattern of solving a differential equation. For example, hyperbolic absolute risk aversion (HARA) or constant relative risk aversion (CRRA) is given by c1−r u(c) = 1−r Then we can start comparing individuals: Proposition 19.3.5 (6C2) The utility function u1 () represents the preferences of a decision maker who is more risk averse than a decision maker represented by u2 if • rA (u1 , x) ≥ rA (u2 , x) for all x • There exists an increasing, concave function ψ() so that u2 (x) = ψ(u1 (x)) • c(F, u2 ) < c(F, u1 ) for any F () • π(x, ε, u2 ) ≥ π(x, ε, u2 ) for any x and ε The absolute and relative measures of risk aversion are local ones, in that we are defining risk aversion “at x”. We might make more demands. Particularly, we might want risk aversion to decrease in x, so that at higher levels of wealth, the agent is less risk averse. Definition 19.3.6 The Bernolli utility function u() exhibits decreasing absolute risk aversion if rA (x) is decreasing in x. This implies that ′ rA (x) = −u′′′ (x)u′ (x) + u′′ (x)2 <0 u′ (x)2 requiring that u′′′ (x) > 0. This in turn implies that u′′ (x) is an increasing (but negative) function. So as x becomes large, the function is becoming “less concave”. Some macroeconomists refer to considerations about the third derivative as “prudence”. Proposition 19.3.7 (6C3) The following are equivalent: • The Bernoulli utility function u() exhibits decreasing absolute risk aversion • If x2 < x1 , then u2 (z) = u(x2 + z) is a concave transformation of u1 (z) = u(x1 + z). R • For any lottery F (), consider u(cx ) = u(x + z)dF (z). Then x − cx is decreasing in x. • The probability premium π(x, ε, u) is decreasing in x. R R R • For any F (), if u(x2 + z)dF (z) ≥ u(x2 ) and x2 < x1 , then u(x1 + z)dF (z) ≥ u(x2 + z)dF (z). Proof To prove the second point, note that DARA implies that r(z + y) = − u′′ (z + y) u′ (z + y) is a decreasing function in z + y, so that r(z + x2 ) < r(z + x1 ), and from Proposition 6C2 it follows that u2 is a concave transformation of u1 . Since that is true, the other points of this Proposition are just restatements of Proposition 6C2 as comparisons across wealth levels. 203 19.4 Stochastic Orders and Comparative Statics Often in economics, we want provide comparative statics predictions with respect to changes in distributions. A first step might be to classify a family of distributions that depend on a parameter θ, such as f (z, θ), and then provide comparative statics predictions with respect to θ. For example, consider the indirect utility function Z u(x)dF (x, θ) x If we differentiate with respect to θ, we get Z u(x)fθ (x, θ)dx x But signing this term is difficult. We can’t just assume that “fθ (z, θ) > 0”, because Z ∞ f (z, θ)dz = 1 −∞ and differentiating with respect to θ yields Z ∞ fθ (z, θ)dz = 0 −∞ meaning that if we change θ and perturb f (), the sum of all the perturbations must cancel out. For instance, we can put more weight on the “good” outcomes, but only if we put less weight on the “bad” outcomes. So any of our variational changes Rto the distribution must cancel out across b all states. If we go back to the indirect utility function a u(x)dF (x, θ), this is equal to Z a b u(x)dF (x, θ) = − Z a b u(x)d(1 − F (x, θ)) = u(x, a) + Z b a u′ (x)(1 − F (z, θ))dz Then a change from F2 to F1 for a decision maker with payoff function u(x) Z b Z b Z b u′ (x)(F2 (x) − F1 (x))dz u(x)dF2 (x) = u(x)dF1 (x) − a a a But now we need a way to compare distributions. If u′ (x) > 0, then the above inequality is Z b u′ (x) (F2 (x) − F1 (x)) dz {z } a | {z } | >0 ? If F1 (x) < F2 (x) for all x, then we could conclude that the integrand is positive for all x, and the change is positive. Definition 19.4.1 A distribution F first-order stochastically dominates G if, for all x, F (x) ≤ G(x). In other words, F first-order stochastically dominates G if it places less weight on the low outcomes, which necessarily means it places more weight on the high outcomes. The simplest example is a lottery over two dollar amounts a and b, b > a, where the agent receives b with probability p and a with 1 − p. The cumulative distribution is p a≤x<b F (x, p) = 1 x≥b 0 otherwise 204 So if we increase p to p′ , we get a distribution where more weight is placed on getting the price b, and F (x, p′ ) ≤ F (x, p) for all x. This is one example of a stochastic order. This is a partial order on the space of probability distributions, but not a total order, since there are probability distributions where F (x) > G(x) but F (y) < G(y). Here are the other main examples: • A distribution F hazard rate dominates G if, for all x, g(x) f (x) ≤ 1 − F (x) 1 − G(x) • A distribution F likelihood ratio dominates G if, for all x < y, f (y) f (x) ≤ g(x) g(y) Depending on what you’re doing, one or another of these orders might be easier to use. The intuition of hazard rate dominance comes from the following expression: If Y is a random variable, then f (x)dx ∼ pr[y = x|y ≥ x] 1 − F (x) so the hazard rate is the probability that Y takes the value x, given that y ≥ x. Said another way, the hazard rate is the probability that a lightbulb bursts at this exact moment given that it hasn’t burst up to now. Likelihood ratio implies the probability density functions of f and g only cross once on the interior of their support, and when this happens f intersects g from below. Consequently, f places less weight on low outcomes than g. Proposition 19.4.2 Likelihood ratio dominance implies hazard rate dominance, and hazard rate dominance implies first-order stochastic dominance. MWG only use first-order stochastic dominance, but when comparing distributions it is helpful to know that these other orders exist. For example, if you are estimating how likely it is that a worker is fired at given dates after being hired, you might think of the hazard rate. If a better education lowers the likelihood of being fired at all dates, you have hazard rate dominance. Proposition 19.4.3 (6D1,6D2) The distribution F first-order stochastically dominates G if, for every non-decreasing function u : R → R, Z Z u(x)dF (x) ≥ u(x)dG(x) In particular, any decision maker with an increasing utility function will prefer F to G. These ideas provide a way of discussing shifting probability from less to more favorable outcomes. But another kind of change in uncertainty is when weight isn’t shifted upward monotonically, but instead pushed from the center towards the tails. For example, when the variance of a normal distribution is increased, the bell becomes “fatter” and the probability at the mean falls, but the mean value itself isn’t changed. This motivates a second kind of order. Definition 19.4.4 For any two distributions F and G with the same mean, F second-order stochastically dominates G if for every non-decreasing concave function u : R → R, Z u(x)dF (x) ≥ u(x)dG(x) In particular, any risk averse decision maker with an increasing utility function will prefer F to G. 205 This definition implies that what’s really going on is that probability is being shifted away from the mean. Another way of imagining this is that we draw a value x distributed according to F , but then add a “noise” term that perturbs it further, but has an expectation of zero. For example, we draw a value of x from a normal distribution with mean m and variance σx2 , but then add a normally distributed shock z with mean zero and variance σz2 , getting a variable y = x + z. This new variable has a distribution 1 (y − m)2 1 p exp − f (y) = 2 σx2 + σz2 2π(σx2 + σz2 ) This is just another normally distributed random variable, but with mean m and variance σx2 + σz2 . A general definition of this kind of construction is Definition 19.4.5 Let X be a random variable with distribution R F , and let Z be a random variable whose distribution conditional on x, H(z|X = x), satisfies zh(z|X = x)dz = 0. Then the distribution G of the random variable Y = X + Z is a mean-preserving spread of F . Using the above definition, Z Z Z Z Z u(x)g(x)dx = u(x + z)f (x)h(z|x)dzdx = u(x + z)h(z|x)dz f (x)dx Z Z (x + z)h(z|x)dz f (x)dx ≤ u But then since E[z|x] = 0, the term Z R (x + z)h(z|x)dz = x, and Z u(x)g(x)dx ≤ u(x)f (x)dx Which is to say that if G is a mean-preserving spread of F , then F second-order stochastically dominates G. Lastly, consider the difference in means (using the integration by parts) Z Z Z Z b xdF (x) − xdG(x) = [xF (x) − xG(x)]a − F (x)dx + G(x)dx But F (b) = G(b) = 1 and F (a) = G(a) = 0, so Z Z Z Z xdF (x) − xdG(x) = − F (x)dx + G(x)dx and since G is a mean-preserving spread of F so they have the same mean, the left-hand side is zero, leaving Z 0 = G(x) − F (x)dx But G is obtained from F by shifting weight into the tails, so for each x, we’ve already integrated over more probability than in the distribution F , implying that Z x Z x F (x)dx G(x)dx ≥ a a So this gives three ways of characterizing second-order stochastic dominance: Proposition 19.4.6 (6D2) For two distributions F and G with the same mean, the following are equivalent: • F second-order stochastically dominates G • G is a mean-preserving spread of F Rx Rx • a G(x)dx ≥ a F (x)dx 206 19.5 Exercises MWG: 6B1, 6B2, 6C6, 6C18, 6C15, 6C18, 6C20 207 Chapter 20 Informational Economics The idea of “information” didn’t appear formally in economics until the 1960’s and 1970’s, despite the fact that many economists discussed it. The difficulty is that information is hard to quantify, and the way it effects agents’ decisions is difficult to discuss. Not only does it matter what one agent “knows”, but also whether that agent knows whether or not other agents know the same information, and so on. For example, imagine you are a loan officer at a bank, and someone comes in hoping to borrow money. You look at their assets and collateral, and consult the bank’s policy, and are forced to turn them down. Then the person offers to pay back even more money. Should you accept? On the one hand, the person is offering an apparently more attractive deal. On the other hand, the very fact that the person is so eager to promise to pay back more money may be because he never planned on making the repayment in the first place. Another example is education. Many people spend many years acquiring costly degrees to convince other people that they are intelligent or talented. However, if getting the degree creates a premium in the labor market, the unintelligent or untalented might undertake getting a costly degree because it is worth it for other people to conclude that they are intelligent and talented. If this is the case, the education itself loses the ability to discriminate between the two types. 20.1 The Lemons Model MWG have a good version of the lemons model that we’ll do in class, but here is the simplest possible model. That illustrates the situation. There are many sellers and many buyers. Each seller owns a car, which is a lemon with probability p and a peach with probability 1 − p. All sellers know whether their own car is a peach or lemon, but no one besides the seller can observe this. Sellers value their car at αvH if the car is a peach and αvL if it is a lemon, where α < 1 — imagine that the sellers are people who are moving and would prefer to sell the car rather than drive it across the country. Buyers value peaches at vH but lemons at vL , but can’t observe this, so that the quality of a car v is a random variable from their perspective. In such a market, what is a reasonable trading price? In a perfectly competitive world, the price would equal the marginal cost of production. Here, we have the extra piece of information that the seller is willing to trade at the current price. So one reasonable choice for the price would be t∗ = E[v|t∗ ] so that the price paid to the seller is the expected value of the car, given the trading price. Then the sellers who have peaches are willing to trade at t∗ if t∗ ≥ αvH , and the sellers who have lemons are willing to trade at t∗ if t∗ ≥ αvL . Buyers are willing to trade if E[v|t∗ ] − t∗ ≥ 0. 208 Let’s consider the world case-by-case: • Suppose both peaches and lemons are traded. Then E[v|t∗ ] = pvL + (1 − p)vH = t∗ This is agreeable to sellers with peaches if pvL + (1 − p)vH ≥ αvH or that Lemon owners are willing to sell if (1 − α)vH ≥p vH − vL pvL + (1 − p)vH ≥ αvL which is always true. • If the price drops too low, however, the peach owners will withdraw from the market, leaving only lemons. Then price is E[v|t∗ ] = vL = t∗ and lemon owners will be willing to sell only if vL ≥ αvL which is always true. • If peach owners are in the market, the price must be at least αvH > αvL , so that lemon owners will always be in the market when peach owners are. This breaks the world up into cases that depend on p and α. If (1 − α)vH /(vH − vL ) ≥ p, then peaches and lemons are both traded; buyers sometimes get peaches and are happy, and sometimes get lemons and are unhappy, but break even on average. In the range where p ≥ (1−α)vH /(vH −vL ), however, the presence of the bad lemons has driven the price so low that peach owners are no longer willing to trade. You can see this in the following: E[v|t∗ ] = t∗ = pvL + (1 − p)vH = vH − p(vH − vL ) Since t∗ is decreasing in p, the trading price is decreasing in the proportion of lemons, and more lemons equates to lower prices. This eventually drives the peaches out, leading the price to drop discontinuously to vL . Consequently, at a price of t∗ = vL , the buyers and sellers both know that only lemons are traded. This is called market unraveling, because the presence of “bad” types drives out good types, creating a vicious cycle where all trade may vanish. The presence of the lemons clearly exerts an informational externality on the other participants in the market. Their very presence has a negative effect on the peach-owners who trade. In a model with more types, you can imagine equilibria where some “very good” types refuse to trade, while “mediocre” and “bad” agents do so, and if the proportion of bad types is increased, the market collapses completely. This is one way that economists explain things like the recent financial crisis: The presence of bad debt in the market made other participants unsure of the value of assets, leading to situations where no one wanted to trade, and even hoarded cash in the hopes that if it became known that they were holding bad debt, they could still pay off their obligations. 209 20.2 Insurance Markets and Existence of Competitive Equilibria This section considers a simple insurance market (see Rothschild and Stiglitz (1976) for a general model and analysis). In particular, there is an agent with wealth W who may incur a loss of size d, leaving him with W − d if the loss occurs and W otherwise. The agent has a concave utility function, u(), which describes his payoff in each state of the world. If the loss occurs with probability p, then his expected utility is E[U ] = pu(W − d) + (1 − p)u(W ) Such an agent would be happy to pay for insurance if it were priced correctly. Suppose that there is an insurer who offers to provide contracts where the agent pays q per unit of insurance, and receives a payout 1 per unit if an accident occurs. This would make his expected utility E[U ] = max pu(W − d + α − qα) + (1 − p)u(W − qα) α A maximizing agent would then choose 0 = p(1 − q)u′ (W − d + α∗ ) − q(1 − p)u′ (W − qα∗ ) On the other side of the market, a risk-neutral insurer would be willing to provide a contract if p(qα − α) + (1 − p)αq ≥ 0 If there is only one type of consumer, we might proceed by considering a competitive insurance industry in which no firms make economic profits, or p(qα − α) + (1 − p)αq ≥ 0 That yields a price q∗ = p Combining the two conditions yields 0 = p(1 − p)u′ (W − d + α∗ − α∗ p) − p(1 − p)u′ (W − pα∗ ) and simplifying implies u′ (W − d + α∗ − pα∗ ) = u′ (W − pα∗ ) And since u() is concave, u′ () is monotone, and W − d + α∗ − pα∗ = W − pα∗ and simplifying yields α∗ = d In this case, the consumer fully insures, and the outcome is Pareto efficient. In particular, the consumer’s marginal utilities of consumption are equated across both states: u′ (W − d + d − pd) = u′ (W − pd) ←→ u′ (W − pd) = u′ (W − pd) Proposition 20.2.1 Define a perfectly competitive insurance market as one in which (i) consumers maximize their expected utility, (ii) firms maximize profits given the market price of insurance, and (iii) firms make zero profits. In an insurance market with complete information and sufficient competition between insurance providers, the Pareto optimal outcome occurs. 210 Now, however, consider a market where there are multiple “types” of consumers: Each consumer privately knows about his or her likelihood to suffer the loss, but this information is not observable to others. In particular, there is a high risk type, who suffers the accident with probability ph , and a low risk type, who suffers the accident with probability pl , with ph > pl . In the population, there is a proportion n of high risk types and 1 − n of low risk types. If the insurer could observe the agents’ types, he could contract separately as in the previous example. But if this information is hidden, as in the lemons market, the insurer can’t contract on it. Consumers simply show up and decide whether or not to take one of the contracts offered. Our goal now is to ask whether a market like the kind above exists: In particular, we want to know if (i) both types can be fully insured and (ii) the (representative) firm makes zero profits. In general, these goals will conflict. If the insurer sticks to offering a single contract, the Pareto optimal contract won’t occur (somewhat obviously, but let’s check). If the firm offers a single price for insurance, q, then the two types of agents face the following maximization problems: max h ph u(W − d + αh − αh q) + (1 − ph )u(W − αh q) α max l pl u(W − d + αl − αl q) + (1 − pl )u(W − αl q) α and each consumer type maximizing yields conditions 0 = ph (1 − q)u′ (W − d + α∗h ) − (1 − ph )qu′ (w − α∗h q) 0 = pl (1 − q)u′ (W − d + α∗l ) − (1 − pl )qu′ (w − α∗l q) If the firm makes zero profits, n(ph [−α∗h + q ∗ α∗h ] + (1 − ph )q ∗ α∗h ) + (1 − n)(pl [−α∗l + q ∗ α∗l ] + (1 − pl )q ∗ α∗l ) = 0 This implies that q ∗ must satisfy q∗ = nph α∗h + (1 − n)pl α∗l nα∗h + (1 − n)α∗l Then these three equations describe the equilibrium in the market: 0 = ph (1 − q ∗ )u′ (W − d + α∗h − q ∗ α∗h ) − (1 − ph )q ∗ u′ (w − α∗h q ∗ ) 0 = pl (1 − q ∗ )u′ (W − d + α∗l − q ∗ α∗l ) − (1 − pl )q ∗ u′ (w − α∗l q ∗ ) nph α∗h + (1 − n)pl α∗l q∗ = nα∗h + (1 − n)α∗l But now since q ∗ 6= {ph , pl }, neither type gets full insurance. The new price is somewhere between ph and pl , since q ∗ = ph (1 − n)α∗l nα∗h + p l nα∗h + (1 − n)α∗l nα∗h + (1 − n)α∗l so that q ∗ is a convex combination of the two accident probabilities. If, instead, the insurers offer two contracts — one intended for the high risk type and one intended for the low risk type — the efficient outcome might be restored. The zero-profit, full insurance contract must then have the following prices qh∗ = ph 211 ql∗ = pl and efficient quantities α∗h = d α∗l = d These are the prices where, if the consumers each take the contract intended for them, they fully insure and the insurer makes zero profits. Note that any other set of contracts will fail to be an efficient allocation of risk, or the insurer will make strictly positive profits (or both). But consider what then happens if the insurer posts this menu of contracts {(qh∗ , α∗h ), (ql∗ , α∗l )} but cannot observe the types of the agents. The contract intended for the low risk type satisfies ql∗ < qh∗ , so that it is cheaper, and α∗h = α∗l , so that the two types receive the same amount of insurance. Consequently, the high risk type never has an incentive to choose the contract intended for him: He is always better off choosing the contract intended for the low risk type. This means that if the insurer offers these contracts, all consumers will pick (ql∗ , α∗l ), and the insurer makes losses with probability one: n(pl [−d + pl d] + (1 − pl )pl d) + (1 − n)(ph [−d + pl d] + (1 − ph )pl d) < 0 Then we’ve shown that: Proposition 20.2.2 In an insurance market with private information, there is no equilibrium in which (i) both types fully insure and (ii) the representative firm makes zero profits. It turns out that in general, competitive equilibria may fail to exist in these markets. To provide incentives for the consumers to choose contracts that are optimal, the firm must offer less-than-full insurance to the low risk types at a discount, while the high risk types end up paying relatively high prices. To achieve this, the representative firm might not satisfy the zero profit condition. This shows that perfectly competitive equilibrium may not exist when there are informational frictions. 20.3 Market Intermediation This section considers the problem facing a simple market maker (see Glosten and Milgrom (1984)). In many markets, there is an agent who coordinates trades by setting prices. In financial markets, for example, there are often not enough active buyers and sellers at the same time for the market to clear. In these cases, an intermediary is often introduced who “smoothes” the market by purchasing excess shares when supply is greater than demand, and selling those shares when demand exceeds supply. These intermediaries often charge different prices to buy or sell, and this is potentially a puzzle in a competitive equilibrium framework. There are four kinds of agents in this market: Those who are buying or selling because of “exogenous” reasons, and inside traders who are trying to sell before bad news becomes public. The current estimate of the stock’s value is v0 . The probability that the news is bad is p, and the price falls to vl , the probability that the news is good is 1 − p, so the value increases to vh . Thus, v0 = pvl + (1 − p)vh . However, some agents in the economy are informed, or are engaged in inside trading: They already know whether the news is good or bad, and are trying to trade before the rest of the market does. Consequently, their actions have a negative informational effect on the market maker: When the news is bad, some people try to dump their worthless shares, while if the news is good, some agents get a big discount at the expense of others. If the intermediary is to make zero expected profits, he will have to account for the presence of these informed agents in his pricing. The probability that a trader is inside trading (“informed”) is µ and the probability that a trader is executing a regular trade (“uninformed”) is 1 − µ. Uninformed traders want to buy with 212 probability λ and sell with probability 1 − λ. (Now is a good time to draw a “tree” picture showing all the possibilities) The market maker charges a price pb to a buyer who wants to purchase a share of the asset, and pays a price ps to any agent who wishes to sell a share of the asset. For the market maker to make zero expected profits on average, the following needs to be true: Let p̃ be the (random) trading price. Then if E[p̃] = v0 the average price charged equals the expected value, so the market maker makes a net profit of zero in expectation: E[p̃] − v0 = 0 But whether or not a particular agent wishes to buy or sell provides information to the market maker. If a buyer arrives, which occurs with probability p(1 − µ)λ + pµ + (1 − p)(1 − µ)λ, the market maker’s new estimate of the stock’s value is E[v| Buyer] = p(1 − µ)λvh + pµvh + (1 − p)(1 − µ)λvl p(1 − µ)λ + pµ + (1 − p)(1 − µ)λ and if a seller arrives, which occurs with probability p(1− µ)(1− λ)+ (1− p)(1− µ)(1− λ)+ (1 − p)µ, E[v| Seller] = p(1 − µ)(1 − λ)vh + (1 − p)(1 − µ)(1 − λ)vl + (1 − p)µvl p(1 − µ)(1 − λ) + (1 − p)(1 − µ)(1 − λ) + (1 − p)µ Suppose we set pb = E[v| Buyer] and ps = E[v| Seller]; note that these prices are conditional on the requested trade. When we take the expectation, E[p̃] = E[v| Buyer]P r[ Buyer] + E[v| Seller]P r[ Seller] pµvl + (1 − µ)λv0 (pµ + (1 − µ)λ) = pµ + (1 − µ)λ (1 − p)µvh + (1 − µ)(1 − λ)v0 ((1 − p)µ + (1 − µ)(1 − λ)) + (1 − p)µ + (1 − µ)(1 − λ) = pµvl + (1 − µ)λv0 + ((1 − p)µvh + (1 − µ)(1 − λ)v0 ) = (1 − µ)v0 + µ(pvl + (1 − p)vh )) = v0 This describes a way of setting prices so that the market maker does not make a loss or gain on average, given the commonly available information at date zero. However, the bid-ask spread, |ps − pb |, is not zero. In particular, a non-zero spread ensures that the market maker achieves zero profits by making gains on one side of the market and losses on the other. While this might appear to be monopolistic price discrimination, it isn’t: The market maker is simply achieving his zero profit condition. So the moral of the story is that when there are uninformed and informed agents in the market, the market maker suffers losses on some trades because informed agents face no risk. This hurts the uninformed traders because the bid-ask spread destroys some profitable trades, since the market maker has to balance losses on some trades with gains on others. Exercises 1. Suppose sellers each have one unit of a good and privately know its value, which comes from the set {v0 , v1 , v2 } with vk+1 > vk for all k and v0 = 0. For any given seller, the probability of being 213 P type k is pk , where 0 ≤ pk ≤ 1 and k pk = 1. Sellers value their good at αvk , where α < 1, and buyers value the good at vk . Fix p1 , p2 , and consider p0 as the variable of interest. To do this, let p¯1 + p¯2 = 1, as the “kind of world where low types don’t exist”. Then we’ll add the lowest types in such a way that the proportion of 2 to 1 types remains the same, so that p2 = p̄2 (1 − p0 ) and p1 = p̄1 (1 − p0 ) So that p1 + p2 = (p̄1 + p̄2 )(1 − p0 ) = 1 − p0 , and p0 + p1 + p2 = 1 for all p0 ∈ [0, 1]. i. If t∗ = E[v|t∗ ], characterize the market for all values of p0 . In particular, show that there is a “best” seller who participates when p0 = 0, and explain the qualitative changes that occur in the market as p0 increases. ii. Note that t∗ = E[v|t∗ ] gives all of the gains from trade to the seller. For any distribution (p1 , p2 ) and p0 , characterize the set of Walrasian prices: Trade is mutually beneficial for all participants at price t∗ , given that they want to participate. How do the Walrasian prices vary as p0 varies, holding (p1 , p2 ) fixed? iii. Suppose there was a method of verifying quality that costs c > 0, and once quality is verified, vk + αvk . Characterize the equilibrium. In particular, which types have their trades occur at t∗ = 2 quality verified, and under what conditions? iv. Suppose that there was a method of verifying quality that costs c > 0, and once quality is verified, trades occur at t∗ = αvk . Characterize the equilibrium. In particular, which types have their quality verified, and under what conditions? v. Given your results in iii and iv, explain why verification technology is good, bad, or irrelevant for markets with adverse selection, in general. 2. Suppose there is a consumer with preferences over the market good q, quality of the market good v, and the numeraire n, u(q, n) = E[v|p]q + n and wealth constraint w = pq + n. There are two sellers. One is the high type, who makes goods of quality vh and has cost function cv2h qh2 , and one is the low type, who makes goods of quality vl and has cost function cv2 l ql2 , where vh > vl . Both firms face a fixed cost F . The consumer cannot distinguish the quality of the goods at the time of purchase, so his order will be filled with goods from both types of sellers. A perfectly competitive equilibrium is one in which (i) the consumer takes the price and average quality of the goods as given and selects an amount q to purchase to maximize his utility, (ii) each firm, taking the price as given, maximizes firm profits, and (iii) the amount produced by both firms fills the consumer’s order (markets clear) and the realized average quality of the goods is equal to the consumer’s expectation of their quality (consumers’ expectations are correct). (0) Show that, if p is such that both types of firms produce qh∗ and ql∗ in equilibrium, then E[v|p] = vh qh∗ + vl ql∗ qh∗ + ql∗ so that, in equilibrium, E[v|p]q ∗ = vh qh∗ + vl ql∗ so that the consumer only cares about the average quality of the goods he has received. (This is a hint) (i) Solve the consumer’s and producer’s problems. For what values of F and p do the firms drop out? Which drops out first? (ii) Suppose there is a perfectly competitive equilibrium in which both firms operate. Characterize the equilibrium prices and quantities. How does total production depend on F ? When is the 214 equilibrium price increasing in vh , and when is it decreasing in vh ? How does the equilibrium quantity vary in c, vh and vl ? How does the high type’s equilibrium supply vary in vh and vl ? Briefly explain your results. (iii) Suppose that F is sufficiently high that only one firm operates in equilibrium. Which firm is it? How high does F have to be for the remaining firm to drop out? How does this critical value of F depend on the value of the remaining firm? (iv) Show there are values of F where the low type firm will drop out (so the market entirely collapses), but if the high type firm could operate alone, it could make profits. Show that there are levels of F where if the high type firm is operating alone, it is possible for the low type firm to profitably enter, but the low type firm could not operate profitably without the high type firm in the market. 3. Consider the insurance utility maximization problem: max pu(w − d + α − αq) + (1 − p)u(w − αq) α Suppose that q ≥ p. i. Is α∗ increasing or decreasing in p? ii. Is α∗ increasing or decreasing in q? iii. Is α∗ increasing or decreasing in w? Explicitly provide an answer for the case when u′′′ () ≥ 0, and in general. iv. If u(x) is more concave than v(x) in the sense of the Arrow-Pratt coefficient of absolute risk aversion for all x, is α∗ weakly greater or weakly less under v() than under u()? √ 4. For the insurance market in section 2, provide an analysis when u(x) = x, w = 1, d = 1, and p and q are variable: i. Show that if q = p, α∗ = 1 = d ii. Show that any pair of contracts (αh , qh ) and (αl , ql ) where αl > αh and ql < qh provides incentives for the low type to imitate the high type. (This includes the contract that is efficient for the high type, (αl , ql ) = (1, pl )). iii. At this point, we’ve covered individual rationality and incentive compatibility constraints. (a) Formulate the problem as a “monopolistic screening” program with constraints for the two consumer types to separate, and the added constraint that the insurer makes zero profits. (b) Write out incentive compatibility and individual rationality constraints. What do they tell you about the nature of optimal contracts? Briefly explain your findings. (c) (Optional) If you characterized incentive compatibility, can you characterize the allocations that satisfy the constraints? Explain your findings. 5. For the Market Maker model, show that: i. As µ → 0, both pb and ps converge to v0 . Interpret this in terms of convergence to efficiency, and explain the relationship with a simple partial equilibrium model. ii. If p = λ = 12 , how does the bid-ask spread, |ps − pb |, vary in µ? iii. If p = λ = 21 , characterize when the market maker suffers losses when selling the asset, vs. when they suffer losses on buying the asset in terms of µ (i.e., is pb < v0 ? is ps > v0 ?). Explain your result in terms of insider trading. 215 Chapter 21 Signaling This last section introduces a simple model of market signaling, just as we introduced simple models of adverse selection and moral hazard, and studied the consequences for efficiency. Recall that the simple adverse selection model (after Akerlof’s lemons paper) merely added private information to the simple partial equilibrium framework, with the consequence that trade often broke down, or the most profitable trades were driven out by the presence of bad trades that lowered the clearing price. Subsequently, we began to ask the question, “how can contracts and more complicated trading institutions mitigate the adverse selection problem?” This led us to consider monopolistic screening and price discrimination, which created a notion of informational rents. But this opened up the possibility for other such market frictions, such as moral hazard, where agents take actions after contracting, leading to agency rents. In all of these cases, the “active” agent in the market is essentially the monopolist seller or the principal, who is designing contracts to maximize some objective (such as welfare or his own profits). This framework of contract design — with the help of game theory — can be considerably expanded, and is referred to as market screening, since the party without private information provides incentives for the party with information to reveal their type. But what about the other side of the market? Agents often take actions to prove themselves and convince others of their type. Instead of screening, this is market signaling. 21.1 Educational Signaling For simplicity, the model is given the interpretation of a labor market where agents signal their quality by acquiring costly degrees, but all the pieces of the model can be easily reinterpreted to apply to other scenarios. Suppose there are two types of agents in the population: High productivity ones, h, who occur with probability p and low productivity ones, l, who occur with probability 1 − p. There are two kinds of education they can choose: graduate after high school, L, and college, H. Education does not contribute to the agents’ productivity, but going to university is comparatively cheaper for the high productivity than the low productivity type: If c(i, j) is the cost to type i of choosing education level j, c(h, H)− c(h, L) < c(l, H)− c(l, L). Note that c(h, H)+ c(l, L) < c(h, L)+ c(l, H), or c() is submodular in (i, j). High productivity types have a marginal product of labor of πh for a firm, while low productivity types have a marginal product of labor πl for a firm, with πh < πl . After agents choose high school or university, a representative firm makes a competitive wage offer, where the wage offered is equal to the expected marginal product of labor, conditional on the kind of degree that the agent received. So, the representative firm evaluates the resume of each candidate and checks whether the individual went to high school or college, and offers a different wage to each type. A natural prediction would be that the university degree will be associated with the high type, which will lead to a higher wage for university graduates, while a high school 216 diploma will receive a lower wage, for example. This ties together the adverse selection model with a partial equilibrium model of wage determination in labor markets. However, the situation is not as the model so far described might suggest. Then each productivity type h or l must choose to graduate L or go to college H, the representative firm then observes the choice and offers a wage wL or wH , conditional not on the type of the agent, but on the kind of education they pursued. There are three possible wages of interest that might arise: • The low type goes to high school while the high type goes to university: Then wH = πh and wL = πl • Both types go to university, no one goes to high school: wH = pπh + (1 − p)πl , wL = • Both types go to high school, no one goes to university: wL = pπh + (1 − p)πl , wH = The first case — where the high productivity type goes to university and the low productivity type goes to high school — is called a separating equilibrium because each type choose a different choice, and the choices reveal the agents’ information. The second and third cases are pooling equilibria, because the private information of the agents cannot be reliably inferred just from their educational choice. 21.1.1 Separating Equilibria Separating equilibria are the easiest to analyze, and the use of incentive compatibility constraints aids the analysis. First, it must be incentive compatible for the high type h to choose H over L. This yields the constraint: ICh : wH − c(h, H) ≥ wL − c(h, L) Second, it must be incentive compatible for the low type l to prefer L over H. This yields another constraint: ICL : wL − c(l, L) ≥ wH − c(l, H) ∗ = π and w ∗ = π , this requires that If wH h l L ICh : πh − c(h, H) ≥ πl − c(h, L) ICL : πl − c(l, L) ≥ πh − c(l, H) So the low type could go to college, but this is only profitable if the ICL constraint is violated, or πh − πl ≥ c(l, H) − c(l, L) — the increased wage outweighs the increased cost of education. Likewise, the high type might find that the added debt, years of lost productivity, and endless problem sets are less attractive than going straight into the workforce, or the ICh constraint is violated: πh − πl < c(h, H) − c(h, L). If the two constraints jointly hold, then we have c(l, H) − c(l, L) ≥ πh − πl ≥ c(h, H) − c(h, L) The above inequality provides the necessary and sufficient conditions for a separating equilibrium: The added cost of college for the low types is greater than the gain in terms of final wages, but not so great as to deter the high types from going to college. If the inequality is satisfied, the two types prefer to choose the proposed signals, rather than switch. Consequently, the firm makes zero profits, since πh − wh∗ = 0 and πl − wl∗ = 0. Is the outcome efficient? On the one hand, the high type and low type are correctly identified in equilibrium and receive their marginal product of labor as a wage. On the other hand, the total costs of education in the economy are pc(h, H)) + (1 − p)c(l, L), which contribute nothing 217 to productivity. If these costs are merely there for signaling purposes, the outcome might be very inefficient relative to a world where everyone simply goes to high school and receives a wage of w = pπh + (1 − p)πl . 21.1.2 Pooling Equilibria We now consider the case where both types go to college (sound familiar?). In this situation, the wage is wh∗ = pπh + (1 − p)πl On average, the firm will again break even and make zero profits. The problem is that in this scenario, the consequences of only graduating from high school are not well-defined, in the following sense: • Suppose you are the manager of the firm, and in your entire tenure you have only seen candidates for the job who have gone to college. One day, a candidate walks in and hands you his application, and it only lists a high school diploma. You ask him why he didn’t go to college, and he makes the following speech: “I’m a highly productive person, and four years of study seem like a waste, especially when it doesn’t even increase my productivity. I decided to just go straight into the workforce.” You might conclude (i) this is a bright individual who indeed would have wasted his time in college, or (ii) this is a low productivity person trying to escape the burden of college, since more education is comparatively more expensive for the low productivity types. In some sense, both inferences are equally valid, since either could be true. So what should we do? Statistically, we should use Bayes’ law, so that we assess the probability that this applicant is a high type given that they only graduated from high school by computing pr[h|L] = 0 pr[h ∩ L] = pr[L] 0 But (as written above), this probability isn’t even well-defined since (i) we assumed no high types h choose L and (ii) no one chooses L. This is the main difference between market signaling and market screening: Outcomes must be considered which do not actually occur in equilibrium1 . This means that we cannot, at this point, compute a wage for high school graduates because we cannot decide on a “realistic” set of inferences about their productivity. To close the gap in the model, let pr[h|L] = µ be a subjective probability assessment, and we will see what values of µ are consistent with a pooling equilibrium. The wage for those choosing L, given µ, is then equal to ∗ wL = µπh + (1 − µ)πl For the two types both to prefer college H over high school L, the following incentive compatibility constraints must be satisfied: ICh : pπh + (1 − p)πl − c(h, H) ≥ µπh + (1 − µ)πl − c(h, L) ICl : pπh + (1 − p)πl − c(l, H) ≥ µπh + (1 − µ)πl − c(l, L) Combining these constraints yields (p − µ)(πh − πl ) ≥ max{c(l, H) − c(l, L), c(h, H) − c(h, L)} = c(l, H) − c(l, L) 1 The theory of games of incomplete information provides a framework for understanding these kinds of models in a formal way. 218 Now, if µ is equal to or greater than p — so that the firm assesses a high likelihood to a high school graduate being a high type — the pooling equilibrium cannot hold, because both types will want to jump ship and get the high wage. If µ = 0, the firm assumes that any applicant who refused to go to college is a low productivity type, and will offer wl = πl . This still might be true, if the low type dreads going to college (c(l, H) − c(l, L) is very high). So the pooling equilibrium will be incentive consistent for all parties only if (i) both types find college relatively affordable and (ii) the firm concludes that high school graduates are likely to be low productivity types. This outcome is inefficient as well, especially because the low-productivity types are overinvesting in education, simply so that potential employers will not correctly infer that they are low-productivity. This has the flavor of a “rat race” which you have probably experienced at various times in your life. 21.1.3 Least Cost Separating Equilibria Suppose that H and L are now variables, representing the difficulty of college and high school, respectively, where c(i, 0) = 0 for both types. Is there a least cost separating equilibrium? Well the necessary and sufficient conditions for a separating equilibrium were c(l, H) − c(l, L) ≥ πh − πl ≥ c(h, H) − c(h, L) And social surplus is p(wh∗ − c(h, H)) + (1 − p)(wl∗ − c(l, L)) − pwh∗ − (1 − p)wl∗ = −pc(h, H) − (1 − p)c(l, L) Thus, a social planner will want to minimize the total signaling costs, pc(h, H) + (1 − p)c(l, L). If we set L∗ = 0, then the sufficient conditions reduce to c(l, H) ≥ πh − πl ≥ c(h, H) And the optimal difficulty of college solves πh −πl = c(l, H ∗ ) — This is the least cost for college such that the low type finds it unprofitable to pretend to be a high type. As long as πh − c(h, H ∗ ) ≥ πl , no high productivity types will be tempted to switch, and L∗ = 0, ∆π = c(l, H ∗ ) characterizes the least cost way of inducing the types to separate. 219 Chapter 22 Monopolistic Screening In markets where sellers face buyers with private information, similar issues arise as in the insurance model: If the monopolist knew the agents’ types, he could offer contracts that perfectly price discriminate against them and achieve the socially efficient outcome. Unfortunately, the agents have private information, so the monopolist has to incentivize them to behave the way he wants, and this comes at a cost. 22.1 A Simple Price Discrimination Model Suppose there are two types of consumers, “high types” who occur with probability p and “low types” who occur with probability 1 − p. Each is interested in purchasing up to two units of the seller’s good, which costs c per unit to produce. The payoff to type i for quantity q is u(i, q) = v(i, q) − tq We will assume that v(h, q) > v(l, q) for all q, and that v(i, q) is supermodular, or that v(h, 2) + v(l, 1) ≥ v(h, 1) + v(l, 2) ≡ v(h, 2) − v(h, 1) ≥ v(l, 2) − v(l, 1) This says that the gain to the high type of moving from one unit to two units is greater than the gain to the low type of moving from one to two units, though supermodularity is usually defined using the first expression ( f (x, x) + f (x′ , x′ ) ≥ f (x′ , x) + f (x, x′ )). If a consumer receives nothing they get a payoff of zero. In particular, we consider the problem of implementing a particular pattern of purchasing decisions on the part of the customers. We might want the high types to buy two units and the low types to buy one; we might want the high types to buy one unit and the low types to buy nothing; we might want both types to purchase one unit; and so on. There are some things that are not implementable: The low types purchasing one unit and the high types purchasing none, for instance. For if the low types find it profitable, v(l, 1) − t1 ≥ 0, so that v(h, 1) − t1 > v(l, 1) − t1 ≥ 0, and the high types could always “impersonate” the low types and make the purchase. Suppose the monopolist wants both types to purchase a single unit. Then it must be the case that both types prefer buying rather than not, or that v(h, 1) − t1 ≥ 0 v(l, 1) − t1 ≥ 0 These are called individual rationality constraints (IR constraints): The agents prefer to participate rather than withdraw from the market. The monopolist’s profits in this implementation scheme are π = max p(t1 − c) + (1 − p)(t1 − c) t1 220 The objective function is increasing in t1 , so we want to raise it as high as possible without breaking the constraints. The two IR constraints imply that min{v(h, 1), v(l, 1)} ≥ t1 So that the profit-maximizing price is t∗1 = v(l, 1). This is the highest price at which both types are willing to participate. Note that we are not asserting that it wouldn’t be better to, say, increase the price to t∗1 = v(h, 1) and allow all the low types to drop out, yielding profits of pv(h, 1) rather than v(l, 1) — there is an obvious trade-off from selling fewer units at a higher price. We are saying that if attention is restricted to the implementation scheme where all types buy a single unit, the optimal price is t∗1 = v(l, 1). Now suppose that the monopolist wants the high types to purchase two units while the low types purchase a single unit. Then we have the IR constraints, IRH : v(h, 2) − t2 ≥ 0 IRL : v(l, 1) − t1 ≥ 0 But there is a new opportunity for the agents to “lie” to the seller: The high types might imitate the low types and choose a single unit, and the low types might imitate the high types and choose two units. To prevent this from happening, we need to impose incentive compatibility constraints (IC constraints). These constraints require that, given the intended implementation scheme, no type has an incentive to “impersonate” another type. ICH : v(h, 2) − t2 ≥ v(h, 1) − t1 ICL : v(l, 1) − t1 ≥ v(l, 2) − t2 Note that we can combine the IC constraints by multiplying the second equation by −1 and adding the result to the first equation to get v(h, 2) − v(l, 2) ≥ v(l, 1) − v(l, 2) which is just the supermodularity condition assumed earlier: This is the restriction that allows the monopolist to “sort” the agents. So the monopolist’s new optimization problem is π = max p(t2 − 2c) + (1 − p)(t1 − c) t2 ,t1 subject to IRH : v(h, 2) − t2 ≥ 0 IRL : v(l, 1) − t1 ≥ 0 ICH : v(h, 2) − t2 ≥ v(h, 1) − t1 ICL : v(l, 1) − t1 ≥ v(l, 2) − t2 First, note that the ICH constraint implies that v(h, 2) − t2 ≥ v(h, 1) − t1 > v(l, 1) − t1 ≥ 0 |{z} |{z} ICH IRL so that v(h, 2) − t2 > 0, so that IRH is slack. Additionally, note that if ICL and IRL both bind, we it must be true that t1 = v(l, 1) and t2 = v(l, 2). If this is the case, note that ICH implies 221 v(h, 2) − v(l, 2) > v(h, 1) − v(l, 1) by supermodularity. If we raised t2 to t2 + ε, the high type would still purchase two units, while the low type would not change behavior (since they were buying a single unit anyway), and profits increase. Therefore, it is not profit-maximizing for IRL and ICL both to bind; ICH and IRL are the constraints that actually matter. This reduces the problem to π = max p(t2 − 2c) + (1 − p)(t1 − c) t2 ,t1 subject to IRL : v(l, 1) − t1 = 0 ICH : v(h, 2) − t2 = v(h, 1) − t1 But the remaining constraints completely determine the prices: t∗1 = v(l, 1) t∗2 = v(h, 2) − v(h, 1) + v(l, 1) Then the utilities of the consumers are u∗H = v(h, 1) − v(l, 1) u∗L = 0 and the monopolist’s profits are π ∗ = p(v(h, 2) − v(h, 1) + v(l, 1) − 2c) + (1 − p)(v(l, 1) − c) Basically, the monopolist must allow the high type to retain some informational rents, v(h, 1) − v(l, 1), which provides the incentive for the high type to reveal his information. If the monopolist raised the price any higher, the high type would simply pool with the low type, and get a payoff of v(h, 1) − t∗1 = v(h, 1) − v(l, 1) anyway. Therefore, the presence of the low type helps the high type by giving him another option besides dropping out, improving his bargaining power against the monopolist. 22.2 Price Discrimination with a Trade-off Suppose now that the consumers have tastes for quality/quantity q, so their utility functions take the form v(i, q) − t(q) In this case, the quantity/quality variable q is a real number, and the types are either h or l, with the assumption that v(h, q) > v(l, q). Quantity/quality costs c per unit for the monopolist to produce. We imagine the monopolist posting two contracts, (qh , th ) intended for the high type, and (ql , tl ) for the low type (why is this sufficient?). Then, again, we have IR constraints: IRH : v(h, qh ) − th ≥ 0 IRL : v(l, ql ) − tl ≥ 0 and IC constraints: ICH : v(h, qh ) − th ≥ v(h, ql ) − tl ICL : v(l, ql ) − tl ≥ v(l, qh ) − th 222 In this new problem, what does incentive compatibility mean? Again, multiply ICL by −1 and add it to ICH to get v(h, qh ) − v(l, qh ) ≥ v(h, ql ) + v(l, ql ) ≡ v(h, qh ) − v(h, ql ) ≥ v(l, qh ) − v(l, ql ) which again has the interpretation that the high type gets a larger increase in utility from moving from ql to qh than the low type does, which is equivalent to the supermodularity condition. Using the fundamental theorem of calculus, Z qh Z qh ∂v(l, x) ∂v(h, x) dx ≥ dx ∂q ∂q ql ql or Z qh ql ∂v(h, x) ∂v(l, x) − dx ≥ 0 ∂q ∂q If h and l are real numbers where h > l — as in hv(q) and lv(q) for example — we can go further to get Z qh Z h 2 ∂ v(y, x) dydx ≥ 0 ∂θ∂q l ql Since supermodularity is equivalent to the integrand being positive and h > l, that means that qh ≥ qL for the inequality to hold. This characterizes incentive compatibility in general: the quantity schedule as a function of type q(θ) must be an increasing function. Again, we can show that IRL and ICH bind (do so by applying the previous argument to this situation). Then the monopolist’s problem is reduced to π= max p(th − cqh ) + (1 − p)(tl − cql ) th ,qh ,tl ,ql IRL : v(l, ql ) − tl = 0 ICH : v(h, qh ) − th = v(h, ql ) − tl Again, we can rearrange the transfers to get t∗l = v(l, ql ) t∗h = v(h, qh ) − v(h, ql ) + v(l, ql ) Substituting these constraints into the objective, we get π = max p(v(h, qh ) − v(h, ql ) + v(l, ql ) − cqh ) + (1 − p)(v(l, ql ) − cql ) qh ,ql Maximizing yields two first-order conditions: vq (h, qh∗ ) = c p(−vq (h, ql ) + vq (l, ql )) + (1 − p)(vq (l, ql ) − c) = 0 The first implies that the high type gets the socially optimal provision of quality/quantity: There no distortion at the top. On the other hand, the low type’s condition can be re-arranged as vq (l, ql ) − c = p (vq (h, ql ) − vq (l, ql )) > 0 1−p So if vxx (l, x) < 0, the low type will get an inefficiently low level of quality/quantity. In short, this occurs because providing quality/quantity to the low type improves the high type’s bargaining position against the seller, so the seller restricts supply to extract more from the high type. 223 Note that in the monopolist’s objective function, p(v(h, qh ) − cqh ) + (1 − p)(v(l, ql ) − cql ) − p(v(h, ql ) + v(l, ql )) | {z } | {z } Social Surplus H’s informational rent So the seller can be thought of as a kind of Social Planner who is maximizing total surplus less the second term, which represents the informational rent that accrues to the high types. The seller treats this rent as a cost, which ends up hurting the low type. In general, these kinds of problems usually follow a sequence of steps: • Characterize incentive compatibility and individual rationality. Here, it was equivalent to q(t) being an increasing function. • Use the characterization of incentive compatibility and individual rationality to eliminate the transfers from the problem; i.e., figure out a way to substitute the transfers into the objective function. • Find the profit-maximizing allocation for the seller. This approach is actually rather general. One of the homework questions guides you through the case where there are a continuum of types with distribution f (θ). Exercises 1. Suppose that v(h, 2) = 5, v(h, 1) = 3, v(l, 2) = 2, and v(l, 1) = 1, where high types occur with probability p and low types occur with probability 1 − p. The cost of a unit is c = 0. i. Suppose the seller wants to implement a scheme where the high type buys 2 units and the low type buys 1 unit. (a) Characterize incentive compatibility and individual rationality. (b) Formulate the seller’s profit maximization problem. (c) Find the profit maximizing payments. ii. Suppose the seller wants to implement a scheme where the high type buys 2 units and the low type gets nothing. (a) Characterize incentive compatibility and individual rationality. (b) Formulate the seller’s profit maximization problem. (c) Find the profit maximizing payments. iii. When is the scheme in part ii more profitable than part i? Explain your results. 2. Suppose there are three types, L, M and H, with values v(θ, q). The seller is willing to sell quantities 1, 2, 3 at prices t1 , t2 , t3 . The cost for each unit is zero. The probabilities of each type are pL , pM and pH with pL + pM + pH = 1. i. Suppose that the seller wants to implement a scheme where the high type buys 3 units, the medium type buys 2 units, and the low type buys 1 unit. Characterize incentive compatibility and individual rationality for the types. ii. Show that if v(θ, q) satisfies v(θi , q) − v(θi , q ′ ) ≥ v(θℓ , q) − v(θℓ , q ′ ) whenever θi ≥ θℓ and q ≥ q ′ , then incentive compatibility is satisfied. iii. Formulate the seller’s profit maximization problem and find the profit-maximizing prices, t∗1 , t∗2 and t∗3 . iv. Find some (non-trivial) parameterization of the problem where the seller prefers to offer the same quantity-price contract to the medium type and the high type, rather than a separate contract to each of them. Is there a systematic reason why this occurs? √ 3. Suppose that v(θ, q) = θ2 q, with types θh and θl , who occur with probability p and 1 − p respectively. The cost of q to the seller is cq. i. Characterize individual rationality and incentive compatibility for the two types. ii. Formulate the seller’s profit maximization problem. iii. Find the profit maximizing quantities and transfers. iv. Define δ = ql∗ − qlo as the difference between the level of quality chosen by the monopolist and 224 the optimal level of quality provision. Is δ increasing or decreasing in c, p, θh , θl ? v. When is it better to separate the two types rather than offer a single contract to the high type, and provide nothing to the low type? vi. Show that in the general model, where v(θ, q) is increasing in both arguments and vθq ≥ 0, that th ≤ (tl /ql )qh . Explain why this is an example of “quantity discounting”. 4. This question is a little different. There are two types, vh and vl , who occur with probability p and 1−p, respectively, with vh > vl . There are two periods, and agents discount payments/payoffs in period two at a rate δ. The seller is introducing a new product, and can sell in period 1 at a price p1 or period 2 at a price p2 . The good costs c per unit. i. Suppose the seller wants to implement a scheme where the high type buys in period 1 but the low type buys in period 2. Characterize individual rationality and incentive compatibility constraints. ii. Formulate the seller’s profit maximization problem. iii. What are the profit maximizing prices? iv. When does the seller prefer the scheme characterized above to one in which he sells to all consumers at date one? v. Let δ = e−ρ∆ . Show that as ∆ → 0, t1 → t2 . What does this mean about the seller’s ability to price discriminate when the time between offers goes to zero? Explain you result briefly in words. Do you think this holds for more general models of trade over time? Why or why not? 5. (Optional, if you feel like seeing something a bit more advanced.) This is patterned after Mussa and Rosen (1976), a seminal paper on price discrimination. There is a measure 1 of consumers, whose types are distributed f (θ) on [0, θ̄] with f (θ) > 0 for all θ ∈ [0, θ̄]. A contract schedule (q(θ), t(θ)) is incentive compatible if each type θ prefers to select the contract (q(θ), t(θ)) rather than some other contract, (q(θ ′ ), t(θ ′ )). The consumers’ indirect utility functions are U (θ) = max θv(q(θ̂)) − t(θ̂) θ̂ Quality/quantity costs cq for the seller. i. Write out individual rationality and incentive compatibility conditions, so that type θ does not want to imitate type θ ′ . Since the consumer’s problem is U (θ) = max θv(q(θ̂)) − t(θ̂) θ̂ find first-order conditions for maximization in θ̂. ii. Use the envelope theorem and the first order condition to show that in any incentive compatible mechanism Z θ v(q(z))dz + U (θ) U (θ) = θ where θ is the worst-off type who receives any (weakly positive) quantity. iii. Show that the transfers can now be written as Z θ t(θ) = θv(q(θ)) − v(q(z))dz − U (θ̄) θ Note that since the individual rationality constraint binds for θ, U (θ) = 0. iv. Use an integration by parts to show that Z θ̄ Z θ̄ 1 − F (θ) t(θ)f (θ)dθ = v(q(θ)) f (θ)dθ θv(q(θ)) − f (θ) θ θ Hint: Use f (θ)dθ = −d[1 − F (θ)]. v. Use the result from part iv to show that the monopolist’s profits can be written Z θ̄ 1 − F (θ) v(q(θ)) − cq(θ) f (θ)dθ θv(q(θ)) − π = max f (θ) θ,q(θ) θ 225 vi. Show that the profit-maximizing schedule q(θ) is defined pointwise by 1 − F (θ) v ′ (q(θ)) = c θ− f (θ) and θ is characterized by θv(q(θ)) − 1 − F (θ) v(q(θ)) − cq(θ) = 0 f (θ) vii. When is q(θ) increasing in θ, as required by incentive compatibility? √ viii. If v(q) = q, characterize the optimal schedule. Is it concave or convex? Are the tariffs t(q(θ)) concave or convex? ix. The Lerner index is p−c L= p If p = t′ (q(θ)), how does L vary across the types? 226 Chapter 23 Monotone Comparative Statics This last chapter is to provide a really useful set of tools for doing comparative statics in economics. We haven’t mentioned it much before because MWG and most of what you do in the first year can be accomplished (indeed, is expected to be accomplished) using older methods. These tools are ones that you will really want to keep in mind for your own research later on. How does the monotone comparative statics (MCS) approach compare to the implicit function theorem? The IFT take as a starting point a generic system of equations, F (x(α), α) = 0, where x ∈ Rn is a vector of endogenous variables determined implicitly by α, where α is a constant exogenous parameter. We totally differentiate the system with respect to α, getting the identity Dx F (x(α), α)Dα x(α) + Dα F (x(α), α) = 0 If the n × n matrix Dx F (x(α), α) is invertible (for a maximization problem, this is the Hessian matrix, and negative definite matrices are always invertible), then Dα x(α) = −Dα x(α)−1 Dα F (x(α),α) Recall a quick example of how this works. Suppose a monopolist faces a demand curve p(x) and has total costs c(x) = cx. Then his maximization problem is π(x) = max p(x)x − cx x Taking a derivative yields p′ (x)x + p(x) − c = 0 and the sufficient condition is p′′ (x(c))x(c) + 2p′ (x(c)) < 0 Then the above equation is the system to which the IFT is applied: F (x(c), c) = p′ (x)x + p(x) − c = 0 so x(c) is an implicit function of c. Totally differentiating the above condition gives ′′ dx(c) −1=0 p (x(c))x(c) + 2p′ (x(c)) {z } dc | SOC and rearranging gives 1 dx(c) = ′′ <0 dc p (x(c))x(c) + 2p′ (x(c)) 227 The IFT just formalizes the above step, so that you can always skip right to: x′ (c) = − Fc (x(c), c) 1 = ′ <0 Fx (x(c), c) 2p (x(c)) + p′′ (x(c)) Note, however, that the IFT’s starting point is an arbitrary system of equations F (x(α), α) = 0. It is equally valid when used in economics, a fluid mechanics problem in engineering, a biology study of species evolution in a dynamical systems, and so on. In economics, however, our systems typically come from maximization problems. This added structure might be exploited usefully to gain additional insight. In particular, MCS is a different approach that does not even rely on differentiability (though there are powerful tools that assume differentiability). Instead, it relies on complementarity of the controls and parameters inherent in the problem to derive the changes in behavior that result from changes in parameters. 23.0.1 Single Crossing and the Monotonicity Theorem Let X ⊂ R be any set. Definition 23.0.1 A function f : X → R satisfies the single crossing condition (SCC) if, for all t > t′ , f (t′ ) > 0 implies f (t) > 0, and f (t′ ) ≥ 0 implies f (t) ≥ 0. A function f satisfies the strict single crossing condition if t > t′ and f (t′ ) ≥ 0 implies f (t) > 0. In particular, imagine a first order condition for a maximization problem, f ′ (x(α), α) = 0. This first order condition satisfies the SCC if it only crosses the x-axis once (or possible “sits” on the x-axis for some convex interval). If shifting α entails a systematic shifting of the first-order condition one way or the other, then we can predict how changes in α will translate into changes in the solution x(α). Let g(x, t) be a function mapping R2 → R; in particular, x is the control and t is a parameter. Definition 23.0.2 The function g(x, t) satisfies single crossing differences (SCD) if, for x′ > x, f (t) = g(x′ , t) − g(x, t) satisfies the SCC. Namely, if x′ > x and t′ > t, then g(x′ , t) − g(x, t) > 0 implies that g(x′ , t′ ) − g(x, t) > 0. ( Strict SCD implies that g(x′ , t) − g(x, t) ≥ 0, then g(x′ , t′ ) − g(x, t) > 0) The following theorem is fundamental: Theorem 23.0.3 (Monotonic Selection Theorem) Let X be a compact subset of R, and let t ∈ T = [a, b]. Then the function g : X × T → R satisfies the strict single crossing differences condition iff for every compact subset S ⊂ X, every optimal selection x∗ (t, S) ∈ Argmaxx∈S g(x, t) is non-decreasing in t. Proof If g() satisfies the strict SCD, then every optimal selection is non-decreasing: Let t0 < t1 , and let x∗0 ∈ Argmaxx∈S g(x, t0 ) and let x∗1 ∈ Argmaxx∈X g(x, t1 ). Optimality implies that g(x∗0 , t0 ) − g(x∗1 , t0 ) ≥ 0 and g(x∗0 , t1 ) − g(x∗1 , t1 ) ≤ 0. These two inequalities and strict SCD implies x∗1 ≥ x∗0 . (By contrapositive) If g() does not satisfy the strict SCD, then there is some optimal selection that is not non-decreasing: Suppose that strict SCD does not hold. Then there is a t0 < t1 and x0 > x1 such that g(x0 , t0 ) − g(x1 , t0 ) ≥ 0 and g(x0 , t1 ) − g(x1 , t1 ) ≤ 0. But then if we let S be the compact set S = {x0 , x1 }, x0 > x1 and the optimal selection is decreasing. 228 This theorem answers all your questions about comparative statics predictions: monotonicity of the control on a compact set X is equivalent to the strict SCD. Note that nothing about concavity, convexity, or differentiability entered the picture. Moreover, the assumption of compactness of X is just to ensure that a solution Argmaxx∈X g(x, t) exists, so X can be any closed, bounded subset of R, including finite sets {x1 , x2 , ..., xK }. What functions satisfy SCD? If g(x, t) is differentiable, then the following conditions imply SCD: ∂ 2 g(x, t) ≥0 ∂x∂t or ∂ 2 log(g(x, t)) ≥0 ∂x∂t Additionally, the SCD or strict SCD properties are preserved by strictly monotone transformations of g(), such as log() or ()α . 23.1 Lattices and Supermodularity The natural setting for most comparative statics studies is actually not Rn , but another kind of mathematical object called a “lattice”. Lattices Suppose X is a partially ordered set with order ≥. Let x, y ∈ X. Then • x ∨ y is the join of x and y. This is the least upper bound of x and y in X, or x ∨ y = sup{z ∈ X : z ≥ x, z ≥ y} • x ∧ y is the meet of x and y. This is the greatest lower bound of x and y in X, or x ∧ y = inf{z ∈ X : z ≥ x, z ≥ y} For example, take X = {(1, 1), (1, 2), (2, 1), (2, 2)}. Then (1, 2) ∨ (2, 1) = (2, 2) , (1, 2) ∧ (2, 2) = (1, 2) If we had X ′ = {(1, 4), (2, 3), (3, 2), (4, 4)}, then (1, 4) ∧ (2, 3) = (1, 3), and (1, 4) ∨ (4, 4) = (4, 4). Note that every meet and join of two elements in X is also in X, but that is false for X ′ : (1, 4) ∧ (2, 3) = (1, 3) ∈ / X ′ . Partially ordered sets that satisfy the added condition that the meet and join of all elements are also in the set are special. A partially ordered set X is a lattice if, for any x, y ∈ X, the elements x ∨ y and x ∧ y are also in X. A set Y ⊂ X is a sub-lattice of X if, using the same partial order on X, for any x, y ∈ Y , the elements x ∨ y and x ∧ y are also in Y . Proposition 23.1.1 Any n-cell in Rn , ordered by x ≥ y if maxi xi ≥ maxi yi , is a lattice. Proof Recall that an n-cell C(a, b) in Rn is defined by taking two vectors a and b with ai < bi for all i = 1, 2, ..., n, and letting C(a, b) = {x : ai ≤ xi ≤ bi }. Now, let x and y be two vectors in C(a, b). The element x ∨ y is the coordinate-wise maximum of x and y: (x ∨ y)i = max{xi , yi }. But these are bounded above by bi , and the interval [ai , bi ] is complete and all subsets of it have a least upper bound in the set (review the handout on analysis in R). So all the components of (x ∨ y) are contained in the cell, so the cell contains all the least upper bounds of its members. The same argument holds for x ∧ y, substituting max for min and bi for ai , so the cell contains all the greatest lower bounds of its members. Therefore, C(a, b) is a lattice. 229 Note, however, that if we replaced the cell C(a, b) in the previous example with a subset of Qn , the same argument would apply, since the greatest upper bound of two vectors of rational numbers will be a rational vector as well. Further still, if we picked any grid of numbers G = {g1 , g2 , ..., gn } with gi ≤ gi+k and created the n-fold product Gn , and endowed it with the partial ordering x ≤ y if maxi x ≤ maxi y, the same arguments would go through, since the meet and join would just be the coordinate-wise max and coordinate-wise min of any two points, which will again be in Gn . Consider the set A = {x : x1 + x2 ≥ 1, x1 ≤ 1, x2 ≤ 1}. It looks like a triangle, with vertices at (0, 1), (1, 0), and (1, 1). If we compute (0, 1) ∨ (1, 0) = (0, 0), that is not a point in A, so A is not a lattice. So it is easy to construct objects that are partially ordered sets, but are not lattices. So, takeaway of this subsection: Lattices are sets that include the coordinate-wise max and coordinate-wise min of all combinations of points in the set. In Rn , these look roughly like k-cells or Cartesian products of “square” grids of points. Super- and submodularity The classes of super- and submodular functions are very important. Definition 23.1.2 Let X be a lattice, and f : X → R. Then f is supermodular on X if f (x ∨ y) + f (x ∧ y) ≥ f (x) + f (y) and submodular if f (x ∨ y) + f (x ∧ y) ≤ f (x) + f (y) Where have we seen these conditions before? Monopolistic screening and signaling. We needed them to insure that types could be separated by contracts. If f is differentiable on X, then f is supermodular if, for all xi 6= xj , ∂ 2 f (x) ≥0 ∂xi ∂xj and submodular if ∂f (x) ≤0 ∂xi ∂xj Note that, furthermore, these conditions implied strict single-crossing differences. So, everything is tied together: To construct an incentive scheme that separates the different types of agents, a supermodularity condition is required so that the agents’ payoffs satisfy SCD, and consequently the monotonic selection theorem: Higher types choose higher quantities or more intense signaling. Note that modular functions are defined on lattices, but we might have simply defined supermodular functions on appropriate sets in Rn . However, it emerges that the set of optimizers of a supermodular function on a lattice is a sublattice (Theorem ). This means that there is a greatest maximizer and a least maximizer, and all the intermediate maximizers are ordered. In many applications, this is an incredibly useful property. For example, suppose you want to do comparative statics and find out that your model had multiple maximizers, but the objective function happens to be supermodular. Then you can focus on the greatest maximizer and conduct the comparative statics exercise, without worrying about the multiplicity of solutions. A simple example is complementary inputs to production like capital and labor, where π(K, L) = α K Lβ : ∂ 2 π(K, L) = αβK α−1 Lβ−1 ≥ 0 ∂K∂L A closely related idea to supermodularity is “increasing differences”: 230 Definition 23.1.3 Let X be a lattice and T a partially ordered set. The function g : X × T → R exhibits (strictly) increasing differences in (x, t) if g(x, t) − g(x, t′ ) is (strictly) increasing in x for all t ≥ t′ . Note that the SCD condition considered how f (t) = g(x′ , t) − g(x, t) behaved on either side of the x-axis. This new condition places a monotonicity restriction on f (t). Supermodularity is a stronger property than increasing differences: If g() is (strictly) supermodular on X ×T then it has (strictly) increasing differences on X × T . Note that supermodularity is unrelated to convexity, concavity, returns to scale, or any topological concepts like compactness. It is tied instead to lattices and the supermodularity and increasing differences conditions. Maximization Consider the family of optimization problems over x ∈ X, indexed by a parameter t ∈ T : max g(x, t) x∈S where X is a lattice, T is a partially ordered set, and S ⊂ X. Then a solution correspondence φ : T → S gives all of the solutions to maxx∈S g(x, t) for each t ∈ T . We are interested in the behavior of solution correspondences, since these are the objects of our comparative statics analysis. A given correspondence φ(t) is higher than φ(t′ ) if sup φ(t) ≥ sup φ(t′ ) and inf φ(t) ≥ inf φ(t′ ). The correspondence is increasing if, for t ≥ t′ , φ(t) is higher than φ(t′ ). Theorem 23.1.4 Let g : X × T → R be supermodular on the lattice X for each t in the partially ordered set T . Let S be a sublattice of X. Then • (i) Then φ(t) is a sublattice of S for all t. • (ii) If g() has increasing differences in (x, t), then φ(t) is increasing. • (iii) If g() is strictly supermodular on (x, t), then φ(t) is ordered. • (iv) If g() has strictly increasing differences in (x, t), then φ(t) is strictly increasing. Proof (ii) Let x1 ∈ φ(t1 ) and x2 ∈ φ(t2 ), t2 ≥ t1 . Then min{x1 , x2 } ∈ φ(t1 ) and max{x1 , x2 } ∈ φ(t2 ). Then 0 ≥ g(max{x1 , x2 }, t2 ) − g(x2 , t2 ) x2 is maximal at t2 ≥ g(max{x1 , x2 }, t1 ) − g(x2 , t1 ) g has increasing differences ≥ g(x1 , t1 ) − g(min{x1 , x2 }, t1 ) g is supermodular ≥ 0 (iii) If g() is strictly supermodular on S, let x1 , x2 ∈ φ(t), but neither x1 ≥ x2 nor x2 ≥ x1 , so they are not comparable. Since g() is strictly supermodular on S, 0 ≥ g(max{x1 , x2 }, t) − g(x1 , t) > g(x2 , t) − g(inf x1 , x2 }, t) ≥ 0 which is a contradiction. Therefore, all x ∈ φ(t) are comparable. (iv) If t2 ≥ t1 , then x2 ∈ φ(t2 ) and x1 ∈ φ(t1 ). Suppose (by way of a contradiction) that x2 ≤ x1 . Then max{x1 , x2 } ≥ x1 . Consequently 0 ≥ g(max{x1 , x2 }, t2 ) − g(x2 , t2 ) x2 is maximal at t2 > g(max{x1 , x2 }, t1 ) − g(x2 , t1 ) Which leads to a contradiction, because the string of inequalities in part (ii) now fails. 231 Part (i) and part (iii) also suggest the following: Theorem 23.1.5 If g() is a supermodular function on a lattice S ⊂ X, then argmaxx∈S g(x, t) is a sublattice of S. Proof Pick any x and y in argmaxx∈S g(x, t). Then by supermodularity, 0 ≤ g(x, t) − g(x ∨ y, t) ≤ g(x ∧ y, t) − g(y, t) ≤ 0 So that x ∨ y and x ∧ y are also in argmaxx∈X g(x, t). So we haven’t made any concavity or differentiability assumptions, no particular topological assumptions except those that ensure a maximizer exists at all, and we get two very useful theorems: The set of maximizers of a supermodular function on a lattice is itself a lattice, and comparative statics results are practically free. Example: Firms Here’s an example. Suppose a firm is choosing capital and labor mixes, given a technology parameter t, resulting in a profit function π(K, L, t). The matrix of cross-partials is ∗ ∂ 2 π(K, L, t)/∂L∂K ∂ 2 π(K, L, t)/∂t∂K ∂ 2 π(K, L, t)/∂K∂L ∗ ∂ 2 π(K, L, t)/∂t∂L 2 2 ∂ π(K, L, t)/∂K∂t ∂ π(K, L, t)/∂L∂t ∗ These cross-partials will all be positive for a function like tK α Lβ . Then, since π(K, L, t) is a supermodular function on any 3-cell in R3 , we get the result that K and L are increasing in t... Done. No Hessians, mentions of concavity, convexity, or Cramer’s rule. Even if K and L were real numbers but t came from a partially ordered set t0 ≤ t1 ≤ ... ≤ tn , this would still work by defining (K, L, t) ≥ (K ′ , L′ , t′ ) if K ≥ K ′ , L ≥ L′ and t ≥ t′ , and relationships like ∂π(K, L, tk ) ∂π(K, L, tk−1 ) − ≥0 ∂K ∂K held. This can be incredibly useful when studying models that have natural complementarities but are not differentiable. 232
© Copyright 2026 Paperzz