Chapter 6: Single Firm Dynamics

Chapter 6:
Single Firm Dynamics
Steven Berry
Yale Univ.
∗
Ariel Pakes
Harvard Univ.
November 11, 2003
1
Introduction
In the course so far, we have considered the empirical analogs of many of the
classic models of Industrial Organization. However, almost all of these models
lack any sophisticated dynamics. This is perhaps not surprising, in that it
can be difficult enough to model interactions between firms without modeling
how firms and market evolve over time. Yet, we know that industries and
firms really do change and evolve over time and we would like models that
capture some of this behavior.
In the present notes, we take a step back in some ways by returning to
models of single firms facing a exogenous environment. However, models of
single firms dynamics are interesting both in their own right and as building
blocks for dynamic models of industry behavior.
First we begin with some theoretical background. This theory background
is more detailed than in earlier sections, because the basic theory is often
lacking even from theoretically oriented IO textbooks like Tirole. We then
consider a set of simple models that have been applied to data.
For more on the theory, the reader might also want to consult a standard
text, such as Stokey, Lucas, and Prescott (1989). For estimation issues, one
∗
This is very much a work-in-progress. Corrections Welcome. Copyright by Steven
Berry and Ariel Pakes.
1
could also read Rust (1994), Pakes (1994) and Wolpin and Keane (1994).
For computational issues, one might consult Judd (1998).
2
Review of Theory
We consider an simple investment example and review some results for models with and without uncertainty. In the body of the text, we provide some
intuition for results, but refer the reader to a fairly detailed and technical
appendix for formal treatment.
2.1
A Simple Investment Problem and the Contraction Mapping Theorem.
We consider the problem of a firm that must choose a capital level, kt , in
each period t. Say that capital evolves according to the deterministic rule
kt+1 = (1 − δ)kt + it ,
(1)
where it is new investment and δ is a depreciation factor. Say that there is
some cost of investment, c(it , and that single period profits, net of investment
costs, are given by π(kt ). Then, gross single-period profits are
π(kt ) − c(it ).
(2)
We can also think of these profits as a function of current and future capital
stocks by writing single period profits as
F (kt , kt+1 ) = π(kt ) − c(kt+1 − (1 − δ)kt )
(3)
Note that the notation generalizes to many other problems involving a a
choice in the future that depends on a choice today.
The simplest way to write the firm’s dynamic problem is
t
max Σ∞
t=0 β F (kt , kt+1 )
kt ,t=1,∞
subject to
kt+1 ∈ Γ(kt ), t = 0, 1, 2, ....
2
(4)
and x0 is given. The set Γ is the set of feasible future levels for capital, which
might for example be
Γ(kt ) = ((1 − δ)kt , (1 − δ)kt + ī),
where ī is some maximum possible level of investment.
The maximization problem in (4) is called the “sequence problem,” because its solution is a sequence of policies and a function which gives us the
value of those policies for every initial condition x0 .
The sequence problem can be difficult because it requires us to solve for
a infinite stream of future policies. Luckily, the solution to the sequence
problem is often the same as the solution to a different, and in some ways,
easier problem, called the functional equation. This expresses the problem
recursively as
v(kt ) = sup [F (kt , kt+1 ) + βv(kt+1 )].
(5)
kt+1 ∈Γ(kt )
Now the problem is to find a valuation function v as well as a “policy”
function that chooses kt+1 as a function of kt .
Note that the sequence problem looks for an infinite sequence of numbers
among all possible such sequences. The difficult part of the function equation problem (5) is to find the valuation function, v(·). Once we know that
function all we have to do is maximize it and we will find the optimal policy
for each possible k0 . I.e. given v(·) we this is a familiar problem.
What we have done is transform the sequence problem to a problem of
finding the value function, v(·). It is not immediately obvious that this is an
easier problem, or even that a solution exists. Fortunately, the Contraction
Mapping theorem in the appendix tells us that
1. the Sequence Problem and the Functional Equation are equivalent and
2. the Functional Equation defines a contraction mapping.
The formal definition of a contraction mapping is once again given in the
appendix. Informally, consider the right-hand side of (5) evaluated at a
arbitrary function w(k):
w0 (k) = sup [F (k, k 0 ) + βw(k)].
(6)
k0 ∈Γ(k)
The maximized value of this problem is another function w0 (k). When
the right-hand side is a contraction mapping, then a sequence of functions
3
found by the recursive application of (6) is guaranteed to converge to a unique
value function, which is the solution to the functional equation. Note that the
contraction mapping theorem applies under fairly weak conditions, including
when k is a vector of state variables.
This has great importance as a theoretic result about the dynamic problem. It also has tremendous importance as a practical suggestion for computing the answer to our problem. We take an initial guess as to the value
function v (say that it is identically equal to zero) and then iterate on the
functional equation to obtain an answer. Further, the contraction mapping
theorem tells us a upper bound for how far we can be from the answer.
2.2
Computation
It may help to consider how one would recursively compute a value function.
Consider a general deterministic dynamic problem with state variable x,
defined by the value function
v(x) = max
π(x, x0 ) + βv(x0 ).
0
x
(7)
To compute v, one would typically begin by discretizing the state space x
so that x = (0, 1, . . . , K). The single period profit function, π is then a K by
K matrix with columns corresponding to values of x0 and rows corresponding
to x. The value function v is a K by 1 vector. We begin with a arbitrary
guess at the value function, say it is the vector w. Applying the contraction
mapping, we find the new vector
π(x, x0 ) + βw(x0 )
T w(x) = max
0
x
(8)
To find the k th element of T w, we just find the maximum element of the
column vector that is the sum of the vector w and the k th column of the
matrix π. This is a trivial exercise in a matrix programming language like
Gauss or Matlab.1
1
Indeed, in Gauss, the the code for one iteration on the value function would just be
tw = maxc(π + βw)
(9)
where π is a matrix, w is a vector and the “maxc” operator returns the vector which
contains the maximums of each column of π + betaw. (Note that in a matrix programming
language addition of a matrix plus a vector creates the “obvious” resulting matrix with
(i, j) element πij + wi .
4
While a single iteration is easy, the amount of computation grows with
the dimension of the state space. If there are L state variables (instead of
just one), with each state variable discretized on K points, then the size of
the relevant dimension is K L , which grows very rapidly in L. This problem
is a simple version of the the “curse of dimensionality” which affects dynamic
computations.
2.3
Establishing Properties of the Value Function.
Here, we discuss applications of “Corollary 1”. Not written.
1. Idea of the Corollary, which is formally stated in the Appendix.
2. Discuss how to show strong properties, as opposed to just weak.
3. Show the proof of monotonicity.
4. Discuss the proof of convexity.
2.4
Introduction to Stochastic Problems.
For empirical work, purely deterministic models are not very interesting. At
the least, one would want to add a stochastically changing environment, if
not also random shocks to the firm’s own outcomes. In this environment, we
can sometimes use “Euler equation” methods to estimate model parameters.
Our early discussion of the size distribution suggesting some form of
“Gibrat’s Law” as an explanation for the skewed firm sizes that are observed
in data. We will not discuss models that directly lead to Gibrat’s law, but
we do want to emphasize a point that is related to the Gibrat/Simon model:
it is likely that the outcomes of firms decisions are subject to random shocks.
Thus, we will also consider models of “Stochastic Accumulation” where the
firm cannot exactly choose the outcomes of its own investment, but can only
alter the distribution of future outcomes.
Stochastic Environments
We can begin with the simple investment model with an exogenous random
term, z, that shifts profits over time. This could be a market price (or a set
of prices) in a perfectly competitive market. We will assume that z follows
5
a first-order Markov process, with distribution P (zt+1 | zt ). The variable z
is exogenous in that the firm cannot effect its evolution over time.
For the simple investment problem, the single period return is now
π(k, z) − c(i),
(10)
where again k is capital and c is the cost of investment, i. We assume the
“deterministic accumulation rule
kt+1 = (1 − δ)kt + it .
(11)
The value function becomes
v(k, z) = max π(k, z) − c(i) + β
Z
i∈Γ(k,z)
v((1 − δ)k + i, z 0 )P (dz 0 | z)
(12)
Moving away from the simplest investment context, we can consider the
model with endogenous state x (where x is possibly a vector) and exogenous
environment z. The problem is to choose next periods state, x0 , from the
feasible set Γ(x, z). The value function is
v(x, z) = 0max π(x, z, x0 ) + β
x ∈Γ(x,z)
Z
v(x0 , z 0 )P (dz 0 | z)
(13)
Here x could be any kind of physical investment, or could be a marketing
effort. One could set up a dynamic demand model where x is the current
“stock” of consumers of the firm and where today’s price set by the firm,
together with today’s stock of consumers, determines tomorrow’s stock x0 .
Much of the discussion from the purely deterministic case carries over to
the present context. For example,
• we can use Blackwells conditions to show that (13) is a contraction,
• we can use Correllary 1 in the appendix to prove properties of the the
value function, and
• we can compute the value function by simple recursive methods.
The next subsection provides an estimation method that is sometimes
useful for this class of models when investment is a continuous variable.
6
2.5
Estimation for Continuous Problems: Euler Equations.
Here we discuss Euler equations and problems.
To illustrate the method, consider a classic investment problem. The
single period return for firm j depends on capital, k, investment, i and an
exogenous state of the world, z.
π(kjt , zjt , θ) − c(ijt , θ),
(14)
where c is the cost of investments. Assume a standard depreciation rule
kt+1 = (1 − δ)kt + it .
(15)
The value function is
v(kjt , zjt ) = max π(kjt t, zjt ) − c(ijt ) +
ijt ∈Γ(kjt )
Z
βv((1 − δ)kjt + ijt , z 0 )P (dz 0 | zjt )
(16)
which gives an optimal policy i (k, z). One way to take the model to data
would be, for each θ, to solve Bellman’s equation for v and the associated
optimal investment rule. This would require knowledge (or an estimate of)
the law of motion for the exogenous shifters z. The actual investment could
then be compared to the predicted investment at alternative θs.
A Euler equation method relies instead on a feasible perturbation of the
optimal policy which reallocates investment from t + 1 to t, while keeping k
constant at all times after t + 1. Start from the optimal strategy i∗jt and consider an alternative strategy which increases investment by a small amount
in period t and then in period t + 1 reduces the optimal investment from
i∗jt+1 to i∗jt+1 − (1 − δ).
First, note that the two changes together keep kjt+2 at its original unper∗
turbed level, kjt+2
:
∗
kjt+2 = (1 − δ)kjt+1 + ijt+1
= (1 − δ) ((1 − δ)kjt + ijt ) + i∗jt+1 − (1 − δ)
(17)
(18)
= (1 − δ) (1 − δ)kjt + i∗jt + + i∗jt+1 − (1 − δ)
= (1 − δ) (1 − δ)kjt + i∗jt + i∗jt+1
∗
∗
+ i∗jt+1 ≡ kjt+2
= (1 − δ)kjt+1
(19)
7
Second, note that the perturbation is only feasible if i∗jt+1 is always (for
any realization of zjt+1 ) away from a lower boundary. This requirement will
frequently be violated in micro data, where firms may not necessarily invest
in every period. This is an important constraint on the use of Euler equation
methods.
Note that if i∗ is in the interior of Γ, then the optimal investment path
i∗ must satisfy the necessary condition that the derivative with respect to
the perturbation of the expected profit stream must be zero. Since the
profit stream from t + 2 onwards is constructed so as to be unaffected by the
perturbation, the derivative will only involve terms in t and t+1. Specifically,
this first-order condition (or “Euler equation”) is
Z
∂c
− (ijt , θ) + β
∂i
"
#
∂π
∂c
(kjt+1 , z 0 , θ) − (1 − δ) (i∗t+1 , θ) P (dz 0 | zt ) = 0. (20)
∂k
∂i
We do not directly observe the expectation in (20), but, given the parameter,
θ, and functional forms for π and c, we do see a realization of the integrand.
In an important article, Hansen and Singleton (1982) showed how to use this
fact to construct a method of moment estimation procedure.
To develop the estimator, first denote the (discounted) integrand of the
expectation as the function
h(k, z, i, θ) = β
∂π
∂c
(k, z, θ) − β(1 − δ) (i, θ).
∂k
∂i
(21)
Then the deviation of the realized h from its mean is the random variable
vjt = h(kjt+1 , zjt+1 , ijt+1 , θ) −
Z
h(kjt+1 , z 0 , i∗ (kjt+1 , z 0 ), θ)P (dz 0 | zjt ), (22)
where i∗ (k, z) is the optimal investment policy as a function of capital and
the state of the world, z and where (kjt+1 , zjt+1 , ijt+1 ) is a vector of realized
data. Since vjt is an expectational error, it is uncorrelated with everything
known at time t, including (kjs , ijs ,zjs ) for time s < t + 1.
If we add and subtract h to (20), we get
−
∂c
(ijt , θ) + h(kt+1 , zt+1 , it+1 , θ) = vjt .
∂i
(23)
This is set up exactly as a method of moments estimating equation, using
the condition E[vjt | Jjt ], where Jjt is the set of information available to the
firm at time t.
8
Note that the Euler equation method does not require information on the
distribution of z, nor knowledge of the optimal policy rule i∗ (k, z), but only
observations on the outcomes. This is a major advantage of the approach.
One disadvantage has already been mentioned – optimal investment has
to be strictly in the interior of Γ.2 Another possible problem concerns the
distribution of the exogenous shocks zjt . Sometimes, a mistake is make in
using a short panel of firm where the shock zt is common to all the firms (for
example, it is a market price). In this case, the GMM estimation procedure
is not averaging over a large number of independent draws on z and it need
not converge, even as the number of firms in the panel gets large.3
Note that the Euler equation method can often be adapted to other singleagent decision-making problems with a stochastic environment and deterministic transitions of the endogenous state variables.
The empirical application section below discusses the application in Pindyck
and Rotemberg (1983) of Euler equations to the problem of production in
the face of energy price shocks
A Discrete Example
This example is taken from Dixit (1989). Dixit motivates his examples by the
problem of a firm deciding whether to enter (if it is currently out) or exit (if it
is currently in) a market when the firm’s entry-exit decision does not affect
the price process and there are sunk costs of entry (and possibly of exit).
We should note that a similar type of analysis could be applied to the many
phenomena that involve switching between different regimes when there are
costs to switching (producing different goods, switching between different
types of fuels, or preparing land for different crops, are but a few of the
examples). These problems are all a subclass of a set of problems sometimes
referred to as (s, S) problems. The original proof of the optimality of (s, S)
policies is due to Scarf (·).
We will look for a pair of functions one that gives the value of being active
in the market given the “price”, x, and one that gives the value of being out
of the market (i.e. of being a potential entrant) given x. When in the market,
2
See Pakes (1994) for an extension to the case where x need not be in the interior of
Γ is the immediately following period, but must be in the interior in some later period.
The form of the Euler equation then becomes more complicated and exit of the firm is
still ruled out by the condition.
3
Technical footnote here about what is really required.
9
there is a per-period production cost of w. If an active firm exits, it gets a
“scrap value” of l. An entrant incurs a sunk entry cost of k. The problem is
then defined by the two functions
υA (x) = x − w + β
υO (x) =
+β
Z
Z
max {υA (x0 ),υO (x0 ) − l} P (dx0 |x)
max {υA (x0 ) − k,υO (x0 )} P (dx0 |x)
where (υA (x), υO (x)) give the value of being active and inactive, respectively.
A contraction mapping is then defined by
[T f ](x) =
(T1 f )(x)
(T2 f )(x)
!
=x−w
=
+β R max {f1 (x0 ), f2 (x0 ) − l} P (dx0 |x)
+β max {f1 (x0 ) − k, f2 (x0 )} P (dx0 |x)
R
where
P = {P (·|x), x ∈ X}
is a family which is stochastically increasing in x.
Exercise Verify that the last equation satisfies the definition of a contraction mapping.
The solution generates two “cutoff” values of price, x. The inactive firm
enters iff x > xh2 , the entry cutoff. If one is active then she exits iff x < xl1 ,
the exit cutoff. This is reminiscent of Scarf’s S − s inventory policy and the
reasoning is similar.
[See Ariel’s Notes 4 for the full analysis.]
2.6
Estimation with i.i.d. Errors
The intuition of estimation with discrete controls and a stochastic environment is that we would like to “fit” the observed outcomes to those predicted
by the computed model at various values of the parameters. We can elaborate on the general principle here, but for now see the Rust example below.
Also discuss problems with the i.i.d. assumption and difficulties with the
extension. The extension to first-order Markov processes is probably doable
now, given faster computers . . .
Brief Notes:
First consider a problem with discrete controls. The simplest stochastic
environment is an i.i.d. additive shock to the single-period return from each
discrete choice. In this case the value function is
v(x, ) = max
[π(x, x0 ) + + βE v(x0 , ),
0
x
10
(24)
where π is a K by K matrix, and is a K by 1 vector. If we knew E v(x0 , ),
the right-hand side would describe a typical multinomial discrete choice problem with an i.i.d additive error.
The trick in Rust (1987) is to take the expectation of the left-hand side,
which generates a contraction mapping (verify this as an exercise) in the
expected v, where barv(x) ≡ E v(x, ),
v̄(x) ≡ E max
[π(x, x0 ) + + βv̄(x).
0
x
(25)
The expected value function v̄ is now the expected value of the maximized
discrete choice. This has a closed form when has the “double-exponential”
distribution the generates the “logit” discrete choice model. In this case,
then, the contraction is easy to solve.
We now consider a model with continuous x0 , which we might still discretize on K points. In some cases an Euler equation method is inappropriate
because there is a chance of ending up on the boundary (as with zero investment). In this case, we might add a scalar error which interacts with the
value of x, as is
v(x, ) = max
[π(x, x0 ) + x + βE v(x0 , ).
0
x
(26)
Now if the where normal and we knew v̄(x) ≡ E v(x0 , ), this would simply
be an ordered probit, which is a familiar econometric model. One again
we could solve for v̄(x) using the expected maximum of the ordered probit
maximand.
More realistically, we would like to introduce serial correlation in the
unobservables. Wolpin and Keane [cite] do this by introducing unobserved
“fixed effects” that do not vary over time but do vary across firms. They
discretize the support of the fixed effects and solve the value function once
for each possible value of the fixed effect. The predicted outcome is then the
expected outcome across the distribution of the fixed effects. Note that if
the fixed effects can take on, say, L values, then there are 2L-1 additional
parameters to estimate: a value and a probability for each point, minus one
for the adding up constraint on probabilities.
An alternative would be to model the distribution as a first-order
Markov process. In the case of (26), this introduces a second dimension
of the state space, changing the value function to
0
v(x, ) = max
[π(x, x ) + x + β
0
x
11
Z
v(x0 , 0 )P (0 | ).
(27)
Such a two-dimensional state-space can now probably be computed in the
context of an estimation algorithm.
However, if the space has many dimensions, then the “curse of dimensionality” will rapidly cause problems. Consider what would happen if there
was one for each of L choices. Say that we discrete each on K points.
Then the transition matrix P is a K L by K L matrix. This “curse of dimensionality” rapidly becomes a problem. For some ideas on solving the curse
of dimensionality, see Rust (1997) and Pakes and McGuire (2001).
2.7
Stochastic Accumulation.
In this section we consider models where the outcome of the firm’s investment is uncertain. (See for example, the single-firm case of Ericson and
Pakes (1995)). In the case of physical investment, there may be delays in
construction or the capital good may be of uncertain quality. Research and
Development (R&D) efforts are a form of investment that is notoriously uncertain. In such cases, the firm may choose an investment level y that alters
the distribution of future states.
If todays “capital” stock is x, then the distribution of future outcomes,
x0 , might have the distribution Px (x0 | y, x), where z is again an exogenous
state of the world. We assume that P , is stochastically increasing (see the
Appendix) in both y and x. This is a way of formalizing the idea that higher
“capital” x and higher “investment” y should increase the probability of good
outcomes tomorrow.
The value function is now
v(x, z) = max π(x, z, y) + β
y∈Γ(x,z)
Z
v(x0 , z 0 )Px (dx0 | y, x)Pz (dz 0 | z)
(28)
Once again, one can show that the value function in (28)
• is a contraction mapping (from Blackwell’s conditions),
• can be solved for via a simple recursive method.
“Corrolary 1” style proofs are still possible, often aided by the fact that Px
is stochastically increasing.
In the course of solving for v by successive approximations, we make a
guess, w(x, z), for the value function. If π and Px are differentiable in y, then
12
an optimal policy in the interior of Γ satisfies the first-order condition
Z
∂π
∂Px
(x, z, y) + β w(x0 , z 0 )
(dx0 | y, x).Pz (dz 0 | z)
∂y
∂y
(29)
This is useful in computation and in characterizing the optimal policy. It
is not a traditional Euler equation, however, as it involves the (initially unknown) function v. Usefully, this first-order condition for stochastic accumulation problems does not require the differentiability of v.
Often, we are interesting in discrete policies, such as entry and exit. For
example, consider the stochastic investment problem modified so that the
firm can exit and obtain a scrap value Φ(x). In this case, the value function
is the maximum of the scrap value and the maximized value of continuing in
business.
(
Φ(x),
R
maxy∈Γ(x,z) π(x, z, y) + β v(x0 , z 0 )P (dx0 | y, x)P (dz 0 | z)
(30)
While a first-order condition may hold for investment conditional on not
exiting, the exit decision is necessarily discrete. (Note that the Olley-Pakes
empirical model discussed above relies on a version of this problem where x
is physical capital while the vector z consists of (“management talent”) and
age.
Exercise.
v(x, z) = max
1. Prove that the operator defined implicitly in (30) is a contraction mapping.
2. Prove that the value function is weakly increasing in k.
3. What is the form of the stopping rule. I.e. what is the form of the
optimal policy for χ.
4. Can the value function be strictly increasing in k everywhere?
5. Draw the value function. Can it be concave in k everywhere?
We now consider several empirical examples of stochastic accumulation
models, each of which feature discrete controls.
13
3
Empirical Applications
3.1
Euler Equations and Dynamic Cost Minimization
Pindyck and Rotemberg (1983) apply Euler equation method to estimate dynamic factor demands in the presence of energy price shocks. One economic
question is the degree to which capital is substitutable with energy.
The authors treat energy and materials as purely static choices (fully
variable inputs), while capital and labor are subject to costs of adjustment
(as “partially fixed”). The variable cost of materials and labor, conditional
on a choice of capital, labor and output, is given by
C(et , mt , Kt , Lt , Qt , t) − rt Kt − wt Lt
(31)
where
• et is the price of energy,
• mt is the price of materials, and
• K is capital, L labor and Q output.
The firm faces convex adjustment costs for capital and labor of c1 (It ) and
c2 (Ht ), where investment is defined as It ≡ Kt − (1 − δ)Kt−1 and the net
hiring of labor is Ht = Lt − Lt+1
The value function is
pt Qt − C(et , mt , Kt , Lt , Qt , t) − rt Kt − wt Lt −
v(Kt , Lt , zt ) = max
0 0
K ,L
(32)
−c1 (K 0 − (1 − δ)Kt ) − c2 (L0 − Lt ) + βv(K 0 , L0 )Pz (dz 0 | zt ),
where the exogenous stochastic environment consists of the price vector zt ≡
(et , mt , pt ). Note that pt is the output price.
One can construct a feasible perturbation that leaves output fixed in every
period, substituting in period t from variable inputs to (K, L) and back from
(K, L) to variable inputs in period t + 1. The first-order conditions with
respect to this perturbation define the dynamically cost-minimizing use of K
and L.
The Euler equation for capital is:
Z
∂c1
∂c1 ∗
∂C
C(et , mt , Kt , Lt , Qt , t)+rt +
It −β (1−δ)
(I ((1−δ)Kt +It , zt ))Pz (dz 0 | zt ) = 0.
∂K
∂I
∂I t+1
(33)
14
where i∗ (k, z) is again the optimal investment rule. The Euler equation for
hiring labor, H ≡= Lt − Lt−1 ) is
Z
∂C
∂c2
∂c2 ∗
C(et , mt , Kt , Lt , Qt , t)+wt +
Ht −β
(H ((1−δ)Kt +It , zt ))Pz (dz 0 | zt ) = 0.
∂L
∂H
∂H t+1
(34)
The equations for the pure static inputs, energy and materials, are just
∂C
given by Shepard’s Lemma as E = ∂C
and M = ∂m
.
∂e
The empirical work uses industry-level data, so one must think of a “representative” firm, which is unfortunate. The application is to the 1978 decontrol of natural gas prices, which substantially increased the price of natural
gas. The conditional variable cost function C is parameterized as a translog,
while the adjust costs are assumed to be quadratic.
The empirical findings suggest that labor adjustment costs are small and
that energy and capital are complements, rather than substitutes, in production. The short run estimate (with K and L fixed) for the own-price elasticity
of energy is -0.36, while the long-run estimate is -0.99. Thus, the long-run
effect of a persistent energy price change is quite different from the short-run
effect, a finding that could not occur in a purely static cost-minimization
exercise.
3.2
Patents as Options
There is a long literature examining the value of patents and the value of
innovative ideas in general. Some of this literature tries to measure how effectively patents protect the profits from innovation. A related literature uses
patents as a proxy output measure. Since only “novel” ideas can be patented,
counts of patents in a given country or time are sometimes used as a rough
measure of innovative activity. However, it is known that most patented
ideas never produce profits, so patent counts are a very rough measure.
For some time in Europe (and, now, in the US) patent holders are periodically required to pay a renewal fee to keep their patents in force. Many
patent holders let their patents expire, indicating that the the renewal fee
exceeds the expected discounted present value of current and future patent
protection. At a very simple level, rather than using raw patent counts as
a output measure, it may be better to weight patents by how long they are
keep in force. The unrenewed patents presumably did not produce great
value and so they should be given little weight in a measure of innovative
activity.
15
Pakes (1986) takes this logic further and (building on earlier work) estimates the distribution of patent values over time from the renewal behavior
of patent holders.
3.2.1
The Model.
Let us begin with a simple two period model. We condition on the existence
of the patent. In the first period, the patent returns (random) net profits of
r1 . These profits are net of what would have been earned in the absence of
the patent and therefore may not be the total return to the innovative idea.
In the second period, the patent returns net profits of r2 , with a distribution
function G(r2 | r1 ), which is non-increasing in r1 . To keep the patent in force
in period one, a renewal fee of c1 must be paid, and in period two the fee is
c2 .
Using the usual logic of dynamic programming, we solve the model backwards. The second period renewal fee is paid if r2 > c2 and so the second
period return function is
max[r2 − c2 , 0].
If the patent is renewed in the first period, the firm receives
r1 − c 1 + β
Z
max[r2 − c2 , 0]G(dr2 | r1 ).
Since G is decreasing in r1 (i.e. the probability of a low r2 is decreasing in
r1 ), the expected future value of the patent is increasing in r1 . The current
value r1 − c1 is obviously increasing in r1 , so the total return is increasing
in r1 . There must therefore be some value r̄1 such that if r1 < r̄1 then the
patent is not renewed and if r1 > r̄1 then the patent is renewed.
The two period optimal policy is therefore governed by the optimal cut-off
points r̄1 and r̄2 ≡ c2 .
Now consider the multiple period problem, where the patent can be renewed up to age A. In each period there is a renewal fee ca and a distribution
of returns ra with distribution function G(ra |ra−1 , a, θ). The parameters θ
of the return distribution are to be estimated. The last period, A, is then
identical to period 2 problem just solved and the remaining optimal policies
follow by backwards recursion. The value function for the problem is
V (ra , a) = max [ra − ca + βE[V (ra+1 , a + 1) |ra ], 0]
If
16
1. G is weakly decreasing in ra−1
2. G is weakly increasing in a and
3. ca is increasing in a,
then it will not be hard to show (using the exercise below) that there is a
sequence of cut-offs r̄a determining the optimal policies in each period.
Pakes observed as data the number of patents renewing at each age a as
well as the renewal fee schedule ca . His estimation method is as follows. For
a given θ, first solve for the optimal r̄a ’s. To begin we know that r̄A = cA .
Then, as before, r̄A−1 solves
r̄A−1 − cA−1 + β
Z
max[rA − cA , 0]G(drA | rA−1 , A, θ) = 0
Since the right-hand side is increasing, there is a unique solution. Similar
logic is used in the following exercise:
Exercise: Prove by backwards recursion that V (ra , a) is increasing in ra and
decreasing in a.
Given the properties just shown, the r̄a are then recursively determined
from
Z
r̄a − ca + β V (ra+1 , a + 1)G(dra+1 | ra , a + 1) = 0
The integrals in this equation can be difficult; at age a we require and expectation of returns A − a periods into the future. Pakes choose a family of
distributions that gave an analytic (though still quite complicated) solution.
The unconditional proportion of patents dropping out by time a is then
F (r, a, θ), where
1 − F (r̄a , a) = P r(r1 > r̄1 , r2 > r̄2 , . . . , ra > r̄a ).
The fraction of the original sample who drop out at a given age is then:
π(a) = F (r̄a , a) − F (r̄a−1 , a − 1)
If F could be calculated, then θ could be chosen in a method of moments
framework to match the observed data on π(a).
However, F does not have an analytical solution, so Pakes simulates the
required distribution and computes the fraction that drop out at each time
17
period. Given the cut-off points r̄a and the distribution G, this simulation is
easy. To begin a single simulated return distribution, draw r1i from G(·|a =
1). If r1i > r̄1 , continue to the next period and draw r2i from G(· |r1i , a = 2).
If r2i > r̄2 continue and so forth. The fraction of the i = 1, . . . , Ns draws that
surviving to period a − 1 but dropping out at a is the simulated estimated of
the period a drop-out rate. As the Ns , the number of simulation draws grows
large, the simulated drop-out rates will approach the true values. Even for
small Ns , the simulated drop-out rates are unbiased.
The asymptotics of such simulation estimators is given in Pakes and Pollard (1989) and McFadden (1989). Pakes observed a very large number of
patents and also took a very large number of simulation draws, so that any
failure to fit the data is a sign of specification problems rather than sampling
variation. Using the later asymptotic results, one could have taken fewer
simulation draws and still calculated correct standard errors.
3.2.2
Results
Pakes fits the renewal patterns reasonably well. The estimated distributions
for r are consistent with the idea, found in many institutional discussions
of patents, that the great majority of patents have little economic value,
but that the remaining patents are extremely valuable on average. Firms
continue to pay renewal fees in the hope that their innovation will prove to
be one of the small number of big winners. However, much of the uncertainty
is resolved in the early to mid renewal periods, when drop-out proportions
are high (is this right)?
3.3
“An Empirical Model of Harold Zurcher”
In Rust (1987), the idea is to model discrete investment processes at a detailed level via a dynamic programming model that generates a “stopping”
time for machine replacement. The specific application is to bus engine
replacement. Uses include [i] checking to see if a DP model can well approximate actual decision processes and [ii] providing a method to build, from the
ground-up, the aggregate demand for a discrete investment good.
In every period, the machine has accumulated usage (mileage) xt , which
gives an operating cost of c(xt , θ1 ), where θ is a parameter to be estimated.
The operating cost function is assumed to be increasing in xt . In a given
period, the firm can either continue to operate the machine, or else can
18
replace it at a cost of R. After replacement, the usage variable drops to xt =
0. Over time, usage evolves according to the exogenous Markov distribution
P (xt+1 |xt ), which is assumed to be stochastically increasing.
The single period return is then:
u(x, i, θ) =
−c(xt , θ)
if it = 0
−R − c(0, θ) if it = 1
(35)
where It is a indicator variable for machine replacement.
The value function is:
(
v(xt ) = max
−c(xt , θ1 ) + β v(xRt+1 )P (dxt+1 |xt ) ,
−R − c(xt , θ1 ) + β v(xt+1 )P (dxt+1 |0)
R
(36)
To avoid an “over prediction” problem (of probabilities one or zero events),
Rust must add some errors to the utility function. In particular, he assumes
that
−c(xt , θ1 ) + 0t
if it = 0
u(x, i, θ) =
(37)
−P − c(0, θ) + 1t if it = 1
where the ’s are unobserved shocks.
In general, if the ’s are correlated over time, then the problem is considerably complicated. The value function now depends on two states
(
−c(xt , θ1 ) + 1t + β ṽ(xRt+1 , t+1 )P (dxt+1 dt+1 |xt , t ),
ṽ(xt , ) = max
−R − c(xt , θ1 ) + 2t + β ṽ(xt+1 , t+1 )P (dxt+1 dt+1 |0, t )
(38)
Now the burden of calculating ṽ would be greatly increased because of the
“curse of dimensionality.”
However, Rust makes a greatly simplifying conditional independence assumption:
P (xt+1 , t+1 |xt , t ) = P1 (xt+1 |xt )P2 (t+ 1 ).
(39)
R
If we define the expected value
v(x) ≡
Z
ṽ(xt+1 , t+1 , )P2 (dt+1 )
(40)
then equation (38) has almost the same right-hand-side as (36)
(
ṽ(xt , ) = max
−c(xt , θ1 ) + 0t + β v(xt+1
)P (dxt+1 |xt ),
R
− R − c(xt , θ1 ) + 1t + β v(xt+1 )P (dxt+1 |0)
R
19
(41)
Note that this is just a simple binomial discrete choice model, except that
we don’t know the function v that enters the right-hand side of the “utility”
function.
Note that this conditional independence assumption is quite strong. Most
economic time-series show a strong serial correlation and we might except
that to be true of the unobserved shock to maintenance expenses for a given
bus. If the bus is troublesome in one month, then perhaps it is more likely
to be troublesome the next month. In fairness to the application, there is a
long engineering literature that assumes that machine failures are uncorrelated over time and Rust is controlling for the observed use of the equipment.
However, in other economic contexts, the assumption of conditional independence over time might be more bothersome.
To find v, take the expectation of both sides of the last equation with
respect to the distribution of the ’s, giving the recursive equation
"
v(xt ) = E max
(
−c(xt , θ1 ) + 0t + β v(xRt+1 )P (dxt+1 |xt ),
−R + c(xt , θ1 ) + 1t + β v(xt+1 )P (dxt+1 |0)
R
#
(42)
This is especially simple if we the distribution of the ’s is i.i.d “extreme
value” (the “logit” distribution) so that we know the choice probabilities and
expected utilities have simple function forms. First define
δ0 (xt , v) = −c(xt , θ1 ) + β
Z
v(xt+1 )P (dxt+1 |xt )
δ1 (v) = −R − c(0, θ1 ) + β
Z
v(xt+1 )P (dxt+1 |0).
(43)
(44)
Then the choice probabilities have the logit form:
P r(It = 1|xt ) =
eδ1 (v)
.
eδ0 (xt ,v) + eδ0 (v)
(45)
However, we cannot estimate the parameters from this equation directly,
because we still don’t know the function v.
Also from McFadden we know that the expected maximized value is
v(xt ) = ln(eδ0 (xt ,v) + eδ0 (v) )
(46)
This last equation then provides a simple recursive formula for determining
v.
20
To compute the equilibrium, Rust follows the usual procedure of discretizing the state space of x. In particular, he divides it into 90 intervals of equal
length. We can then think of x as being a member of the set (0, 1, 2, . . . , K),
where K = 90. Rust assumes that the x process advances by at most one
increment each period, and defines the probability of advancing as the parameter θ3 .
P r(xt+1 = xt + 1|xt ) ≡ θ3
P r(xt+1 = xt |xt ) ≡ (1 − θ3 )
(47)
(48)
Note that the integrals in the mean returns now have the particularly
simple form of
δ0 (xt , v) = −c(xt , θ1 ) + β [v(xt + 1)θ3 + v(xt )(1 − θ3 )]
δ1 (v) = −R − c(0, θ1 ) + β [v(1)θ3 + v(0)(1 − θ3 )]
(49)
(50)
Rust then uses a nested algorithm to calculate the likelihood function. For
a given parameter vector, the function v (which is a K-length vector), is found
from the solution to the K non-linear equations in (46). Then, this answer
is put into the probabilities (45). The product of these probabilities form
the likelihood. The maximum likelihood algorithm then searches over the
possible parameters, with the computation of v nested into each evaluation
of the likelihood.
Rust finds that the model fits the data fairly well. One possible next
step in this research agenda might be to construct an aggregate investment
demand function and ask question about the suppliers of the investment
goods.
3.4
Serial Correlation
Here one might discuss at least the method of Keane and Wolpin/Eckstein
and Wolpin.
Exercises
1. Discuss the reasonableness of the conditional independence assumption
on .
2. Use an inductive argument applied to (36) to show that v(xt ) is decreasing in xt and that the optimal policy associated with (36) is a
“stopping rule.”
21
3.5
Dynamic Regulation of a Natural Resource
In Timmins (1999), the issue facing the water authorities is how to set price.
A similar problem faces the authorities determining the extraction of many
other exhaustible resources. If extraction today raises the cost of extraction
tomorrow, then it is clearly inoptimal to set current price equal to current
marginal cost (this ignores the fact that a lower current price will cause a
lower aquifier level, and hence a higher marginal cost in the future. What
a social planner ought to do is to take into consideration the effect of lower
price today, on the cost of extraction in future years through its effect on the
level of the aquifier cost. Thus the planner’s problem is an optimal control
problem.
The prices actually charged in California are noticeably lower (not higher)
than marginal costs (and by quite a bit). So it is pretty clear that the
authority is not behaving as a planner. In fact the authority determining
prices is either an elected board, or responsible to elected officials. It therefore
has to keep the upcoming elections in mind.
The motivating questions are
• how does the authority in fact choose prices?, and
• what are the implications of that choice on consumer welfare?
To determine the answer to the first question we will assume a behavioral
model up to a parameter vector which will be estimated. It will be estimated
from pricing behavior, that is the parameter that will be estimated is that
value which rationalizes the water authorities pricing behavior. Once we have
both the behavioral rule, and the primitives estimated (demand, costs, ....),
it will be easy for us to analyze an assortment of issues; for e.g. how much
would we be better off if a planner were making the choices, or whether it is
worthwhile to install a new technology which saves water but has installation
costs.
3.5.1
The Primitives.
There are three of these;
• A demand function,
• A cost of extraction function whose major two arguments are the demand for water, and the “height” of the aquifier system,
22
• An equation which determines the evolution of the “height” of the
aquifier system as a function of previous height, demand, and rainfall.
His observations are on different municipalities over time. Thus there
are two indices, one for the municipality and one for the time period, and
some of the parameters are allowed to be municipality specific. I ignore
these distinctions in my description. However such distinctions can become
important to the properties of alternative estimators, so that in general one
would want to be careful to specify them before embarking on the econometric
details.
Demand is given by
D = exp[δ0 + δ1 P + δ2 Inc + δ3 R + δ4 S + d ],
(51)
where D is quantity demanded, (P, Inc, R, S) are, respectively, price, (median) income, rainfall, and number of service connections, while d is a disturbance term whose realization is not known by the authority when pricing
decisions are made (so price cannot be a function of it). The {dt } are assumed
to be i.i.d. over time 4
A few notes are in order.
• The assumptions you make on the disturbances in these models will
determine how the model is estimated. The way to make the right
assumptions is to figure out what are the major determinants of the
phenomena of interest that you do not have empirical measures of, and
then try to figure out their properties. In this context an implication of
the fact that d is not known when pricing decisions are made, is that
price cannot be a function of it. If in addition is supposed primarily
to result from things that do not depend on price, then it is reasonable
to assume that price and are independent – or at least uncorrelated.
• Note that this is in the tradition of representative agent demand systems. That is we do not have a distribution of agents with different
incomes, special water needs (like agriculture), ex-cetera, and then integrate out to obtain the demand system. It should therefore be thought
4
In the paper rainfall is treated as a random process whose mean is different in different
municipalities and whose disturbance is i.i.d. over time. The mean is estimated in the
algorithm.
23
of as primarily a simple summary of the data on demand. As noted it
runs into special problems when we want to analyze the distributional
impacts of a given policy, but provided we get a good approximation to
demand in the region of interest, could well do O.K. on other aspects of
policy evaluation. On the other hand the distributional impacts might
be quite important when policy decisions depend on votes and interest
groups.
Extraction costs are given by
C = hα1 Dα2 exp[α0 + c + ξ],
(52)
where C is extraction cost, c is a disturbance whose realization the authorities do not know when they are setting price, and ξ is a source of disturbance
known to the authorities when prices are set but not known to the econometrician. Consequently prices will be a function of ξ but not of c . Since D is
a function of p, and h is a function of D both right hand side variables are
correlated with the composite disturbance term, and hence O.L.S. cannot be
used to estimate the parameters of this equation.
{ct } is assumed to be i.i.d. over time, though it can be correlated with
d
t at a given point in time.
He entertains two different assumptions on {ξt } in different papers;
• {ξt } are i.i.d. In this case past values of D and h as well as the current
and past values of the other exogenous variables (rainfall, S, ...) are
instruments for this equation.
• ξt = ξt−1 ρ + vt . In this case we can either use just current and past
values of the exogenous variables as instruments, and then worry about
serial correlation in the error, or “quasi first difference”, and use, in
addition lagged values of D and h as instruments.
Aquifier height evolves as
ht+1 = γ1 ht + γ2 Dt + γ3 Rt + ht+1 ,
(53)
where, once again h is assumed unknown when prices are determined, is
i.i.d over time, and is freely correlated with c and d . Note that since d is a
24
determinant of D and D is a determinant of h, this means that we need to
use instruments on h and D. However here lagged values of D and h work.
These definitions then gives us consumer surplus, CSt and tax revenue (which
may be negative) T Rt as, respectively,
CS ≡
Z
∞
D(·, p)d p
P
and
T Rt ≡ P ∗ D(·) − Cost(·).
3.5.2
Behavioral Assumptions.
The basic behavioral assumption is that the authority measures benefits as
a weighted average of consumer surplus and revenues, say
π ≡ νCS(·) + (1 − ν)T R
(54)
and chooses a pricing policy that maximizes the expected discounted value
of these benefits over time. That is if a policy is designated by d then it is
a sequence of functions mapping history at each t into a pricing function for
that t. The policy chosen maximizes
Ed Στ π(·)t+τ β τ ,
where Ed is notation for expectations when the d policy is chosen.
Note here that ν is to be estimated. Thus if ν = .5 then consumer’s weigh
tax revenue the same way they weigh consumer surplus and we just maximize
the (expected discounted) sum of consumer surplus and tax expenditures. In
a representative agent world it is hard to imagine a benevolent social planner
doing anything else. However there is no benevolent social planner, and the
water authority may believe that consumers accredit it with the benefits
from water pricing, but not with the benefits from tax savings (since this
is a small part of the fiscal health of the city). In general, given that the
authorities are either elected, or respond to elected officials, you might think
that also the way they discount the future would, in some sense, reflect the
electoral cycle. This would argue for richer discounting schemes than the
geometric decay he is assuming. The political economy issues also weight
against using the representative agent demand function used here (after all
25
votes are determined by interest groups, etc.) These are issues Timmins is
pursuing in other research.
Given our behavioral assumption you should be able to prove that to
solve this problem we need only find the value function defined implicitly the
functional equation
V (ht , xt , ξt |θ) = sup[E[πt (·)|ht , pt , xt , ξt ]
(55)
p≥0
+β
Z
V (h0 , x0 , ξ 0 )d P (ξ 0 )d P (h0 |pt , ht , xt , ξt )d P (x0 |x),
where x = (Inc, R, S, ). Note that h0 depends on D(·) and through that on
p. So the solution for the optimal price is a dynamic problem. This is a
familiar type of problem in the economics of exhaustible resources.
When computing the value function, we also obtain the policy function;
Pt = P (ht , xt , ξt |θ)
3.5.3
(56)
Estimation.
The parameter vector includes all the parameters determining the primitives,
the ν from the current benefits function, and the β from the discounting
process. Let the vector of parameters be denoted by θ. All told there are
over 50 components of θ.
The Computational Problem.
Estimation routines are based on computing the implications of the models
for different values of the parameter vector, and then finding that value of
the parameter vector that makes the model’s implications as close as possible
to the data. Note that we have to calculate the value function each time
we need to evaluate a different parameter vector – hence the name “nested
fixed point algorithms”. Thus the computational burden of the algorithm is
determined, in large part, by the number of different parameter vectors we
have to evaluate.
If we use a Newton Raphson, or other derivative based routine, we require
estimates of at least the following at each iteration:
• the derivatives of the distance function with respect to each element
of the parameter vector (and if these are not analytic they generally
26
use two-sided estimates of the derivative; so this requires two function
evaluations for each parameter.)
• points on a line search, the direction being given by derivatives.
So even without computing second derivatives exactly, the number of
function evaluations we need per iteration is more than twice the number of
parameters, and characteristically the number of iterations we need also goes
up significantly in the number of parameters. Since it is the evaluation of
the value function that is costly, we do not want to do this too many times.
Two stage estimators let us minimize on this (at some cost in the statistical
efficiency of the estimator).
Two Stage Estimators; the First Stage.
Timmins tries to alleviate this problem by writing θ = (θ1 , θ2 ) where θ1 are
the parameters that can be consistently estimated without ever computing
the value function. He then fixes θ1 at the value estimated in the first stage,
say θˆ1 , and iterate only over θ2 in a second stage which requires computing
of the value function.
This is a procedure used in most papers that require calculation of a fixed
point to evaluate the objective function at different values of the parameter
vector, and is very helpful. One has to keep in mind, however, that the variance of the second stage parameter estimate is a function of the distribution
of the first stage parameter estimate, and this has to be taken account of in
forming standard errors. Any standard econometric discussion of two stage
estimators will tell you how to do this.
More precisely the model has implications for:
• The demand for water.
• The cost of producing that water.
• The height of the acquifier.
• The price charged for the water.
We can analyze the predictions for each of the first three of these variables,
for demand, cost, acquifier height, without ever calculating the contraction
mapping. This produces estimates of the vast majority of the parameters
27
(over fifty of them) without ever computing the value function. That is the
following system of equations can be estimated without computing the value
function.
ln(Dit ) = δ0 + δ1 Pit + δ2 Incit + δ3 Rit + δ4 Sit + di,t ,
ln(Cit ) = α1 ln(hit ) + α2 ln(Dit ) + ξi,t + ci,t ,
hi,t+1 = γ1 hit + γ2 Dit + γ3 Rit + hi,t .
We have already discussed methods of estimation for this system. Note
that the system does not identify ν, β or the full error covariance matrix
(since we cannot separate out σ 2 (c ) from σ 2 (ξ) without the implications on
price, which only depends on the latter.)
The Pricing Equation
Recall that Pt = P (ht , xt , ξt ; θ) the higher is ξ the higher is price, so this
equation is invertible, i.e. we can write ξt = P −1 (pt , ht , xt ; θ). Given an
assumption on the distribution of ξ (say normal (0, σ 2 )), we can maximize
that likelihood to obtain the parameters not yet estimated.
The two cases are ξ is i.i.d. with density µ(·). Then the likelihood is
P
−1
(pt , ht , xt ; θ))]. If ξ follows the first order autoregressive
i,t log[µ(ξ = P
P
process we have the likelihood as i,t log[µ(ξt (·, θ) − ξt−1 (·, θ))]. Note that in
the last expression we are missing using the first year (we only use the quasi
difference from the second year). We cannot use that for the first year ξ has
an unknown correlation with the first year h.
Something on the Estimates.
Timmins gets an estimate of ν = .59(.002), which is significantly different
from the value of zero which is the value that would maximize profits plus
consumer surplus. Recall that the social planner without any distributional
preferences has an ν = .5. The estimate implies that the marginal dollar
transferred from tax revenues to water gives a gain of only 71 cents. Timmins then goes on to evaluate several experiments. One is the question of
whether it is worthwhile to install low flow toilets. The answer is that the
authority, given low flow toilets would further decrease the price of water,
thus diminishing the returns to the innovation. So low flow toilets would
not be beneficial given current decision procedures. On the other hand if we
28
could somehow induce the water authority to make the “socially optimal”
decisions, low flow toilets would be beneficial.
29
4
Appendix: Mathematical Preliminaries.
4.1
Some Definitions.
You might want to have these definitions accessible.
Definition 1 (Real Vector Space). A (real) vector space X is a set of elements
(vectors) together with two operations, addition and scalar multiplication.
Further
1. For any two vectors x, y ∈ X, addition gives a vector x + y ∈ X; and
for any vector x ∈ X and any real number α ∈ R, scalar multiplication
gives a vector αx ∈ X. These operations obey the usual algebraic laws;
that is, for all x, y, z ∈ X, and α, β ∈ R :
• x + y = y + x;
• (x + y) + z = x + (y + z);
• α(x + y) = αx + αy;
• (α + β)x = αx + βx; and
• (αβ)x = α(βx).
2. Moreover, there is a zero vector θ ∈ X that has the following properties
• x + θ = x; and
• θx = θ
3. Finally, there is a unit vector 1 ∈ X with the property that
1x = x
Definition 2 (Metric Space). A metric space is a set S, together with a metric
(distance function) ρ : S × S → R, such thatfor all x, y, z ∈ S :
1. ρ(x, y) ≥ 0, with equality if and only if x = y;
2. ρ(x, y) = ρ(y, x); and
3. ρ(x, z) ≤ ρ(x, y) + ρ(y, z).
Definition 3 (normed vector space). A normed vector space is a vector space
S, together with a norm k · k : S → R, such that for all x, y ∈ S and α ∈ R :
30
1. kxk ≥ 0, with equality if and only if x = θ;
2. kαxk = |α| · kxk; and
3. kx + yk ≤ kxk + kyk (the triangle inequality).
Question: Show that a normed linear space is a metric space with ρ(x, y) =
kx − yk.
Definition 4 (convergence). A sequence {xn }∞
n=0 in S converges to x ∈ S, if
for each > 0, there exists N such that
ρ(xn , x) < , ∀ n ≥ N .
Definition 5. (Cauchy sequence). A sequence {xn }∞
n=0 in S is a Cauchy
sequence (satisfies the Cauchy criterion) if for each > 0, there exists
N such that
ρ(xn , xm ) < , ∀ n, m ≥ N .
Definition 6. (complete metric space). A metric space (S, ρ) is complete if
every Cauchy sequence in S converges to an element in S. A Banach space
is a complete normed linear space.
Definition 7: (Stochastically Increasing). A family of probability measures,
say
Pq = {Pq (·|w, r) : (w, r) ∈ W × R}
is stochastically increasing in w, if for every bounded increasing function f (·),
and every r
Z
f (q)P (d q|w, r) ≥
Z
f (q)P (d q|w0 , r), whenever w ≥ w0 .
Pq is stochastically increasing in w, if and only if for any (t, r)
P (q > t|w, r) ≥ P (q > t|w0 , r) whenever w > w0 •
31
4.2
Relevant Theorems
The theorems that we will use in class are Theorem 1 and its Corrolary,
and Theorem 2. Theorem 1 is the contraction mapping theorem. This will
generally be used to find the value functions for our problems (though it is
sometimes used to prove uniqueness of solutions to other fixed point problems in both theoretical and empirical I.O.). Its corrolary establishes a way
of proving properties of the limit functions; generally the value function and
policy functions in our examples. Theorem 2 is Blackwell’s theorem (it provides easy to use sufficient conditions for a contraction).
Definition 8. (contraction mapping). Let (S, ρ) be a complete metric space
and T : S → S be a function mapping S into itself. T is a contraction
mapping (with modulus β) if for some β ∈ (0, 1), ρ(T x, T y) ≤ βρ(x, y),
for all x, y ∈ S.•
T will usually take one set of functions into another, and will be defined
pointwise by an equation like (T f 0 )( x) = f 1 (x) for all x in some domain.
Theorem 1. (Contraction Mapping Theorem): If (S, ρ) is a complete metric
space and T : S → S is a contraction mapping with modulus β then
• T has exactly one fixed point, say υ∗ ∈ S
• for any υo ∈ S, ρ(T n υo , υ∗) ≤ β n ρ(υo , υ∗), and hence by completeness
limn→∞ T n υ0 = υ∗.
Here T n is the nth iterate of T so T 2 υ = T (T (υ)) and T n (υ) =
T (T n−1 )(υ).
• ρ(T n υo , υ) ≤ (1 − β)−1 ρ(υn , υn−1 )•
Typically the υn is a function we have gotten by iterating on some initial
condition. The l.h.s. of the last inequality is just ρ(νn , ν∗ ), a quantity we
would like to know in order to determine how far away we are from the υ∗
we are trying to approximate. The r.h.s. is a quantity we can measure, and
hence allows us to bound how far we are away from the υ∗.
A simple proof of the last point follows.
ρ(υm , υn ) ≤ ρ(υm , υm+1 ) + . . . + ρ(υn+1 , υn )
(use the triangle inequality to prove this).
≤ (β m−1 + ... + β n )ρ(υ1 , υo )
32
(from Theorem 1, since it insures ρ(υq , υq−1 ) ≤ β q−1 ρ(υ1 , υ0 ) )
βn − βm
=
ρ(υ1 , υo )
1−β
(all we have done is summed the series). Now take limits as m → ∞
ρ(υ∗, υn ) = limm→∞ ρ(υm , υn ) ≤ (1 − β)−1 β n ρ(υ1 , υo )
Since this is true regardless of the initial condition (υo )
ρ(υ∗, υn ) ≤ minj≤n (1 − β)−1 β n−j ρ(υj+1 , υj ) = (1 − β)−1 ρ(υn+1 , υn ).•
Corollary 1 Let (S, ρ) be a complete metric space, and let T : S → S be
a contraction mapping with fixed point ν ∈ S. Then
0
0
0
0
1. if S is a closed nonempty subset of S and T (S ) ⊆ S , then ν ∈ S .
0
0
2. if in addition T (S ) ⊆ S“ ⊆ S , then υ ∈ S”•.
The first part will usually be used as follows. Say we take a function
which is in a closed subset of S (eg. the set of weakly increasing functions is
a subset of the set of functions). Further assume that if, we take any member
of the subset and apply T to it, we obtain another member of the subset.
Then the fixed point is a member of the subset (eg. it is weakly increasing).
The second part assumes that anytime we apply the operator T to a member
of the initial closed subset, we obtain a member of an open subset of the close
subset (eg. a member of the set of strictly increasing functions). If this is
also true than the fixed point is also a member of the open subset. Corollary
1 is important, you should remember it.
Theorem 2 : (Blackwell’s sufficient conditions for a contraction)
Let X ⊆ Rl , and let B(X) be a space of bounded functions f : X → R, with
the sup norm. Let T : B(X) → B(X) be an operator satisfying
1. Monotonicity. f, g ∈ B(X) and f (x) ≤ g(x), for all x ∈ X, implies
(T f )(x) ≤ (T g)(x), for all x ∈ X, and
2. Discounting. There exists some β ∈ (0, 1) such that
[T (f + a)](x) ≤ (T f )(x) + βa, all f ∈ B(X), a ≥ 0, x ∈ X.
33
Then T is a contraction with modulus β.•
Above (f + a)(x) is the function defined by (f + a)(x) = f (x) + a.
Theorem 3. (Convergence of the Policy Function)
Let K ⊂ Rl , X ⊂ Rm and assume that the correspondence Γ : K → X is
nonempty, compact and convex — valued, and continuous. Assume that for
each n and each k ∈ K, fn (k, x) is strictly concave in its 2nd argument and
that f has the same properties, where fn → f in the sup norm. (Recall that
this implies that fn and f are continuous.) Define gn , and g, by
gn (k) = argmaxx∈Γ(k) fn (k, x), n = 1, 2, ....
g(k) = argmaxx∈Γ(k) f (k, x)
Then gn → g pointwise. If K is compact, gn → g, uniformly. •
References
Ericson, R., and A. Pakes (1995): “Markov Perfect Industry Dynamics:
A Framework for Empirical Work,” Review of Economic Studies, 62, 53–
82.
Hansen, L., and K. Singleton (1982): “Generalized Instrumental Variables Estimation of Method of Moments Estimators,” Econometrica, 50,
1269–1286.
Judd, K. (1998): Numerical Methods in Economics. MIT Press, Cambridge
MA.
Pakes, A. (1986): “Patents as Options: Some Estimates of the Value of
Holding European Patent Stocks,” Econometrica, 54, 755–784.
(1994): “Dynamic Structural Models, Problems and Prospects:
Mixed Continuous Discrete Controls and Market Interactions,” in (Sims
1994), chap. 5, pp. 171–260.
Pakes, A., and P. McGuire (2001): “Stochastic Approximation for Dynamic Analysis: Markov Perfect Equilibrium and the ‘Curse’ of Dimensionality,” Econometrica.
Pindyck, R. S., and J. Rotemberg (1983): “Dynamic Factor Demands
and the Effects of the Energy Price Shocks,” AER, 73(5), 1066–1079.
34
Rust, J. (1987): “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,” Econometrica, 55(5), 999–1033.
(1994): “Estimation of Dynamic Structural Models, Problems and
Prospects: Discrete Decision Processes,” in (Sims 1994), chap. 4, pp. 119–
170.
(1997): “Using Randomization to Break the Curse of Dimensionality,” Econometrica, 65(3), 487–516.
Sims, C. (ed.) (1994): Advances in Econometrics. Cambridge Univeristy,
Cambridge.
Stokey, N. L., R. E. Lucas, and E. C. Prescott (1989): Recursive
Methods in Economic Dynamics. Harvard Univ., Cambridge MA.
Timmins, C. (1999): “Price as a Policy Tool for Dynamic Resource Allocation: Municipally Owned Water Utilities,” Working paper, Yale Univ.
Wolpin, K., and M. Keane (1994): “The Solution and Estimation of
Discrete Choice Dynamic Programming Models by Simulation and Interpolation: Monte Carlo Evidence,” Review of Economics and Statistics,
76(4), 648–672.
35