Estimation. Notes for VApplied MicroeconometricsV.!

Estimation.
Notes for "Applied Microeconometrics".
Martin Browning
Department of Economics, University of Oxford
Revised, 6 February 2012
1
1.1
Three classes of models.
Models and methods.
Very broadly we can categorise econometric models into three groups: (fully)
parametric, semi-parametric and nonparametric. Below we shall consider each of
these. As well as econometric models, there are numerous econometric methods
(least squares, maximum likelihood, GMM, instrumental variable estimation,
generalised empirical likelihood etc.). These methods are used to estimate the
models. Although some methods are more or less suited to di¤erent models, it
is important to keep clear the distinction between models and methods.
Suppose we have two random variables Y and X that may be stochastically
linked and we wish to estimate the mean of Y conditional on X, E (Y j X).1
This conditional mean is a deterministic function of X, denoted E (Y j X) =
g (X). Without loss of generality, this can also be written as a ‘regression’
identity if we de…ne:
Y g (X)
(1)
so that E ( j X) = 0. This gives the model:
Y
g (X) +
with E ( j X) = 0
(2)
These notes are solely for the use of students on the Applied Microeconometrics course.
Please do not distribute them or quote them without permission.
1 More generally we might be interested in the variance of Y given X or even, most ambitiously, the distribution of Y given X (from which we could calculate any statistic of interest)
but the conditional mean is enough to be going on with.
1
(see chapter 2, sections 2:1 and 2:2 of Wooldridge).
We now consider three classes of econometric models and the econometric
methods that can be applied to them.
1.2
A (fully) parametric model.
In a fully parametric model we specify exactly the relationship between Y and
X and the distributions involved, dependent on a …nite set of parameters. For
example, suppose X > 0 and we take:
Y
"
=
+ X +"
(3)
2
(4)
N 0;
Here we have made parametric assumptions concerning the functional form for
the mean (the …rst equation) and assumptions concerning the distribution of the
unobserved variable (the ‘error term’), ". Speci…cally we assume that " is normally distributed independently of X and that it has zero mean and a constant
variance (homoskedasticity). The independence assumption is implicit since the
mean and variance of the distribution are explicitly assumed to be independent
of X and these two parameters fully characterise a Normal distribution.
We are given a sample (Yi ; Xi )i=1;:::n and wish to estimate the parameters of
interest ( ; ; ).2 The estimator of choice for parametric models is maximum
likelihood (MLE) since this exploits e¢ ciently all of the functional form and
distributional assumptions. In the current example, MLE leads to:
^ ; ^ ; ^ = arg min
n
X
(Yi
2
(Xi ) )
(5)
i=1
This is nonlinear least squares (NLS); the coincidence of MLE and Least Squares
generally only happens if we take Normal distributions for the (additive) unobservables. Under the assumptions made, this estimator is a consistent estimator
of the true parameters. One of the attractions of MLE is that it is very well
understood and we have a vast theoretical apparatus and lots of practical experience to draw on. Moreover, under weak assumptions it is the ‘best’estimator
for a fully parametric model. In section 4 below we shall discuss what we mean
2 Whether or not
is of interest depends on the use you will make of the estimates. Here
we are assuming that we are only interested in the deterministic mean function. Thus the
variance term is a ‘nuisance parameter’. If you wished to simulate Y values for given X values,
then becomes a parameter of interest and you would also need an estimator for it.
2
by ‘best’.
In this example the object to be optimised (maximise the likelihood function
or minimise the sum of squares) is very simple. In practice, even if you are clever
enough to write down the likelihood function for your model3 , maximising it
might be di¢ cult. The likelihood function could have multiple local maxima
or it could be non-di¤erentiable or simply be very slow to converge. Most
practicing applied microeconometricians spend a lot of time trying to ‘get the
model to converge’. There are no hard and fast rules for non-linear optimisation
but often it helps to …x some parameters at ‘reasonable’values and to do a grid
search on the parameters for which likely values are not obvious.
MLE is not the only way to estimate a fully parametric model. A simple alternative is Simulated Minimum Distance (SMD) (sometimes known as indirect
inference). SMD is much easier to implement than MLE but it is generally less
e¢ cient.
Sometimes we can also estimate the parameters of a fully parametric model
using Generalised Method of Moments (GMM). For example in this example,
we have assumed that " is independent of X. This is a strong assumption and
implies:
E ("X r ) = 0 for r = 0; 1; 2:::
(6)
(Note the independence also implies an in…nite number of other such conditions).
If we take r = 0; 1; 2; 3 then we have:
E ((Y
E ((Y
X ))
X ) X)
= 0
=
0
E (Y
X ) X2
=
0
E (Y
3
=
0
X )X
(7)
The …rst of these is simply the zero mean assumption.4 GMM proceeds by
3 And you have to be very clever indeed for interesting models that capture important
theoretical structural features and …t the data.
4 Whether the parameters are identi…ed from this system is an identi…cation issue that is
dealt with in the identi…cation notes.
3
replacing population expectations with sample analogues:
n
1X
((Yi
n i=1
Xi ))
= u1 = 0
X i ) Xi )
= u2 = 0
Xi ) Xi2
= u3 = 0
Xi ) Xi3
= u4 = 0
n
1X
((Yi
n i=1
n
1X
(Yi
n i=1
n
1X
(Yi
n i=1
(8)
This gives four equations in three unknowns - the parameters are said to be
over-identi…ed. For a given sample, there will not be any value for ( ; ; )
that satis…es all four equations. One way to deal with this is by dropping an
equation (for example, the fourth one). When we take exactly as many moments
as parameters then solving these equations is known as the Method of Moments
(MM).
In using only three moments we are throwing away some information (the
excluded moment). If we wish to use all moments (and we could add an in…nite number to the list above) we use GMM. This weights the equations and
minimises the weighted sum of squares of the errors5 :
2
2
2
2
! 1 (u1 ) + ! 2 (u2 ) + ! 3 (u3 ) + ! 4 (u4 )
(9)
where the weights are non-negative and sum to unity. Minimising this gives
consistent estimates of the parameters for any weights that give at least three
equations. The GMM estimator gives an e¢ cient (‘minimum asymptotic variance’) and automatic way to weight the di¤erent equations given the choice of
moments. If we took di¤erent moments, for example replacing the last moment
above with:
n
1X
((Yi
Xi ) ln (Xi )) = u5 = 0
(10)
n i=1
then we would have di¤erent weights.
The optimal weighted GMM estimates are generally ‘less e¢ cient’than MLE
if the distributional assumptions are correct. This is because MLE uses all of
5 By not considering values of r that are greater than 3 in (7) we have already made an
implicit weighting assumption.
4
the assumptions concerning the distribution. For example, the GMM estimator
based on the moments above does not use the information that the Normal
distribution is symmetric. On the other hand, the GMM estimates are more
robust since they rely on weaker assumptions. As usual, Wooldridge (chapter
14) provides a good introduction to the technical details. Also good is Appendix
A in:
‘Panel Data Econometrics’by Manuel Arellano
(11)
GMM includes many familiar estimators (OLS, IVE) as a special case. A full
account is given in:
‘Generalized Method of Moments’by Alastair Hall
(12)
Exercise 1 It is a very useful exercise to simulate some data according to (3)
and (4) and then to estimate by MLE (that is, use (5)) and by GMM. For the
latter, you could try using di¤ erent equations and other moments. Then simulate
relaxing the Normality assumption (for example, take a skewed distribution for
") but keeping the independence between " and X and estimate again. Even
better would be to conduct a small Monte Carlo analysis comparing MLE and
various forms of GMM.
The additive-error form assumption in (3) is very commonly used but sometimes theory dictates that the unobservable enters in an awkward way. Suppose,
for example, that we had X > 0 and:
Y
=
+ X
N
;
2
(13)
Note that X > 0 implies that Y
and must have the same sign. Since
the object of interest, the conditional mean E (Y j X), now depends on all
parameters, the parameters of interest are ( ; ; ; ). Very often a researcher
will take an approximation to (13) with the additive-error form:
Y = ~ + ~X ~ + "
(14)
but this is probably not a very good approximation for (13) and consistent
estimates of ~ ; ~ ; ~ (which could be found with NLS) will not be consistent
5
estimates of the parameters of interest, ( ; ; ). However, we can invert on
to give:
Y
= ln X =
(15)
ln
This allows estimation of the parameters ( ; ; ; ) using GMM or MLE or
other estimators.
It is very important to note that the MLE, SMD and GMM estimates will
generally di¤er (even if they are ‘asymptotically equivalent’ since they are all
consistent under the maintained assumptions). If the GMM estimates are ‘signi…cantly di¤erent’from the MLE estimates this suggests that the parametric
assumptions are wrong. Perhaps we can estimate the parameters of interest
with weaker assumptions. That is the subject of the next subsection.
1.3
1.3.1
Semi-parametric models.
General.
In the previous example the assumptions we made were very strong. If we
assume (3) and are only interested in estimating ( ; ; ) we have seen that we
can use the following assumption:
E ("X r ) = 0 for r = 0; 1; 2; 3
(16)
A stronger (and more conventional) assumption is mean independence:
E (" j X) = 0
(17)
This is a semi-parametric model since we made some parametric assumptions
(for the functional form) but left others open. For example, the assumption on
the error term is consistent with " having a non-Normal distribution with the
variance depending on X. To estimate in this case we need to use GMM. Or
some variant of GMM. We cannot use MLE or SMD since we do not specify the
distribution of the unobserved variable.
Note that if we knew the value of then we have a problem that is linear in
parameters. Taking r = 1; 2 in (8) gives Ordinary Least Squares (OLS). So you
have conducted a semi-parametric analysis if you used OLS!
6
1.3.2
First order conditions.
Semi-parametric models very often arise when we have a theory model involving
optimisation and we seek to estimate the parameters that appear in the …rst
order conditions. Here I present the most famous example of this, the Euler
equation for consumption.6 Given an intertemporal model with a particular
utility function and perfect capital markets we have the following equation for
consumption growth:
ln ct+1 =
where
+ rt+1 +
xt+1 + "t+1
(18)
is the …rst di¤erence operator and:
ln ct+1
= log consumption in period t + 1
(19)
rt+1
= real interest rate between t and t + 1
(20)
xt+1
= value of a demographic variable in t + 1
(21)
"t+1
= a ‘surprise’or ’consumption shock’
(22)
Under the theory of optimal intertemporal allocation this is a …rst order condition for optimisation. The theory has the strong implication that the consumption shock for t + 1, "t+1 , is uncorrelated with any information known to
the agent in time t (this is the econometrician’s de…nition of a surprise). Using
the law of iterated expectations (Wooldridge, section 2.2) we can show that this
implies:
E ("t+1 zt ) = 0
(23)
for any variable zt that was known in period t.7 In particular, we have:
E ("t+1 ) = 0
(24)
If we assume that xt+1 was known in period t (for example, we have quarterly
data and xt is a dummy for the fourth quarter when consumption is high relative
6 It was consideration of this model that lead Lars Hansen to develop GMM in the …rst
place.
7 This does not imply that the variance of "
t+1 is independent of past information. More
generally the distribution of "t+1 may depend on lagged information - it is only the conditional
mean that is zero.
7
to other quarters) then we also have:
E ("t+1 xt+1 ) = 0
(25)
If we substitute for "t+1 from (18) into (24) and (25) then we have two equations
in three unknowns, ( ; ; ). We need one further equation to identify the three
parameters. It is unrealistic to assume that people know the real interest rate
one period ahead8 so we have:
E ("t+1 rt+1 ) 6= 0
But we can use the lagged real interest rate since that was known at time t.
This gives a third equation (a particular example of (23)):
E ("t+1 rt ) = 0
(26)
Lagged real rates are actually a very good instrument for the current real rate
since real rates are serially correlated. The three equalities (24) to (26) are
known as orthogonality conditions. First order conditions for theory models
often give orthogonality conditions of this kind.
Suppose now that we are given observations on a single household followed
for periods t = 1; 2:::T . We de…ne T 1 functions:
et+1 ( ; ; ) =
for t = 1; ::T
ln ct+1
rt+1
xt+1
(27)
1. Then we take the sample analogues of (24) to (26):
1
T
1
T
1
= u1 = 0
et+1 xt+1
= u2 = 0
T
X1
= u3 = 0
t=1
t=1
1
T
et+1
1
T
X1
T
X1
1
et+1 rt
(28)
t=1
This is a system of three equations in three unknowns so we can solve uniquely
for ( ; ; ). This is , of course, Instrumental Variable Estimation. Alterna8 Even if they have …xed nominal interest rates the one period ahead in‡ation rate is not
known.
8
tively, we could construct the expectation of the real rate conditional on period
t information, denoted t r^t+1 , and then regress ln ct+1 on a constant, xt+1
and t r^t+1 ; this is, of course, 2SLS. The two approaches will not generally give
exactly the same parameter estimates.
We could use other variables known at time t to give further orthogonality
conditions. One particularly strong assumption is that the shock "t+1 is uncorrelated with expected changes in income. Denote the change in income from
period t to t + 1 by yt+1 and denote the expectation of this in period t by
t yt+1 . The theory states that E ("t+1t yt+1 ) = 0. This is a strong test of the
theory since it states that consumption changes are not a¤ected by expected
income changes. It would be invalid if, for example, the agent faced credit constraints and would have liked to consume more in period t but had to wait until
an expected increase in income actually materialised.
To use the income information we need to take the income series and construct estimates of t yt+1 , denoted t ^ yt+1 ; this could be done using some
standard forecasting technique. A direct approach to using this is to add this
fourth orthogonality condition:
1
T
1
T
X1
et+1t ^ yt+1 = u4 = 0
(29)
t=1
to give an over-identi…ed system. Then use GMM to estimate the parameters
and use the test of the over-identifying restrictions to check the joint validity of
the orthogonality conditions.
An alternative to setting up an over-identi…ed system is to use the original
three conditions (28) to give estimate ^ ; ^ ; ^ and then use equation (27) to
construct estimated residuals:
e^t+1 =
ln ct+1
^
^ r^t+1
t
^ xt+1
(30)
Finally test for that the covariance between e^t+1 and t ^ yt+1 is zero. This is
known as a moments test and is an alternative to the test of the over-identifying
restrictions.
9
2
Nonparametric models.
Above we took:
Y
g (X) +
with E ( j X) = 0
(31)
and assumed a functional (parametric) form for the unknown function g (:).9 Using non-parametric regression we can dispense with the parametric assumption
on g (:) and estimate it. The main advantage of the nonparametric approach
is, of course, that it does not impose a functional form restriction. The main
disadvantages are that it is not e¢ cient relative to a parametric estimator (if
the parametric assumptions are correct) and it is di¢ cult to allow for more
than two right hand side variables. Cameron and Trivedi have an excellent
introductory discussion of nonparametric estimation (chapter 9, unfortunately
called ‘Semiparametric methods’).
3
M-estimation.
Above we considered several estimators (econometric methods). There is a unifying theory that allows us to consider all of these as special cases. This is
known as M-estimation (the ‘M’is for maximisation or minimisation). This is
a powerful approach since we can establish the properties of M-estimators and
then simply apply these for the speci…c estimator we are interested in. In these
notes I have not used M-estimator theory since I shall not be interested in technical results (for example, the asymptotic properties of a particular estimator).
For those who are interested, a very useful discussion can be found in chapter 12
of Wooldridge. For those who are eager for even more detail the best reference
is:
‘Asymptotic statistics‘by A. van der Vaart
(32)
This is not recommended for the faint hearted.
4
Criteria for estimators.
Given a model, there are often several possible estimators. Which one to use?
Here are some criteria that have been used in the literature.
9 To repeat: ‘parametric’ means that we know the function g (:) except for a …nite set of
unknown parameters. For example, g (:) is a third order polynomial.
10
1. Unbiased or consistent.
2. In a …nite sample, the estimator has a small root mean squared error.
3. If biased but consistent, estimate converges quickly to the true value as
the size of the sample increases.
4. Estimator has a known asymptotic covariance matrix that can be estimated.
5. The asymptotic covariance in 4 is as small as possible.
6. Robust to misspeci…cation.
7. Robust to outliers.
8. Appropriate for model (parametric, semi-parametric or non-parametric?).
9. Easy to use.
11