ECON 721:
Generalized Method of Moments (aka Minimum
Distance Estimation)
Fall, 2016
Main reference: Hansen (1982, Econometrica), Hansen & Singleton (1982, 1988)
Suppose that we have some data {xt }, t = 1..T and we want to test some hypotheses
about E(xt ) = µ. For example,
H0 : µ = µ0
How do we proceed? By a central limit theorem,
T
1 X
√
(xt − µ)~N (0, V0 )
T t=1
V0 = E((xt − µ)(xt − µ)0 ) if xt iid
!
T
T
1 X
1 X
0
= limT →∞ E √
(xt − µ) √
(xt − µ)
in general case.
T t=1
T t=1
1
Because V0 is symmetric, we can decompose:
V0 = QDQ0
where D is a matrix of eigenvalues and Q is an orthogonal matrix of eigenvectors.
QQ0 = I
Q−1 = Q0 .
We can write:
QD1/2 D1/2 Q0 = V0 .
Under null, H0 ,
#0
"
#
T
T
X
1
1 X
√
(xt − µ0 ) V0−1 √
(xt − µ0 )
T t=1
T t=1
"
#0
"
#
T
T
X
1 X
1
= √
(xt − µ0 ) QD−1/2 D−1/2 Q0 √
(xt − µ0 )0 ~χ2 (n),
T t=1
T t=1
"
where n is the number of moment conditions (length of vector of moments).
Note that the previous equation is a quadratic form in
"
#
T
X
1
D−1/2 Q0 √
(xt − µ0 ) ~N (0, In )
T t=1
How does the test statistic behave under the alternative (µ 6= µ0 )?
2
#0
"
#
T
T
X
1 X
1
√
(xt − µ0 ) V0−1 √
(xt − µ0 )
T t=1
T t=1
#0
"
#
"
T
T
1 X
1 X
−1
√
(xt − µ) V0
(xt − µ) +
= √
T t=1
T t=1
"
T
T
X
2 X
0 −1 1
√
(µ − µ0 ) V0 √
(xt − µ) +
T t=1
T t=1
T
T
X
1 X
0 −1 1
··· + √
(µ − µ0 ) V0 √
(µ − µ0 )
T t=1
T t=1
The last term is O(T ) and gets large under the alternative (as we want).
Problems:
(i) V0 is not known a priori.
Estimate VT → V0
In an iid setting, use sample VCV matrix. More generally, approximate limit by
finite T.
(ii) µ not known.
Suppose we want to test
H0 : µ = ϕ(β)
where ϕ is specified, but β is unknown.
We can estimate β by minimum χ2 estimation:
"
minβ∈B
#0
"
#
T
T
X
1 X
1
√
(xt − ϕ(β)) VT−1 √
(xt − ϕ(β)) ~χ2 (n − k),
T t=1
T t=1
where k = dimension of β and n = number of moments.
3
We will show that searching over k dimensions, you lose k degrees of freedom.
Let’s find distribution theory for β̂. First order condition (FOC) is:
T
X
√ ∂ϕ 0
−1 1
V √
T
(xt − ϕ(β̂T )) = 0
∂β β̂T T
T t=1
Taylor expand ϕ(β̂T ) around ϕ(β̂0 ):
∂ϕ ϕ(β̂T ) = ϕ(β0 ) +
(β̂T − β0 ) , for β ∗ between β0 and β̂T
∂β β̂ ∗
Plug into FOC (for ϕ(β̂T )) to get
#
"
T
X
√ ∂ϕ 0
∂ϕ
1
V −1 √
(β̂T − β0 ) = 0
T
xt − ϕ(β̂0 ) −
∗
∂β β̂T T
∂β
T t=1
β̂
Rearrange to solve for (β̂T − β0 ):
"
#
0
T
X
√ ∂ϕ 0
√
∂ϕ
T
∂ϕ
−1
−1 1
(β̂T − β0 ) = T
(xt − ϕ(β0 ))
V (√ ) V √
T
∂β β̂T T
∂β β̂T T
T ∂β β ∗
T t=1
Apply CLT to last term. If
∂ϕ →p D0
∂β β̂T
VT →p V0
∂ϕ →p D0
∂β β ∗
T
1 X
√
(xt − ϕ(β0 ))~N (0, V0 )
T t=1
4
Then,
√
0
0
T (β̂T − β0 ) ~ N (0, (D00 V0−1 D0 )−1 D00 V0−1 V0 V0−1 D0 (D00 V0−1 D0 )−1 )
~ N (0, (D00 V0−1 D0 )−1 )
where D00 is k × n and V0 is n × n. Why is the limiting distribution of the criterion
χ2 (n − k)?
Write
T
T
T
1 X
1 X
1 X
√
(xt − ϕ(β̂T )) = √
(xt − µ0 ) + √
(µ0 − ϕ(β̂T ))
T t=1
T t=1
T t=1
Recall that
∂ϕ (β̂T − β0 )
ϕ(β̂T ) = ϕ(β0 ) +
∂β β̂∗
∂ϕ = µ0 +
(β̂T − β0 )
∂β ∗
β
which implies
∂ϕ µ0 − ϕ(β̂T ) = − (β̂T − β0 )
∂β β ∗
Plug in the expression for β̂T − β0 that we derived earlier to get:
( 0
)−1
0
T
X
∂ϕ ∂ϕ ∂ϕ
∂ϕ
−1
−1 1
µ0 − ϕ(β̂T ) = − V
V √
(xt − µ0 )
∂β β ∗ ∂β β̂T T ∂β β ∗
∂β β̂T T
T t=1
Note that this term is a linear combination of
PT
t=1 (xt
− µ0 ), so
T
T
1 X
1 X
√
(xt − ϕ(β̂T )) = B · √
(xt − µ0 )
T t=1
T t=1
5
where
" 0
#−1
0
∂ϕ ∂ϕ
∂ϕ ∂ϕ −1
V
V −1 = B0
plimB = I −
∂β β0 ∂β β0 0 ∂β β0
∂β β0 0
Note that
0
∂ϕ ∂β V0−1 B0 = 0. This tells us that certain linear combinations of B0 will give
β0
a degenerate distribution (along k dimensions), which needs to be taken into account
when testing.
Recall that
V0−1 = QD−1/2 D−1/2 Q0
so
D
−1/2
T
T
1 X
1 X
−1/2 0
Q√
xt − ϕ(β̂T ) = D
Q B√
(xt − µ0 )
T t=1
T t=1
0
where
" 0
#−1
0
∂ϕ ∂ϕ
∂ϕ
∂ϕ
−1/2 0
−1/2 0
−1/2 0
−1/2 −1/2 0
D
QB = D
Q −D
Q
QD
D
Q
QD−1/2 D−1/2 Q0
∂β β̂T ∂β β̂T
∂β β̂T
∂β β0
= (I − A(A0 A)−1 A0 )D−1/2 Q0 = MA D−1/2 Q0
MA is an idempotent matrix.
A=D
−1/2
Q
∂β 0 ∂ϕ β̂T
Thus,
D
−1/2
T
T
X
1 X
−1/2 0 1
√
√
Q
(xt − ϕ(β̂T )) = MA D
Q
(xt − µ0 )
T t=1
T t=1
0
The matrix A accounts for the fact that we performed the minimization over β.
6
How is the distribution theory affected? We have a quadratic form in normal random
variables with an idempotent matrix (MA ). For example
ε̂0 ε̂ = ε0 MX ε
MX = I − X(X 0 X)−1 X 0 .
Some Facts (see, e.g., Seber)
(i) Let Y ~N (θ, σ 2 In )
Let P be a symmetric matrix of rank r.
Then
Q=
(Y − θ)0 P (Y − θ) 2
~χ (r) iff P = P 2
σ2
i.e. if P idempotent.
(ii) If Qi ~χ2 (ri ), i = 1, 2, r1 > r2 , and Q = Q1 − Q2 is independent of Q2 , then
Q~χ2 (r1 − r2 ).
Apply these results to get the distribution of
ε̂0 ε̂
ε0 MX ε
=
σ2
σ2
The rank of an idempotent matrix is equal to its trace and
tr(MX ) =
n
X
t=1
for an idempotent matrix, λi are all 0 or 1.
7
λi
rank(In×n − X(X 0 X)−1 X 0 ) = rank(I) − rank(X(X 0 X)−1 X 0 )
= n − trace(X(X 0 X)−1 X 0 )
= n − trace((X 0 X)−1 X 0 X)
=n−k
using the fact that tr(AB) = tr(BA) (trace is invariant to cyclical perturbations).
Thus,
ε̂0 ε̂ 2
~χ (n − k)
σ2
By the same reasoning, under the null
"
#0
"
#
T
T
X
1 X
1
√
(xt − ϕ(β̂T )) VT−1 √
(xt − ϕ(β̂T )) ~χ2 (n − k)
T t=1
T t=1
We preserve the χ2 distribution but lose degrees of freedom in estimating β.
In the case where n = k (just identified case), we can estimate β but we have no degrees
of freedom left to perform the test.
Would GMM provide a method for estimating β if we use a weighting matrix that is
different from VT−1 ?
Suppose we replace V0−1 by W0
"
minβ∈B
#0
"
#
T
T
1 X
1 X
√
(xt − ϕ(β̂T )) W0 √
(xt − ϕ(β̂T ))
T t=1
T t=1
- Could choose, for example, W0 = I, which avoids the need to estimate the weighting
matrix.
- The result will be that the asymptotic covariance is altered, but
√1
T
PT
t=1 (xt − ϕ(β̂T ))
will still be normally distributed. The asymptotic criterion distribution will be different.
8
What is the advantage of focusing on min-χ2 estimation?
It turns out that W0 = V0−1 gives the smallest covariance matrix. You get a more
efficient estimator for β and the most powerful test of restrictions.
We can show this by showing that the following is positive semi-definite:
(D00 W0 D0 )−1 (D00 W0 V0 W0 D0 )(D00 W0 D0 )−1 − (D00 V0−1 D0 )−1
where the first term is the covariance matrix for
√
T (β̂T − β0 ) when a general weighting
matrix is used.
Equivalently, we can show that
(D00 V0−1 D0 ) − (D00 W0 D0 )(D00 W0 V0 W0 D0 )−1 (D00 W0 D0 )
is positive semidefinite.
α0 (D00 V0−1 D0 ) − (D00 W0 D0 )((D00 W0 V0 W0 D0 )−1 (D00 W0 D0 ) α
i
h
−1/2
1/2 1/2
−1/2
D0 α
I − V 1/2 W00 D0 (D00 W0 V0 V0 W0 D0 )−1 D00 W0 V 1/2 V0
= α0 D00 V0
h
i
−1/2
= α0 D00 V0
I − Ṽ (Ṽ 0 Ṽ )Ṽ 0 V 1/2 D0 α ≥ 0
The last term equals 0 if W0 = V0−1 .
The term in brackets is idempotent, so all eigenvalues are either 0 or 1.
We showed that it can be written in quadratic form. The quadratic form x0 Ax is positive (negative) semidefinite iff all eigenvalues of A are nonnegative (nonpositive) and
at least one is zero.
Therefore, W0 = V0−1 is the optimal choice for the weighting matrix.
9
Many standard estimators can be interpreted as GMM estimators
Some examples:
(i) OLS
yt = xt β + ut
E(ut xt ) = 0 ⇒ E((yt − xt β)xt ) = 0
minβ∈B
T
1 X
√
(yt − xt β)xt
T t=1
!0
VT−1
T
1 X
√
(yt − xt β)xt
T t=1
!
where VT is an estimator for E(xt ut u0t x0t ) for the iid case, = σ 2 E(xt x0t ) if homoskedastic.
(ii) Instrumental variables
yt = xt βt + ut
E(ut xt ) 6= 0
E(ut zt ) = 0
E(xt zt ) 6= 0
where there are k1 instruments.
minβ∈B
T
1 X
√
(yt − xt β)zt
T t=1
!0
VT−1
T
1 X
√
(yt − xt β)zt
T t=1
!
where VT = Ê(zt ut u0t zt0 ) in iid case.
Suppose E(ut u0t |zt ) = σ 2 I
Then V0 = σ 2 E(zt zt0 )−1 , V0−1 = (σ 2 )−1 E(zt zt0 ).
It is possible to verify that 2SLS and GMM give the same estimator:
10
β̂2SLS = (X 0 Z(Z 0 Z)−1 Z 0 X)−1 (X 0 Z(Z 0 Z)−1 Z 0 Y )
In first stage, regress X on Z, get X̂ = Z(Z 0 Z)−1 Z 0 X and regress Y on Z, get
Ŷ = Z(Z 0 Z)−1 Z 0 Y . In the second stage, regress Ŷ on X̂
−1
× E(xi zi )E(zi zi0 )−1 E(zi ui ui zi0 )E(zi zi0 )−1 E(xi zi0 )0
E(xi zi0 )E(zi zi0 )−1 E(xi zi0 )0
−1
× E(xi zi0 )E(zi zi0 )−1 E(xi zi0 )0
−1 2
= E(xi zi0 )E(zi zi0 )−1 E(xi zi0 )0
σ
V ar(β̂2SLS ) =
Under GMM,
∂ϕ D0 =
∂β β0
V ar(β̂GM M ) = (D00 V0 D0 )−1
n
1X
= plim
xi zi = E(xi zi0 )
n i=1
V0 = σ 2 E(zi zi0 )
(D00 V0−1 D00 )−1 = (E(xi zi0 )0 (σ 2 E(zi zi0 ))−1 E(xi zi0 ))−1
= σ 2 (E(xi zi0 )E(zi zi0 )−1 E(xi zi0 )0 )−1
(iii) Nonlinear least squares
yt = ϕ(xt ; β) + ut
E(ut ϕ(xt ; β)) = 0
minβ∈B
T
1 X
√
(yt − ϕ(xt ; β))ϕ(xt ; β))0
T t=1
!
VT−1
11
!
T
1 X
√
(yt − ϕ(xt ; β))ϕ(xt ; β)
T t=1
(iv) General method of moments
Any moment condition (usually derived from an economic model)
E(ft (β)) = 0
minβ∈B
where
√1
T
PT
t=1
!0
T
1 X
√
ft (β) VT−1
T t=1
ft (β0 )~N (0, V0 ).
12
!
T
1 X
√
ft (β)
T t=1
© Copyright 2026 Paperzz