Influence functions and their uses in econometrics

Influence functions and their uses in econometrics
Maximilian Kasy
Object of interest: a statistic φ(P ) ∈ R, which is a function of the distribution of the
vector X, P (X). Examples include E[X], V ar(X), quantiles of X, the Gini coefficient,...
What does it mean for φ to be differentiable at P 0 ? Intuitively, that there is is a continuous
linear functional Dφ, such that
φ(P ) ≈ φ(P 0 ) + Dφ(P − P 0 ).
On functional spaces, there are different notions of differentiability; which one is appropriate
depends on the context.
Strongest notion of differentiability: Fréchet derivative at P 0
(φ(P ) − φ(P 0 )) − Dφ(P − P 0 )
= 0,
lim
kP − P 0 k
P →P 0
where Dφ is a continuous linear functional with respect to some norm kP k, for instance the
L2 norm on the space of densities which is defined by
sZ
kP k =
(dP/dP 0 )2 dP 0 .
The limit has to equal 0 for all sequences of measures P converging to P 0 .
A weak notion of differentiability: directional or Gâteaux derivative:
1
(φ(P 0 + t(P − P 0 )) − φ(P 0 )) − tDφ(P − P 0 ) = 0,
t→0 t
lim
where the limit has to equal 0 for all measures P .
Riesz representation theorem:
Consider a Hilbert space H , i.e., a vector space equipped with an inner product h., .i and
with the corresponding norm. Then for any continuous linear functional
ψ:H →R
there is an element x ∈ H , such that
ψ(y) = hx, yi
1
for all y ∈ H . The vector x is the dual representation of the linear functional ψ.
P
k , hx, yi =
Intuition
of
this
theorem
for
H
=
R
xi yi : We can write any vector y as
P
th
y = yi ei , where ei is the i unit vector. By linearity of the functional ψ,
X
ψ(y) =
yi ψ(ei ) = hx, yi
for x defined by xi = ψ(ei ).
To apply this representation theorem in the present context, take H to be the space
of all measurable functions of X which have mean zero and finite variance under P 0 . This
space can be equipped with the inner product
hy, zi = E 0 [y(X) · z(X)] = Cov(y, z).
Consider the derivative Dφ to be a functional applied to the relative density of P to P 0 ,
minus the relative density of P 0 to P 0 - which equals 1. If Dφ is a continuous functional
on H , the Riesz representation theorem implies the existence of a mean zero, finite variance
function IF (X), the influence function, such that
Z
0
0
0
Dφ(dP/dP − 1) = E [IF (X) · (dP/dP )(X)] = IF (X)(dP/dP 0 )(X)dP 0 (X)
Z
=
IF (X)dP (X) = E[IF (X)].
Put differently, we have an approximation of the functional φ by the mean of IF :
φ(P ) ≈ φ(P 0 ) + E[IF (X)].
Alternative, intuitive derivation of influence function: Suppose we want an approximation for the plug-in estimator φb = φ(Pn ), where Pn is the empirical distribution
Pn =
1X
δXi .
n
Using the approximation given by the derivative, and the linearity of the derivative, we get
1X
φb ≈ φ(P 0 ) + Dφ(Pn − P 0 ) = φ(P 0 ) +
Dφ(δXi − P 0 )
n
= φ(P 0 ) + En [Dφ(δXi − P 0 )],
where En denotes the sample average. This suggests that
IF (X) = Dφ(δX − P 0 ).
2
Influence functions play a role in a number of different contexts in econometrics:
• Asymptotic distribution theory, efficiency bounds:
One can show that any “regular” estimator φb of the just-identified parameter φ(P ) is
asymptotically equivalent to a linearised plug-in estimator,
φb ≈ φ(P 0 ) + En [IF (X)].
This implies the asymptotic efficiency bound
b → Var(IF (X)).
n Var(φ)
Tsiatis, A. (2006). Semiparametric theory and missing data. Springer Verlag,
in particular chapter 3
van der Vaart, A. (2000). Asymptotic statistics. Cambridge University Press,
chapter 20
• Robust statistics:
Since estimators can be approximated by φb ≈ φ(P 0 ) + En [IF (X)], we get that the value
of φb can be dominated by a single outlier, even in large samples, unless IF is bounded.
Huber, P. (1996). Robust statistical procedures. Number 68. Society for Industrial Mathematics
• Distributional decompositions:
In labor economics, we are often interested in counterfactual distributions of the form
Z
P (Y ) = P 1 (Y |X)dP 2 (X),
where we observe samples from the distributions 1 and 2. In order to estimate φ(P ),
we can again use the approximation
Z
Z
φ(P ) ≈ φ(P 2 ) + IF (Y )dP (Y ) = φ(P 2 ) + IF (Y )dP 1 (Y |X)dP 2 (X)
= φ(P 2 ) + E 2 [E 1 [IF (Y )|X]].
The conditional expectation E 1 [IF (Y )|X] can be estimated using regression methods,
the expectation w.r.t. X can be estimated using the P 2 sample average of predicted
values from the regression.
Firpo, S., Fortin, N., and Lemieux, T. (2009). Unconditional quantile regressions. Econometrica, 77(3):953–973.
• Partial identification:
Nonparametric models with endogeneity, such as those discussed in the first part of
class, tend to lead to representations of potential outcome distributions of the form
P (Y d ) = αP 1 (Y ) + (1 − α)P 2 (Y d ),
3
where draws from P 1 (Y ) are observable, while the data are uninformative about P 2 (Y d ).
A linear approximation to φ(P (Y d )) then implies
φ(P (Y d )) − φ(P 0 ) ≈ αDφ(P 1 (Y ) − P 0 ) + (1 − α)Dφ(P 2 (Y d ) − P 0 ).
The first term here is identified, the second term can be bounded if and only if Dφ is
bounded on the admissible counterfactual distributions P 2 (Y d ) - the same condition as
in robust statistics.
Kasy, M. (2012). Partial identification, distributional preferences, and the
welfare ranking of policies. workingpaper, section 3
How can we actually calculate influence functions?
The easiest way is via directional derivatives: Consider families of distributions indexed by a
parameter θ ∈ R: P (X; θ). Then we get that
φ(P (.; θ)) : R → R
is a function we can easily differentiate. Next, do some algebra to get the resulting expression
into the form
Z
∂φ
∂
= IF (X) dP (X; θ).
∂θ
∂θ
Finally, normalize, by adding a constant, so that IF has mean zero.
Example:
Z
φ(P ) = Var(X) =
Thus
∂φ
=
∂θ
2
X dP −
2
Z
XdP
.
Z
Z
∂
∂
X
dP − 2
XdP
X dP
∂θ
∂θ
Z
∂
dP
=
X 2 − 2E[X] · X
∂θ
Z
2
Normalizing, as required, to E[IF ] = 0, we get
IF (X) = X 2 − 2E[X] · X − Var(X) + E[X]2 .
4