The influence function of the Stahel-Donoho estimator of multivariate

The influence function of the Stahel-Donoho estimator
of multivariate location and scatter
Daniel Gervini
∗
University of Zurich
Abstract
This article derives the influence function of the Stahel-Donoho estimator of
multivariate location and scatter for elliptical distributions. Local robustness and
asymptotic relative efficiency are studied. The expressions obtained for the influence
functions coincide with those of one-step reweighted estimators.
Key words and phrases: asymptotic relative efficiency, gross-error sensitivity,
local robustness, reweighted estimators.
∗
Department of Biostatistics, Sumatrastrasse 30, 8006 Zurich, Switzerland
1
1
Introduction
The statistical analysis of multidimensional data typically relies, explicitly or implicitly,
on estimators of location and scatter. For example, principal component analysis, canonical correlation, and discrimination, are all techniques based on the scatter estimator.
The most common estimators are the sample mean and the sample covariance matrix,
which are optimal if the data follows a normal distribution but are otherwise extremely
sensitive to outliers. Since visual detection of outliers is unworkable in high dimensions,
robust estimators are indispensable. An overview of robust estimation in the multivariate
setting is given in Maronna and Yohai (1998). Among the estimators considered in that
article, those proposed independently by Stahel (1981) and Donoho (1982) (which will
be called S-D estimators hereafter) possess very good robust properties. They were the
first equivariant estimators with maximum asymptotic breakdown point 1/2 (see Hampel, Rousseeuw, Ronchetti and Stahel, 1986, Theorem 5.5.3), and were in fact introduced
for that reason. More recently, Maronna and Yohai (1995) computed the maximum bias
function for contaminated normal models and carried out extensive simulations, showing
that properly tuned S-D estimators outperform other robust estimators in many situations. Nevertheless, these estimators have not received much attention in comparison
with other robust proposals. This is probably due to the practical difficulties involved in
their computation. Another reason may have been the lack of asymptotic theory, which
has been solved in part in a as yet unpublished manuscript by Zuo, Cui and He (2002).
In this article we derive the influence function of the S-D estimator for elliptical
models. The influence function allows us to study the local robustness and the asymptotic efficiency of these estimators, providing a rationale for choosing appropriate weight
functions and tuning parameters. It turns out that properly tuned S-D estimators sometimes outperform, and are always at least comparable to, the more popular S-estimators
(Davies, 1987).
This article is organized as follows. Section 2 formally introduces the S-D estimator
and reviews some basics about elliptical distributions and influence functions. Section 3
gives some non-standard results about univariate location-scale estimators of projections
2
that are needed in Section 4, where the influence function of the S-D estimator is derived
and the resulting properties are discussed. Proofs are left to the Appendix.
2
Definitions
The Stahel-Donoho estimator of location and scatter is defined as follows. Given a sample
X = (x1 , . . . , xn ) in Rp , consider all univariate projections a> X with a ∈ Rp , a 6= 0.
Let m(·) and s(·) be univariate estimators of location and scale. Then, a measure of
¯
¯
outlyingness of each a> xi is the standardized distance ¯a> xi − m(a> X)¯ /s(a> X). The
overall outlyingness of the point xi is then defined as
¯ a> x − m(a> X) ¯
¯
¯
i
r(xi ; X) = sup ¯
¯.
> X)
p
s(a
a∈R
The S-D estimator downweights each observation according to its overall outlyingness.
The location estimator Tn is then a weighted mean and the scatter estimator Vn is a
weighted covariance matrix:
ave{w(r2 (xi ; X))xi }
ave{w(r2 (xi ; X))}
ave{w(r2 (xi ; X))(xi − T (X))(xi − T (X))> }
=
,
ave{w(r2 (xi ; X))}
Tn =
Vn
where w is a non-negative and usually non-increasing function. In what follows, we will
express estimators as functionals on the space of distribution functions. Thus Tn = T (Fn )
and Vn = V (Fn ), where Fn denotes the empirical distribution function of X.
To analyze the asymptotic properties of these functionals, we will restrict ourselves
to the class of elliptical distributions as target models. These models are flexible enough
to accommodate heavy-tailed distributions without finite moments, while possessing well
defined location and scatter parameters. A random vector Y ∈ Rp has an elliptical
distribution if its density is of the form
1
f (y) = |Σ|− 2 h((y − µ)> Σ−1 (y − µ)),
(1)
where h is a non-negative and integrable function, µ ∈ Rp is the location parameter,
and Σ ∈ SP(p) is the scatter parameter (SP(p) denotes the family of symmetric positive
3
1
1
definite matrices). The vector X = Σ− 2 (Y − µ), where Σ 2 ∈ SP(p) is the unique
symmetric square root of Σ, has density f (x) = h(kxk2 ) and its distribution is called
spherical. A more general definition of sphericity, which does not need existence of a
density function, requires that the distribution of X be rotationally invariant. That is,
L(ΓX) = L(X) for every Γ ∈ O(p), where O(p) is the family of orthogonal matrices.
Several properties of spherical distributions are given in Bilodeau and Brenner (1999,
chapters 4 and 13) and Hampel et al. (1986, chapter 5). Two of these, which will be
used many times in this paper, are that L(a> X) = L(kakX1 ) for any a ∈ Rp , with
L(X1 ) symmetric about zero, and that X can be factorized as X = RU , where R = kXk
is stochastically independent of U = X/kXk and U has the uniform distribution on
Sp−1 (the unit sphere in Rp ). Then E(U ) = 0 and E(U U > ) = p−1 I. Moreover, if
W = (U1 , . . . , Uk )> with 1 ≤ k < p then kW k2 ∼ B(k/2, (p − k)/2) and W has density
f (w) =
π
k
2
Γ( p2 )
Γ( p−k
)
2
R2 has density
(1 − kwk2 )
p−k
−1
2
,
0 < kwk2 < 1.
p
π2 p
fR2 (t) = p t 2 −1 h(t).
Γ( 2 )
Two elliptical models of importance are the multivariate normal Np (µ, Σ), whose density
is determined by
t
e− 2
h(t) =
p ,
(2π) 2
and the multivariate T on ν degrees of freedom, Tp,ν (µ, Σ), whose density is determined
by
h(t) =
³ ν ´ ν+p
2
.
p
Γ( ν2 )(νπ) 2 ν + t
Γ( ν+p
)
2
If X ∼ Np (0, I) then R2 ∼ χ2p and the marginals of X are independent N1 (0, 1). The
1
Tp,ν (0, I) distribution can be characterized as the distribution of X = Z/(Q/ν) 2 with
Z ∼ Np (0, I), Q ∼ χ2ν , Z and Q independent; then R2 /p ∼ Fp,ν , or equivalently (1 +
ν/R2 )−1 ∼ B(p/2, ν/2), and the marginals of X have T1,ν distributions (but are not
independent).
Let us turn now to the influence function. Let T and V be location and scatter
functionals which are well defined on elliptical distributions and their point-mass con4
taminations Fε,z = (1 − ε)F + ε∆z , for small enough ε. The influence function of T at F
is
IF (z; T, F ) = lim
ε↓0
T (Fε,z ) − T (F )
,
ε
whenever the limit exists. The definition of IF (z; V, F ) is analogous. We will restrict
ourselves to affine equivariant estimators. Therefore, if Fµ,Σ is an elliptical distribution
1
1
1
as in (1), we have that T (Fµ,Σ ) = Σ 2 T (F0,I ) + µ and V (Fµ,Σ ) = Σ 2 V (F0,I )Σ 2 , which
implies that
1
1
1
1
IF (z; T, Fµ,Σ ) = Σ 2 IF (Σ− 2 (z − µ); T, F0,I )
1
IF (z; V, Fµ,Σ ) = Σ 2 IF (Σ− 2 (z − µ); V, F0,I )Σ 2 ,
where F0,I is spherical. Thus, without loss of generality we can and will work exclusively with spherical distributions from now on. For arbitrary equivariant estimators
and spherical distributions Hampel et al. (1986, chapter 5.3) show that T (F ) = 0 and
V (F ) = βV,F I for some βV,F > 0, and that the influence functions are of the form
IF (z; T, F ) = zwµ (kzk2 )
³
kzk2 ´
1
>
IF (z; V, F ) = zz −
I wη (kzk2 ) + Iwτ (kzk2 ),
p
p
(2)
(3)
for non-negative functions wµ and wη , and a function wτ such that E{wτ (R2 )} = 0.
It is known that, under appropriate conditions, T (Fn ) is asymptotically normal with
covariance matrix given by
CT = EF {IF (X; T, F )IF (X; T, F )> } = dµ I,
(4)
with dµ = E{R2 wµ2 (R2 )}/p. A similar result holds for vecs V (Fn ), where vecs is the operation that stacks the p + p(p − 1)/2 non-redundant elements of V into a vector, as follows:
√
√
vecs V = (v11 / 2, . . . , vpp / 2, v21 , v31 , v32 , . . . , vp,p−1 )> . The asymptotic covariance of
vecs V (Fn ) is
³
´
1
1
CV = EF {vecs IF (X; V, F ) vecs IF (X; V, F )> } = dη I − ww> + dτ ww> ,
p
p
5
(5)
>
4 2
2
2
2
where w> = (1>
p , 0p(p−1)/2 ), dη = E{R wη (R )}/(p(p + 2)) and dτ = E{wτ (R )}/(2p).
−1
Note that C−1
T and CV have the same forms as in (4) and (5) with dµ , dη and dτ replaced
by their respective inverses.
Useful summaries of local robustness are the information-standardized gross-error
sensitivities,
γT∗ (F ) = sup {IF (z; T, F )> I1 (F )IF (z; T, F )}1/2
z∈Rp
γV∗ (F )
= sup {vecs(IF (z; V, F ))> I2 (F )vecs(IF (z; V, F ))}1/2 ,
z∈Rp
where I1 and I2 are the Fisher information matrices of F for location and scatter,
namely I1 (F ) = E{`2 (kXk2 )XX > } and I2 (F ) = E(M M > ), with M = vecs(− 12 I +
1
`(kXk2 )XX > )
2
and `(t) = −2h0 (t)/h(t). These matrices have the forms (4) and (5) with
d∗µ = E{R2 `2 (R2 )}/p, d∗η = E{R4 `2 (R2 )}/(p(p + 2)) and d∗τ = E{(R2 `(R2 ) − p)2 }/(2p) =
((p + 2)d∗η − p)/2. Then
∗1
γT∗ (F ) = sup |twµ (t2 )|dµ2
t≥0
n1³
o 12
1
1´ 4 2 2 ∗
γV∗ (F ) = sup
1−
t wη (t )dη + wτ2 (t2 )d∗τ .
p
2p
t≥0 2
For the Np (0, I) distribution is d∗µ = d∗η = d∗τ = 1, while for the Tp,ν (0, I) distribution is
d∗µ = d∗η = (p + ν)/(p + ν + 2) and d∗τ = ν/(p + ν + 2).
What is of primary concern in many applications is not the estimation of Σ itself but
of a shape component of Σ. A shape component is a function S such that S(λΣ) = S(Σ)
for all λ > 0. Two common examples are the eigenvectors and the ratios of eigenvalues.
The influence function of a shape component does not depend on wτ (Kent and Tyler,
1997). In particular, the influence function of S(V ) = V /(tr V /p) is
IF (z; S, F ) =
1 ³
βV,F
kzk2 ´
zz −
I wη (kzk2 )
p
>
2
and then CS is like (5) with dτ,S = 0 and dη,S = dη,V /βV,F
. Then
γS∗ (F )
n1³
1 ´ d∗η o 12
= sup |t wη (t )|
.
1−
2
2
p βV,F
t≥0
2
2
6
In this article we will use γS∗ instead of γV∗ to evaluate the robustness of the scatter estimator. An additional advantage of working with S(V ) instead of V is that the asymptotic
relative efficiency with respect to the maximum likelihood estimator is determined by a
single scalar, because the asymptotic covariance is just a multiple of I − ww> . Thus, we
have
1/d∗µ
dµ
1/d∗η
ARE(S; F ) =
.
2
dη /βV,F
ARE(T ; F ) =
To summarize, in this article we will judge the performance of the estimators on the basis
of γT∗ , γV∗ , ARE(T ) and ARE(S).
Going back to the S-D estimators, we have that their functional forms are given by
EF {w(r2 (X; F ))X}
EF {w(r2 (X; F ))}
EF {w(r2 (X; F ))(X − T (F ))(X − T (F ))> }
V (F ) =
,
EF {w(r2 (X; F ))}
T (F ) =
where
¯ a> x − m(F a ) ¯
¯
¯
r(x; F ) = sup ¯
¯.
a
p
s(F )
a∈R
Here F a = L(a> X) and m(·) and s(·) are univariate location and scale functionals. The
S-D estimators are well defined for any F if the weight function satisfies:
W1. w : [0, ∞) → [0, ∞) is bounded and w(u2 )u2 is bounded.
In order that T and V be affine equivariant we also have to assume that the univariate
estimators m and s are equivariant, that is, m(L(αZ +β)) = αm(L(Z))+β and s(L(αZ +
β)) = |α|s(L(Z)) for any α, β ∈ R. This implies that m(L(Z)) = 0 if L(Z) is symmetric
about zero, and that
¯ >
¯
¯ a x − m(F a ) ¯
¯
r(x; F ) = sup ¯¯
¯
p−1
s(F a )
a∈S+
where Sp−1
= {a ∈ Sp−1 : a1 ≥ 0}. For a spherical distribution F we have m(F a ) = 0
+
and s(F a ) = s0 for all a ∈ Sp−1 , so that r2 (x; F ) = kxk2 /s20 . Then T (F ) = 0 and
V (F ) = E{w(R2 )R2 }/(pE{w(R2 )})I.
7
To obtain the influence functions of T and V , note that
(1 − ε)h1 (ε) + εw(r2 (z; Fε,z ))z
h0 (ε)
(1 − ε)h2 (ε) + εw(r2 (z; Fε,z ))(z − T (Fε,z ))(z − T (Fε,z ))>
V (Fε,z ) =
h0 (ε)
T (Fε,z ) =
where
h0 (ε) = EF {w(r2 (X; Fε,z )}
h1 (ε) = EF {w(r2 (X; Fε,z ))X}
h2 (ε) = EF {w(r2 (X; Fε,z ))(X − T (Fε,z ))(X − T (Fε,z ))> }.
Therefore
h01 (0) + w(r2 (z; F ))z
h0 (0)
h0 (0) + w(r2 (z; F ))zz > − h2 (0)
IF (z; V, F ) = 2
.
h0 (0)
IF (z; T, F ) =
Clearly, the problem is to obtain h01 (0) and h02 (0). This is not so straightforward because
it involves computing the derivative of r(x; Fε,z ) with respect to ε, which in turn requires
a
a
some non-standard results about the derivatives of m(Fε,z
) and s(Fε,z
). These results are
given in the next section.
3
Derivatives of location-scale estimators of projections and of the outlyingness measure
a
a
Since it is necessary to find the derivatives of m(Fε,z
) and s(Fε,z
) with respect to a and
ε, we will restrict ourselves to the class of univariate location-scale estimators that are
solutions of simultaneous M-estimating equations. This class is rich enough to include all
estimators used in practice, and allows us to obtain nice expressions for the derivatives
under fairly general conditions.
8
Let us consider then, a pair of location-scale functionals (m, s) that solve
n ³ Z − m ´o
E ψ
= 0
s
n ³ Z − m ´o
= K.
E ρ
s
(6)
(7)
Assume that
A1. ψ is odd and ρ is even.
A2. ψ and ρ are twice continuously differentiable and ψ(u), ψ 0 (u)u, ψ 00 (u)u2 , ρ(u),
ρ0 (u)u, and ρ00 (u)u2 are bounded.
Two important cases, for which the resulting S-D estimators attain the maximum asymptotic breakdown point 50%, are:
• The median and the standardized median absolute deviation, corresponding to
ψ(u) = sign(u), ρ(u) = I(|u| ≥ Φ−1 (3/4)) and K = 1/2.
• The maximum likelihood estimator of location and scale of a univariate Cauchy
distribution, corresponding to ψ(u) = u/(1 + u2 ), ρ(u) = u2 /(1 + u2 ) and K = 1/2.
Note that A2 is satisfied in the Cauchy case but not in the med/mad case. Nevertheless, the final formulas of Theorem 3 can be applied in this case as well, if the central
distribution has a density.
If X is a random vector with spherical distribution F , and F1 denotes the marginal
distribution of X1 , then A1 implies that (m(F a ), s(F a )) = (0, kaks0 ) for every a ∈ Rp ,
where s0 = s0 (F ) is implicitly defined by
n ³ X ´o
1
= K.
EF1 ρ
s0
(8)
Typically ρ is calibrated so that s0 = 1 when F = Np (0, I), but this needs not be the
case. All we assume is the following:
A3. Equation (8) admits a solution s0 > 0 for which EF1 {ψ 0 (X1 /s0 )} 6= 0 and
6 0.
EF1 {ρ0 (X1 /s0 )X1 /s0 } =
9
The next proposition gives sufficient conditions for A3 to hold. These conditions are
satisfied by most estimators and distributions that appear in practice (for the med/mad
case, an ad-hoc proof is possible). Theorem 1 below provides expressions for the partial
a
a
) and s(Fε,z
) that are needed to compute the derivative of r(x; Fε,z )
derivatives of m(Fε,z
at ε = 0, which is given in Theorem 2.
Proposition 1 Equation (8) admits a solution s0 ∈ (0, ∞) if ρ is bounded, even, nondecreasing in [0, ∞), ρ(0) < K < ρ(∞), and X1 has a symmetric distribution about zero.
In addition, A3 holds if ψ is bounded, odd, non-negative in [0, ∞), both ψ(|u|) and ρ0 (|u|)
are strictly positive in a neighborhood of zero (except at zero), and P (X1 = 0) < 1.
Theorem 1 Let Fε,z = (1 − ε)F + ε∆z with F an spherical distribution. Suppose that
a
a
A1, A2 and A3 hold. For each z ∈ Rp , let G(ε, a) = (m(Fε,z
), s(Fε,z
)). Then G is
twice continuously differentiable at (0, a) for every a 6= 0, and if ã = a/kak, the partial
derivatives are:
∂G1
(0, a)
∂ε
∂G1
(0, a)
∂a
∂G2
(0, a)
∂ε
∂G2
(0, a)
∂a
∂ 2 G1
(0, a)
∂a∂ε
∂ 2 G1
(0, a)
∂a∂a>
∂ 2 G2
(0, a)
∂a∂ε
∂ 2 G2
(0, a)
∂a∂a>
=
kaks0 ψ(ã> z/s0 )
EF {ψ 0 (X1 /s0 )}
= 0
=
kaks0 {ρ(ã> z/s0 ) − K}
EF {ρ0 (X1 /s0 )X1 /s0 }
= ãs0
=
ãψ(ã> z/s0 )s0 + ψ 0 (ã> z/s0 )(I − ãã> )z
EF {ψ 0 (X1 /s0 )}
= 0
ã{ρ(ã> z/s0 ) − K}s0 + ρ0 (ã> z/s0 )(I − ãã> )z
=
EF {ρ0 (X1 /s0 )X1 /s0 }
1
=
(I − ãã> )s0 .
kak
Theorem 2 Under the conditions of Theorem 1, r2 (x; Fε,z ) is differentiable at ε = 0 for
each x, z ∈ Rp , x 6= 0, and
¯
¯
∂ 2
kxk ψ(x̃> z/s0 )
kxk2 {ρ(x̃> z/s0 ) − K}
¯
= −2
r (x; Fε,z )¯
−2 2
,
∂ε
s0 EF {ψ 0 (X1 /s0 )}
s EF {ρ0 (X1 /s0 )X1 /s0 }
0
ε=0
where x̃ = x/kxk.
10
(9)
4
Influence function of S-D estimators
Now we can resume the derivation of the influence function of S-D estimators. Theorem
3 below gives the expressions of IF (z; T, F ) and IF (z; V, F ). For this theorem we need
an additional assumption:
W2. w is differentiable almost everywhere and sup |w0 (u2 )u4 | < ∞.
Theorem 3 Let F be a spherical distribution such that PF (X = 0) = 0 and assume that
A1, A2, A3, W1 and W2 hold. Then
IF (z; T, F ) =
n c (F )
w(kzk2 /s20 )kzk o z
1
g1 (kzk) +
c0 (F )
c0 (F )
kzk
and
IF (z; V, F ) =
n c (F )
w(kzk2 /s20 )kzk2 on zz >
Io
2
g2 (kzk) +
−
c0 (F )
c0 (F )
kzk2 p
n c (F )
w(kzk2 /s20 )kzk2 − c3 (F ) o I
2
+
g3 (kzk) +
c0 (F )
c0 (F )
p
if z 6= 0, and IF (0; T, F ) = 0, IF (0; V, F ) = 0. Here
g1 (t) =
g2 (t) =
g3 (t) =
c0 (F ) =
c1 (F ) =
c2 (F ) =
c3 (F ) =
E{ψ(U1 t/s0 )U1 }
E{ψ 0 (X1 /s0 )}
E{ρ(U1 t/s0 )U12 }p − E{ρ(U1 t/s0 )}
E{ρ0 (X1 /s0 )X1 /s0 }(p − 1)
E{ρ(U1 t/s0 )} − K
E{ρ0 (X1 /s0 )X1 /s0 }
©
ª
E w(R2 /s20 )
ª
©
−2E w0 (R2 /s20 )R2 /s0
ª
©
−2E w0 (R2 /s20 )R4 /s20
ª
©
E w(R2 /s20 )R2
where X ∼ F , R = kXk and U = X/kXk.
Note that IF (z; T, F ) does not depend on the scale estimator s (except through s0 )
and IF (z; V, F ) does not depend on the location estimator m. More remarkable, although
11
not completely surprising, is that IF (z; T, F ) and IF (z; V, F ) have the forms of influence
functions of one-step reweighted estimators (Lopuhaä, 1999, Remark 4.1; in Lopuhaä’s
notation, c0 (F ) = c1 , c1 (F ) = pc2 , c2 (F ) = p(p + 2)c4 and c3 (F ) = pc3 ). Specifically,
suppose that T0 and V0 are estimators with influence functions of the form (2) and (3) with
wµ (kzk2 ) = pg1 (kzk)/kzk, wη (kzk2 ) = p(p + 2)g2 (kzk)/kzk2 and wτ (kzk2 ) = 2pg3 (kzk).
Then the influence functions of T and V coincide with those of the one-step reweighted
estimators
T1 (F ) = EF {w(d2 (X; T0 , V0 ))X}/EF {w(d2 (X; T0 , V0 ))}
V1 (F ) = EF {w(d2 (X; T0 , V0 ))(X − T1 )(X − T1 )> }/EF {w(d2 (X; T0 , V0 ))},
where d2 (X; T0 , V0 ) = (X − T0 )> V0−1 (X − T0 ). We say that this result is not com1
pletely surprising because if one takes m(F a ) = a> T0 (F ) and s(F a ) = {a> V0 (F )a} 2 then
d(X; T0 , V0 ) = r(X; F ). Of course, T0 and V0 are only “virtual” estimators; while it
is possible to construct M-estimating equations from the influence functions of T0 and
V0 , they do not satisfy the conditions for existence of solution required in Hampel et
al. (1986, p.287, Theorem 1). Nevertheless, this heuristic argument sheds some light on
certain facts that have been observed in simulations; for example, that a given S-D estimator can attain high relative efficiency under both Normal and Cauchy models (Yohai
and Maronna, 1995, Section 4.4), a behavior that is not uncommon for reweighted estimators (see Lupohaä, 1999, and Gervini, 2002). It also allows us to conjecture that S-D
estimators are asymptotically normal, which was doubted in the past. The asymptotic
normality of T , with asymptotic variance given by EF {IF (X; T, F )IF (X; T, F )> }, has
been proved by Zuo et al. (2001, Theorem 3.1). The conditions assumed by these authors
are more general than ours in principle, but when it comes to obtaining working formulas
they also assume that the model distribution is elliptical and that m is an M-estimator
defined through a differentiable ψ function, although they are allowed to use the MAD
as scale estimator (see their Lemma 3.2). However, for the asymptotic normality of V
(which has not yet been rigorously established) it is presumably necessary to impose on
s smoothness conditions similar to ours.
12
The gross-error sensitivities and the asymptotic efficiencies that result from Theorem
3 provide a rational way to choose the weight function w. We need a weight function
that offers a reasonable trade-off between local robustness and efficiency. Maronna and
Yohai (1995) recommend a Huber-type weight function wH (t2 ; c) = min{t2 , c}/t2 . Zuo
et al. (2001) propose a weight function that gives weight one to observations with small
outlyingness and then decreases exponentially to zero; following this idea, we consider a
truncated exponential weight function wE (t2 ; c) = min{1, exp(5(1−t2 /c))}. This function
decreases so rapidly to zero that it is essentially a smoothed down indicator function. Finally, we introduce a Gaussian weight function wG (t2 ; c) = ϕ(t2 /c)/ϕ(1), which decreases
exponentially for t2 > c but, unlike wE and wH which are constant on [0, c), it is strictly
decreasing on [0, ∞). We take cut-off values of the form c = χ2p,1−α . Figures 1 and 2 plot
gross-error sensitivities and asymptotic relative efficiencies as functions of α, for Np (0, I)
models with p = 2 and p = 10, and the median and MAD as m and s. We can see very
clearly that the exponential weight is uniformly worse than the others. The reason is
that wE decreases to zero too quickly after the cut-off value, and then |wE0 | becomes too
large, resulting in large coefficients c1 (F )/c0 (F ) and c2 (F )/c0 (F ). Figure 3 illustrates
this, showing c21 (F )/(pc20 (F )) and c22 (F )/(p(p + 2)c20 (F )) for a N5 (0, I) and a T5,3 (0, I)
model. Note how these ratios are so close to zero for α = .05 (the value suggested by
Maronna and Yohai, 1995), that the contribution of m and s is virtually negligible and
the asymptotic behavior is then dominated by the weight function.
Regarding the other two weights, neither of them stands out as uniformly better.
Huber weights are the most efficient but the value α = .05 yields poor local robustness.
Overall, the best choices seem to be either a Gaussian weight with α = .10 or a Huber weight with α = .50. For further insight, we have compared these two estimators
with the multivariate S-estimator corresponding to Tukey’s ρ-function, with asymptotic
breakdown point 1/2 and calibrated for consistency under the Normal model. Figures 4
and 5 plot gross error sensitivities and the asymptotic relative efficiencies as functions of
p, for Np (0, I) and Tp,3 (0, I) models. Again, there are no clear winners between Huber
and Gaussian weights, but we can confidently say that the S-D estimators perform fairly
13
well compared to the S-estimator, and are almost always more efficient.
A
Proofs
Proof of Proposition 1. The first part is immediate, since E(ρ(X1 /s)) goes to ρ(0)
when s → ∞, and goes to ρ(∞) when s → 0. For the second part, there is by assumption
a δ1 > 0 such that 1 − F1 (δ1 ) > 0. Given s > 0, there exists a δ > 0 small enough so that
δ ≤ δ1 /s and ψ(δ) > 0. Then
Z ∞
n ³ X ´o
1
0
E ψ
(1 − F1 (sx)) dψ(x)
= 2s
s
0
≥ 2s(1 − F1 (δ1 ))ψ(δ),
which is strictly positive. To show that E{ρ0 (X1 /s)X1 /s} > 0, use the same argument
Ru
replacing ψ(u) by 0 ρ0 (x)x dx. ¤
Proof of Theorem 1. For a fixed z ∈ Rp , let H : [0, 1] × Rp × R × R+ → R2 be
given by
n ³ a> X − m ´o
H1 (ε, a, m, s) = EFε,z ψ
s
n ³ a> X − m ´o
− K.
H2 (ε, a, m, s) = EFε,z ρ
s
Then G(ε, a) is implicitly defined by the equation H(ε, a, G(ε, a)) = 0. To find the
derivatives we will use the Implicit Function Theorem (Dieudonné, 1969, Theorems 10.2.2
and 10.2.3). By A1, A2 and a straightforward application of dominated convergence, we
have
∂H1
∂ε
∂H1
∂a
∂H1
∂m
∂H1
∂s
=
=
=
=
³ a> z − m ´
n ³ kakX − m ´o
1
+ψ
−EF ψ
s
s
n ³ kakX − m ´ X o a
³ a> z − m ´ z
1
1
(1 − ε)EF ψ 0
+ εψ 0
s
s kak
s
s
n ³ kakX − m ´o³ 1 ´
³ a> z − m ´³ 1 ´
1
(1 − ε)EF ψ 0
−
+ εψ 0
−
s
s
s
s
n ³ kakX − m ´³ kakX − m ´o³ 1 ´
1
1
(1 − ε)EF ψ 0
−
s
s
s
³ a> z − m ´³ a> z − m ´³ 1 ´
− ,
+εψ 0
s
s
s
14
with analogous expressions for the partial derivatives of H2 replacing ψ by ρ − K. All
these partial derivatives possess themselves continuous partial derivatives, whence H is
twice continuously differentiable. Since G(0, a) = (0, kaks0 ), we need to evaluate the
partial derivatives of H at (0, a, 0, kaks0 ). The differential is

EF1 {ψ 0 (X1 /s0 )}
0
−

kaks
0

Dm,s H(0, a, 0, kaks0 ) = 
EF {ρ0 (X1 /s0 )X1 /s0 }
0
− 1
kaks0




which is non-singular by A3. Hence G is (twice) continuously differentiable at (0, a) for
any a 6= 0, and Dε,a G(0, a) = (Dm,s H(0, a, 0, kaks0 ))−1 Dε,a H(0, a, 0, kaks0 ). Since


³ a> z ´
0
 ψ kaks0

Dε,a H(0, a, 0, kaks0 ) = 
n ³X ´ X o a 
 ³ a> z ´

1
1
− K EF1 ρ0
ρ
kaks0
s0 kaks0 kak
the expressions for the first partial derivatives given in the theorem follow. Second partial
derivatives are obtained from them in a straightforward manner. ¤
Proof of Theorem 2. For a fixed x ∈ Rp , x 6= 0, consider the Lagrangian
G(ε, a, λ) =
³ a> x − G (ε, a) ´2
1
+ λ(1 − kak2 ).
G2 (ε, a)
By Theorem 1 we have that G is differentiable at (0, a, λ) for any a 6= 0. Let H :
[0, 1] × Rp × R+ → Rp+1 be the vector of partial derivatives of g with respect to a and λ.
That is,
³ a> x − G (ε, a) ´³ x − ∂G1 (ε, a) a> x − G (ε, a) ∂G
´
1
1
2
∂a
−
(ε,
a)
− 2λa
G2 (ε, a)
G2 (ε, a)
G22 (ε, a)
∂a
H2 (ε, a, λ) = 1 − kak2 .
H1 (ε, a, λ) = 2
Therefore the vector a = a(ε) that realizes the supremum r2 (x; Fε,z ) satisfies the implicit
equation H(ε, a(ε), λ(ε)) = 0. We will use the Implicit Function Theorem to show that
a(ε) is differentiable at ε = 0. The condition H2 (0, a(0), λ(0)) = 0 is equivalent to
ka(0)k = 1. Since
H1 (0, a, λ) = 2
a> xx
(a> x)2 a
−
2
− 2λa,
kak2 s20
kak4 s20
15
by multiplying each term by a we deduce that H1 (0, a(0), λ(0)) = 0 if and only if λ(0) = 0
and either a(0)> x = 0 or a(0) = ±x/kxk. Clearly the maximizer of r2 (x; F ) cannot be
orthogonal to x because that would imply r2 (x, F ) = 0; then it must be a(0) = ±x/kxk,
whichever belongs to Sp−1
+ . Let a0 = a(0). Using the formulas in Theorem 1 we have that
∂H1
2 n³
aa> ´ > ³
aa> ´ (a> x)2 o
(0, a, λ) =
I − 2λI
I −2
xx I − 2
−
∂a
kak2 s20
kak2
kak2
kak2
∂H1
(0, a, λ) = −2a
∂λ
∂H2
(0, a, λ) = −2a
∂a
∂H2
(0, a, λ) = 0
∂λ
and then

´
2³ >
2
−2a0 
 2 xx − kxk I
Da,λ H(0, a0 , 0) =  s0

>
−2a0
0

which is non-singular. Then, by the Implicit Function Theorem we have that a(ε) and
λ(ε) are differentiable at ε = 0. Therefore
³ a(ε)> x − G (ε, a(ε)) ´2
1
r2 (x, Fε,z ) =
G2 (ε, a(ε))
is also differentiable at ε = 0 and
¯
∂ 2
¯
r (x; Fε,z )¯
∂ε
ε=0
¯
¯ ´
³ a> x − G (0, a ) ´³ a0 (0)> x − ∂G1 (ε,a(ε)) ¯
a>
1
0
∂ε
0
0 x − G1 (0, a0 ) ∂G2 (ε, a(ε)) ¯
ε=0
.
=2
−
¯
G2 (0, a0 )
G2 (0, a0 )
G22 (0, a0 )
∂ε
ε=0
The explicit value of a0 (0) is irrelevant; we only use that a0 (0)> a0 = 0, a consequence of
ka(ε)k2 = 1. Since G1 (0, a0 ) = 0 and G2 (0, a0 ) = s0 , we have
¯
∂G1
∂G1
∂G1 (ε, a(ε)) ¯¯
=
(0, a0 ) +
(0, a0 )> a0 (0)
¯
∂ε
∂ε
∂a
ε=0
s0 ψ(a>
0 z/s0 )
=
0
EF {ψ (X1 /s0 )}
¯
¯
∂G2
∂G2
∂G2 (ε, a(ε)) ¯
=
(0, a0 ) +
(0, a0 )> a0 (0)
¯
∂ε
∂ε
∂a
ε=0
©
ª
s0 ρ(a>
0 z/s0 ) − K
=
EF {ρ0 (X1 /s0 )X1 /s0 }
16
and (9) follows. ¤
Proof of Theorem 3. By dominated convergence we have
¯
n
o
∂ 2
¯
0
0 2
h1 (0) = EF w (r (X, F )) r (x, F )¯ X
∂ε
ε=0
¯
n
o
∂
¯
h02 (0) = EF w0 (r2 (X, F )) r2 (x, F )¯ XX > .
∂ε
ε=0
The theorem follows upon replacing the derivative of r2 (X, Fε,z ) by the expression given
in Theorem 2 and using the factorization X = RU . We have to evaluate expectations of
the form E{ϑ(z > U/s0 )U } and E{ϑ(z > U/s0 )U U > }, where ϑ is either ψ or ρ − K. Take
a Γ ∈ O(p) with z/kzk as its first column. Using that L(U ) = L(ΓU ) we obtain that
n ³ kzkU ´ o z
n ³ z>U ´ o
1
U
= E ϑ
U1
(10)
E ϑ
s0
s0
kzk
o
n ³ kzkU ´ o
n ³ kzkU ´
n ³ z>U ´
o zz >
1
1
UU>
= E ϑ
U22 I + E ϑ
(U12 − U22 )
E ϑ
(11)
s0
s0
s0
kzk2
and since
n ³ kzkU ´o
n ³ z>U ´
o
1
E ϑ
= tr E ϑ
UU>
s0
s0
n ³ kzkU ´ o
n ³ kzkU ´ o
1
1
= E ϑ
U12 + (p − 1)E ϑ
U22
s0
s0
equation (11) can be rewritten entirely in terms of U1 as
o
n ³ kzkU ´ o zz >
n ³ z>U ´
1
UU>
= E ϑ
U12
E ϑ
s0
s0
kzk2
o³
n
³
1
zz > ´
kzkU1 ´
2
(1 − U1 ) I −
.
+
E ϑ
p−1
s0
kzk2
The rest is straightforward, using that (10) is zero for ϑ = ρ − K and (11) is zero for
ϑ = ψ. ¤
Some formulas for the med/mad case. To compute the gross-error sensitivities
of the S-D estimator, observe that
c1 (F )g1 (t) + w(t2 /s20 )t
c0 (F )
c2 (F )g2 (t) + w(t2 /s20 )t2
t2 wη (t2 ) =
c0 (F )
c2 (F )g3 (t) + {w(t2 /s20 )t2 − c3 (F )}
wτ (t2 ) =
c0 (F )
c3 (F )
βF =
.
c0 (F )p
twµ (t2 ) =
17
When m is the median and s is the standardized MAD, we have
n ³U t´ o
Γ( p )
1
U1
= √ 2p+1
E ψ
s0
πΓ( 2 )
n ³ U t ´o
³ Φ−1 (.75)2 s2 ´
1
0
E ρ
= 1 − FB( 1 , p−1 )
2
2
s0
t2
n ³U t´ o
³ Φ−1 (.75)2 s2 ´
1 1
1
0
2
E ρ
U1
=
− FB( 3 , p−1 )
2
2
2
s0
p p
t
and then, changing the order of integration to avoid ψ 0 and ρ0 , we have
Γ( p2 )
1
g1 (t) = √
p+1
πΓ( 2 ) 2s0 f1 (0)
FB(1/2,(p−1)/2) (Φ−1 (.75)2 s20 /t2 ) − FB(3/2,(p−1)/2) (Φ−1 (.75)2 s20 /t2 )
g2 (t) =
(p − 1)2s0 Φ−1 (.75)f1 (s0 Φ−1 (.75))
1/2 − FB(1/2,(p−1)/2) (Φ−1 (.75)2 s20 /t2 )
g3 (t) =
2s0 Φ−1 (.75)f1 (s0 Φ−1 (.75))
where f1 is the marginal density function of X1 . For the Np (0, I) model we have s0 = 1,
while for the Tp,ν (0, I) model we have s0 = FT−1
(.75)/Φ−1 (.75).
1,ν
For γT∗ (F ) we can obtain a explicit expression:
n c (F )
supt≥0 w(t2 /s20 )t o ∗ 12
Γ( p2 )
1
1
dµ .
γT∗ (F ) =
+
√
c0 (F ) πΓ( p+1
2s
f
(0)
c
(F
)
)
0
1
0
2
√
2
The maximum of w(t /s0 )t is attained at t = s0 c for Huber and Exponential weights,
√
and then supt≥0 w(t2 /s20 )t = s0 c; for the Gaussian weight the maximum is attained at
√
√
√
t = s0 c/21/4 and then supt≥0 wG (t2 /s20 )t = ϕ(1/ 2)s0 c/(21/4 ϕ(1)).
References
Bilodeau, M. and Brenner, D. (1999), Theory of Multivariate Statistics (Springer, NY).
Davies, P.L. (1987), Asymptotic behavior of S-estimates of multivariate location parameters and dispersion matrices, Ann. Statist. 15, 1269–1292.
Dieudonné, J. (1969), Foundations of Modern Analysis (Academic Press, NY).
Donoho, D.L. (1982), Breakdown properties of multivariate location estimators, Ph.D.
qualifying paper.(Department of Statistics, Harvard University).
18
Gervini, D. (2002), A robust and efficient adaptive reweighted estimator of multivariate
location and scatter. To appear in J. Multivar. Anal.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986), Robust
Statistics: The Approach Based on Influence Functions (Wiley, NY).
Kent, J.T. and Tyler, D.E. (1997), Constrained M-estimation for multivariate location
and scatter, Ann. Statist. 24, 1346–1370.
Lopuhaä, H.P. (1999), Asymptotics of reweighted estimators of multivariate location
and scatter, Ann. Statist. 27, 1638–1665.
Maronna, R.A. and Yohai, V.J. (1995), The behavior of the Stahel-Donoho robust multivariate estimator, J. Amer. Statist. Assoc. 90, 330–341.
Maronna, R.A. and Yohai, V.J. (1998), Robust estimation of multivariate location and
scatter, in: S. Kotz and N.L. Johnson, eds. Encyclopedia of Statistical Sciences,
Update Volume 2 (Wiley, NY) pp. 589–596.
Rousseeuw, P.J. and Yohai, V.J. (1984), Robust regression by means of S-estimators,
in: J. Franke, W. Härdle and R.D. Martin, eds. Robust and Nonlinear Time Series
Analysis. Lecture Notes in Statistics Vol. 26 (Springer, NY) pp. 256–272.
Stahel, W.A. (1981), Robust estimation: Infinitesimal optimality and covariance matrix
estimators, Ph.D. thesis (ETH, Zurich).
Zuo, Y., Cui, H., and He, X. (2002), On the Stahel-Donoho estimator and depthweighted means of multivariate data, unpublished manuscript.
19
ARE location
GES location
3.2
1
Huber
Exponential
Gaussian
3
0.9
2.8
0.8
2.6
2.4
0.7
2.2
0.6
2
1.8
0.1
0.2
0.3
0.4
0.5
0.5
0.1
0.2
α
1
5.5
0.9
5
0.8
4.5
0.7
4
0.6
3.5
0.5
3
0.4
2.5
0.3
0.1
0.2
0.3
0.3
0.4
0.5
0.4
0.5
ARE shape
GES shape
6
2
α
0.4
0.5
α
0.2
0.1
0.2
α
0.3
Figure 1: Gross-error sensitivities and asymptotic relative efficiencies of S-D estimators
for three different weight functions and cut-off values χ2p,1−α . Model Np (0, I) with p = 2.
20
ARE location
GES location
1
5.5
0.95
0.9
5
0.85
4.5
0.8
4
0.75
Huber
Exponential
Gaussian
0.7
3.5
0.1
0.2
0.3
0.4
0.5
0.65
0.1
0.2
α
α
0.3
0.4
0.5
0.4
0.5
ARE shape
GES shape
16
1
15
0.9
14
13
0.8
12
0.7
11
10
0.6
9
8
0.1
0.2
0.3
0.4
0.5
α
0.5
0.1
0.2
α
0.3
Figure 2: Gross-error sensitivities and asymptotic relative efficiencies of S-D estimators
for three different weight functions and cut-off values χ2p,1−α . Model Np (0, I) with p = 10.
21
Location, Normal model
Shape, Normal model
0.8
0.4
0.7
0.35
0.6
0.3
0.5
0.25
0.4
0.2
0.3
0.15
0.2
0.1
0.1
0.05
0
0.1
0.2
α
0.3
0.4
0.5
0
Huber
Exponential
Gaussian
0.1
Location, T3 model
0.2
α
0.3
0.4
0.5
0.4
0.5
Shape, T3 model
0.4
0.2
0.35
0.3
0.15
0.25
0.2
0.1
0.15
0.1
0.05
0.05
0
0.1
0.2
0.3
0.4
0.5
α
0
0.1
0.2
0.3
α
Figure 3: Coefficients c21 /(pc20 ) (for location) and c22 /(p(p + 2)c23 ) (for scatter) under
N5 (0, I) and T5,3 (0, I) models.
22
GES location
ARE location
5.5
5
0.95
4.5
0.9
SD−Huber
SD−Gauss
S−Tukey
0.85
4
0.8
3.5
0.75
3
0.7
2.5
0.65
2
0.6
1.5
2
5
8
11
p
14
17
20
0.55
2
5
8
GES shape
11
p
14
17
20
14
17
20
ARE shape
18
1
16
0.9
14
0.8
12
0.7
10
8
0.6
6
0.5
4
2
0.4
2
5
8
11
p
14
17
20
2
5
8
11
p
Figure 4: Gross-error sensitivities and asymptotic relative efficiencies of S-D estimators
compared with S-estimators, for Np (0, I) models.
23
GES location
ARE location
7
1
0.95
6
0.9
5
0.85
4
0.8
3
2
SD−Huber
SD−Gauss
S−Tukey
0.75
2
5
8
11
p
14
17
20
0.7
2
5
8
GES shape
11
p
14
17
20
14
17
20
ARE shape
1
25
0.9
20
0.8
0.7
15
0.6
10
0.5
5
0.4
2
5
8
11
p
14
17
20
0.3
2
5
8
11
p
Figure 5: Gross-error sensitivities and asymptotic relative efficiencies of S-D estimators
compared with S-estimators, for Tp,3 (0, I) models.
24