Carroll, Raymond J., Ruppert, David and Stefanski, Leonard AAdapting for Heteroscedasticity in Regression Models"

ADAPTING FOR HETEROSCEDASTICITY IN REGRESSION MODELS
Raymond J. Carroll
David Ruppert
1
2
University of North Carolina
and
Leonard A. Stefanski
North Carolina State University
1
Research supported by the Air Force Office of Scientific Research
AFOSR-F-49620-85-C-0144.
2Research supported by The National Science Foundation MCS-8100748.
ABSTRACT
We investigate
the
limiting
behavior
of
a
M-estimators in heteroscedastic regression models.
assumed
to
be
known
up
to
class
is
to
be
estimated
one-step
The mean function is
parameters. but the variance function is
considered an unknown function of a p dimensional vector.
function
of
nonparametrically
The
variance
by a function of the
absolute residuals from the current fit to the mean.
Under a variety of
conditions. we
for
estimate
was known.
the
discuss
when
the
estimates
adapt
scale.
i.e ..
regresion parameter just as well as if the scale function
Connections
with
the
theory
of
optimal
semiparametric
estimation are made.
Key
Words
and
Phrases : Heteroscedasticity. adaptation, nonparametric
regression. M-estimator, robustness. generalized least squares.
INTRODUCTION
We study aspects of the effect of estimating weights in
of
the
heteroscedastic
(1982a).
and
regression
model
considered
by
a
generalization
Carroll
The observations are (y.,x.) for i=I, ... ,N, where y. is the
I
xi is the vector of predictors.
I
I
& Ruppert
response
Let L(ylx) and S(ylx) define location and
scale for the distribution of y given x.
For example, L(ylx) could be the mean
of y given x, while S(ylx) could be the
standard
deviation.
The
model
is
specified conditionally on the x. by
1
L(ylx) = f(x,P) ;
( 1.1)
s 2 (ylx)
(1. 2)
Throughout
it
will
be
= vO(x) = v(x,9)
understood
that
S(ylx)
depends
on a p dimensional
subvector of x or on a p dimensional predictor z.
If L is expectation and S is standard deviation, then (1.1)-(1.2)
the
usual
heteroscedastic regression model.
specify
In (1.2), the vector parameter 9
is typically unknown.
Carroll & Ruppert
(1982a)
an
odd
function
1/2
A
regression parameter p.
be
consider
Let 9 be any N
with
derivative
~1'
a
class
of
M-estimators
for
the
consistent estimator of 9, and let
Let f
p
~
be the derivative of f with
respect to p and v9 the derivative of v with respect to 9.
Define
1
(1. 3)
E.
y- f(x,P)
We assume that f and v
e(p,9 )
o
E./v 2 (x,9)
z (P,9 )
-(d/dp)e(p,9) .
are such that the following limit result holds for some
positive definite covariance matrix WI :
2
N
-1/2
N
(1. 4)
~
Define 13 =
} ~
z.(p,9) 1'{ e (p,9)
i=l
i
1
Normal( 0, WI) .
to be a solution to the estimating equation
p(~)
-1/2
N
(1. 5)
N
~
i=l
z . (13 ,") "I' { e. (13 ,~) }
1
=
0 .
1
If for some positive definite matrix W
2
(1. 6)
then we would have the asymptotic limit result
(1. 7)
In most applications 9 is unknown and thus 13(9) is also
. t or
estlma
0
A natural
unknown.
. t en t es t'Ima t or
f !,.,t 'IS th en,.,..,
!t(.<I) were
h
9 'IS a.Nl/ 2-consls
0
f ...
.<I
major question is to find conditions under which 13(') adapts for estimating
i.e.,
13(9)
13(9)
and
have
the
same
limiting distribution (1.7).
A
9.
This has
important statistical consequences, because if adaptation for 9 occurs, then at
least for large samples we can pretend that 9 is known and use standard methods
of inference.
(1. 4)
In
For example, if "I'(w)
this
case,
the
of
w then (1.1) and
least
squares
1
calculations
generally
imply
algorithm
with
weights
the
When adaptation occurs, inference can be made as if the
v(x.,9).
weights were known, at least in the limit.
order
(1.2)
solution to (1.5) is a generalized least squares
estimator computed by a weighted
inverse
=
to
understand
the
For discussions
effect
of
Rothenberg (1984) and Carroll, Ruppert & Wu (1986).
which
estimating
use
second
weights,
see
3
An easy extension of Theorem 1 in Carroll & Ruppert
under
conditions,
regularity
p has
the
same
(1982a)
states
asymptotic
that
normal
limit
distribution as if 9 were known as long as
(1. 8)
N
-1
N
P
X e . (P ,9) "i> 1{ e. (P ,9) } z. (P ,9) v9 (x. ,9) / v (x. ,9)
i=l 1
1
1
1
1
For generalized least squares, "i>(w) = w and condition (1.8)
satisfied.
is
a .
-++-+
almost
always
For general M-estimators, condition (1.8) essentially reduces to an
assumption of symmetry of the distribution of e. given x ..
1
In
practice,
the
form
specified but the scale
dimension
of
function
the
1
location function f(x. ,P) may be easily
1
v( x. ,9 )
1
p of x is greater than one.
less
clear,
especially
when
the
There are at least three strategies for
coping with this case.
The first is to assume that the scale is a function
the
This
mean
response.
dimension.
reduces
the scale estimation problem to a single
See Carroll & Ruppert (1982a) and Davidian & Carroll (1986).
A second strategy is to use an empirical
v(x,9),
of
for
example
a
response
model
for
the
scale
function
surface quadratic or its square root.
approach has not been tried too often in the literature, although Box
This
& Meyer
(1985) seem to suggest the idea.
A third
approach
nonparametrically.
Carroll
(1982)
which we investigate is to estimate the scale function
Consider
the
case
that
involves
the
squares fit.
residuals
from
an
(1.8)
holds
unweighted
Though va is unknown, he showed that one can estimate
generalized least squares just as well as if va were known, so that
result
variance.
proposed that the unknown variance function va be estimated by
nonparametric kernel regression on the squared
least
(1.2)
with
"i>
the identity function.
the
p by
limit
Unfortunately, while the
result is interesting the conditions of his proof are unnecessarily stringent.
4
Subsequent to Carroll's paper,Matloff.
Rose
& Tai
(1984)
performed
a
study, while Muller & Stadtmuller (1986) considered a fixed design.
simulation
Robinson (1986) obtained Carroll's result for more than one dimension under far
weaker regularity conditions.
This is a nice piece of work. and we will borrow
techniques from it where appropriate.
In this paper, we study adaptation in a broader framework by considering a
one-step
of
class
nonparametrically.
M-estimators
weighted
The
model
with
weights
estimated
for the means is allowed to be nonlinear.
allow a fairly broad class of smoothers, including kernel and nearest
regression
estimates.
It
is
We
neighbor
not the particular smoother that matters. but
rather that the smoother satisfy certain
reasonable
moment
conditions.
In
particular, robust smoothers could be used.
In the next section we introduce the class of estimators and the basic set
In sections 3 & 4 we use our work to provide examples for which
of conditions.
adaptation occurs.
bounds
In section 5, we link our work to the theory of information
for semiparametric models.
In section 6 we address a few brief remarks
to the case that the scale functions are known to
response.
dependent
on
the
mean
All proofs are in the Appendix.
Section 2
The
be
Estimators and Main Results
choice
of
one
specifying
extraneous
regression.
The
step
weighted
conditions
estimators
for
asymptotic
include
Robinson (1986) as special cases.
Let
M-estimates
those
P*
A
avoids the necessity of
normality
in
nonlinear
studied by Carroll (1982) and
1/2
be any N
consistent estimate of p.
5
Write
E.. ttJ)
E.. (y ,x ,P)
=
y - f (x ,P)
=
=
d(P)
= fp(x,P)
d(x,P)
It is most compact to treat (y,x) as a random variable, the translation to
case
of
fixed
being
XIS
immediate.
In what follows we use the notation of
Pollard (1984), so that for any random variable G(y,x),
espectation of G, and
If
g(x)
the
G(x,y)
~{
}
is
the
G(x,y) } is the average of the values G(x.,y.).
~N{
1
1
is any estimate of the scaling function vO(x), then as in Bickel
(1975) a weighted one step M-estimator of p is
P
(2.1)
=
p*
+ { ~(p*,g)}
AN (P, g) = IP N{ g (x)
-1
B (P , g ) = ~ N{ g (x )
N
Bickel (1975) calls this a
arguments
are
d (P) d (P)
-1/2
one
d (P)
step
T
BN(p*,g), where
1'1 ( E.. (P) g (x)
l' (
E.. (P) g (x )
estimate
of
-1/2
-1/2
) }
) }
Type
Our
1.
technical
for the Type 1 estimates, but they are easily modified to apply
to his Type 2 estimates.
squares
-1
estimate
with
If 1'(w)
weights
w,
the
then
inverse
a
is
(2.1)
of g(x).
generalized
least
Ordinarily, one first
computes p based on unweighted least squares residuals, updates the preliminary
estimator and recomputes p.
-'t
Let qN = N
an
This process will be repeated a few times.
for a sufficiently small positive
estimate of vo(x) based on the residuals
E..i(P*).
amount qN avoids problems with infinite weights.
vo(x)
based
on
we
could
Let g(x) equal qN
plus
The addition of the small
estimate
of
the true unknown errors E..i(P), and let v(x) = vo(x) + qN'
In
this and the next section, rather than adding qN
change
'to
have
taken
the
maximum
of
conditions about these estimates are as follows.
Let vO(x) be an
to
the
smoothers,
without
qN and the smoother.
The key
6
infimum va(x) > a
J..A....l.l
-
P
Na (I)-6 IP { va(x) - va(x) }2
N
J.A..n
where a
.........
-
-
4 I
(1)
.........
where a(2)
Assumption
(A.2)
appropriate
holds
conditions,
for
kernel
~
and
see section 4.
(4 +
have
been
used.
We
are
not
p)
a for all
0
> a ,
a(l)
nearest
neighbor
estimates
under
The constant a(l) is the optimal rate
of convergence for nonparametric regression estimates,
could
> 0 ,
0
P
Na (2)-o IP { g(x) - vo(x) }2
N
J.A....a.l
a for all
although
slower
rates
restricting to smoothing only squared
residuals.
Condition (A.3) is often easy to check.
kernel
and
linear
smoothers
such
as
nearest neighbor estimates having "bandwidth" not depending on the
responses, a(2)
some
For
instances,
~
1 > a(I), at least under minimal regularity conditions.
a(2) > 1.
In
It is easy to develop conditions for these to hold
in the case of a linear smoother applied
to
squared
residuals,
but
in
the
interest of space we forego the details.
If
x has compact support, then under additional regularity conditions the
convergence in (A.2) - (A.3) is uniform in x, e.g., in (A.2)
p
o .
See section 4 for a further discussion.
would
then
particular,
avoid
be
~
The regularity conditions on
~
constructed as in Bickel (1975), and we could take qN
would not need to be differentiable.
the assumption of compactness.
set, the convergence of g and va
is
and
f
O.
In
Where possible, we wish
to
If the x's are not confined to a compact
not
uniform
and
we
must
assume
more
e-
7
smoothness
for
The following conditions are reasonable, most of them trivial
in~.
linear
regression.
We use the notation II . II to be the Euclidean norm of
the argument.
iA.....il
(1) , ~(2)
..,.
~1 exists and is continuous and for M
I 'P(a) - 'P(b) I
I 'P 1 (a) -~l(b)
.1A.:..Ql
lA...ll
IP {
~-1/2 IP [ sup{
~1
2
5
) }2 <
}
II d (13) II 'P ( e (13 )
(~(f3+A1) 1 A 1/2)
A
for all M >
J.A.....ll
k 2
N / P
[SU P{
" d(p."l) 11
(A.I0l
If G(A ,A )
1
2
(A.12l
I
<
~
00
a - b
1
00
II 5 MN- 1I2 ]
~
2
1) ,
O(
I7 N
II A
2
II d(f3+A ) - d(f3+A2) Ilk}
1
+
o
A
1
2
II
1/2
~ MN-
II
~ MN
-1/2
and k = 1,2 .
I~ (e (13 ) ) I
d(f3+A ) d(f3+A1) T
1
I7 -1 IP [ sup{ II G(A ,A ) - G(O,A ) II}
N
1 2
2
(A.lll
c
o.
= 0(1) for all M >
IP[ II ff3f3(X,f3) II 4 {1
,
•
2 <
II A
}
2
00
00
a - b
,
Cop minimum( M
..,. (2)
~ 1 ( e (13 )
IP {
J.A....§l
5 c~ minimum( M'P(l)
5
+
Ie (13 ) I (M., (l) =00) I} ]
~ 1 {~(f3+A1 )/A/ 12 }
1I2
IIA 111 5 MN]
> 1/2
IIA 211 - I7 N
N- 2' ,,[ (sup" fp(X.P) ,,2k s2 j (
~:))
)
-+++ 0
I ,,2~'7N
,
<
00
then
for all
] _
0 •
for (k=1,2 : j=O,l : s='P ) or (k=l,j=l,s='P) .
1
j
ok
2n
n
Write H . (~,x) = ~(f3) II -,( f(x,f3) II /vo(x) . Then
k ,J,n
013
if M (2)=00 in (A.4), IP{ H . (~,x)} <
--.,
k ,J ,n
00
for n=2,j=2,k=1
M
>
O.
]
8
if M
... (2) <
in (A.4), IP{ H . (~,x)} <
k ,J ,n
00
r
00
for n=2,k=l,j=0.
(A.13)
....... 0
For linear regression
simplify
to
x
II
II
and
generalized
least
squares,
(A.4)
(A.13)
4 4 2
, ~ (p) and II x vO(x) II
having finite expectations.
The
first step is a preliminary expansion for p.
LEMMA 1
Define
Suppose that
eN
=
0p(l).
Then under the conditions (1.4), (1.6)
and
(A.!)
e-
(A.13), we have that
o
(2.2)
In
principle,
we
methods using (2.2).
by v, the estimator of
THEOREM 1
can
obtain the asymptotic distribution of p by direct
However, sometimes it is possible to replace g
V
o
based only on the errors
~i(P).
Suppose a(2) > 1.0 in assumption (A.3).
Suppose that D
N
op (1),
Define
and that either of the following hold:
IP {
~ 2 (P)
II d (P) 11
2
}
<
00
•
in
(2.2)
9
For some q(I), q(2),
p
Nq (l) sup
x
p
Nq (2) sup
x
NP
['U
vO(x)
P[
(
)2
I
g(x) - v(x)
H(..:1 ,..:1
1
v0 (x) +..:1 1
2
)
- H(..:1 ,O)
1
-++-t
2
1
and
0
11..:1 11
~
11..:1 11
~
1
2
q
MN- (I)]
MN- q ( 2)
-++-t
0 ,
~(,l3)
where H(..:1 ,..:1
1
2
) = l' (
(vO(x) +..:1 1 +..:1 2 )112)
Then. under the conditions of LEMMA I,
o
(2.3)
Our
device
estimator of
V
o
of
adding
a
constant qN
the nonparametric regression
can be eliminated in certain circumstances.
COROLLARY 1 : In the definition of g, do not
Suppose
that
{g - vol and {v
the
rate
N .
at
to
-"f
Then
O
the
add
the
positive
constant qN'
vOl converge in probability to zero uniformly
conclusion
to
THEOREM
1
still
holds.
o
SECTION 3
If,
M-estimators in the Symmetric Case
given
xi'
~(xi',l3)
= Yi - f(x i ,,l3> is symmetrically distributed about
zero, then M-estimators adapt for heteroscedasticity, i.e., they have the
distribution as if the scaling function va were known.
same
In the previous section
10
we assumed only that the estimator
this
section
we
these errors.
restricting
make
the
V
o was a function of the errors
further assumption that
In effect, when estimating
to
the
is an even function of
V
o
scaling
o'
function
v
It
we
are
nonparametric regression estimators which are functions of the
absolute residuals from the current fit p*, as did Carroll (1982) and
(1986) .
In
~(xi'P).
makes
perfectly
good
sense
to
use
the
Robinson
residuals
to
gain
understanding of the variance structure.
Assumption (C.2) below is typically even weaker
proving
the
latter
than
(A.2),
because
in
one typically shows that the expectation of the left hand
side of (A.2) converges to zero.
THEOREM 2
Assume
P{ IId(P)1I
2
[ 1 + lId(P)1I
-6
t]N
If M.,
(1)
I
~ (P)/vo(X)~)
.,4(
2
vO(x) - vO(x) I
4
in (A.4), then IP{ ~ (P)
E[ IP N{
= 00
2
A
Further assume (1.4) and the conclusion to THEOREM
(2.1)
/
V
2 (X) ] } <
o
}] -/
V
1.
2
o
•
e-
0 .
(x) } <
Then
00
00
the
•
estimators
adapt for heteroscedasticity and have the same limit distribution (1.7)
as if the scaling function v
o
were known.
c
For asymmetric errors when" is
typically
fails.
As
To
see
what
the
identity
function,
THEOREM
will
not
pursue
the
matter
in
happens, consider a problem of dimension p = 1.
Taylor series and (A.2), under sufficient regularity conditions, setting
for convenience, we have
2
we indicate below, it is possible to compute the limit
distribution in this case, although we
detail.
not
any
By a
t]N
=0
.
11
(3.1)
D
N
1/2
N
IP { dC£3)
N
+ N
G(E. ,p,x)
1/2
Vo(X)-~
f' (
N
-1/2
(a/aw) ( w
1
1'( E. (13) 1 V (X)2
O
}
G(E.,p,x) [ vO(X) - VO(X)
1 2
d(p) 1'(E.(p) 1 W / ) }
} +
I
o (1) ,
P
W=VO(X)
As indicated in the next section, we can replace G(E.,p,x)
by
its
conditional
expectation given x, which is
G*(x) = -(1/2) vo(x)
The
new
-2
d(p) E [E.(p) 1'l{E.(p)/v (x)
1/2
o
}
I
x ] .
second component of (3.1) has a nontrivial limit distribution.
Thus.
the limit distribution of 13 typically will not be the same as if V were known.
o
SECTION
4
Adaptation for Generalized Least Squares
If 1'(w) = w, the estimator (1.7) is a generalized least squares estimator.
Under weak conditions, Robinson (1986) proves adaptation in
using
a
linear
regression
variant of the nearest neighbor device, the smoother being applied to
squared residuals.
His g(x ) does not use the ith residual Yi - f(xi,p*).
i
proof is easily extended to nonlinear regression, as we now indicate.
First consider linear smoothers based on squared residuals:
(4.1)
g(x)
(4.2)
1
for all x .
His
12
Define the expectation of va given the design as
-1 N
N
(4.3)
2
(x. ,x) E{Eo • (,0) I x.} .
i=1 N 1
1
1
X C
LEMMA 1 and THEOREM 1 still hold if we replace va(x) in (A.2) by
modified
(A.2)
is
easy
to verify under weak conditions.
v*(x).
The
Following Robinson
(1986), choose the weight function cN(x,z) to be of nearest neighbor type
a.
with
THEOREM 1 eliminates the nonlinear regression function, and we
can apply Robinson's proof virtually
without
change,
the
only
complication
being the addition of qN'
Here
is
compact.
~
a
second application of Theorem 1.
Suppose the support of x is
If the variance is to be modeled nonparametrically as a function of p
3 predictors, then we can obtain an adaptation
nearest
result
based
on
the
usual
neighbor regression estimators which use the ith residual in computing
an estimate of va(x), i.e., cN(x,x)
~
a.
For reasons of space and interest
we
.
forego the details.
Some
background
is
useful.
Uniform
convergence
regression estimators when x has compact support has
estimators
by
Devoyre (1978).
weakest
been
of
nonparametric
proved
for
kernel
Collomb & Haerdle (1986) and for nearest neighbor estimators by
Applying these results to
conditions,
our
problem
does
not
yield
the
as the results assume that the marginal distribution of x
is very smooth.
Average mean squared error convergence as in (A.2) have been discussed
Marron
& Haerdle (1986) for kernel estimators with x being compactly supported
and having a smooth distribution.
(A.2)
by
for
nearest
Results of Mack (1981) can be used to
neighbor estimators under similar conditions.
prove
For nearest
neighbor estimators it is far easier to prove (A.2) with va replaced by v*
and
•
13
then
note
that (D.5) is the same as Robinson's Lemma 8, which by the way is a
powerful result.
Here is a third application of THEOREM
squared
residuals
to
1.
One
(1985).
for
example
using
Figure
2
in
We have found it more useful to smooth absolute rather than
squared residuals, a robust smoother being even better.
absolute rather than squared residuals are smoothed.
conditions.
with
estimate the scaling function is the tendency for a few
wildly aberrent values to distort the picture, see
Hinkley
difficulty
Adaptation holds
when
We give one set of strong
Set cN(x,x) = 0 and define
2
s (x,P) .
Consider the smoothers
1
1
2
2
g(x)
v(x)
-1 N
N X cN(x. ,x) s(x i ,P)
i=l
1
THEOREM
Consider
3
+
zero
.1.IL.ll
~
Assume
that
the
fourth
SUPPQse further that (A.4) through (A.13) hold and that
IP [IP N{ Iv*(x)
P*
.
given x is bounded and that s(x,P) Is strictly bounded away from
~(P)
and~.
N
linear regression in which the support of x is bounded.
Since this is linear regression P has dimension p.
moment of
I1
1/2
- vo(x)
1/2 2
I }] -
0 .
1/2 -1 .
is restricted to the set of values (kN
) [11'" .l p ]' where
11'" .i p are integers.
-
v(x)
1/2
- v*(x)
1/2.
satIsfies (A.2)
p
_0
1
1
_1
1
p
N2 P { d(P)~(P)v*(x)-2 [v(x) 2 - V*(X)-2] } _ 0
N
14
Then
then
limit distribution of generalized least squares is the same as if g
o
were replaced by vo.
(0.1) is the same as Robinson's key and powerful Lemma 8.
the
variance
is a function of three or fewer predictors.
(0.4) holds
if
Assumption (0.5) is
essentially Robinson's key Proposition 2, substituting our 1~(p)1 for his ~2(p)
1
and our d(P)/v*(x)2 for his d(P).
Semioarametric Models and Information Bounds
SECTION 5
An
see
alternative approach is optimal estimation
Begun. et. al. (1983) and Bickel (1982).
in
semiparametric
models,
The easiest context for applying
this theory is the location-scale model
Here, given xi' the {e } are independent
i
•
and
identically
distributed
random
variables with a known density function w('), but the scaling function v (') is
o
unknown.
Our previous results do not require that the e.(p) be independent and
identically
1
distributed,
conditionally
on the x ..
1
The usual approach in the
semiparametric literature is to find out how well one can do
v
o
is unknown, i.e., find a matrix I(w) such that
N1/~ (P - P) ~ Normal(O,S) implies S ~ I(W)-l .
estimating pwhen
15
The
matrix
is called the semiparametric information bound.
I(~)
Any estimator
achieving this bound is said to be asymptotically efficient.
We can use the results of section 3 to construct asymptotically
estimators for p.
Suppose
further
that
the
'P,
Let
= - (a/a w) log{ h (w) } .
'P ( w)
Suppose
is a symmetric density.
~(.)
efficient
regression
function
f
and the nonparametric
regression estimators (g,v ) satisfy (A.1) - (A.13), one of (B.1) - (B.2)
o
a(2)
>
1.0,
and (C.1) - (C.3).
efficient in the
semiparametric
distribution
maximum
as
with
Then our estimators (2.1) are asymptotically
sense,
likelihood
because
with
va
they
have
known.
the
same
limit
As an added benefit of
THEOREM 2. even if the errors are incorrectly specified
but
still
symmetric.
our estimators will behave as if va were known.
That
the
information
bound
is
the same as if
V
o
were known is an easy
informal calculation using the theory of Begun, et. al. (1983).
theory, set vo(x)
=
dominating measure.
It
is
a
2
v*(x),
where
v*
is
a
density
with
To apply their
respect
if
also clear that if one is willing to assume symmetric, independent
the
density
~(.)
is
unknown.
carried
comtemplate
This
program
out by Bickel (1982) for homoscedastic regression.
~(.)
change
It should be possible to construct
asymptotically efficient estimators in this case.
been
a
The calculations are immediate.
and identically distributed errors, then the information bound does not
even
to
has
already
One can also
being known and asymmetric or unknown and possibly asymmetric.
but we leave this problem to others.
16
Variance a Function of the Mean
SECTION 6
Often, the variance can be modeled as a function
Assume
that
the
data
of
the
mean
response.
are normally distributed with mean f(x,P) and variance
2
modeled parametrically as a v mp (f(x,P),9).
It is well known
that
generalized
least squares estimates of p have the same limit distribution (1.7) as weighted
least
squares
with
known
normal theory maximum likelihood
asymptotically
more
maximum likel
has
the
efficient
~stimate
estimate
than
p is,
of
at
the
normal
generalized least squares.
has some disadvantages.
model,
However, the
Generalized least
squares
robustness property that its asymptotic distribution is not dependent
on assumptions about higher moments.
the
Jobson & Fuller (1980) showed that the
weights.
McCullagh (1983) shows that this
case for the maximum likelihood estimate.
is
not
It does not take a particularly
nasty distibution before the maximum likelihood estimate becomes less efficient
than generalized least squares.
influence
function
of
errors, and that unlike
estimate
it
not
the
& Ruppert
Carroll
maximum
generalized
likelihood
least
(1982b)
note
using
the
estimate is quadratic in the
squares,
the
maximum
likelihood
robust to a small misspecification in the variance function.
We have seen no real data examples where even the potential gain in
for
that
efficiency
maximum likelihood is over 30%, and it our belief that "asymptopia"
occurs for the maximum likelihood estimate only for much
larger
sample
sizes
than necessary for generalized least squares.
With
this
location-scale
background
model
with
in
scale
mind,
let
depending
us
on
turn
the
to
the
semiparametric
regression
Referring for notation to (1.3), assume that the errors satisfy
parameter.
17
Eo
i (/3 )
Eo
i (/3 )
(6.1)
v 1/2{f(x. ,/3)}
m
1
where, given x., the {e.} are independent and identically distributed.
1
the
ways
of
Each of
1
writing the errors (6.1) has an interesting interpretation.
The
first indicates that the variances are not constant, so that we could apply the
techniques of sections 2-4.
This is almost certain to be
inefficient
if
the
dimension of x. is of any size.
The second way of writing the model suggests a
variant
squares.
1
of
generalized
least
Take the squared residuals from the
current fit and regress them nonparametrically on the predicted values from the
current fit.
regularity
For
symmetric
conditions,
errors,
Carroll
linear
(1982)
regression
showed
least squares has the limit distribution (1.7).
improve
upon
and
with
stringent
that this form of generalized
It
would
be
interesting
Carroll's result, but we leave this to another time.
form of (6.1) also suggests a semiparametric model.
to
The second
We consider only the
case
that the errors (6.1) have known symmetric density w(·).
As
in
the
parametric
case,
it
does
not follow that, having found an
asymptotically efficient semiparametric estimator. one
calculations
outlined
below
indicate
should
use
function
estimate
is
nonnormality.
(1982b)
does
quadratic
in
and
the
hence
Our
that for normally distributed data the
efficient semiparametric estimator will suffer the drawbacks that
likelihood
it.
parametric
the
case.
estimators
the
maximum
The efficient influence
will
be
sensitive
to
We conjecture that there is an analogue to the Carroll & Ruppert
result ,
i.e.,
the
estimators
misspecifications of the variance model.
will
also
be
affected
by
It is not clear that it is often
small
the
case that the increase in efficiency at the model can be achieved in reasonable
size samples and for realistic problems.
18
The
calculations we give here are informal, meant to indicate the form of
the efficient influence function.
the linear model.
(1/2) /JJ In{ vm(JJ) } ; t(JJ)
=
SC(e(p»
is
o
= Ft In{ c.>(t=e(p»
=
E{ x
the loglikelihood with respect to p and
-q(m) x {1
=
~ -1
{1 + e (P)
T
x p=JJ }
The derivatives of
are
+ e(p) SC(e(p»} - x SC(e(p»
e. o
(6.3)
0
I
} .
the usual score function in the univariate case.
(6.2)
of
Define the auxilliary functions
q(JJ)
SC(·)
We assume that the mean function is that
SC (e (P»
1 {o2 v *(JJ)}1/2
m
} .
Using the notation of Begun, et. al. (1983), if the
density
function
of
the
data is f d , the local derivative of the function v * is
m
2(A b )/f 1/2
(6.4)
This
1 1
local
derivative
d
depends
on the data only through JJ and le(p)l.
As in
Begun, et. al. (1983) or Bickel & Ritov (1986), the efficient score function is
defined by
(6.5)
Here d is to be chosen so that the expectation of
(6.4) is zero.
to set d = O.
the
product
of
(6.5)
and
Since to is itself a function only of JJ and le(p)l, it suffices
Thus, the efficient score function is
~.
19
(6.6)
-q(m) {x - E[xIJJ]} (l
information
bound
I(~)
e(p) SC(e(p»}
1 {a 2v *(JJ)}1/2
m
- x SC(e(p»
The
+
is the covariance matrix of Fpa' which is easy to
compute because the two components of (6.6) are uncorrelated by symmetry.
covariance
matrix
of
the
second
term
The
in (6.6) is the information bound of
section 5, where one did not know that the variance was a function of the mean.
This means that there is extra information available when one
variance
a
function of the mean response.
that
the
The only exception is when x
Since the efficient semiparametric score Fpa differs
E[xIJJ].
score
is
knows
from
the
=
usual
(6.2), this is a case where there is a loss of asymptotic efficiency for
not knowing the form of the variance function.
For
normally
semiparametric
distributed
errors,
and
shows
that
the
efficient
score function is quadratic in the errors, not linear as is the
case for generalized least squares.
case,
(6.6)
the
same
concerns
The same thing happens in
the
we have raised previously apply.
gains in efficiency are less than in the parametric case.
parametric
The possible
It is not
clear
to
us that when one is assuming h(') is known, one should try to extract the extra
information in the data.
We
errors
can
~(.)
carry
out the same informal calculations when the density of the
is symmetric and the density of the design s(·) is
unknown.
The
local derivatives for hand s are (respectively)
1/2
2(A b )/f
d
2 2
= 2 b 2 (e(p»
1/2
2(A b )/f
3 3
d
These
are
orthogonal
to
(6.6) ,
=
1 h
1/2
(e(p»
2 b (x) 1 s1/2(x)
3
so
the
information bound are the same as if hand
efficient
s
were
score
known.
function and the
While
one
can
20
construct
uniformly efficient estimates of p when the errors are symmetric. it
would be more interesting to know if such estimates are actually
efficient
in
samples of small to moderate size.
•
REFERENCES
Begun, J. M.. Hall. W. J .. Huang. W. M. & Wellner. J. A. (1983).
Information and asymptotic efficiency in parametric
nonparametric
models. Annals of Statistics 11. 432-452.
Bickel. P. J. (1975).
One-step Huber estimates in the linear model.
Journal of the American Statistical Association 70. 428-436.
Bickel. P. J. (1982).
647-671.
On adaptive estimation.
Annals of Statistics
10.
Bickel. P. J. & Ritov, Y. (1986). Efficient estimation in the errors in
variables model. Annals of Statistics 14, 000-000.
Box, G. E. P. & Myer, R. D. (1985b). Dispersion effects
designs. Technometrics. 28. 19-28.
from
fractional
Carroll, R. J. (1982a). Adapting for heteroscedasticity in linear models.
Annal of Statistics 10, 1224-1233.
Carroll. R. J., and Ruppert, D. (1982a). A comparison between maximum
likelihood and generalized least squares in a heteroscedastic linear
model. Journal of the American Statistical Association 77. 878-882.
Carroll,
R.
J.,
and Ruppert. D. (1982b).
Robust estimation
heteroscedastic linear models. Annals of Statistics 10. 429-441.
Carroll. R. J., Ruppert, D., & Wu. C. F. J. (1986).
and the bootstrap in generalized least squares.
in
Variance expansions
Preprint.
Collomb. G. & Haerdle. W. (1986). Strong uniform convergence rates in
robust nonparametric time series analysis and prediction
kernel
regression estimations from dependent observations.
Stochastic
Processes and Their Applications, to appear.
Davidian. M. & Carroll. R. J. (1986).
function estimators. Preprint.
An asymptotic
theory
of
variance
Devoyre, L. (1978).
Uniform convergence of nearest neighbor regression
function estimators and their applications in optimiation.
IEEE
•
21
Transactions on Information Theory,
IT42, 142-151.
Hinkley, D. V. (1985).
Transformation diagnostics for linear models.
Biometrika 72, 487-496.
Jobson, J. D. & Fuller, W. A. (1980). Least squares estimation when the
covariance matrix and parameter vector are functionally related.
Journal of the American Statistical Association 75, 176-181.
Mack, Y. P. (1981). Local properties of K-NN regression estimators.
Journal on Algebraic and Discrete Methods 2, 311-323.
SIAM
Marron, J. S. & Haerdle, W. (1986).
Random approximations to some
measures of accuracy in nonparametric curve estimation. Journal of
Multivariate Analysis, to appear.
Matloff, N., Rose, R. & Tai, R. (1984). A comparison of two
estimating optimal weights in regression analysis.
Statistical Computation and Simulation 19,265-274.
McCullagh, P. (1983).
11, 59-67.
Quasi-likelihood functions.
Muller, H. G. & Stadtmuller, U. (1986).
in regression analysis. Preprint.
Pollard, D. (1984).
New York.
Annals
methods for
Journal of
of
Statistics
Estimation of heteroscedasticity
Convergence of Stochastic Processes. Springer-Verlag,
Robinson, P. M. (1986).
Asymptotically efficient estimation in
presence of heteroscedasticity of unknown form. Econometrica
000-000.
Rothenberg, T. J. (1984).
Approximate normality of
squares estimates. Econometrica 52, 811-825.
Appendix
The
proof
of
LEMMA
1
is
accomplished
generalized
through
1
propositions.
As N -++
00,
1
N2 P { h(p*,g) TN }
N
least
series
of
1
Let h(P,v) = d(x,P)!v(x)2 and r(p,v) = ~(P)/V(X)2.
TN will be used as a current random variable under discussion.
PROPOSITION 1
a
the
56,
P
~
0 , where
Throughout,
22
a Taylor series in p, with p
By
1
A
suffices to show that if N2 (P p -P)
=
°p (1)
on the line connecting p* and p, it
P
then
p
~1(r(pp,g)
PN{h(p*,g) [ h(Pp,g)
~1(r(p*,g)
- h(p*,g)
]}
a .
By assumption (A.4), it suffices to show that
d (p*)
(7.1)
-
d (P
P
) ]} .......... a
p
(7.2)
Since g(x)
~
qN
and
is
~
small,
(7.2)
follows
from
Cauchy-Schwarz, (7.1) follows from (A.7), (A. B) and since
PROPOSITION
A
PN{s(x,E.)g(x)
fIQQ!
Let
2
-2
° (1)
}
P
s(x,E.)
any
i f p{s2(x,E.)} <
00
positive
is small.
function.
C
Then
TN
•
Since PN{S(x,E.)/v~(X)} = 0p(1), it suffices that
PN{s(x,E.)
Now if
be
~
Applying
(A.B).
V-1
o
I
A
g(x)
-2
- vO(x)
-2
I}
= 0p(1)
and va-2 are boun d e d by c > 1 , th en
A_2
-2
2 A
Ig (x) - va (x) I ~ 2c Ig(x) - vO(x) I
/
2
qN
It thus suffices to show that
-2
A
qN IF'N{s(x,E.) Ig(x) - va(x)1 }
=
0p(1).
Using Cauchy-Schwarz and assumptions (A.2) and (A.3) for
the result follows.
~
sufficiently
small,
C
23
PROPOSITION 3
~
As N ~
~,
By a two term Taylor series of f p done componentwise, it suffices that
P
(7.3)
fpp(X,P) 'P(r(p,g)} __ 0
(7.4)
fppp(x,Pp) 'P(r(p,g) } __
p
Assumption (A.13) is sufficient to prove (7.4).
(7.5)
(7.6)
~
o .
For (7.3), it suffices that
P
fpp(x,P) ['P{r(p,g)} - 'P{r(p,v )}]} __ 0
o
~
-1
P
IP {f
N pp (x,P) 1>{r(p,vo )} [g(x)-vo(x)] [g(x)vo(x)] } - IP {
N
g(x)
-1
o .
Because of (A.1), (A.9) and Cauchy-Schwarz. (7.6) follows if
(7.7)
By
adding and subtracting v(x)
vo(x)
in the numerator of (7.7) and then
+ qN
applying (A.2) and (A.3), (7.7) holds as long as
qN
~
2
IP
N
{g(x)
which follows from PROPOSITION 2 with
(7.5).
-2
P
} -- 0 ,
s(x,~)
First consider the case that ~1)
=
=
Ig (x)
It
thus
~ in (A.4).
~
.d N(x)
= 1.
- v0 (x)
2
I /
suffices
Writing
qN '
the square of (7.5) is bounded by
PN{;(X)-2 II fpp(x,P)
e(p)
2
11 }
IP
{ .dN(x)}
N
c~l) x c~2)
to
prove
24
By (A.9) and PROPOSITION 2, C~l)
= 0
P
(2)
C
N
Thus.
(7.5)
holds
~HC(v)
= max(-c.min(v,c»
~1 )
if
By (A. 2) - (A. 3) ,
(1).
P
-+++ 0
~, and it suffices to consider M;l) <~.
be the Huber
~
function.
PROPOSITION
2,
(A.9)
Cauchy-Schwarz combine to show that it suffices that
Using (A.4), it suffices that for every c
IP N { ~Hc [
2
~ (,D)
I
<~,
2
~
g(x) - vO(x) I
-------------1
1 2
vO(x) g(x) [V (X)2 + g(X)2]
O
~
By (A.l), monotonicity of
K >
p
and the fact that g(x)
~Hc
~
o .
}
~
qN' it suffices that
o .
(7.8)
Fix
]
O.
Decompose the left hand side of (7.8) as
A
+ A
lN
A
=
2N
lN
~
IPN{~HC{~2(,6) ~N(x)} [ I(~2(,6nK)
c f'N{
I(~2(,6) ~
A
If we let N ~
PROPOSITION 4
~
2N
and then«
As N ~
~
c
2
~~,
K)
} -+++ C IP{
+
I(~2(,6)<K)
I(~2(,6) ~
K)}
P
IP { ~N(x) } -+++ 0 .
N
(7.8) follows.
[]
~,
o
P
(N- 2'Y)
;
]}
Let
and
•
25
By (A.l), for some constants c ' c '
l
2
The proof is completed by (A.2) - (A.3) and PROPOSITION 2.
PROPOSITION 5
~:
As N ~
0
~,
By (A.lO) and (1.6), it suffices that
A
AT A T
IP N{ h (,D , g) h (,D , g) "P 1( r (,D , g
h (,D , v0 )h (,D , v0) "P 1(r (,D , v0 »
»-
P
} -- 0
By PROPOSITION 4 and (A.ll),
IP N{ d (,D) d (,D )
T
A
"P 1 ( r (,D , g» [g ( x )
-1
- v0 ( x )
-1
P
]} - - 0,
so it suffices that
(7.9)
If ~2) = ~ in (A.4), then by (A.12) and Cauchy-Schwarz it suffices that
A
IP { Ig(x)
N
_1
2-
_1
vo(x)
2
P
21 } __
0 .
For a constant c, by (A.l) this last term can be bounded by
by
PROPOSITION
If ~2) < ~ in (A.4), by (A.12) to prove (7.9) it suffices
4.
that for every c <
~,
26
o .
This is (7.8), which was proved in PROPOSITION 3.
o
Proof of Leima 1
o
PROPOSITION 6
Combine PROPOSITIONS 1-5.
As N
~ ~,
1
1
A
[d(,6)1'(r(~,g): {g(X)2 - V(X)2}
g(X)2
~
: Under (B.1) or (B.2) with a(2) > 1.0,
and choosing
~,~
o .
1
V(X)2
applying
Cauchy-Schwarz,
small shows that it suffices that
o
(7.10)
P
(1)
.
o
This follows from (A.11).
Proof of THEOREM 1
By (1.4) and PROPOSITION 6, we must show that
P
1
TN
N2 P N{ h (,6 , v ) [ l' ( r (,6 , g»
- l' ( r (,6 , v»
]
} ..- 0 .
o
This is routine for (B.1) and (B.2).
Proof of THEOREM 2:
•
N2 P {
N
(A.3)
It suffices that ~l)~ 0, ~2) ~ O. where
1
A l l
d(,6)1'(~(,6)/vo(x)2)[v(x)-2 - V (X)-2]
O
}
,
•
27
1
_1
1
1
N2 IP { V(X) 2 d(,l3) [1'(E.(,l3)/V(X)2) -1'(E.(,l3)/V (X)2)] } .
O
N
Recall
that
Because
l'
follows
v(x)
is odd. {E. (,l3)} has a symmetric distribution and V is even. it
o
i
I 2
that for j=I.2, A~j) is N /
times an average of N mean zero
uncorrelated (not independent) random variables.
IIIPA
Thus.
A..
..
-~ 2
( 1) (1) T
AN II ~ IP[ f'N{lId(,l3)1I 2l' 2 (E.(,l3)/V O(X)2)
Iv(x)-2 - vo(x) I}]·
N
By (C.I) and Cauchy-Schwarz, A~1) ~ 0 as long as
(7.11)
This follows routinely from (A.I) and (A.2).
Because
A~2) is Nt times a sum of
uncorrelated random variables, we have by (A.4) and (C.I) that
pi 1~2)1
1
2
~
0
as long as
p
IP [ If' N{T
(7.12)
A
v(x)
Fix
-1
c > 0 and write TN
0, where
} ]
N
min2 [ ~ 1) , IE. (,l3 ) I
TNI( 1E.(,l3) I>c)
+
A..
Iv (x) -
2
_1
-
V0 (x)
TNI( 1E.(,l3) I~c)
of (A.I) and (C.2)
But by (C.2) and mimicking the proof of Proposition 2.
2
I
T~l)
+
T~2).
Because
28
P[ P
(7.13)
-
N
{v(x)
-2
}] <
00
•
We thus need only show that
..
lim lim P[ P
(7.14)
C4lO N-fIIlO
<
00,
N
{T~l)}
then by Cauchy-Schwarz,
P[ IP
(l)
{T
} ]
N N
~
[P {I (
~ -1 N
-2 ~
[1P{I(IE.(.C)j>C)}]2 N i~1[P{v(x) }]2
~
-
~
IE. (,6) I >c ) }] 2
{
IP [ IP N {v (x)
Invoking (7.13), equation (7.14) now follows.
when
o.
~1) = 00.
-2
1
]} 2
•
We thus need
to
verify
(7.12)
By (C.2) - (C.3) and a proof similar to that of Proposition 2,
we have that
so that (7.12) follows by applying Cauchy-Schwarz and (7.11).
PROOF OF THEOREM 3 : (Sketch)
As in Bickel (1982, page
[]
653),
from
(0.2)
it
1
suffices
to prove all steps with,6* replaced by,6* =,6
is a deterministic sequence.
-
-
+ ~N/N2,
1
1
1
~
C N 2 {I + IV(X)2 - v*(x)21 '
1
so that (A.3) holds with a(2)
~
1.0 and va replaced by v*,
Lemma
1 holds.
~N ~ ~o
Using compactness and the boundedness of v*,
I
Ig(x) - v(x)
where
This
assures
that
To prove THEOREM 1, it suffices to prove PROPOSITION 6, as the
other step is similar.
Write TN as in PROPOSITION 6 and
1
3/2
1
1
SN = N2 PN{ v* (x ) d (,6) E. (,6 ) [g (x ) 2 - V(X) 2] } .
Then, since d(,6) = d(x,,6)
x is bounded, for some c > 0,
.
•
29
Since we have replaced p* by
p*.
~
1
1
~
Ig(X)2 - v(x)21
_1
~
c
4
N2
p
so that ITN-sNI ~ 0 by (0.4) and (A.6).
Thus,
to
prove
an
analogue
to
P
THEOREM
it
1,
boundedness of
thus
d(P)
suffices
that
SN
~
O.
Using Bickel's trick and the
d(x,P) = x and the conditional fourth moment of
shows directly that the covariance of SN converges to zero
as
~(P),
desired.
one
The
next step in the proof of Theorem 3 is to show that
o .
By algebra,
QN = c(l)
N
C~l) =
These
two
terms
2
N~ IP N{
d(P)
converge
to
+
c(2) where
N'
~(P) v*(x)-~ [v(x)-~ - v*(x)-~] }
zero
by (0.5) and (0.4) respectively.
We are
finished once we show that
p
~
O.
Note that since s(x,P) is bounded away from zero, so too is v*(x).
is
completed
by
noting
The
proof
that this last random variable is a mean zero random
variable whose covariance converges to zero by (0.1).