Giltinan, David M.Bounded Influence Estimation in Heteroscedastic Linear Models."

BOUNDED INFLUENCE ESTIMATION IN HETEROSCEDASTIC
LINEAR MODELS
David M. Giltinan
This research was supported by the Air Force Office
of Scientific Research. Contract AFOSR
F 49620 82 C 0009
DAVID M. GILTINAN.
Linear Models.
Bounded Influence Estimation in Heteroscedastic
(Under the direction of RAYMOND J. CARROLL and DAVID
RUPPERT.)
In a
heteroscedastic linear model, if there is no replication, the
usual approach is to model the variance.
A common situation is that where
the variance is modelled as a function of the mean response, or of a subset
of the explanatory variables.
~mximum
likelihood estimation of the variance
parameter in such models is very sensitive to extreme data points.
In this
dissertation alternative 'bounded influence' methods are developed which
overcome this problem.
The model for the variance which is considered is given by the re1ationship cr.1
= exp[h'(T.)8],
1-
where -h is a known vector-valued function and 8 is
the unknown variance parameter.
Using this model, there is a parallel be-
tween the problem of developing efficient bounded influence estimators for
8 and that of developing efficient bounded influence estimators for the regression parameter in the homoscedastic regression case.
eral such methods of estimating
~
Extension of sev-
to include the problem of estimating 8 is
discussed.
The method proposed by Krasker (1980) is generalized to include estimation of 8 and shown to be optimal in a certain sense.
The question of exis-
tence of a solution to the estimating equations is discussed.
The methods of Krasker and Welsch (1982) in the homoscedastic regression
case are extended to cover estimation of 8.
A three-stage weighted regres-
sion estimate based on these techniques is proposed and the associated inf1uence calculations are presented.
ii
The class of bounded influence regression estimators proposed by
Mallows (1975) is considered.
A necessary condition for a strongly optimal
weight function is derived and the resulting estimator of
to obtain an estimator of 8.
~
is generalized
Mallows estimators are shown to possess a
stability of variance which is absent in Krasker-We1sch estimators.
A
three-stage Mallows regression estimate is proposed and its influence function is given.
Computation of the three-stage estimates is discussed, and the performance of these methods is evaluated by using them in the analysis of a
number of data sets.
iii
ACKNOWLEDGEMENTS
I would like to thank my advisers. Raymond Carroll and David Ruppert
for introducing me to the topics considered in this work and for their
guidance and encouragement throughout its preparation.
I also wish to thank my committee members. Dr. Barry Margolin.
Dr. Stamatis Cambanis and Dr. David Guilkey for their help.
I am grateful to the entire Statistics Department for their support
during my stay at the University of North Carolina at Chapel Hill. In particular. I would like to thank Ms June Maxwell and Ms Jane Wille for their
moral support throughout.
Special thanks to Ms Ruth Bahr for her moral support and for her heroic
work in typing the manuscript.
Finally. I want to thank my family for their continued encouragement
throughout my education.
This work was supported in part by a Graduate Fellowship from the
Morehead Foundation and by the US Air Force Office of Scientific Research
under Grant AFQSR F49620-82-C-0009.
iv
TABLE OF CONTENTS
INTRODUCTION AND PRELIMINARIES
1
1.0
Introduction and overview
1
1.1
The model
2
1.2
Literature review
S
1.3
Swmnary
CHAPTER I
CHAPTER II
17
HAMPEL-KRASKER ESTIMATION OF
e
19
2.0
Introduction
2.1
Hampel-Krasker estimation of
e
(I)
19
2.2
Hampel-Krasker estimation of
e
(II)
24
2.3
Details
CHAPTER III
19
29
KRASKER-WELSCH ESTIMATION OF
e
37
3.0
Introduction
3.1
Krasker-Welsch estimation of
e
(I)
37
3.2
Krasker-Welsch estimation of
e
(II)
43
3.3
A 3-stage Krasker-Welsch estimation procedure
CHAPTER IV
37
. MALLOWS ESTIMATES
47
SS
4.0
Introduction
4.1
Mallows estimation of
~
5S
4.2
Mallows estimation of
e
66
4.3
Stability of variance
68
4.4
A 3-stage Mallows estimation procedure
79
55
v
CHAPTER V
COMPUTATION AND APPLICATIONS
82
5.0
Introduction
82
5.1
Computational aspects
83
5.2
Example 1 - Glucose data
97
5.3
Example 2 - Salinity data
113
5.4
Example 3 - Gas vapor data
128
5.5
Conclusions
135
CHAPTER I
INTRODUCTION AND PRELIMINARIES
1.0
Introduction and overview
The single most widely used technique for estimating the unknown
regression coefficients in the standard linear model is undoubtedly the
method of least squares.
There are several good reasons for this; least
squares estimates are easy to compute and enjoy certain optimality properties.
Under standard assumptions, the ordinary least squares estimator is
the best linear unbiased estimator, and, if normality of errors is assumed,
it is also the maximum likelihood estimator.
Unfortunately, if the under-
lying assumptions about the data are not met exactly, use of least squares
techniques can be disastrous.
In particular, the existence of one, or more,
erroneous data points can have a considerable effect on the least squares
estimate.
This undesirable sensitivity to a small amount of 'bad' data is not confined to least squares estimators.
More generally, any maximum likelihood
estimator is based upon certain assumptions about the form of the likelihood
underlying the data, and the optimality properties of the MLE typically depend quite heavily on the validity of these assumptions.
Use of maximum like-
lihood methods when some of the data may be in error is potentially disastrous.
This is unsettling, since in practice, some fraction of bad data is the rule
rather than the exception
(Hampel, 1983, oral communication).
2
We shall be concerned with the problem of
parameters in heteroscedastic linear models.
estimating the regression
If no replication is assumed,
the usual approach is to model the form of the heteroscedasticity in some
way, so that the variances depend on a few parameters.
parameters are not known and have to be estimated.
Typically, these
It is clearly desirable
to do this in a manner which overcomes the disadvantage of maximum likelihood methods mentioned above.
One possibility is to develop diagnostic tech-
niques to identify influential data points,
effect on inferences made from the data.
i.e. points which have a large
Such techniques might then be im-
plemented in conjunction with standard estimation procedures.
A second ap-
proach, and the one which we shall pursue in the present work, is to develop
estimation procedures which automatically limit the influence of any small
fraction of the data.
Of course this cannot be done wihout paying a price.
Typically the price is a loss of efficiency of the estimates at the underlying theoretical model.
Obviously, it is desirable that this loss of ef-
ficiency be as small as possible.
We shall be interested, then, in developing estimators in heteroscedastic linear models which bound the influence of any small subset of the data,
and which do so as efficiently as possible.
In achieving this goal, we take
as our starting point certain types of bounded influence estimators which
have been proposed for homoscedastic linear models, and we extend them to
our situation of interest, modifying them where appropriate.
1.1
The model
We consider the model that, given x.,
--:L
y.
1
where
S
= -1
X. IS
-
+
a.E.;
1 1
i=l, ... ,n
(1.1.1)
is a pXl vector of regression parameters to be estimated, -1
x. is a
3
pXl vector, and the observations (x.,y.) are assumed to be n independent
-1
1
drawings from a (p+l)-variate distribution H.
(y.-x.
'B)/a.1 is distributed independently of -1
x..
1-1-
We assume
that
E.
1
Denoting the distribution
of E by F, we assume that F is symmetric with mean zero and variance 1. The
2
quantities a. are the conditional variances of the y. given x. and reflect
1
1
-1
the heterogeneity of variance in the model.
The particular model used to describe the heteroscedasticity is typical1)' suggested by the data.
One common situation in practice is that where
2
the variance a.1 increases as the mean T.1 (=x.
'B) does.
-1-
This has been mod-
elled in a variety of ways in the literature - e.g.
=
a.1 = a exp(AT.)
1
(Box and Hill, 1974)
(Bickel, 1978)
a. = a(l+AT~)~
1
A
aIT.\
1
(1.1.2)
(Jobson and Fuller, 1980))
1
Following Carroll and Ruppert (1982b), we shall consider this general
model for the variance:
a. = exp [h ' (T . ) e)
1
-
(1.1.3)
1-
where h is a known R q-valued function of the mean T, and ~ is an unknown
qXl vector of variance parameters.
first 2 models of (1.1.2).
The formulation in (1.1.3) contains the
For instance, the model a. = alT.
1
1
IA may
be writ-
and
In some situations, the variance might be a function of one, or more,
of the explanatory variables.
Analogously to (1.1.3), this situation can
4
be modelled as:
cr.1
= exp[h'(x.)8]
-1-
(1.1.4)
where h is now a known function from :R p to R q.
For example, if we believed
that the variance increased as a function of a single explanatory variable,
model.
A
= crl~il , analogously to the Box-Hill
i
This model is clearly subsumed in the more general formulation in
~, this could be modelled as cr
(1.1.4).
In what fOllOWS, we shall concentrate for the main part on the model
specified by (1.1.3), and shall not state results and conclusions explicitly
for the model in (1.1.4).
Typically, results for the second model are quite
analogous to, and, in fact, easier than, those for the first.
Remark:
The assumption of random
~'s
is a usual one in bounded in-
fluence studies (compare Krasker and Welsch, 1982;
Yohai, 1979).
Maronna, Bustos and
It is largely a mathematical convenience, as shall become
evident later; it simplifies the calculation of certain influence functions.
At the cost of extra complexity, one can forgo this assumption and work with
a more general definition of influence function (for non i.i.d. data,
Huber, 1983).
see
This more general definition yields the same expressions for
the influence function which we consider, and thus, bounding either type of
influence function leads to the same estimation procedures.
From a statis-
tical viewpoint the assumption is not a crucial one - the estimators which
we derive later are conditional on the observed -1
x. 's and are thus equally
sensible in a model where the XIS are fixed.
distribution of
a finite
~
We shall denote the marginal
by G and make no assumptions on G other than that it has
second moment.
5
1.2
Literature review
Influence and sensitivity
Since we shall later be concerned with limiting the effect that a
single observation (x.,y.) may have on inference we begin by considering
-J
J
the problem of assessing the influence of an observation (x.,y.) on a
-J
parameter estimate.
J
Implicit in what follows is the idea that the method
of estimation considered is the same for each sample size.
Cook (1979) has proposed the following approach.
an estimate of 8 based on the full sample of size n,
Suppose 8 denotes
let
~(j)
denote the
estimate computed from the reduced sample with the jth observation deleted.
An empirical measure of the effect of the jth observation on the estimate
A
A
is the change, 8 - 8(.),
-
-J
or some suitably scaled version thereof.
Cook
describes a number of such empirical measures for the case of least squares
estimation of 8.
In this particular case, due to the linear nature of the
estimates, certain simplifications apply which reduce the amount of computation; in general, however, such measures will require n+l separate computations of the estimate, and for this reason we do not consider them further.
In a recent (1983) paper Cook and Weisberg consider the problem of detecting heteroscedasticity in the usual linear model and develop a class of
score tests for heteroscedasticity.
influential data points.
These tests are quite sensitive to
Cook and Weisberg suggest graphical methods for
locating points which have a large influence on the test statistic; such
methods are less useful if the number of explanatory variables is large,
however.
Cook and Weisberg do not address the question of estimation in
such models, commenting merely that it is a topic of current research.
6
A useful way of characterizing the influence of a point on a parameter
estimate is by means of the influence function,
(1968, 1974).
first introduced by Hampel
The development here follows his 1974 paper.
Let T be a mapping into R P, defined on the space M of probabi l i ty
l
distributions on R P+ .
Assume that the estimator being considered may be
written as this functional of the empirical distribution function
and that T(H)
=~,
the true parameter value.
point mass at the point
H
y
(~,y)
Contaminate H slightly by a
by defining (for 0 S Y S 1)
= (l-y)H
+
yo (
~,y
)
Then the influence function of the estimator T at the point
lying distribution H is defined
iFT(~,Y:H)
= lim
T(H )-T(H)
Y
if this limit is defined for every (~,y)
at H.
sometimes
iF8(~'Y)'
and under-
pointwise as the Gateaux derivative
y-+O
iFT(~'Y:~)'
(~,y)
y
€
R P+ l .
We shall sometimes write
where the latter is understood to be defined
The value of the influence function at
(~,y)
may be regarded as an
approximate measure of the normalized effect on the estimate of adding a
small amount of contamination at
(~,y).
Under reasonable regularity conditions,
+
0
P
(1)
and
~....
n (~(n)-~)
~
H N(Q., E[iF(~,y) iF' (~.y)l).
We therefore call V ~ E[!!lx,y) iF' (~,y)] the asymptotic covariance matrix.
7
If one is to have stable inference, then it is desirable that no small
subset of the data have a very large effect on the (asymptotic) variances
of the parameter estimates.
We would like, therefore, to be able to assess
(and control) the influence of a single observation (x.,y.) on the standard
-J
errors of
Bas
well as on
B.
J
One way of describing this influence is by
means of the change-of-variance function, first introduced in the singleparameter case by Hampel, Rousseeuw and Ronchetti (1981) and later extended
to the multivariate case by Ronchetti and Rousseeuw (1983).
Essentially, the change-of-variance function is just the influence
function of the asymptotic covariance matrix.
v = EH(iF iF')
on
~
10
p+l , i.e.
We may view
the matrix
as a functional defined on the space of distribution functions
V
Then the change-o f -variance ~~unction at (~,y)
)
= V(H.
and the underlying distribution H is defined as the directional derivative
towards the point-mass at
(~,y):
CVF(~,y:H) ~ lim
-
(where H = (l-y)H + yo(
y
every (~,y)
€
lR p+1.
V(Hy)-V(H)
y~
~,y
)'
y
as before), if this limit is defined for
The value of the change-of-variance function at (!.,y)
may be taken as an approximate measure of the effect on V of an infinitesimal
amount of contamination at (!.,y).
The maximum influence that a single point (!.,y) can have on the parameter estimate
Bis
of interest.
is the gross-error-sensitivity
One measure, suggested by Hampel (1974),
8
Krasker and Welsch (1982) point out that this definition of sensitivity
is not invariant to a change of coordinate system.
They propose two alter-
native definitions of sensitivity which circumvent this lack of invariance.
If prediction is of interest, the influence of
(~,y)
on the linear combina-
tion A'a
is measured by A'IF(x,y)
and would be potentially troublesome if
- large relative to the variability
Thus a natural standardized mea-
~'Y ~.
sure of sensitivity is
sup
(~,y)
IA'IF(x,y) I
sup -- A;Q. -(A-'-Y-A-)"TJ5-
= sup
[IF(x,y),y-lIF(x,y)] _
(!., y) - - -
Stahel (1981) calls Y2 the 'self-standardized-sensitivity' because the
estimator's influence function is normed by the covariance matrix of the
estimator itself.
If one is concerned with the maximum influence a point
can have on its own fitted value, one might consider
In a similar way, one can measure the maximum influence that a single
point
(~,y)
can have on the asymptotic variance of the estimate.
Ronchetti
and Rousseeuw (1983) suggest two possible definitions of a change-of-varianoe
sensitivity.
They define the unstandardized change-of-variance sensitivity
as:
sup [ trace CYF(X,y)J
(!.,y)
trace Y
An
equivalent formulation is
sup
(~y)
I
J.
[aay log[trace Y(H )]
0
y
~O
Remark that trace Y is the asymptotic mean squared error of the estimate.
One can also define an invariant
measure
of
CY-sensitivity,
9
the (self)-standardized change-of-variance sensitivity
sup [trace(CVF(x,y))V
(!.,y)
-
-1
].
In both cases the supremum is taken over the set of (!.,y) for which the
change-of-variance function is defined.
Note that these definitions do no't involve absolute values, implyinj;(
that sharp decreases in variance/mean squared error due to contamination
at (!.,y) are not of tremendous concern.
It has been argued by Welsch (1983,
oral communication) that, for stabZe inference, a small amount of contamination should result in neither a large inflation or defZation of variance.
We shall refer to this issue again later.
Our primary goal shall be to develop efficient estimators which have
finite gross-error-sensitivity; however, as shall become evident later, in
certain situations estimators developed with a view to bounding YZ also
bound
K
'
Z
Henceforth, when we write sensitivity' we shall understand it to mean
gross-error-sensitivity;
if change-of-variance sensitivity is in question
we shall refer to the CV-sensitivity.
An estimator with finite (gross-error)
sensitivity will be referred to as a bounded influence (B1) estimator.
Estimation in heteroscedastic models
A number of estimation procedures have been proposed for models similar to the one described by (1.1.1) and (1.1.3).
estimate of
~
Typically
a preliminary
is obtained, and the residuals from the preliminary fit are
used to gain information about the variances.
used to obtain a more efficient estimate of
S.
This information is then
10
For example, Box and Hill (1974) and Jobson and Fuller (1980) both
suggest a form of generalized least squares.
and~,
"-1
constructs estimated weights a
i
forms ordinary weighted least squares .
One obtains estimates of
= exp[-h'(x.
'8 )6]
-1.""1'-
e
and then per-
In the case of normal errors,
•
Jobson and Fuller have also suggested a 'feedback' procedure, which uses
the information gained in the final estimate of 8 in the terms
h' (x. '8)6.
cr.
1.
=
This essentially reduces to maximizing the likelihood jointly
-
-1. - -
in
~ and~.
Apart from the intrinsic lack of robustness in any maximum
likelihood procedure (the estimates will clearly be adversely affected
by outliers or non-normal error distributions), a second form of robustness
should also be considered in this situation, namely robustness against misspecification of the function
in the model for the variance.
~
In a com-
parison between joint maximum likelihood and generalized least squares in
this type of heteroscedastic model, Carroll and Ruppert (1982b) caution
against use of the joint MLE on the grounds that it lacks functional robustness - it is much more sensitive to small misspecifications in the functional relationship between a. and 8 than the generalized least squares esti1.
mate.
-
However, even the GLS estimate will be sensitive to erroneous data,
since least squares methods are known to be highly non-robust.
Carroll and Ruppert (1982b) suggest a modification which gives partial
protection against this danger. They show that using any
n~-consistent pre-
liminary estimate of 8 and an n~-consistent estimate of ~ yields estimated
weights which are
~atisfactory,
in the sense that the final estimate for 8
obtained using these estimated weights is asymptotically equivalent to the
"optimal" weighted estimate of
-1
weights, a. , were known}.
1.
preliminary estimate of 8.
~
(ie e. the estimate obtained if the true
Thus, in particular, one could use a robust
11
For estimating! they suggest the following procedure:
the log-like-
lihood for! is, up to a constant
n
L h'(x.'(3)8.
i=l - -1. - -
5l,(fi)
c
Taking derivatives wrt ! suggests that one solve
¥{( _y..;:;i~-_.. . :xi=-'_(3_ _] 2 _ l} !!.(!.i '~)
i=l
= Q. •
(1.2.1)
exp[h'(x. '(3)8]
-
-1. - -
Since solving (1.2.1) leads to an influence function which is quadratic
in £ ~ (Y-!.'J~) exp [-!!.' (x'.~)!]
and thus unbounded, Carroll and Ruppert
suggest robustifying the estimation procedure by replacing the term in
braces in (1.2.1) by a bounded function X(£).
replace x. '(3 by x. '(3
-]. -
-1.
-p
In practice one would also
in (1.2.1), i.e. they suggest solving
(1.2.2)
to obtain!, where X is an even function with X(0) < 0, X(00) > O.
They suggest
a choice of X which generalizes the classical Huber Proposal 2 for the homoscedastic case.
It follows from the general theory of M-estimates that the solution of
(1.2.1) has influence function which is proportional to (£2_l)~(!.'~)' Clearly
this may be arbitrarily large in absolute value if either 1£1 or I~(T) I is
large, where T
~
x'(3.
= --
The estimate suggested by Carroll and Ruppert, with
influence function proportional to X(£)!!.(T), bounds the effect of large £,
but will not, in general, have finite sensitivity, since the factor of
may still lead to unbounded influence.
~(T)
A moderate residual occurring in con-
12
junction with a large value of I~(T) I may still have quite an effect on
Clearly, if we want to protect against this, our estimator for
e must
'"
e.
limit
not only the effect of large residuals, but also of large values of I~(T) I.
The following example illustrates the desirability of bounding the
c
effect of large values of I~(T) I as well as of large residuals.
set discussed by Leurgans in the Biostatistias Casebook (1980)
In a data
46 pairs
of measurements (x. ,y.) were obtained, representing levels of glucose conJ.
J.
centration in blood as measured by a test method (Y) and a reference method (X).
A residual plot from a least squares fit of Y against X gives clear
evidence of heteroscedasticity - the variance increases with the fitted
A
= crlTil and using maximum likelihood
i
methods to estimate A resulted in an estimate of A
= 0.52. However, if
MLE
the point with the lowest X-value and the lowest V-value is deleted from the
value.
Modelling the variance by cr
analysis, the MLE of A based on the reduced data is 0.86 - deletion of just
a single point has a tremendous effect on the estimate.
Use of the method
proposed by Carroll and Ruppert gave estimates of A as 0.72 and 0.87 based
on the full and reduced data sets respectively - bounding the effect of a
large residual narrows the gap, but if a really stable estimate of A is
desired, then some attempt must be made to counter the effect of large
values of Ih(T)
I.
This is analogous to the problem of estimating
regression case.
~
in the homoscedastic
In this situation, the least squares estimate is the so-
lution of:
n
~
.L
J.=l
and, writing
£
(y.-x.
'8) x.
1 - 1 - --J.
for
= -0
(Y-~'~)/a,
(1.2.3)
has influence function proportional to EX.
The influence function will be large in absolute value if either
1£1 or
13
Ixl is large.
Early attempts to robustify the L.S. estimate against the
influence of "extreme" data points considered minimization of functions
2
of the residuals (Y-~'~)/cr which grow more slowly than [(y-x'~)/cr] .
For
instance, Huber (1973) defines
It I
It I
P (t)
c
n
L
<
C
t
Y·-X.'Bj
1. -1. -
wrt ~ and cr, where d
cr
i=l
c
is a constant chosen to give consistency of the scale estimate. The resultand suggests
minimi zing dcr
+
cr P
~
ant influence function is proportional to
t
~ (t)
Huber psi-function
While
~
c
c
=
x is an outlier.
ence function
It I
It I
~
~ p ,
c -
c
is the
~ c
> c
(~,y) ~an
still be arbitrarily large if
Clearly, any estimator which is to have a bounded influ-
must limit not only the effect of large residuals, but also
the effect of outliers in the
~-space
any downweighting
ciency.
{ c sgn t:
(s)x where
-
(.) limits the effect of large residuals, and is thus "robust"
in this sense, the influence of
liers in the
.
c
~-space.
One problem with this is that out-
increase the efficiency of most estimation procedures -
of such outliers will typically result in reduced effi-
There is a tradeoff between efficiency and sensitivity - the usual
approach in the regression case has been to attempt to maximize efficiency
subject to a bound on the sensitivity.
Bounded influence methods for homoscedastic regression
A number of methods have been proposed in the literature for obtaining
efficient bounded influence regression estimates. Krasker (1980)
considers
14
the class of M-estimates for
~
defined by an equation of the form
n
r
~ (x.,y. :i) = 0
i=l - -J. J . where
l:
:R p x lR x lR P ~ R P
(1.2.4)
satisfies the Fisher consistency requirement
(1.2.5)
He considers the problem of choosing l optimally, subject to a given
bound a on the sensitivity Y of the estimator. The 'optimal' l is taken
l
to mean that l which minimizes the trace of the asymptotic covariance matrix
,...
of S
(or, equivalently, that l which minimizes the asymptotic expectation
,...
2
of I~-~I).
Stahel (1981) has shown that one cannot obtain an estimator
which is strongly optimal for this class - i.e. one which minimizes
over the whole class.
VB
Using a result of Hampel (1978), Krasker shows that
the optimal estimator is defined by
n
L ~ [(y.-x.
'i)
i=l-a
J. -1 -
-1
A x.]
-1
=0
where
:lir
.
~
(z)
-a -
=
{
if
S a
if
> a
is the multivariate Huber psi-function, and A is a pxp invertible matrix
chosen so that the normalization
In fact,
A must satisfy
(1.2.6)
and will exist provided a is sufficiently large.
In practice· A will not be
IS
known and will have to be estimated,
'"
1
A = -
.L
n 1=1
a
. . ,"'-1
n [ 24> (
(] A
X.
-1
I)
e.g. as the solution of
J
(1.2.7)
x x '
-i-i
- 1
where cr is an auxiliary scale estimate.
Krasker and Welsch (1982) consider a smaller class of estimators,
but look at a stronger kind of optimality.
They restrict attention to
estimators of the form:
n
r w(y. ,x. ;S) (y. -x. 'S)x.
i=l
1 -1 -
1 -1 - - 1
= o
(1.2.8)
where w is a non-negative, bounded continuous weight function which varies
through y only through the residual
\y-x'si
cr
and gives Fisher consistency
at the normal model for each x in :R P:
(1.2.9)
(If w - 1, then the solution of (1.2.8) is
~LS·)
Krasker and Welsch address the question of choosing that weight function
which minimizes the asymptotic covariance matrix, V , of S subject to a
w
bound on the sensi tivi ty. CwO minimizes the covariance matrix if V - V
w W
o
is non-negative definite for all w.)
The particular case treated explicitly
in their paper is the self-standardized sensitivity, Y , but similar methods
2
yield results for Y and Y as well. In particular, with a bound on Y
I
I
3
the efficient estimator is identical to the Hampel-Krasker estimator described above.
Krasker and Welsch obtain the following result:
Let a sensitivity bound a> 0 be given.
Suppose there exists a w which
minimizes Vw among all admissible weight functions satisfying Y
2
w satisfies
~
a.
Then
16
I lsi (x'N -1 x) 1}
w(y,x;S,cr)
= Min {I, y-x
- cr
(1.2.10)
~
-
w-
Equation (1.2.10) is a necessary condition for a minimum, but is not
sufficient.
Ruppert (1983) has shown that, in general, no weight function
exists which minimizes V
strongly; nonetheless, the weight function satis-
w
fying (1.2.10) is intuitively appealing and works well in practice.
It can
be shown that
N = EG g(
w
(x'N
-
~l x) Ji)
(1.2.11)
.!.!'
w-
2
where get) -~ Ez Min(z2,t ) with.Z ~ N(O,l).
Typically Nw has to be estimated,
e.g. by solving
N
w
=-1 nI
n
i=l
g(
a
) x. x. '
-],-1
l
(x. 'N- x.)
Ji
x
-1
(1.2.12)
W-1
As early as 1975, Mallows had suggested estimators of the form
n
I x.
. 1-].
0
w(x.) n (y.-x. '~)
1=
-].
1 -]. -
where w: lR p
-+
lR +
0 as I.!I
00
and n: lR
-+
-+
=0
(1.2.13)
is a bounded continuous weight function satisfying w(x)
-+
lR is an odd function (this guarantees Fisher
consistency of the estimator under any symmetric distribution for c).
Essen-
tially an estimator of this type bounds the influence of large residuals
and outlying .!-values separately.
The influence function of
o
~
defined by
(1.2.13) is given by
= M- l
x w(x) n(y-x'B)
w - EF[n(£)]
where Mw ~ EG [wC.~).!.!' ], and
nis
o
the derivative of n.
Clearly,
B will
have
17
bounded influence iff both
neE)
and
~
w(!J are bounded.
We defer further
discussion of this class of estimators until Chapter IV, where a more detailed treatment will be given.
1.3
Summary
Our aim will be to obtain efficient bounded influence estimates for
the model described in Section 1.1.
Since Carroll and Ruppert (1982) have
shown that the joint maximum likelihood approach is highly sensitive to
slight misspecification of the functional relationship between cr. and
1
we do not attempt to estimate
~
and
~
B,
-
simultaneously. Instead we adopt a
3-stage generalized least squares approach.
Clearly, any of the bounded
influence regression methods described in the preceding section may be
employed in the preliminary estimation of
~,
and in obtaining the final
weighted estimate of
S.
these techniques to
the problem of estimating the variance parameter,
Our main problem then will be the adaptation of
~.
In Chapter II we discuss the extension of the Hampel-Krasker method
to the problem of variance estimation and we comment on a number of associated difficulties which arise in the context of estimating
~.
In Chapter III we consider 2 ways of extending Krasker-Welsch techniques to cover estimation of
tational difficulties;
e.
The first of these entails certain compu-
the second proposal avoids these problems and may
be easily implemented in a 3-stage estimation procedure.
In the final
section of this chapter we describe such a procedure and present the associated influence calculations.
In Chapter IV we strengthen the results of Maronna, Bustos and Yohai
for Mallows estimators in regression and extend the Mallows technique to
include estimation of
e.
We shall see that estimators of this type possess
18
a certain stability of variance which, for instance, Krasker-Welsch estimators do not.
We therefore propose a 3-stage Mallows estimator for our
model.
In the final chapter we discuss algorithms for the computation of the
3-stage estimators described in Chapters III and IV, and we evaluate the
performance of these estimators by using them in the analysis of a number
of data sets.
Implicit in our whole approach is the intuitively appealing idea that,
if the influence of any small subset of the data is controlled at each
stage of the estimation procedure, then its effect on the final estimate
should not be excessive.
Formal influence calculations given in later chap-
ters show that this is indeed the case.
CHAPTER II
HAMPEL-KRASKER ESTIMATION OF 8
Introduction
2.0
In this chapter we consider how to extend the bounded influence methods
of Hampel (1978) and Krasker (1980) to the problem of estimating the variance parameter
T.
1
= x!B,
-1-
known.
We shall work with the model
~.
0.
1
= exp[h'(T.)8]
1-
and we shall consider the question of estimating
~,
where
assuming T.
1
In practice, as discussed earlier, one would use a preliminary esti-
mate B and work with t.
-p
1
= x!B
--1-'-}J
in the estimating equations
rather than T..
1
We shall see that the extension of the Hampel-Krasker technique is complicated by the lack of symmetry of our problem, and the optimal estimator
which we derive is quite hard to implement in practice, leading us to consider alternative bounded influence methods in subsequent chapters.
2.1
Hampel-Krasker Estimation of 8
(I)
We consider the general class of M-estimates of 8 defined by an equation of the form
n
L
r,: (y. , T . ; ~
i=l where
with
z:: :
:R x R
x R
1
q
=0
(2.1.1)
1
-+
:R q.
The normal-theory MLE of
e
is in this class,
20
= {
l.XP[~'«)~LE1J
Y-T
'\
2
x-I
h(T).
}-
We shall derive a bounded influence estimator for a which minimizes the
trace of the asymptotic covariance matrix among all estimators in this class.
subject to a given bound on Y • the sensitivity of the estimator. For clarl
ity of exposition we frame the question in a more general setting and obtain
the solution to our problem as a particular case.
Consider the following situation:
Suppose 0 c lR
k
and {fa
family of densities on (Q.F,~), a a-finite measure space.
.§.e0} is a
Let Zl •... Zn be
...,
i.i.d. according to fa'
Let
o
n
L
qz.;
'-Eb = 0
k
k
. 1 -
1
1=
where
.5.
Q x lR
-.. lR
be estimated by!, the solution to
(2.1.2)
is asswned to satisfy a Fisher-consistency requirement
a -~(z; -a) = -0
E
The influence function at
-1
M~ (~) .5.(z;~)
~
V
a
for
~
e 0 .
e solving
(2.1.3)
(2.1.2)
is
given
by
where
is asswned to be nonsingu1ar.
Note that the equations
n
L
qz.; 6) = 0
. 1 -
1=
1
...,
yield the same solution for a and thus we may assume the normalization
(2.1.4)
without loss of generality, in which case
21
IF (z; e)
-I;;
-
= -l;(z;e).
-
The asymptotic covariance matrix of
e
given by Vl;@) = Eaf(Z; ~)~'(z;~)
supl~(z; ~)I·
z
d 10gf (z)
(z)
e
for
de
and note that ~(z) - fe(z)
and the sensitivity, Y1' is
For
a is
!e
€
0, wri te !e (z)
d f (z)
where ie(z)
= dee
Our first result is a generalization of a result of Hampel (1978).
This formulation is due to Ruppert (1982); we include the proof for completeness.
For ~
Theorem 2. 1
€
0, B a kxk matrix and £
€
lR k define
1.*(z;~ = !a[B~(z) - ~
where ~
-a
is the multivariate Huber psi-function defined by
~ (u)
--a -
= -u,
= a~/I~I,
if I~I < a
if lui ~ a.
If there exists a nonsingular matrix B and a vector a such that the Fisher
consistency requirement (2.1.3) and the normalization condition (2.1.4) are
satisfied, then
~*
minimizes the trace of
Vl;(~)
subject to a bound a on the
sensitivity Y and the conditions (2.1.3) and (2.1.4).
l
Proof:
Consider, for
~
satisfying (2.1.3) and (2.1.4):
=
B[E .E:.e (z) .f' (z)] - E L~(z )(B!e (z)) ']
since
a E
~'
(z)
= E ~(z )~' = 0
by (2.1. 3) .
22
J ~(z)fe(z)d~
Differentiating the equation
=~
e
with respect to
and
interchanging the order of differentiation and integration, we obtain:
.
J ie ~'d~
.
=
-J ~
fed~ = I kxk '
by (2.1. 4) ,
(a dot indicates differentiation with respect to !), i.e.
E .::e
~
and so choosing
-~'=
~
E -~ .::e
i ' = I.
to minimize trace
Thus
V~
is equivalent to choosing
~
to mini-
mize
trace
tit
E[{(B!e(Z)-~)
- ~(z)}{(B!e(z)-et) - ~(z)}']
= E[{(B!e (z) -et) -
~(z)}' {(B!e (z) -et)
and, subject to the constraints,
~*
-
~(z)}]
o
clearly achieves this.
Existence of the matrix B and vector
~satisfying
the requirements of
Theorem 2.1 may be established provided a is sufficiently large,
following theorem shows.
(For
a=~, et=~
and B=1 -1 , where 1 =
the information matrix, satisfy the conditions.)
such that
and
as the
E~'
is
We must find et and B
23
Theorem 2.2
{ -3
80
E
For a sufficiently large
fa [B!e (z) -~]
a8
I }=
8=8
B and a such that
I
-~
and
Proof:
there exist
The proof uses a fixed point argument generalizing that in
Proposition I of Krasker (1980).
Since the details are somewhat technical
we delay their presentation until the final section of this chapter.
We now apply Theorem 2.1 to our situation of interest.
z.
~
=
In our case
(Y.,T.) and the density, fe(z), may be written as
~
~
1
exp [-"2
2
E (~)]
geT)
where g is the marginal density of T,
and
El~)
t.J.
(y-'r) exp [-!!.' l T)~] .
It follows readily that
and thus the estimator
e which
minimizes the trace of the asymptotic covar-
iance matrix subject to the requirement of Fisher consistency and a bound
a on the sensitivity Y solves an equation of the form
l
(2.1.5)
where
(2.1.6)
The Fisher consistency
requ~rement
(2.1. 6) is not entirely satisfactory,
since it involves taking the expectation with respect to the unknown dis-
24
tribution of the
simply take
•
TIS
This could be circumvented in two ways.
One could
the expectation in (2.1.6) with respect to the empirical dis-
tribution of the
dom
TIS.
TIS.
Alternatively we may argue
if our assumption of ran-
is really made just for mathematical convenience and is not crucial
from a statistical viewpoint, then our inference should be conditional on
the observed
TIS.
This suggests that, while Theorem 2.1 and 2.2 are not
without general relevance, a formulation based on the conditional distribution of the yls given the
is more meaningful for our problem.
TIS
In the
next section we develop such a formUlation.
2.2
Hampel-Krasker Estimation of e
(II)
In this section we consider the following general situation: (y 1.• -1
z.)
i=l, ... n, are n independent drawings from a (k+l)-variate distribution H
and suppose that the conditional distribution of y given
fe(Y/~)
~
has a density
with respect to Lebesgue measure, depending on some k-dirnensional
parameter!.
Denote the marginal distribution of
in some open subset of R k.
~
by G.
Assume that a is
In general, ! is unknown and is to be estimated
from the data (Y.,z.) i=l, ... n.
1-1
The form of the density fa is known.
In this situation, consider the class of M-estirnates of a obtained by
solving an equation of the form
n
"
I l;(Y·,z.;!)
=0
. 1 1-1
(2.2.1)
1=
where ~ is an R k-valued function.
Since our inference is to be conditional
on z we require Fisher consistency of e for each
~,
for each
The influence function for a is given by
i.e.
Z e
(2.2.2)
2S
= MI;;-1
IF (y, z; e)
-I;;
--
I;; (y , z; e)
--
where M ~ El-al;;/aa) is assumed nonsingular.
I;;
-
-
As before, we assume the normalization
(2.2.3)
without loss of generality, so that .!.FI;;(Y'~; ~)
A
covariance matrix of ~ is given by VI;;
tivity, Y , is
I
= I(Y,~;
= EH i(Y,~)
~).
asymptotic
The
i' (y,~), and the sensi-
sup Il(Y,~) I·
(y,~
We have the following analogue to Theorem 2.1.
For any kxk matrix B and function
Theorem 2.3
CI.:
lR k
-+
lR k, a>O
define
where
~ (y/~)
=a
log
fe(y/~)/a~
and
!a
is the multivariate Huber psi-
function defined previously.
If there exists a nonsingular matrix B and a function
£(~)
such that
the Fisher consistency requirement (2.2.2) and the normalization condition
(2.2.3) are met, then i* minimizes the trace of VI;; subject to a bound a on
the sensitivity Y .
I
Proof:
The proof is entirely similar to that of Theorem 2.1.
Existence of the matrix B and the function
specific applications of Theorem 2.3.
~(~)
o
must be checked in
Krasker (1980) does this for the
homoscedastic regression case - by the symmetry of that problem
CI.
= ~ gives
Fisher consistency, and he shows existence of B provided the sensitivity
bound a is large.
We now apply Theorem 2.3 to our situation of interest.
Our model is
26
y.1
= T.1
+ E.
1
exp[h'(T.)S]
1-
i=l, ... ,n
where (Y.,T.) are n independent drawings from some bivariate distribution.
1
1
We assume an underlying normal model, i.e. that fS(Y/T) is normal with mean
T and standard deviation
exp[~'(T)~].
It follows readily that
Theorem 2.3 implies that the optimal estimator
~,
subject to the constraints,
solves an equation of the form
B 'II
J"-a
1=1
{[E~(e)-IJ
1-
Bh(To)-a(T.)}
-1-1
= -0
(2.2.4)
where aCT) is chosen to give Fisher consistency and B is chosen to satisfy
the normalization requirement
Clearly it suffices to show the existence of some C(T) such that aCT)
[l-C(T)]
B~(T)
gives Fisher consistency.
=
We therefore consider solutions
of the equation
instead and show the existence of the required Band C(T).
The normaliza-
tion condition is now
I
qxq
Evaluation of the derivative shows that this requirement reduces to
27
B
where Z
~
-1
(2.2.6)
N(O,l).
Simultaneously we require that
Clearly it will do equally well to show that for sufficiently large a
there exists a nonsingular qxq matrix A and a continuous function c:
R~
R
such that
r
A = E g (c (T),
T~
J 'l'a
IA
_~
~(T)I
)
~(T)~' (T)]
(2.2.7)
[(E2_c(T))A-l~('r)] d~(E) = O.
(2.2.8)
In practice, the expectation in (2.2.7) will be taken with respect to the
empirical distribution of the T'S; we are interested in finding A and
C(T.), i=l, ... ,n satisfying
1.
I
g [c (T . ) ,
1a
1 h (T . )h ' (T. )
A =!.
n i=l
1.
IA- h(T.) I) - 1. 1.
-
1.
where
(2.2.9)
'l'
J -a
[(E 2 -c(T.))A -1 h(T.)]
1.
-
1.
d~(E)
=
o
i=l, ... ,n
Existence of A and c(T i ) satisfying (2.2.9) is a consequence of the following theorem.
Theorem 2.4
support T.
Let P be a probability distribution on R with compact
Suppose that Ql
= Ep2~(T)~'(T)
exists and is invertible. Further
assume that the function h: lR ~ JR q is continuous.
Then, provided a is
sufficiently large, there exists a nonsingular matrix A and a
continuous
28
function c: T .. lR
such that
A = Eprg(C(T),
~
J !a[(€
Proof:
2
-c(T))A
_la
IA !!.(T)
-1
J
I
h(T)!!.' (T)J
~(T)] d~(€) =
0
The proof employs a fixed point argument similar to that in
the proof of Theorem 2.2.
Again, the details are quite technical, and we de-
fer their presentation until Section 2.3.
In practice, therefore, we are led to solving the fOllowing system of
equations for
~:
Jl!a{[ i (8) - i] A
A =!
-lh (T
c
€
I [c.,
g
n i=l
x
1
a
i)} = Q.
) h (1". ) h ' (T . )
IA- ~(Ti)1 J -
1
1 -
(2.2.10)
1
where c. satisfies
1
~ [(€
J -a
2
A_I
-c.) A h(T.)] d~(€)
1
-
1
=0
i=l, ... ,n
We remark that, had we proceeded from the assumption that the T. 's were fixed,
1
not random, and invoked the more general definition of influence function
(for non-i.i.d. data), then a search for the optimal bounded influence estimator of
e
would lead us to precisely the system (2.2.10);
compare Ruppert
(1982), Section V.
Practical implementation of the estimator defined by (2.2.10) is unfortunately not particularly easy.
The presence of the constants c., made neces1
sary by the Fisher consistency requirement, leads to awkward computational
problems.
While these problems are not insurmountable. rather than becoming
involved in the development of a (possibly costly) program to solve (2.2.10)
for
~,
in subsequent chapters we investigate other types of bounded influence
29
estimators of
2.3
~,
some of which do prove to be computationally more tractable.
Details
We now present proofs for Theorem 2.2 and Theorem 2.4.
Proof of Theorem 2.2:
We need to show that, for a sufficiently large,
there exists a nonsingular kxk matrix B and a
€
R k such that
(2.3.1)
and
E ~ [B!e (z) -£]} = Q.
a
log fe(z)
5/,0 (z)
= --::-=--ae
Write
ie(z)
=
and let
Ie
where
-0
(2.3.2)
(a k-vector).
a!e(z)
ae
= E[-ie(z)]
(a kxk matrix)
be the information matrix.
Equation (2.3.1) may be written as E{-I[IB~(Z)_~I<a]Bie(Z)} = I kxk
or equivalently as
so it clearly suffices to find A and a such that
(2.3.3)
and
(2.3.4)
For the remainder of the proof we omit the subscript '8'. Let M denote
the space of kxk matrices
equipped with the operator norm
30
Let lR k be equipped with the usual Euclidean norm.
are Banach spaces in their
in. e.g.
the norm
Mx :IRk is a Banach space
respective norms.
I (A.~ I
=
Then both Mand 'IRk
Max{ IAI , I~I}
for (A,a) in Mx R k.
We do not subscript the various norms in what follows;
it should be
clear which norm is meant in a given situation.
Let S be a closed ball in M, centered at the information matrix
I
(assumed nonsingular). of radius sufficiently small that all elements of
For (A,a) ~ S x lR k define the (continuous) mapping
S are invertible.
rPIa (A.a)
F (A.a)
-a
-
=
E{-I
l
=
•
[IA-l!(z)_~I<a]~(Z)}
E{~ [A-l~(z)_a]}
F2a (A,")
-a
-
-
a
+
Clearly it is enough to show that the mapping F has a fixed point for
-a
sufficiently large a.
We shall show that. for some 0 > 0, F maps S ((1,0);0) into itself
-a
where S ((t.~);6) is the closed ball of radius
a sufficiently large
centered at (I,Q).
-
for
0
Brouwer's Fixed Point Theorem (Dunford and Schwartz, I,
1958. page 468) then guarantees the existence of the required fixed point.
Suppose then, that (A,a) ~ S((1,~);o),
want to show that
II - Fla(A,~)
norm.
I
~
I (I,Q)
- ~(A,~
I
0 and
IF 2a (A,a)
I
Le.
o.
~
~
II-AI
~ 0,
lal ~ o.
It is clearly enough to show that
by definition of the maximum
0
We show the latter inequality first:
= E{~--a [A-l~(z)-a]}
=
f
+
a
-
~ [A-l~(z)-a] dF(z)
-a
-
-
= J[Min[l'l _la
IA
We
~(z)-~I
)
+
a
{A-l!(z)_~} +~]
dF(z).
31
Define
Then
= fM
[
1 a
A,a [, A- !( z) -s:1
{A-1! (z) -a} + a] dF (z)
+ fMC [{A-1R,(z)-s:} +s:l dF(z)
A,a
= fM
A,a
+
[-{A -1.&( z) -S~) [
a
- I.J + a - aJ· dF ( z)
1
[IA- R,(z)-al
f n({A-1!(z)-a}
+ a) dF(z) .
The second of these terms is zero, since E !(z)
= O.
Thus we need to show
tha't
if a is sufficien'tly large.
Note 'tha't, since A is in S(I,c)
and O such that O s lA-II s 0 for A in
2
2
1
1
inS(cI,Q);c), !A- !(z)-s:1 ~ D21!(z)I-8, i.e.
'there exis't positive numbers 0
5(1,0).
(A,~)
Thus, for
MA,a
c
and the inverse map is continuous,
M = {z: D21!(z)
1
I - c > a}
I fMA,a {A-I!( z) -~} r[I _1!(z)
a
-~I
and so
- 1J dF ( z)
A
s
fM
1
IA- !(z)-s:I!
A,a
Since the function I!(z)
a ~
00
-1 a
IA
I
!(z)-££I
-lldF(Z)
sfM
I
A,a
(Dll~(z) 1+ 8)
dF(z)
is F-integrable, the F-measure of the set M ~ 0
by the Dominated Convergence Theorem, and thus fM(D11!(z) l+c)dF(z) ~O
as
32
as a
~
with the convergence being obviously uniform for A and a
00
in
s((I,Q);o).
(A,a) I ~ 0 for (A,a) ES((t,~);o).
2a
We want to show that, for a sufficiently
Thus, if a is sufficiently large, IF
Now consider F
large,
la
(A,a) - 1.
-
~ 0 =>
!(A,s:)-(1,Q)!
F
la
IFla(A,a) - 11
~ 0.
By definition
(A,a) - 1
-
Since we assume that
li(z)1
is integrable, there exists a compact set
Ten such that
f
c
li(z)
I.
dF(z) < e;/2.
T
Then
IFla(A,a) - Fla (l,a)
~f
II [lA!(z)-s:I<a]
_
1
T
-
I
I
-1
I Ii (
z)
[11 !(z)-al<a]
I
dF ( z)
+ e; •
( 2 . 3 .6)
Now, for each ZEn
I
[Il-l~(z)-al<a]
-
~
1
as a
~
as a
~ 00
00
-
and thus the difference of these two indicator functions tends to
in z as a
~
00.
Furthermore, the function I
between the functions I [Dll!(z) 1+0 < a]
1
[I A- !(z) -£1 <a]
a
may be sandwiched
and I [ D 1!(z) 1_ 0 < a]
2
for
(A,£) E S(ct,~);o) ; thus, applying the Dominated Convergence Theorem
dominating function
as a
~
00
2IiCz)I),
uniformly in (A,a)
pointwise
all
(with
the integral on the R.H.S. of (2.3.6) ~ 0
E S(Cl,~);o).
33
Thus
(2.3.7)
Now consider
and for a sufficiently large.
IFla(I,a) .-
=
IJ{l - I
~ J11 -
I
[1
11
1
} i(z) dF(z)
[IC !(z)-a!<a]
-1
R,(z) -~I <a]
I Ii (z) I
I
dF ( z) .
The integrand tends to 0 pointwise as a
~
00
Thus, by the Dominated Convergence Theorem
(uniformly for
~ €
S (~,o)).
and an argument similar to the
one above, the integral
~
0 as a
~
00
(uniformly for a
sufficiently large, i.e.
S (0,0))
€
IFla(I,a) -
and can thus be made < e:: for a
11
< e::
Va
€
S(0,0)
and a sufficient-
ly large.
(2.3.8)
Combining (2.3.7) and (2.3.8)
along with the fact that
shows that, for a sufficiently large, (A,a)
€
E:
S ((I,.Q) ; 0) =>
was arbitrary
IF la (A,a) - 11
<0,
0
which completes the proof.
Proof of Theorem 2.4:
Recall that we want to show that, for
suffi~ient-
ly large a,there exists a nonsingular kxk matrix A and a continuous func-tion
c: T
~
lR such that
A = Elg(C(T),
IA- 1
a
!!.(T)
I
)
~(T)~'(T) I
(2.3.9)
34
(2.3.10)
The method is quite similar to that employed in the proof of Theorem 2.2
so we shall not repeat all of the details.
Again, let M be the space of
kxk matrices, equipped with the operator norm.
Let C(T) be the space of
continuous real-valued functions on T, equipped with the sup norm
sup If(x)
I.
Ifl =
Then M and C(T) are Banach spaces in their respective norms,
XET
and M x C(T) is a Banach space in the maximum norm employed previously.
Let S be a closed ball in M, centered at Ql' of radius sufficiently small
that all elements of S are invertible.
For (A,c)
E
S x C(T) define the
(continuous) mapping
E{g(C(T),
F (A,c)
-a.
=
J !a[(e:
which we may write as
h(T)h'(T)~
- - J
la
)
IA- h(T) I
2
-1
-c(T))A !!(T)]
Fla(A,C)]
d~(e:)+C(T)
Remark that F (A,c) defines a con2a
[ F2a (A,c)
tinuous function of T.
It will suffice to show that F has a fixed point for a sufficiently
~
large.
.
This can be done by showing that F maps the closed ball S((QI,i);o)
-a
into itself for some 0 > 0
and invoking Brouwer's theorem.
By i we mean
the function which is identically I on T.
Assume therefore that (A,c)
for a sufficiently large,
E
S((QI,i);o).
IFla(A,c) - Qll
S
It is enough to show that,
0 and
IF 2a (A,c)
-
il
S
o.
These inequalities are proved by arguments which are very similar to
those employed in proving the corresponding inequalities in the previous
theorem.
We provide the details for the second inequality
compactness of T is used in the proof.
to show how the
35
Note that, for any T
F2a (A,c) (T) =
€
J Min[l,
T,
2
a
-1
Ie: -C(T) IIA !!.(T)
= J(e: 2-C(T)) d~(e:) + C(T) + JM [ 2
J(e: 2_C(T))
I
a
-1
AT Ie: -C(T) I IA !!.(T)
I
d~(e:)
+ C(T)
J
d~(e:)
2
- 1 (e: -C(T))
where
Therefore,
Since T is compact, and!!. is assumed continuous, I!!.(T) ISh
max
V
T
E T
for
some number h. •
max
If e:
Since
E
For A E S(Q1'0), lA-II S D for some D .
1
1
2
2
MA then 1e: -c(T)I >
_la
~ D ~
=> 1e: 1 ~D ~
-IC(T)I·
T
IA !!.(T) I 1 max
1 max
Ic-il
S 0,
IC(T)I < 1+0
VT
E
T.
Thus
MA
.=.
{e:: 1e:
2
1
T
V (A,c)
E
> D
1
~
-
(l+e)}
max
S((Q1,i);o),
VT
= Mo ' say,
T.
E
Then
IF 2a (A,C)(T)
-11
=
IJM L
a -1
ATLIe: -C(T) I IA !!.(T)
2
I
S JM
2
a -1
AT Ie: -C(T) I IA ~(T)
As a ~
00
the ~-measure of Me ~
o.
I
- 1J (e: -C(T)) d~(e:)1
2
I
2
- 11 1e: -c(T)
I d~(e:)
Thus, since 1e:\2 + (1+0) is in Ll(~)'
by the Dominated Convergence Theorem we can find a sufficiently large that
36
the R.H.S. of the last inequality is less than or equal to o.
But the R.H.S.
of the inequality does not depend on T, i.e. a single choice of a works
for all T
€
T.
Thus, if a is sufficiently large, we can guarantee
V (A,c)
i.e.
sup !F
T€T
2a
(A,C)(T) -
11
~ 0
€
S((Ql,i);o),
VT
€
T,
so that
if a is large enough.
Showing that
IFla(A,c)
Qll
~ 0 is done just as in the previous
theorem, and we do not present the details here.
o
CHAPTER III
KRASKER-WELSCH ESTIMATION OF
3.0
e
Introduction
In this chapter we examine how the estimator proposed by Krasker and
Welsch (1982) in the regression case may be extended to the problem of
estimating
!.
As before, in the derivations, we assume that the T. 's are
1
in practice estimated T. 's would be used.
1
The first extension which we describe leads to an estimator suffering
from the same computational difficulties as the Hampel-Krasker estimator described in Section 2.2.
We therefore consider a second extension which by-
passes these difficulties at the cost of inconsistently estimating the scale
component of
!.
This is not a serious drawback, since one can compensate
by including an auxiliary scale estimate in the final weighted regression
stage.
In the final section of this chapter we describe a 3-stage KraskerWelsch estimator for our problem and, using the influence function as a
tool, we investigate the asymptotic behavior of the estimator.
Details of
how the estimator may be computed are postponed until Chapter V.
3.1
Krasker-Welsch estimation of
e
(I)
In this section we consider estimators
an equation of the form
e which
are obtained by solving
38
n
r
i=l
,..
y. -T.
w (y .;r. ; 8)
[(
1 1 -
1 1 x
exp[h t (T.)8]
-
J2
_
K (T.)
w
1-
]
~(T.)
1
=~
(3.1.1)
1
where Kw(T) is chosen to give Fisher consistency at the normal model for
each T, i.e. K (T) satisfies
w
(3.1.2)
w is a weight function which is assumed to be bounded, continuous, non-nega-
= (y-T) exp[ _~t (T)~J .
tive and to depend on y only through e: (~)
function for
~
The influence
defined by (3.1.1) is
IF (e:,T;8)
-w
-
2-K (T)) M-1 h(T)
= w(e:,T,8)(e:
w
w-
where
A more useful expression for M is given by the following lemma.
w
Lemma 3.1
M
w
Proof:
= E{W(e:,T) (e: 2_Kw (T))(e: 2_l)
h(T)h t (T)}.
-
-
By (3.1.2),
Interchanging the order of differentiation and integration:
Thus
Mw 6=
II -dd8'
= II
[W(e:(~),T)(e: 2 (~)-Kw(T))~(T)]
<P(e:(~))de:(~)
dG(T)
W(e:(8),T)(e: 2 (8)-K (T)) h(T) ra<p(e:(8))de:(8)] dG(T)
-
-
w
-
L-
de
by the product rule for differentiation, and this can readily be shown to
2
2
equal If W(e:,T) (e: -KW(T)) (e: _1) <P(e:)de: h(T)h.t(T)dG(T),using the fact that
d<P(e:)/de:
= -e:<p(e:)
and
dE:(~/d~
= -h(T)e:(~).
0
39
The asymptotic covariance matrix is given by
In the regression case, Krasker and Welsch consider the problem of choosing
that weight function which minimizes Vw subject to a bound a on the sensitivity.
They derive a necessary condition that such a weight function must
satisfy, but fail to show that the resulting weight function is strongly
optimal (i.e. that it does minimize V).
w
In fact, Ruppert (1983) has shown
that strong optimality is too much to hope for, and that, in general, no
weight function exists which minimizes V strongly within the class under
w
consideration.
Nonetheless, the weight function derived by Krasker and Welsch
in the regression case is intuitively appealing and appears to work quite
well in practice, and we therefore consider how to extend it to our situation.
Their proof that the weight function satisfies the first order optimality condition does not readily extend to our situation,
which
the more involved Fisher consistency requirement.
Their proof consists es-
is complicated by
sentially of considering a particular constrained optimization problem and
obtaining a characterization of the w solving that problem by considering
certain derivatives.
Since, in our case, the term subtracted for Fisher con-
sistency Kw(T), depends on w, the derivatives are far more complicated and
do not yield a manageable condition that w should satisfy.
Although we were unable to extend the Krasker-Welsch proof, their resuIts for the regression case nonetheless do suggest a course of action.
Their weight function satisfying the first order optimality condition takes
the intuitively appealing form: downweight an observation only if its influence would otherwise exceed the maximum allowable influence - otherwise
give the observation a full weight of one.
We consider three definitions
40
of sensitivity and suggest the following estimation procedures:
Case 1
Choose w with a view to bounding
by a
l
say.
The analogous problem in the regression case is to choose w subject to
a bound a l on
and Krasker and Welsch propose the choice
which can be shown to minimize
trace(V) subject to the constraint.
w
This suggests the following choice of w for our problem:
= Min fl ,
2
IE -K
L
a1
l
(°r)lIM- h(T)1
wl
wI -
]
(3.1. 3)
•
Remark that this choice of weight function results in the Hampe1-Krasker
estimator described in Section 2.2, and thus minimizes trace Vw subject to
the constraint Y ~ al' within the class of M-estimators of 6.
l
Note that, by Lemma 3.1,
= E[Wl(E,T) (E 2-Kw(T)) (E 2_l)
J
= E{Mi nr1,
L
l
a
-1
l (E 2_K (T))(E 2-l)!!.(T)!!.'(T)}
(T)IIM h(T)\J
wl
2
IE -K
w
h(T)h'(T)]
l
w l
41
= E gl(K (T),
;1
) h(T)h'(T)
T
wI
IM:h(T)\ 1
where gl(c,d) ~ Ez{Min(lz2-cl,d) sgn(z2- c) (z2_ l )} with Z ~ N(O,l).
Typically M
W
will have to be estimated in practice, e.g. by solving
l
A
1
n
(
al
= - I gl K (T.), x 1
n i=l
Wl
Wl
~
\M- h(T.)
w 1
l
M
)
h(T.)h'(T.).
I -
1 -
~
(3.1.4)
Case 2
Choose w with a view to bounding
by a , say.
2
The analogous problem in the regression case is to choose w subject to
a bound a 2 on
and the Krasker-Welsch proposal is
This suggests the following choice of w for our problem:
(3.1.5)
Recall that
~
2
2
2
N = E[w (E,T)(E -K (T)) h(T)h'(T)]
w
w
--
so, by substitution,
~(T)~' (T)
42
where g2(c,d) ~ Ez [Min(z2- c)2, d 2 ].
In practice, Nw would be estimated
2
as the solution of
"
N
w2
=-n1
r
2
n
(
g2 K (t.),
xa 1
L J h(t.)h'(t.)~
i=l
w2 1
(h'(t.)N- h(t.))'2 - 1 1
-
1
w 2
(3.1.6)
1
Case 3
Finally, suppose w is chosen subject to a bound a 3 on
The analogous problem in the regression case is to choose w subject to
a bound a
3
on
and the Krasker-Welsch proposal is
This suggests the following choice of w for our problem:
(3.1. 7)
By substitution, Mw satisfies
3
M
w3
= E {gl(K
w3
t
(t),
a!l
)
h'(T)M h(t)
w3-
~(t)h'(t)}
where gl is defined as before, and M might reasonably be estimated by
w3
solving
1
= -
n
L
n i=l
h(t.)h'(t.) .
-
1 -
1
(3.1.8)
43
Remark
Implementation of the estimators with weight functions defined
by (3.1.3), (3.1.5) or (3.1.7) in practice leads to some quite difficult
computational problems.
For definiteness consider the estimator with weight
function defined by (3.1.7).
I
8 is the solution to the equation
Min[l,
i=l
and M solves
3
~3
=
1. Y gl(K3c-L),
n i=l
1
a~_l
I
h' (T. )M h (T.) J
1
3 - 1
h(T.)!!.'(T.).
1
(3.1.10)
1
In practice, the quantities K (T i ), i=l, ... ,n are unknown and are determined
3
by the further system of equations:
a3
2
1£
x_I
-K3(T.)
Ih'(T.)M
1
1
1
3 -h(T.)
J(E 2-K3 (T i ))!!.(T i )
d<P(E) = 0
i=l, ... ,n.
(3.1.11)
Solving the system of equations defined by (3.1.9)-(3.1.11) is not
particularly easy.
To avoid these computational difficulties, in the next
section we consider a different class of Krasker-We1sch type estimators
which are relatively easy to implement in practice.
3.2
Krasker-We1sch estimation of 8
(II)
Here we consider estimators 8 which solve an equation of the form
n
\"
.L
1=1
"" 2
w (y. , T. ; 8) [(E. (8))
1
1-
1-
1] h(T.)
-
1
=0
(3.2.1)
where w is a bounded, continuous, non-negative weight function depending
on y only through
£
= (y-T)
exp[-h'(T)~].
Such estimators will not, in
44
general, be consistent.
that hl(T)
= 1,
However, we now show that, provided
~(T)
is such
then only the first component of!, 6 1 say, is inconsist-
ently estimated.
In the examples of interest to us, the first component
of ! corresponds to a scale parameter o.
Inconsistent
estimation of
0
is not a problem, provided one employs an auxiliary scale estimate in the
final weighted estimation of
~.
Suppose then that 6 is a solution to the equation (3.Z.l).
-
h
Suppose
that the vector-valued function h: R -+ R q is partitioned as (h 1) where
-Z
~: ~~j R(:~:::el~un::::-n:::C:h::.i:::::C:::Yp:~ame::::::::: :fa:::r:::::~
we can always reparametrize in the following way:
6l+az6Z+···+aq8q
~Z
~*(T)
=
6*
=
6
Then
~*'
(T)!*
q
= ~'(T)~, by the definition of 6i.
Thus if
a. =
E W(E,T)(EZ-l)h.(T)
~_J"---
j=2, ... q
E W(E,T)(EZ-l)
J
then
E[h~(T) W(E,T)(E 2-l)]
J
- provided
E(~)
depends on
~
=0
only through
j=2, ... q
h'(T)~.
Tben we may, without loss
of generality, assume that our model is parametrized in such a way that
Z
E h 2 (T) W(E,T)(E -1) = ~, and only the definition of 6 is changed to accom1
modate this. We shall assume such a parametrization for the remainder of
this section.
45
To discuss consistency of the estimator 8 defined by the equation
(3.2.1) , remark that the equation may be written in the form
-nI n
L
i=l
"
"2
w(e:.exp[h' (T.)(8-8)],T.) [(e:.exp[h' (T.)(8-8)]) - 1] h(T.) = 0
1.
-
1.
-
-
1.
1.
-
1.
-
-
-
1.
(3.2.2)
or, equivalently, in functional form
J wee:
2
A
"
exp [h' (T) (~-£)] , T) [e: exp [2h' (T) (~-£)] - 1] !!.(T) dHn
=0
(3.2.3)
1
where H is the e.d.f., placing mass - at each (e:.,T.).
n
n
1.
1.
For brevity,
write (3.2.3) as
A
J ~(e:,T;8)
dH
n
=0
(3.2.4)
Let the solution of
(3.2.5)
1
be denoted by~.
0p(l).
A
Then, under reasonable regularity conditions, n~(~-~)
But the discussion above shows that
~2H
= ~2'
where
~2
=
is the "true
value" of the parameter (that which is estimated consistently by the MLE,
say).
Thus, only the first element of e is not consistently estimated by
the solution to (3.2.1).
These considerations show that it is reasonable to consider the class
of estimators of
~
defined by (3.2.1).
The question arises
how best to
choose the weight function subject to a bound on the sensitivity.
As
before,
one might consider various definitions of sensitivity - for brevity, in this
section we shall work only with the third definition:
Y3
= sup
(e: , T)
Ih'(T) IFe(e:,T)
I·
46
The influence function for
2
(E -l)h(T)]
A
a
(evaluated at~) is given by
e = Mw-1 [W(E,T;~)X
where
By an argument similar to that in Lemma 3.1 we may show that (under normality
of E)
If we wish to bound Y3' our basic problem is quite similar to that
considered in Case 3 of the previous section.
We must choose w subject to
-1 h(T) W(£,T) I£2 -1\. By analogy to the case
a bound a 3 on Y3 = sup h'(T)Mw
(E, T)
considered previously we propose the following weight function:
(3.2.6)
where
6.
2
Mw = E{w(£,T)(£ -1)
2
h(T)h'(T)}.
By substitution, it may be shown that M satisfies
w
a 3_
Jh(T)!l'(T)
lit,
(T)M l h (T)]
w2
where g(u) ~ E Min[lz2- ll ,u] Iz -ll
(Z ,... N(O,l)).
z
MW=ETg[
The extension of our proposal here to other definitions of sensitivity is
immediate.
We make no particular optimality claims for the weight function specified by (3.2.6).
It does, however, seem to be a reasonable choice, is fairly
simple to compute, and gives quite acceptable results in practice.
In the
next section we outline a 3-stage estimation procedure which uses this weight
function in the second stage.
47
3.3 A 3-stage Krasker-Welsch estimation procedure
We now describe a 3-stage estimation procedure which gives a bounded
influence estimate of Krasker-Welsch type for (~).
For concreteness, we
work with the third definition of sensitivity:
I).
(respectively sup I!!.' (T).!.F 8 (E,T)
(E,T)
In the numerical examples in Chapter V we also consider working with
Y2~
the influence calculations are similar to those presented in this section.
We present the full estimating equations and associated influence ca1cu1ations at this point for reference later - in subsequent sections we shall
not present the same level of detail.
The estimation procedure is as
follows:
Stage 1
Solve
I
i=l
Min[~,
A
(y. - x.'
1 -.l.
S ) x. = 0
l'
(3.3.1)
-1
A
x.x.'
-1-1
A
n
I
i=l
j
Y1. -x.'
S
-1 -p
X
A
t
= 0
x
(3.3.3)
°1
f:,
A
for M , 01 and~, where feu) = Ez Min[lzl,u]
1
X is an even function satisfying EX(E) = o.
Stage 2
(3.3.2)
Izi
(with Z -- NCO,l)) and
Next solve
I Min [1,
i=l
2
a2
Yi-\
2
x
rexp[h' (t.)8]
-
1-
J
J{(
A_I
J
-1 h' (t. ) M h (t. )
1
1
2 -
Yi - \
exp[!!.'
x
(ti)~J
J
-1
~h (t.)
J-
1.
= 0
-
(3.3.4)
43
and
I g(h'(t.)M-:\h(t.) J -h(t.)h'(t.) - M2 = 0
.!.
n i=l
Stage 3
1. -
-
1.
(3.3.5)
1.
2-1.
Form estimated standard deviations 0i =
Stage 1, using the transformed variables
x~
1.
exp[h'(ti)~l
= x.la.,
-:1.
1.
y~
1.
and repeat
= y.la .• i.e. solve
1.
1.
the equations
r
i=l
a
Min[l.
1. -1.
IY~-xt'
x
a
-1 nL
f
n i=l
(
3
j
K
3
a 31
K
x~'M'-3 x~
-1.
~Ixt'
-1.
A
A
(y~-x~'
(3)
1.-:1.-
= 0
(3.3.6)
M-1-l.
x~
3
) x~~,
-:1.
x~
-1.
(3.3.7)
-:1.-:1.
n
l
(3.3.8)
i=l
A
A
A
To compute the influence functions we first write the equations (3.3.1)
-(3.3.8) in functional form.
The estimators may be written as implicitly
defined functionals of the e.d.f.
dH
n
H as follows:
n
=0
(3.3.10)
A
r
lj
-X'C(34) -8-p )+( exp[h'lX'S
- - =0 )6
~
f "A _
x
a
1
dH
n
=0
(3.3.11)
49
(3.3.12)
(3.3.13)
A
A
X'
J
M-31 -
A
{£ exp[!!.' (.!'.§O).§O]
X
+
X
.!' (~-~)}X
A
exp [2h' (.!'~)~]
A
dH =
n
a
(3.3.14)
A
dH
n
=0
(3.3.15)
(~.3.l6)
1
Hn puts mass - at each of the sample points (x.Jy.) (equivalently
n
-1
1
(.!i J£i) where £i
= (y i -~'.§.o)
exp [-!!.' (xi' ~)~],
at
~ and ~ being the
true parameter values).
for a
S
t
S
• t.
1. Then H = dH/dt
= 0 (£,A)- H.
The estimators as functionals of the perturbed distributions H are subt
scripted with a t and satisfy equations (3.3.9)-(3.3.16) with H replaced
n
by H .
t
We may write these equations more succinctly as
50
f El (~t'
O"lt, MIt) dH t = 0
f
1;;2 (MIt) dH t = 0
f
1;;3(0"1t' ~t) dHt = 0
f ~ @t'
MZt' ~t) dHt = 0
(3.3.17)
f
I;;S(M 2t , ~t) dH t = 0
f
~ (St' ~t' 0"3t'
f
1;;7(M3t , ~t' ~t) dHt = 0
f
1;;8(0"3t' ~t' St' at) dHt = 0
M3t'
~t) dHt =
to emphasize the dependence on t.
0
If we write
(by MIt we mean the elements of MIt stacked as a column vector), and
E = (i1 ,···,1;;8)'
then the system (3.3.17) may be written more compactly
as
I;;(Y )
f -..I-t
dH
t
= 0 •
-
(3.3.18)
By definition, the influence function for the complete vector of estimates
is
aYt/atlt=o =
Y,
say.
Differentiating (3.3.18) with respect to t and
evaluating at t=O gives
(using the product rule and the chain rule for differentiation), i.e.
51
which we write as
•
(3.3.19)
Cy=~.
Equation (3.3.19) may be written more transparently, by displaying the
components of y• and v, and partitioning the matrix C accordingly.
v. =
-1
J 1i (10)
C
ll
C
21
do
i=1, ... ,8.
(e:,x)
Write
We have
•
Cl2 •
C
22
S
~l
M
v
~
1
•
0"1
e•
v
=
•
~2
•
~
•
2
3
~
(3.3.20)
X-s
~
M
3
~7
•
v
0"3
8
with
j ,k=1, ... ,8
where
~. is the function of the parameters appearing in the jth estimating
-J
equation and Yk is the k
th
E
component of y.
exp [~' (~'~)~]
Thus, for instance,
+ ~' (~-.§.t)
exp[2h' (x'~t)~t]
52
Since most of the
~.
-J
involve only some of the components of y, many of the
partial vector derivatives are identically zero, thereby rendering certain
blocks of the matrix C equal to zero.
Specifically, the submatrices
C , C ; C , C , C , C ; C ' C ' C and C
are zero for this
63
6S
72
73
76
7S
S2
S7
S3
SS
reason. A little effort shows that certain of the remaining submatrices
are also zero, as the following lemma indicates.
The matrices C , C , C , C ' C , C , C and C are
67
12
l3
6l
S6
3l
4S
64
Lemma 3.2
all zero.
Proof:
This follows chiefly from the assumed symmetry of the distri-
bution of E about O.
We give a detailed proof that C is zero; the method
64
is similar for the others. By definition
Straightforward calculation shows that the derivative evaluated at t=O is
OEI
1TI
~a
3
xlM-lx
cr
3
- 3exp[3h' (x'B
1 (sgn e:) _x 03 exp[h_' (_x'B,,)6,,]
)6 ]
-1
- 3£10<1 "3
-v
--~..:.o
e~~~~~'~)~] < "3]
~J
h_' (_x'B,,)
~J
x h'(x'B )
--
For each fixed x this is an odd function of E.
-.=<)
The expectation is therefore
zero, by the sYmmetry of the distribution of e:, and the independence of
and x.
The system (3.3.20) thus reduces to
E
o
53
= v
-1
= ~2
= v3
=~
v
=
v
.:..s
=~
v
= v
-7
= v8
The estimates of most interest to us are the estimate of 8 and the final
estimate of 8.
The corresponding influence functions are given by
(3.3.21)
(3.3.22)
We see from (3.3.21)
• the influence function for the final esti-
that~,
•
•
-
-'-P
mate of 8, involves neither 8 nor 8.
-
.
l3 is g1ven
by
• •
E(~~')
The asymptotic covariance matrix for
-1
= C
E ~ ~1 -C1 , which is precisely what would be
obtained if the true cr.1 IS
66
66
were known and 8 was the estimate obtained by
solving (3.3.6)-(3.3.8) with
x~
-'l.
= x./cr. and y~ =
-'l.
1
1
having to form estimated weights using 8 and 8
-
-'-P
y./cr..
1
1
In other words,
does not lead to any increase
in the asymptotic variance of 8.
On
the other hand, there is an asymptotic cost
t. (= x. I 8 ) rather than
1
-1
-p
T.
1
in the estimation of 8;
the asymptotic covariance matrix of
• •
-1
C-1
44 C4l E ~I C4l C44 ;
-
""
~
due
to
the
use of
•
•
8 does involve 8 , and
-p
-
• •
is given by E ~~I
=
-1
C E
44
~ ~
C-1
44
the second term in this expression comes from the
preliminary estimation of 8.
+
54
Finally we remark that equations (3.3.21) and (3.3.22) show that
e
and 8 do indeed have bounded influence, since the quantities vI' v and
-4
are bounded.
~
~
CHAPTER IV
MALLOWS ESTIMATES
4.0
Introduction
We have seen in the previous chapters that, while the methods of
Hampel, Krasker and Welsch may be extended to the problem of finding a
bounded influence estimate of
plicated.
~,
the resulting estimators are quite com-
We now consider a simpler class of estimators, similar to those
initially proposed by Mallows (1975) in the regression case.
While these
estimators will typically be less efficient than those decribed in the
preceding chapters, they are easier to implement in practice.
Furthermore,
as will be shown in Section 4.3, if stability of inference is a desideratum, they are, in a sense, the natural estimators to use.
4.1
Mallows estimation of
~
In this section we extend the previously known results for Mallows
regression estimates. We do this in two ways - by weakening the assumptions
made hitherto, and by considering various definitions of sensitivity.
There
is a certain similarity between the problem of finding optimal Mallows estimates and that of finding optimal Krasker-Welsch estimates, and we shall be
able to exploit this similarity to obtain our results.
Recall that, in the usual (homoscedastic) linear model, a Mallows estimate of
~
solves an equation of the type
S6
n
LX.
i=l-l
where w: lR p
A
w(x.) lji[y.-x.'(3]
-1
1 -1 -
=0
(4.1.1)
R + is bounded and continuous, satisfying We!)
-+
-+
0
as
lR satisfies E 1/1 (E) = O. (If 1/1 is an odd function,
F
then E 1/1 (E) = 0 provided F is symmetric about 0.)
F
Ixl
-+
00,
Define
and 1/1: lR
-+
M
= E
[w(!.) !.!.' ]
N
= E
[w (!.) x x' ]
w
w
G
G
2
The influence function of (3 is
= M- l
x w(x) 1/I[y-x'(3]
w -
-
(4.1.2)
•
E (4J(E))
F
•
where 1/1 is the derivative of 1/1 (assumed to exist a.e.), and the asymptotic
covariance matrix is
2
E 1/1 (E)
E2~(E)
= T u(1/I) say.
w
(4.1.3)
The Mallows approach to bounding the influence function consists of
separately bounding the effect of !. (by choice of w) and the effect of
(by choice of 1/1).
One would like to do this in an efficient manner.
E
One
approach, taken by Maronna, Bustos and Yohai (1979) is to split the problem
into two separate constrained minimization problems:
(i)
choose 1/1 to minimize u(1/I) subject to a bound on the contribution of
1/1 to the influence,
(ii)
e
choose w to minimize T subject to a bound on the contribution of
w
w to the influence.
The precise formulation of (i) and (ii) will depend on the optimality
57
criterion used and the definition of sensitivity which is of interest. The
problem of choosing
~
to minimize
u(~)
subject to a bound on sup
E
~
EF~(E)
is just a particular case of the multivariate version of Hampel's extremal
problem (1978)
and the optimal
~
is, in fact, a Huber psi-function.
interesting is the question of choosing w in an optimal way.
More
Maronna,
Bustos and Yohai considered this question and obtained that w which minimizes trace (T ) subject to a bound on sup
w
x
either that (a) p=l
1M-wI -x
or (b) the distribution of
w(x)
-
~
I
under the assumption
is assumed spherically
symmetric and consideration is restricted to w's depending on
~
only through
Motivated by the desire to consider alternative definitions of sensitivity, and to avoid such restrictive assumptions on the distribution of
we now formulate a number of related problems.
We shall give
~,
detailed
proofs only for the first case considered (bounding Y ) - since proofs for
l
the other two cases are entirely analogous, we merely state those results:
Case 1
I
If we are interested in bounding Y = sup 1M- x w(x)1 sup ~
1
x
wE E~(E)
I
then w should be chosen so as to bound sup 1M- x w(x)
w- x
manner.
I
in an efficient
There is a parallel here with the problem considered by Krasker and
Welsch (1982).
Their weight function w depends both on E
and~,
and they
are concerned with minimizing T = M- 1 N M-1 subject to a constraint on
w
w w W
where (up to a scale factor):
sup IW(E,X) E M-lxl,
w(E,~)
-
58
2
Since our Mw = E[w !.!.'], Nw = E[w !.!., ], the similarity of our problem to
theirs is evident (the only difference is the presence of E in their problem).
They adopt the approach - suppose there exists·a function wI which
minimizes the asymptotic covariance matrix T in the strong sense, subject
w
to the constraint (wI minimizes Tw strongly if T - T ~ 0 for all admiswI
w
sible w) - then derive a necessary condition which wI should satisfy. In
fact, as Ruppert (1983) has shown, in general, a strongly optimal weight
function will not exist; however, the weight function derived by Krasker
and Welsch is intuitively appealing, and gives reasonable estimates in practice.
We therefore adopt a similar approach, and now address the question
of deriving the first-order necessary condition which our weight function
should satisfy if it were to minimize T subject to a constraint a l on
w
[sup 1M:! !. w(!J
I].
x
Theorem 4.1
Let x be distributed according to a p-variate distribu-
tion G, where E I!.1 2 <~. Let wl~ be a bounded continuous function from
G
lR P to lR + • Define matrices
and
and let T = M- I N M- I . Define J(l)(x) = w(x) IM-Ixl.
Then, if wI miniw
w w w
w wmizes T subject to the sensitivity constraint sup J(l)(x) ~ aI' then WI
w
x
w
-
must satisfy (up to a scalar mUltiple)
(4.1.4)
Proof:
Our proof is a modification of that in Krasker and Welsch. We
first prove the following lemma.
59
Lemma 4.2
Suppose the sensitivity bound a l > 0 is given and wl
(1) (x)
such that it minimizes Tw subject to sup J w
x
< alloon the open set U = {x: J(1)(x)
w I
Proof:
~'
_
Let
~
~
al ·
(~
is
Then wI is constant
be an arbitrary nonzero pxI column vector. Then wI minimizes
T ~ among the class of all admissible weight functions.
w-
a bounded open set such that ~
c
U.
Let Co denote the Banach space of all
bounded continuous functions that vanish outside U'.
~'T
Co to minimize
~ =
tion problem:
choose v
the constraint
E[wI2(~)~~,] = E[(W I +V)2(_X) _x~,1].
€
Suppose U' is
-
w +v-
Consider this minimizaf(v), say, subj ect to
Note that J(1) (x) =
wl+v depends on v in a continuous way,
I
I (wl+v) (x)
I \M-wl+v
-
I
xl and, since Mwl+v
if v is sufficiently close to 0, then since sup J(l)(x) < aI' by assumption,
x wl it follows that sup J
(x) ~ a . Thus, by assumption on WI' ~' T ~ ~
1
w~
w1+vl
~' T
~, for v sufficiently close to 0, i.e. fey) has a local optimum
wl+v Writing G(v) =
at v = 0, subject to the constraint N = N
WI
WI +v
_
we see that fey) has a local minimum
=0
at v
2EW
V
1
subject to the constraint G(v)
xx'
= O.
Now, since G(v) - G(O)
=
+ o (v), the Frechet derivative of G at 0 is given by
(4.1.5)
A straightforward calculation yields
] -1 M-I] ~
WI -
(4.1.6)
Since the range of DG(O) is closed we may apply Corollary I to the Lagrange
MUltiplier Theorem given in Luenberger (1968, page 243) and conclude that
there exists some real number r and a bounded linear map A: M ~ R (where
M is the space of pxp matrices) such that
60
r Df(O) v
= A[DG(O)
v]
(4.1. 7)
I: r c .. x.x.] for some constants c .. ,
Now, r Df(O)v is of the form E[v(x)
- . . 1J 1 J
1J
1 J
while
E[wl(x)v(~) I: I:
A[DG(O)v] is of the form
A...
1J
A.. x.x.]
1J 1 J
i j
for constants
Thus, it follows from (4.1.7) that
E[vC.~) {wl(~) I: L
i j
A.. x.x. - I: I: c .. x.x.}]
1J 1 J
i j 1J 1 J
=0
This implies that
v
wI (!.) I: LA. . x. x . = I: I: c .. x ..
i j 1J 1 J
i j 1J 1J
The c ij and Aij are functions of
~
X E:
U'.
and hence there are two possibilities:
the first is that A.. (~) = k c .. (~) for some constant k and all _u, in which
1J
1J case wI is constant on U', and hence on U. If, on the other hand, the A's
were not proportional to the c's, then the weight function would be a nondegenerate ratio of quadratic forms on U that could not be independent of
But if WI is strongly efficient, then it must be efficient for all
cannot have a form depending on
~.
~
~.
and thus
Thus w (!.) must be constant on U, as
l
Claimed in the lemma.
We may assume w.l.o.g. that wI (!.) = I on U. We show that wl (!.) is
always S 1. Suppose, if possible, that for some!.o wI(!o) > 1. Then,
for· this x"' J (1) (x ) = a .
l
-v
wl:.:.o
a.
S
y
S
I}.
Let s = inf{a. ~ 0: J(1) (y x ) = a for all
l
wI
:.:.0
As y goes from 1 to s,
IM-WI1
y x,,1 =
-v
IYIIM-WI1
x,,1 decreases and
-v
thus wl(y !o) must increase to keep J~~)= a 1 . Thus, by continuity of WI'
w1 (s!.o) > 1. However, since s is the infimum and WI = 1 on U, continuity
implies wI(s!o)
= 1. This contradiction shows that wI(!o)
> 1 is impossible.
61
Thus,
btherwise
(when the sensitivity constraint applies).
Thus
o
as claimed.
The relationship wl
(~)
• Min[l'
IM~~ ~Ij
is an implicit one, since
I
the right hand side also involves wI.
By substitution,
al
j x x' }
IM-w1 -xl -I
In practice, therefore, M might be estimated by solving
wI
I
al
l x. x. '
M = I
Min[l,
I~-l
x. I] -1-1
WI n i=l
wI -1
and used to form estimated weights, wI (!.1
0
)
=
Min
[1,
IM-
a
l
l
wI
x.
l
IJ .
-1
As discussed above, condition (4.1.4) is simply a neeessQPY condition
for wI to give a minimum; it does not actually mean that WI minimizes T
w
strongly.
Choice of WI satisfying (4.1.4) does, however. guarantee a
weaker kind of optimality subject to the constraint, as the next theorem
shows.
Consider, for the moment, the class of estimators of
~
defined by an
equation of the type
n
I
i=1
n(x.) ~(Yo-xo '8) = 0
- -1
1 -1 -
(4.1.8)
62
Clearly, this class contains the Mallows estimators defined by (4.1.1)
(just set
2(~)
= w(~)~). The influence function for
is given by M- l n(x) ~(y-x'a)
n - - E~ (e:)
usu
a1 normalization
,
M
n
where M
n
= E[n(x)x'],
- - -
~
defined by (4.1.8)
and if we assume the
= I pxp' this simplifies to _n(_x) ~(y-x'a)
, and the
E1ji(e:)
asymptotic covariance matrix is given by
v (n,~) = E .!l(x)!}.' (x)
If we are interested in bounding Y , then it is of interest to choose n
l
to maximize efficiency subject to a bound a on sup In(~) I. The following
l
x
theorem gives the appropriate choice of .!l when the efficiency criterion is
the trace of the asymptotic covariance matrix.
Theorem 4.3
Let
Let .!l map lR Pinto lR p.
~
satisfy the same conditions as in Theorem 4.1.
Suppose there exists a nonsingular matrix M which
l
satisfies
Define
Then.!li
(4.1.9)
among all .!l satisfying sup l.!l(~) I sal
minimizes trace E[.!l(~).!l' (~)]
x
I
Proof:
pxp
For general .!l satisfying the constraints of the theorem
consider
-1
E[(M l
=
~
-
.!l(~)
-1
)(M l
~
-
M~l E x x' M~l - M~l _M~l
.!l(~))
+
']
E(2(~)2' (~)).
63
and thus choosing
~(~)
to minimize trace E
~(~)
constraints is equivalent to choosing
E[(M -1~ 1
~(!J)
(M-1
l ~ -
!l(~))']
.!ll~)
straints - and clearly
=
to
subject to the
minimize
trace
E IM-1 ~ - n(!J 12 ,subject to the conl
-1
= '¥a
n(~)~' (~)
(Ml~)
= ni (x)
achieves this.
0
1
Remark 1
The estimators obtained by solving
n
A
L
n~(~.) ~(y.-x.
i=l
where
~
~
1
1
'B) = 0
1-
is as in Theorem 4. 3, and
n
L
i =1
A
wl(X.)X. ~(y.-x. 'B)
-1 -1
1 -1 -
=0
where w is as in Theorem 4.1, are identical.
l
Thus the weight function
derived in Theorem 4.1 does minimize trace T subject to a constraint on
w
sup
Iw(~) ~I -
in fact the resulting estimator is optimal in this sense wi th-
x
in a somewhat broader class of estimators, as is shown by Theorem 4.3.
Remark 2
Theorem 4.3 is a stronger result than that of Maronna,
Bustos and Yohai (1979), since it avoids their assumptions on the distribution of x.
We now consider the problem of choosing a weight function w when other
definitions of sensitivity are used.
Case 2
If bounding the self-standardized sensitivity.
is of interest, then we are led to the problem of choosing w so as to bound
64
sup w(x) (x' N- 1 x) ~ in an efficient manner.
x
-
-
w
We adopt the same approach as
-
before and find the necessary condition which a weight function should satisfy if it were to minimize T subject to a constraint a 2 on sup
w
l
(~' N-w -x)~. We have the following analogue to Theorem 4.1. x
Theorem 4.4
Let
~
w(~)·
and w(x) satisfy the same assumptions as in
Theorem 4.1 and let M , N , T be defined as before. Define J(2)(x) =
w w w
w l
w(X)(x'N- x)~. Then, if W is chosen to minimize T subject to the sensiw
- - w z
tivity constraint sup J~Z)(~) ~ a Z' then w2 satisfies (up to a scalar
x
multiple) :
(4.1.10)
The proof is quite analogous to the proof of Theorem 4.1 and
Proof:
will not be presented here.
By substitution we see that
2
-x-,:-=-:-l-)
-
~~' ]
W
z
which suggests estimating NW by the solution to
z
"-
Nw = -I n~ Min [ 1,
. 1
x.
2 n ~=
-].
2
aZ
,;-1 xJ
Wz
x.x.'
-~-].
-~
and using estimated weights
W
(
X ) = Min[l,
2;
az
J
ex.' ~-l x. )~J
-~
in practice.
w -].
Z
We might conjecture that a result similar to Theorem 4.3 holds for the
estimator corresponding to the weight function wz.
This is not the case,
however, and attempts to solve the problem of minimizing trace Tw subject
65
-1 x) ~ do not lead to a tractable weight function.
ZThe weight function described by (4.1.10) has a certain intuitive appeal,
to a bound on sup w(x)(x'N
x
-
-
w
and is easy to work with in practice - and we used this weight function in
subsequent numerical work.
Case 3
If one works with the third definition of sensitivity
l
= sup !.'M: !.w (x)
x
sup~
£
E
IjJ(E)
-1
then one is led to the problem of choosing w so as to bound supw(!.)!.'M x
w
x
in an efficient way.
Adopting the same approach as in the previous two
cases, we have the following analogue to Theorems 4.1 and 4.4, which we
state without proof.
Theorem 4.5
Let!. and w(!.) satisfy the same assumptions as in
Theorem 4.1, and let M , Nand T be as before.
w w
w
Define J(3) (x) = w(x)w -
-1
!.'Mw~·
Then if w minimizes T subject to the sensitivity constraint
3
w
sup J~3)(!.) ~ a , w satisfies (up to a scalar multiple)
3
3
x
w3 (!.) = Min
[1, x'M-a l J
3
-
(4.1.11)
w
3
It follows readily that
M
w
3
=
E{Min
[1, x'M-a13 Jx x' }
-
w
3
and can thus be estimated by the solution to
1 n
= Min [ 1,
n 1=
. 1
L
x.x.'
-1-1
66
allowing one to form estimated weights
w (!.J.') = Minfl,a3
J
3
L
M- l
x.'
-1.
W
3
J
x.J·
-1
We remark once more that this choice of w does not yield a minimum
3
of T subject to the constraint, merely a saddle point. One might conjecw
ture that it minimizes trace T.
w
This conjecture is false, however. That
weight function which does minimize trace Tw subject to the constraint has
quite a complicated form, and does not lead to an estimator which is easily
computed.
For this reason, in subsequent numerical work we chose to work
with the weight function w3 defined by (4.1.11).
4.2
Mallows estimation of
4It
e
In this section we show how the results of the previous section may be
extended to the problem of estimating!.
We shall see that separately
bounding the influence of the residuals and of the position in !.-space allows
a much simpler extension of this method than was the case for the techniques
previously considered.
Assuming that the T. 's are known, the maximum likelihood estimate for
1.
(under normality) solves
I
n
[[
~
for
1.
exp[~ (Ti)~vfLE]
i=l
Write
Y·-T.
;
~(T)
)2
and E for (y-T)
-
J-h(T.)
I
=0
(4.2.1)
1
exp[-~'(T)!].
A Mallows estimator for
e
solves an equation of the form:
Ii=l
z. w (z.) X [
-1.
-1
Yi
-T i
l
l~xp[Z.1 6]J
-1. -
= 0
(4.2.2)
e
67
where w: lR q
E X(e:)
F
= 0,
-+-
lR
+
is bounded and continuous, and X: lR
E (e: X (e:)) >
F
o.
-+-
lR satisfies
The corresponding influence function is
xCe:)
In notation reminiscent of the preceding section we write . Mw= E [w(~) ~~' ],
G
N
w
= EG[w 2 C~~~'
].
The asymptotic covariance matrix of 8 is given by
EFX2Ce:)
2
E [e:X(e:)]
V(w,X)
In attempting to minimize V(w,X) subject to a bound on the sensitivity,
we may factor the bound as before, and split the problem into two separate
constrained optimization problems, choosing wand X separately.
2
The optimal choice of X is that which minimizes
a bound
on~
EFX Ce:)
E~[e:XCe:)]
subject to
Cat least if one considers Y or Y - for Y the conl
2
3
EF[e:X(e:)]
straint is somewhat different).
It follows from Hampel (1978) that the best
X (at the normal model) is given by xC·)
= ~~(.)
Huber psi-function.
Choosing w optimally entails attempting to minimize T = M- l N M- 1
w
w w w
subject to a bound on the contribution of w to the sensitivity, i.e. subject
sup J(~)Cz) ~ a~, ~ = 1,2,3, where, e.g., J51)C~) = IM~I~1 wC~). But
w z
a comparison with the previous section reveals that the form of T and JC~)
w
w
is exactly the same as for the regression case, with ~ replaced by z. Thus
to
the results obtained in the previous section are directly applicable to this
problem - the weight functions given in Theorems 4.1, 4.4 and 4.5 are also
appropriate to use in the estimation of 8.
So, for instance, if one is interested in bounding Y , then one would
I
obtain 8 by solving
68
n
where
~
[1,
x) -~}
x
al
]h(T.){1jJ2[
Yi-Ti
IM-lh(T.)1
J.
k
exp [h'(T.)8]
i=l
l-J.
J.,.,
~- E~ 1jJ~ (e:) and M is the solution to
l
o= L
Min
(4.2.3)
(4.2.4)
In the final section of this chapter we shall describe a 3-stage estimation procedure which combines Mallows estimates of the type suggested in
this and the preceding section.
Before doing this, however, we give some
attention to the change-of-variance function and its application in the
estimation problems under consideration.
4.3
Stability of variance
Until now, we have proceeded from the view that a small amount of con-
taminated data should not have a large effect on parameter estimates, and
have dealt primarily with bounding the sensitivity of our estimators. Large
increases in the (asymptotic) variance of an estimate are also undesirable
and thus, if possible, we would like to guard against very large values of
the change-of-variance function (CVF), i.e. we would like to bound the changeof-variance sensitivity (K) of our estimates.
In this section we shall show that, in the two situations of primary
interest to us here, Mallows
estimation of
developed with a view to optimally
~
and of
~,
the weight function
bounding the standardized gross-error-
sensitivity Y2' is, in a sense, precisely the weight function one would obtain if one proceeded from the standpoint of optimally bounding the standardized change-of-variance sensitivity, K2 .
69
We first derive the change-of-variance function for a general M-estimate; later we consider the regression and variance estimation cases separately.
Suppose then that an r-dimensional parameter n is estimated by the
solution of the equation
n
0
_l;(e:.,z.:n)
~_~ _ =
_
\
.1..
(4.3.1)
~=l
r
r
where .5.: R x R x R
r
R ;
-+-
(e:. ,z.) are independent drawings from an
~
-~
(r+l) dimensional distribution H with the property that e: and z are independently distributed according to distributions F and G respectively. Then
IF" (e:,_z;n)
-!l
where A =
EH[-d.5./d~],
(4.3.2)
and the asymptotic covariance matrix is
say.
(4.3.3)
Regarding V as a functional of the distribution H, contaminating slightly
(e:,~)
at a point mass
and taking the directional derivative, we obtain:
(4.3.4)
where
The standardized change-of-variance sensitivity is thus
K
2
=
sup trace[CVF(E,~) V-I]
(e:, ~)
= sup
[r+l'
-1
(e:,~)B
I(E.~)
+ 2 trace
·-1
l;(E.~)A
1
(4.3.5)
(e: .~)
where the supremum is taken over the set of points where
CVF(e:,~
is defined.
70
Now consider the regression case:
A Mallows estimate of B solves
n
l x. w(x.)
ljJ(y.-x.'
i=l -].
-'1
]. -].
5]
-
t:.
(4.3.6)
-
t:.
Recall that M = E[w(x)x x' ].
w
= 0
2
N = E[w (x)x x' ]
- --
w
- --
and
(X'N-lx)~
Y2 = sup w(x)
x
-
-
w-
-¥- .
sup
E (EIjJ (E)]~
Then. using (4.3.4) and (4.3.5)
CVF(E,_X)
1jJ2(E)
2
-1
-1
- w (x)M x x' M
E2[~(E)]
- W -- W
=V +
.
IjJ(E) we_x)
E~(E)
-r ~~
l
M
J
x x' V
w
(4.3.7)
x x' M-l
--
w
and the standardized change-of-variance sensitivity is
(4.3.8)
We have the following result, relating
Theorem 4.6
definite, then
If
K
2
and Y :
2
is a nondecreasing function and Mw is positive
IjJ
2 =P
K
+
2
Y2 .
This follows directly from Theorem l' and Corollary l' of
Proof:
Ronchetti and Rousseeuw (1983).
Thus, for a Mallows estimator for B
K
2
= P + sup
X
2
W (~
-1
XI N x
w -
sup
E
1jJ2 (E)
2
E1jJ (E)
71
~,
It follows that, if, for a given
we want to choose the weight function
l
which minimizes M- l N M- subject to a bound k on K2 , this is equivalent
w w w
l
2
-1
to choosing w to minimize M- l N M- subject to abound on sup w (x)x' N x.
w w w
- - wx
But this leads to a weight function of the form given in Theorem 4.4, i.e.
a2
. {
w2C.~) = M1.n 1,
-1
(x 'N
-
x)
!i} .
w -
Thus, whether one proceeds from the standpoint of minimizing T
subject to
w
a bound on Y2 or subject to a bound on K , one is led to the same condition
2
on w.
The same is true in the case of Mallows estimation of
needs a little more work.
n
L
i=l
where -1
z.
~y.
1
j
-T.
1
xp[z.'
-1
= 0
8]
(4.3.9)
-
-1 -
= -h(T.).
1.
(E )
-IF8'~
= M- l z
w -
w(z)
-
V(w,X) = M- l N M- l
w
and
but the proof
In this case 8;solves
z. w(z.) X
--J.
!,
Y2 =
W
W
X(E)
E[EX(E)]
2·
EX (E)
E2 [EX(E)]
supw(~)(~'N:l ~)~ sup~
2
~
z
E
It follows from (4.3.4) and (4.3.5)
(E)]
that
EX(E)
E[E:X(E:)]
and
[EX
l
J
M-l z z' w(z)V
+-;
w
Z Z '-W(Z)M-1
--
-
(4.3.10)
w
(4.3.11)
where we recall that the supremum is taken over those
CVF(E:,~)
is defined.
Again,
K
2
(E:,~)
for which
and Y are related in a simple way, as
2
72
described by the following result.
Theorem 4.7
Assume that both Mw and Nw are positive definite, that
X is continuous everywhere, and differentiable except on a finite set D.
Assume further that £X(E)
~
0 wherever X exists.
Then
The proof combines the results of Ronchetti and Rousseeuw
Proof:
(1983) for the regression case and the case of estimating a I-dimensional
We first show that Y~
scale parameter.
S K
2
- q.
We suppose that, for
some (£o'~)
2(
w
~
)
'N-I
~ w ~
(4.3.12)
and derive a contradiction.
Assume w.l.o.g. that £0 > 0, and that X(£O) exists.
(Otherwise,
K2(£0'~) =
EO > 0, X(£o)
~
O.
q => K2
~
implying 0 > K - q
2
q
Then
~
w(~)
0.)
Since
In fact, X(E ) > 0 because if X(E ) were 0
O
O
2
X (E O)
-1
2
q + -':'2-- w (~)~' Nw ~ >
EX (E)
> O.
K
then
2
,
a contradiction.
We now consider 2 cases:
Since X(E O) > 0 and X(E)
~
0 on [£O,EO+Y) for some Y > 0
that X(E) > X(E O) for E > EO'
e'
Define
net) =
-1
~
w(~) (~Nw ~)
2
(EX (E))!i
X(t) .
it follows
73
Define b
= (K
k
-q)2.
2
n2()
£
Then, by definition of
2EX·• (E)
E£X(£)
-
(
W Zn
)
"'l
'M- 1 Z
-=0 w -=0
Z
K
2
,
b2
S
£
Since D is finite, assume w.l.o.g. that (£0,00) n D
=~.
f- D
Then since
for E > £0
M: 1
i
2
2
EX(E) w(!o)!o'
!o ~
[n (E)_b ] > 0
E£X(£)
2
2
and thus, C1X (£) - 2 £X(E)C 2 s b
where
-1
b. ~' Nw
c1 =
Thus
2
2
w (~)
are positive constants.
EX (E)
-iC&2
2
X (E)-C
~
>
E[EX(E)]
for £ > EO
- E
where c
2 b.
= b 2 Ic 1 ,
= - ~ Coth-l[X~E))
Define R(£)
peE) =
Then R(£)
!o
~
d R,n E
V
peE)
E
~
£0 and thus
(4.3.13)
Now, as E
( E)
+
00
Since ~ =
the R.H.S. of (4.3.13)
xC£)
(4.3.13) is > 0
b
+
_00
t.:
c{
> 1
since both c and d are positive.
2
2
(by assumption, X (E O)C > b ), the L.H.S. of
1
(Coth
This is the required contradiction.
-1
(x)
=
1
x+1
"2 R,n(x_l))·
74
Since X(E O) > 0 and X(E) ~ 0 for E in [EO,EO-Y) for some Y > 0 X(E) <
X(E O) V E in (O,E O)' Then, by assumption, n(E ) < -b and so nee) < n(E ) <
O
o
-b
for 0 < E < EO'
Arguing as before, for 0 < E < EO
and we can deduce that R(E)
~
pee)
V E in (O,E O]' from which it follows that
Thus
Coth-l[X~E)J ~
C[P(EO)-R(EO)-d ,q,n E]
(4.3.14)
for E in (O,E O]
As E
~
0 the R.H.S. of (4.3.14)
~ +~
.
Since X~E) < -1 for 0 < E < EO' the L.H.S. is < O.
This contradiction
1·lty YZ <
·
proves t h e lnequa
Z - K Z - q. To see the reverse inequality we note
that, on (R \D) xR q since E X(E) ~ 0 and M > 0, then
w
2
-1
-1
2
q+ X2 (E) w C.~)~' N z - 2EX(E) w(z)z'M
-z
w
w
EEX(E)
EX (E)
2
-1
q+ X (E) w2 (~)~' N z S q + Y2
2
w 2
EX (E)
s
and thus
K
2
2
q + Y2' which proves the theorem.
s
We have shown that, for a Mallows estimator of
K
2
=q
+
sup w2 (z)z'N -1 z
- - w z
2
sup
E
X
0
e
(E)
2
EX (E)
l
Thus, if X is given, and we want to choose that w which minimizes M- N M- l
w w w
subject to a bound k on K , this is equivalent to choosing w to
2
75
-1
-1
2
-1
minimize M N M subject to a bound on sup w (z)z'N
w w w
-- w
z
weight function which satisfies
~,leading
to a
as before.
Throughout the development given above we have defined the standardized
CV-sensitivity to be
K
2
=
trace
CVF(E,~)
-1
V
.
Bounding this quantity corresponds essentially to guarding against large
positive values of the CVF, or equivalently, large increases in the asymptotic variance.
Welsch (1983, oral communication) has argued that, if
stable inference is one's goal, one might equally well wish to avoid a
severe deflation of variance when a small subset of the data is changed.
If this view is taken, then one should be concerned with bounding
Ki =
sup
(E,~)
Itrace CVF (E ,~) v-ll
.
Comparison with (4.3.5) shows that, for this quantity to be finite, we
require both
-1
£' Bland trace
.
~
-1
A
to be finite.
Welsch has remarked
that, in the regression situation, this requirement leads to a Mallows
estimate rather than a Krasker-Welsch estimate.
It will be instructive
to examine in more detail why this is so, since it will afford Us a deeper
insight into the different ways that Krasker-Welsch and Mallows estimators
give bounded influence.
the case of estimating
We shall also see
that the same hOlds true for
~.
Consider the Mallows (regression) case first:
In this situation,
76
trace
.
-1
A
I;
=
-~(e:)
x' M- 1 x
.
E\jJ(e:)
w
Clearly, if \jJ is a Huber psi-function, for instance, then sup \jJ(e:) and
e:
.
sup \jJ(e:) are finite (the supremum is taken over all e:, except the bending
e:
2
-1
points of the psi-function). To guarantee finiteness of both w (x)x'N
x
--
- -1
~
w(x)x' M
-- w
isfying
and
it is enough to bound the latter.
a3 ]
-1
x'M
x
w3 -
w (x)
3
use.
(Note that w2 (~)
= Min
[I,
of w(x)x' M- l x.)
- -
w -
This suggests w3 sat-
as an appropriate weight function to
a2
J does not guarantee boundedness
(x'N- ~)~
- w
l
2
w -
Now consider the case where it is proposed to use a Krasker-Welsch
estimator.
In this situation:
Aw = E[w e:
2
.!~'
and if w(e:,~) has the form Min
1
[1, TEl
i.!1 w
1
where
1·\w
is some norm de-
pending on w, then
= -s gn
e: I
[le:I
x x'
<~l-1~lw
and thus
= -sgn e: x' Aw-1 x
trace ~ A-I
- w
I
[le:I
while
1;'
-
B-
1
I;
w-
Now, bounding W2(e:,~) e:
= w2 (e:,x)
2
2-1
e: x' B
x' B~l x
-
<*1
x.
w-
only is not enough to ensure boundedness
77
can be kept bounded
for arbitrarily large values of
ing~'
I~I
by making
£
small enough.
Thus, bound-
B- 1 s does not in any way guarantee finiteness of IX'A-IX I
w - ~ - [1£1<
a
~
In order to ensure finiteness of Itrace CVF v-II, the quantity x' A-I x
w
must be bounded, and KW estimators do not necessarily achieve this.
is required is that outlying values of
~
of x must be controlled independently of
What
be downweighted, regardless of how
well the corresponding y-value fits the model.
£
-
In other words, the influence
i.e. a Mallows estimator should
be used!
The following example may illustrate why this is so.
Suppose, in a
simple linear regression problem, we obtain the following data
(dots
represent data points):
+Y
• •
•
-+
x
Consider the effect of adding a point (x,y) to these data, at the position
marked by the arrow.
While the x-value is far away from the bulk of the
x-values, the point fits the model suggested by the data very well very small.
£
]
is
Thus, a Krasker-Welsch analysis will give close to full weight
I.
73
to this point; as a result the variance of 8KW will decrease considerably.
On the other hand, a Mallows-type analysis will downweight the point consid-
erably, regardless of the small E-value, simply because the x-value is extreme - and thus adding the point will not affect the variance of
much.
~
very
The message is clear - if stability of variance is a desideratum, then
a Mallows estimate is appropriate.
A similar argument applies. in the case of estimating! - if stable inference is desired, then, to bound Itrace CVF(E,~)
v-I,
it proves necessary
to downweight outliers in factor space, even if the corresponding residuals
are small.
In other words, outlying values of
~
and large residuals need
to be treated separately, leading to a Mallows-type analysis.
Further, since
2
X (E)
EX
and
2
(E)
if both of these quantities are to be finite then w must be chosen so that
2
-1
-1
w (~) ~' Nw ~ and w(~) ~' Mw ~ are bounded.
It is clearly enough to
bound the latter of these - this suggests use of w satisfying
3
as before.
Remark
X(x)
= ~~(x)
Use of the classical 'Proposal 2' X-function
- E¢
~~(x)
guarantees boundedness of X2 (E) and E
X (E),
as do
a number of other common x-functions.
In summary, we have seen in this section that consideration of the
change-of-variance function gives further reason for using the types of
estimators proposed in Sections 4.1 and 4.2.
If one wishes to protect
against large increases in variance, i.e. if bounding
K
2
is a goal, we
have seen that one is led to the weight function w defined previously.
2
79
If, on the other hand, stability of variance is desired and one wishes to
large changes in variance due to contamination
avoid
(bounding Ki)' then
a Mallows estimator with weight function w3 as defined above is suggested.
In later applications we consider both weight functions.
4.4
A 3-stage Mallows estimation procedure
Here we combine the estimation procedures suggested in Sections 4.1
and 4.2 to obtain a 3-stage bounded estimate for
r:J.
For definiteness,
we work here with the second definition of sensitivity:
in later applications we also worked with Y3'
For future reference we now
present the full set of estimating equations; we also give the influence
e
function for the estimate of
and the final estimate of
B.
Since extensive
influence calculations of a similar nature were presented in Section 3.3
we omit the derivation here.
The estimation procedure is as follows:
Stage I
Solve
I x. Min [I,
i=l
-1
1
r
n i=1
n
I
i=l
for
~p'
Al and
01'
X
~ 1JJ [Y'-X"8~1 -1 --p
=0
al
A_I
(x.'
Al --J..
x.)
-1
rL
Min I,
[Y.1:1-P
-x.' B ~
~
'"
x.x.'
(4.4.2)
-1-1
=0
(4.4.3)
°1
We assume that
(4.4.1)
01
1JJ
is a nondecreasing odd function and
that X is an even function satisfying E X (E:)
= o.
80
Next solve
Stage 2
I h(x.' S) Min[l, [h' (x.' a
i=l- -], -p
x
-], -p
1
n
-n . I 1
Min[l,
1=
"a~
a2
~
tJX
x
r Yi-!.i'.§"
x
x
l'= 0
a )]:iJ L~xp [h'
(x.' a )~]J
--'1 -p
- -],-P
(4.4.4)
h(x.'
)A-1
2
-
x
~2 = 0
a )h'
(x.' 8)
- -], -p
- -],-p
1 h(x.'
h' (x.' a ) A-lh(x.' a )]
- -], -p
2 - --'1 -p
for AZ and!, where
Stage
3
-
x
(4.4.5)
is the estimate obtained in Stage 1.
Form estimated standard deviations a.1 = exp[h(x.'
f3 )8] and repeat
--J.-p-
Stage 1, using the transformed variables
x~
-1
"
= -1
x./a.1 and
Y~
1
'"
= y./a
..
],
1
In a manner similar to that described in Section 3.3. one can obtain the
influence function for the resulting estimates.
Those of most interest to
us are the estimate of 8 and the final estimate of
a.
The corresponding in-
fluence functions are (in a notation similar to that of Section 3.3):
(4.4.6)
(4.4.7)
Recall that
j,k
= 1, ... ,8
where ~. is the function of the parameters appearing in the jth estimating
-J
equation and
1k
is the k
th
component of the parameter vector, y.
We remark that, just as in the Krasker-Welsch case,
function for the final estimate of
~,
involves neither
8,
the influence
e nor -p
S.
The
-1
asymptotic covariance matrix for 8 is given by C-1 E ~ ~6 C66 - there is
66
no asymptotic cost in using estimates of a.], rather than the true a.1 's.
e
'"
As against this, the asymptotic covariance matrix for ! does have an
"
extra term due to the use of a preliminary estimate of 8 - Var(8) is given
81
by
There is
an asymptotic cost due to the use of t. (= x. I
1
S ) rather than
;~
T. in the estimation of 8.
1
Both the estimate of 8 and of
~
do, of course, have bounded influence,
as is evident from (4.4.6) and (4.4.7), since
~l' ~
and
~
are bounded.
CHAPTER V
COMPUTATION AND APPLICATIONS
5.0
Introduction
In this final chapter we examine the performance of the 3-stage
bounded influence estimators proposed in Sections 3.3 and 4.4
by using
them in the analysis of a number of data sets. Since the influence function
and change-of-variance function are essentially asymptotic tools, some investigation of how the estimators work in small samples seems advisable.
Examining the behavior of the estimators in specific examples also affords
better insight into the way they work.
In particular, we shall be inter-
ested in seeing just how well the proposed estimators deal with influential data points, and what price is exacted.
The particular model for the variance used in analyzing the data sets
below was the Box-Hill model:
0. =
1.
A
ol'LI
1.
if it seemed that the variance might reasonably be expressed as a monotonic
function of the mean.
Sometimes it seemed appropriate to model the variance as a monotonic
function of some linear combination
in this case we fit the model:
I
o. = 0 ~ (x.)
1.
-1.
IA
~(x)
of the explanatory variables -
83
Computational aspects
5.1
We wrote Fortran programs to compute the Krasker-Welsch estimator described in Section 3.3 and the Mallows estimator described in Section 4.4,
using the Box-Hill model for the variance.
a.1
= alT.1 IA)
Copies of the programs (with
are available from the author - only a small modification is
needed to fit the model a.=
alR-(x.)
1
-1
III..
We now outline the alRorithms used
in the computation.
Computation of the Krasker-We1sch estimate:
Stage 1:
The estimating equations are:
~L
Min
i=l
[1,
Xl]
bl
I~1' Ix.' M- x.
1 -1
l -1
x
(y.-x.'S )x. =
1 -1-p -1
0
(5.1.1)
where
~l'
·1
6
(y.-x.~
)/~l
1 -1-p
M satisfies:
l
l
'"
1 n (
bl
)
Ml = f
x 1
x . x .'
n i=l x.' M- X. -1-1
-1
l -1
L
(5.1.2)
(Z ~N(O,l))
where f (u) ~ E Min [ Iz I,u] Iz I
z
and a is a robust scale estimate solving:
l
t
x ., 8
nL X Y1.--;1
--p
i=l
a
J=0
(5.1.3)
1
For our choice of X we used the classical Huber proposal 2:
X(y)
= ~b2
2
(y) - E~ ~b (y) where ~b
1
1
1
is a Huber psi-function bending at ± b l ·
Note that the function feu) is readily seen to satisfy
u2
feu) = f 0 X3 Cy)dy
+.
u;r
r:l2 X2 Cy)dy
u
where XrCe) denotes the p.d.f. of a
x2 Cr) variable, and may thus be evaluated
using, for example, the IMSL subroutine MOCH.
84
n
To solve (5.1.2) we used starting value M(O) =! L x.x.' • then iterated
1
n.1=1-1-1
b 1 - I !]. . x .'
to convergence according to the scheme M1(k +1) = -1 nL f [ .
n i=l
,(k)
1 :1.
x.
M
x.
-1
-1
l
for k=O.1.2.....
The convergence criterion was framed in terms of the
II
d
, M-1 l xi' 1terat1ng
·
.
.
"ro bust d1stances.
Iib
= .xi
unt1·1 ~n
L.i=l Id(k+l)
Ii
- d(k)
Ii I was
less than a prespecified E.
Once the d
have been obtained. the equations (5.1.1) and (5.1.3)
li
may be solved by an iteratively reweighted least squares type algorithm
as follows:
(5.1.1) implies that
n~
L.
i=l
11,
't'
(Y.1 -x.'
B)
-1 -p
b
-d
x
x.
-1
°1
1
=0
(5.1.4)
Ii
while it follows from (5.1.3) that
nL ~1P 2
b
i=l
1
'"
(Y.1-x.'
BJ
-1 -p
x
01
~
- t;iJ = 0
The equations (5.1.4) and (5.1.6) may be written in the form
n
L
i=l
n~
• L.
1=1
'"
wI· (y.-x.' B ) x.
1 1 -1 '"""P -1
2
wIi
=0
(Y. -x.' B ) 2
1 -11'
x
01
J.=
0
(5.1.5)
85
where w
li
= l/J b ~El.i)A , whence it follows that ~ and 0"1 satisfy:
_1
Eli
dli
A
~
n
-1 n
x.y.w ·]
= . -1.-1.
x.x.' wI·]
1.
. 1~1. l 1.
1.=1
1.=
n:
n:
tL
n
"2
1 =
0
Letting S(k) and
2
i=lW li
(Y. -x.' S
1 -1
D &1
-p
r
]
(5.1. 7)
"2
1
(5.1.8)
0
;ik) denote the estimates at the beginning of the kth iter-
ation, (5.1.7) and (5.1.8) suggest the iteration scheme:
"2(k+l)
o1
=
iterating until some prespecified convergence criterion is satisfied.
We
used the ordinary least squares estimates as our starting values and
A(k 1) A(k)
"(k 1) A(1_)
Maxrl:}=ll S j + - Sj
I, 10 + - 0 " I ] as convergence criterion.
2
Note that the function f (u) = E Min(z2,u ) satisfies
2
z
and may thus be evaluated using the IMSL subroutine MDCH.
Stage 2:
computed.
!1
A
The estimated means from the preliminary fit, t. = x. ' S were
1. -1. l '
The equations to be solved at this stage are
(5.1.9)
with M satisfying
2
86
r
'"
1 n
b2
)
M
g(
x_I
h(t.)h'(t.)
2 =n i=l h'(t.)M h(t.)
1
1
-
1
2 -
(5.1.10)
1
where g(u) ~ EzMin[lz2-ll,u] Iz 2-l\.
Routine calculations involving
gamma integrals show that
l+u
l+u
[l-u] +
g(u) = 3 f Xs (x)dx + f
Xl (x)dx - u [ f Xl (x)dx - f
Xl (X)dxj
[l-u] +
[l-u] +
l+u
0
l+u
- 2
I
[l-u]+
x3 (x)dx + u
00
loof x (x)dx [l-u]
+
1
-f
X3(x)dx~
+U 3
0
J
and may be evaluated using the IMSL subroutine MDCH.
The equation (5.1.10) may be solved for M in a manner entirely anal2
'"
ogous to that used in Stage 1 to solve for M , and the robust distances,
l
l
d • = h'(t.) - h(t.), may be obtained for this estimation stage. For
21
2- 1
1
the Box-Hill model, h(t i ) = [lo~lt. I] and thus (5.1.9) yields:
M
.
I Min
i=l
~
1,
j i( [Yi-t~] -l~1
1
b
-t 22
2
'[",Yi ~1-1Id2i
alt.\A)
i~lMin
b2
[
1,
I[ Yi -t~] 2-lid
'"
A
.
JI(y. -t.] 2 }
J ll~I:. I~ -1
21
(5.1.11)
J
;\tiIA
1
n
=0
log It i
I•
0
(5.1.12)
1
o\til
It follows readily from (5.1.11) that, for any fixed value of A:
n
L w2 . CO,A) (Yi-ti) 2
i=l 1
It. IX
2
1
aA =
n
r
. 1
1=
where w2i (0, A)
w2 · (a, A)
1
(5.1.13)
87
We therefore proceeded as follows:
Fix A, solve (5.1.13) iteratively for a , and compute the absolute
A
value of the sum on the L.H.S. of (5.1.11) for these values of A and a A,
i.e. compute
H (A) /::,
n
•
Then minimize Hn(A) as a function of A (we used the subroutine ZXGSN in
the IMSL library).
For that value of A which gives the minimum, now solve
(5.1.13) iteratively to obtain the corresponding value of a.
Stage 3:
Using the fitted values {t.}, and the estimates a and A from
-
Stage 2, we computed
'"
~
'"
'"
A
a.~ = alt.~ I and transformed the data - defining
y.*
= y.la.
~
~
~
x.* = -~
x.la.~ ,
-~
We then repeated the computations of Stage I, using the transformed data
( xi * 'Yi *)n
i=l'
to
0 b tain
t h e f'~nal estimate of
~.
Q
Calculation of asymptotic covariance matrices:
The influence calculations of Section 3.3 provide the basis for computing the variances.
•
~
1
Recall that, in the notation of Section 3.3:
-1
= C66
~
••
'"
and var[n~(~KW- ~)] = E 3 13'
C
66
where E*
-1
= C66
-1
E(~~')C66
= E[-dla~{w3(E*,x*)E*x*}]
~ (y* -x* '§l/ a, and w3 (E* , x*)
• Min
~ '1 E~r"dJ
with d 3 =x*'
M~ Ix* .
88
It is easily shown that
x*x*'
C66 = E I[j£*I<bS/d ]
s
as
There is a number of possible ways to estimate C and N6 . One method
66
is simply to take expectations with respect to the empirical distribution
function of (£,!), i.e. estimate C66 by
n
x'!'x'!'
"
1
"""'1."""'1.
C66 = n I I "
i=l [I £i I<b/d ] as
si
-
and N by
6
N
6
b2
n
= -n I Min [1 , x ~ 2
i=l
I£~I
d S~·
~
1
j"
2
£~ x~x~'
~
-~-~
This method has the drawback that, if all the weights are 1 (i.e. if there
is no truncation), it does not reproduce least squares covariances, since
it does not separate the £'s and the x's.
One way to overcome this draw-
back is to exploit fully the assumed independence of the £'s and x's and
write the e.d.f. of (£,x) as the product of the marginal e.d.f.'s.
This
results in the estimates
1
n
b
n
· [1 ,
="2 I
I M~n
n i=l j=l
x
I£.~ I
2
J
S
* *'
2 2 "2
£. x.x.
~ -J-J
d ·
SJ
We used this method of estimation in the numerical work reported below,
since we considered it desirable to employ an estimation technique which
reproduces least squares covariances in the event of there being no downweighting.
89
To obtain the asymptotic covariance matrix of
recall that, in the
~,
notation of Section 3.3:
•
-1
8
= C44
~
and thus
Note that, if one works with the model o. = exp[h'(x.)8], then .the matrix
1.
- -1.C is zero and the second term vanishes. In two of the three data sets
4l
which we considered, this was the case; in the other we fit the model o.1. =
A
A
oiT.1. I and O.1. = olx.1. I with almost identical results.
-1
-1
We therefore omitted
-1-1
the second order term, C C4l C E(v1vi )C
C
C , in our analyses, on
44
ll
ll 4l 44
the basis that, for our examples, it is either zero or negligible, and not
worth the considerable extra complexity involved in computing it.
With this simplification
where C44 = Efl2E
and
N
4
=
2
I
E(~~')
[I E2 -11
=
where W (E,T) = Min[l,
2
h(T)h'(T)}
-
< b/ (!!.' (T)t-1;lh(T))]-
E{W~(E,T)(E2_1)2
IE2-II!!.'
h(T)h'(T)}
b2
(T)M~~
J
1!.(T)]
In the analyses reported below we used the estimates
r r 2E.1. I [lEi-II <b/d
i=l j=l
1 n
C44 ="2
n
n
"'2
"'2
2j
h(t.)h'(t.)
] - J J
J
1
n\
n\ M1.n
. ~(E.-I)
"'2
2 , b 2 /d 2 ·
N = --2
4 n
2 2J
L..~
1.
i=l J-l
.
'
h(t.)h (t.)
-
J -
J
90
d
where
"-1
= h' (t.) M h(t.),
2- J
J
2j
analogous to the estimates of C
and N described above.
66
6
Choice of truncation values:
The cutoff points b , b and b in the estimating equations still need
l
2
3
to be specified. Note that
so that
and, taking traces,
-1
Thus
p/b
l
= E{f[
b:
)
l
~'Ml ~
~':l~}
.
1
Consider the function feu)
= Ezu
.!.Min[lzl,ullzl= Ez Min[ill,lllzi
SE z Izi for
u
u
TIT?r '
S EI z I => b l 2:
l
is to have a solution, we need b l
each u. Thus p/b
2:
i. e. if the defining equation for M
l
k
(TI/2)z p where p is the dimension of B.
A similar argument shows that, if the equation
is to have a solution, then we need
where q is the dimension of
e.
91
In the examples which follow,
the lower bound initially.
we typically used cutoff points 1.5
x
We discuss the matter further in the analyses
below.
Remark
It is a relatively simple matter to alter the above algorithm
to compute Krasker-We1sch estimates where the emphasis is on bounding Y2'
rather than Y . The changes are confined to the computation of the 'robust
3
distances' at each stage, and to the manner in which the asymptotic variances
are calculated.
Equations (5.1.1) and (5.1.2) become
I
Min[l,
NI
= -n . L1
i=l
and
(5.1.14)
n
1
(5.1.15)
1=
respectively, so substitution of f
for f, and insertion of a square root
2
in the computation of the d li is sufficient. The changes in calculating
d 2i and d 3i are analogous. Calculation of the asymptotic variances requires
the
modification of inserting a square root in the appropriate places.
Lower bounds for the cutoff values also change.
The matrix N should
l
now satisfy
N
1
= E f
(
bl
)
l
2 (X'N~lX)~
t
X x'
--'
whence it follows that
-1
xx' N
-- 1
implying that
so that
-1
x'N 1
x-
-
b
2
1
92
Since the function
~(~
2
---2-
is bounded above by 1, it follows that b must
1
u
be ~ p if N is to exist. A similar argument yields a lower bound of q~
l
for b if the matrix N is to exist.
2
2
Computation of the Mallows estimate:
We now describe the algorithm used to compute a 3-stage Mallows estimate (bounding Y3) similar to that described in Section 4.4. Since some parts
of the algorithm are very similar to the Krasker-Welsch case, we shall give
a less detailed description here.
Stage 1:
The estimating equations are
n
=0
L
i=l
where
w1(~i)
= Min [ 1,
JJ
:_11
n
x.' A
-1.
X.
(5.1.16)
"- satisfies
and A
l
l -1.
(5.1.17)
and 01 is a robust scale estimate solving
)
n
1.=1
X
j
~.-x.'
8
l;l=P = 0 .
01
2
We used X(Y) = ~b(Y)
-
2
E~ ~b(Y)'
(5.1.18)
Equation (5.1.17) was solved in the
.
f as h 1.on
. , us1.ng
.
obvious recurS1.ve
n1 ~n
~i=l
, as start1.ng
.
va 1ue.
~ixi
E
'
st1.-
1
mated Mallows weights,w 1 (x.)
x.)] were then obtained.
-1. = Min[l, al/(x.'
-1. A-1 -1.
The equations (5.1.16) and (5.1.18) may be rewritten as
n
~
"
x. v. (y. - x.' l3 )
~ -1. 1. 1. -1. ~
i=l
r
=0
(5.1.19)
93
where
1P ((y.-x.'
B )/0)
1. -1. --p
b
~x
((y.-x.' 8 )/a)
1. -1. --p
=wl(X.)
1.
-1.
Vo
'"
2(Y'-X.,
8 )
1. -1. -n
nt
and
l.
i=l
1Pb
x
a
=n ~
~
(5.1. ZO)
where
Equations (5.1.19) and (5.1.20) may be solved by an iteratively reweighted
least squares algorithm quite similar to that described in the KraskerWelsch case.
Again, we used the ordinary least squares estimates as start-
The estimated means from the preliminary fit, t.1. ~ -:1.-p
x.' B were com-
Stage Z:
o
puted.
Mallows weights for the second step were obtained in the same way as
for the first, with
A
A
Z
w
2i
e
= -1
~(ti)
n
l:
n i =1
substituted for xi in the comnutation of A :
Z
Mi n [ 1,
h' (t
1.
0
)
A-lh
(t. )
2-1.
:2
= Min [1 ,
J h (t
aZ
-
0
)
h ' (t
1. -
0
)
1.
1
h' (t. )A-lh(t.)J
1. 2-1.
solves
n
[
i t h(t i ) wZi X
Z
We used X(Y) = 1P b (y)
and (5.1.21) yields
-~,
J
exp[~' (~i)~Jl = Q.
y.-t.
as before.
For the Box-Hill model,
(5.1.21)
~(t)
1
= [logltl]
(5.1. 22)
n
IW'
i=l z1.
[Yo-tO]
loglt. I X 1. ~ = 0
1.
alt. I
1.
To solve the system (5.1.22)-(5.1.23)
(5.1.Z3)
we adonted
94
the same approach as described previously - fix A. solve (5.1.22) for a A
(regarding A as fixed). and compute the absolute value of the sum on the
L.H.S. of (5.1.23) for these values of A and aA'
Using the subroutine
ZXGSN find the value of A which minimizes this absolute value. and then
find the corresponding value of a.
Stage 3:
Using the fitted values {t.}. and the estimated a and A from
1
A
A
A
A
Stage 2. compute a.1 = alt.1 I and transform the data - define
x'!' = -1
x.la.;
1
y'!' = y./a .•
-1
1
1
1
We then repeated the calculations of Stage 1 using the transformed data
(x~.y'!')
-1.
1
i=l •...• n.
Calculation of the asymptotic variances:
Influence calculations similar to those of Section 4.4 provide the basis
for computing the variances.
Var(n~(a-.§)) =
where
M
w3
~ E[w~(x*)x*x*']
.) - - -
may be estimated by
and
may be estimated by
1
n
2
NW3 = - L w (x'!')x~x~'
n i=l 3 -1 -1-1
The quantity
E 1jJ2(£)
- -
E2~(£)
may be estimated by
where e.1 =y~1 - -x'!"1 B
-
9S
In practice, we mUltiplied by a finite-sample correction factor:
Compare with, e. g., Carroll and Ruppert (1982) or Huber (1981, page 173).
To obtain the asymptotic covariance matrix of
!'
recall that (in the
notation of Section 4.4)
(5.1.24)
•
•
C = E[EX (E)] E[w h(T)h'(T)] = E[EX (E)] A ,
44
2
2
2
Z
E[~~' ) = E X (E) E[w !!.(T)!!.' (T)] .
2
Thus the first term on the R.H.S. of (5.1.24) may be estimated by
n 2
J
-nl i=l
L w2l.·h(t.)h'(t.)
l. l.
~
A
where r.l. = (Yl.·-t.)/(alt.
l.
1
A_I
AZ
1 n 2
-Lx(r.)
n l.=
. 1
l.
1 n.
2
[-n .Ll lr·x(r.)]
l.
l.= .
A
I ).In practice we multiplied by a correction fac-
tor of n/(n-q).
We note as before that, if one works with the model a.l. = exp[h'(x.)8]
- -1then the matrix C is zero so that the second term in (5.1.24) vanishes.
4l
For the model a.1 = alt.1
IA,
however, the second term is not necessarily zero
(though in practice it turned out to be negligible for the data set under
consideration).
Now
-1
-1
-1
1-1
We need thus to estimate C44 C4l Cll E(v1V 1 )C~l C4l C ·
44
-1
E~2t< eXP[~~ (~:~)!]j]
Cll E(~lvl' )C U
=
-1
-1
MWI NWI MWI
E2
[$[< eXP[;~ (~'~)!]JJ
96
2
= E[wI!.!.' ] may be estimated in the usual way,
1 n
n
and the scalar factor is estimated by
2
L
tJJ (u.)
. 1
1
1=
--~-------
l I ~(u. )J2
. 1
Un 1=
S )
(y. -x.'
1.
scaled residuals
1
-1
~
x
·
from the preliminary fit, and 01 is a correction
(1
factor based on these residuals, analogous to Cl above.
The matrix C is, by definition:
4l
E[-a/dS (h(x'S ) Min[l,
-pt - - -pt
!!.'
a2
c.~'~Pt)A;~hC!.'~t)
JX L
J J
l.exp[!!.' (!.'~t)~tJ t=O
which may be shown to equal
EG[Min[l,
a: l
!!.' (T)A 2 !!.(T)
j !.h(T)~'C'r)~J EF
E: X(E:)
and may be estimated by
1
n
AI.
•
(-n i =1
L w21.'
x. h(t.)h'(t.)~ -L r. x(r.).
-1. - 1. 1.
n
1.
1.
•
0
In the Box-Hill model, h(t
i ) = (lilt.
I)
so that
1
•
A
A
h'(t.)e = Allt.l.
- 1.1.
Thus
1 n
1 n
•
= (- L w2 ' x. h(t.) A/lt1.' I) - L r.1. x(r.)
1.
n i=l
1. -1. - 1.
n i=l
A
and we have all the quantities needed to estimate Var e.
Choice of truncation values:
Note that, e.g .
. [ 1,
Al = E M1.n
J x x'
1
a -1
x'A x
-
1-
(Y-!.'~t)
by definition ,
97
so that
so we must have a
l
~
p.
In general, the dimension of the parameter being
estimated is a lower bound for the truncation value.
Remark
If one wishes to bound Y2 rather than y ,and defines Mallows
3
weights accordingly, e.g. in Stage 1, let
where B satisfies
l
2
al
-1
x'B
x
1-
j !.!.' ,
(5.1.25)
only a slight modification to the above program is needed (in those segments
where the Mallows weights are defined at each stage).
the cutoff points aI' a
2
and a
3
are then
The lower bounds for
p~, q~ and p~ respectively, as is
clear from Equation (5.1.25).
5.2
Example 1 - Glucose data
As our first example we return to the data set mentioned in Chapter I,
originally
presented by Leurgans in the Biostatistics Casebook (1980). We
recall that the data are concerned with the comparison of a test method and
a reference method for measuring glucose concentration in blood.
are given in Table 5.2.1.
The data
98
GLUCOSE DATA
TABLE 5.2.1
Reference
Measurement
Test
Measurement
(X)
(Y)
24
100
89
25
80
71
26
90
78
80
ISS
169
79
27
74
70
5
106
103
28
95
87
6
185
173
29
81
74
7
139
132
30
84
79
8
273
261
31
25
31
9
220
192
32
68
62
10
293
274
33
80
77
11
350
333
34
75
72
12
79
73
35
85
85
13
67
64
36
77
79
14
202
177
37
120
114
15
57
57
38
115
109
16
354
318
39
164
157
17
300
268
40
85
82
18
370
351
41
86
83
19
198
168
42
87
86
20
307
281
43
247
227
21
77
75
44
94
88
22
85
75
4S
111
102
23
58
56
46
210
188
Reference
Measurement
Test
Measurement
(X)
(Y)
1
ISS
150
2
160
3
180
4
Patient
No.
Patient
No.
99
We shall be interested in investigating the relationship between the
reference method (X) and the test method (Y).
In doing this we shall have
occasion to use the bounded influence techniques
Chapter IV.
~roposed
in Chapter III and
The fact that our method of analysis differs from that of
Leurgans should in no way be interpreted as a criticism of her work - our
main objective is to examine the performance of the Mallows and KraskerWelsch estimators, described in previous chapters, on an appropriate data
set.
A scatterplot of Y against X for these data (Figure 5.2.1) reveals a
pronounced linear trend.
However, a plot of the least squares residuals
against fitted values gives a clear indication of heteroscedasticity - the
variance increases with the fitted value (Figure 5.2.2).
One way of dealing
with this heterogeneity of variance is to transform the data - Leurgans performs a log transformation, fitting the model
R-n y.
= a'
+
8' R-n x .
+ e.'
1 1 1
(5.2.1)
The residual plot corresponding to this model is given in Figure 5.2.3.
It
is clear from this plot that the point with the lowest X- and Y-va1ue (observation #31) is a massive outlier,
both in X and in Y.
Leurgans deletes ob-
servation #31, and thus obtains the residual plot shown in Figure 5.2.4.
e,
100
•
• •
•
•
•
•
e
e
Figure 5.2.2
•
t
R
e
5
i
d
u
a
1
e
LEAST SQUARES RESIDUAL PLOT FOR TIlE COMPLETE GLUCOSE DATA
•
•
••
.
•
• •' •
••
•••••
• •• • •
•
.-•
•
•
•
•
••
•
•• •
•
•
•
•
•
•
•
•
PREDICTED VALUE
+
~
o
~
e
e
e
LEAST SQUARES RESIDUAL PLOT FOR THE
Figure 5.2.3
COMPLETE LOG-TRANSFORMED GLUCOSE DATA
Observation
•
+
#31
t
R
e
s
i
d
u
•
a
1
•
• •
•
•
•
.• ..-
••
••
•
•
•
-.
•
• • • •
•
••
•
• •
•
•
••
•
•
•
•
• •
•
•
•
~
o
t·J
PREDICTED VALUE (LOG SCALE)
~
e
e
e
Figure 5.2.4
LEAST SQUARES RESIDUAL PLOT FOR TIlE REDUCED LOG-TRANSFORMED
GLUCOSE DATA
•
•
•
t
R
e
s
i
d
•
•
•
I
•
-
•
•
•
•
.-
1
•
•
•
•
#
•
•
•
u
a
•
••
•
••
•
••
•
•
•
•
•
•
•
• •
•
•
•
PREDICTED VALUE (LOG SCALE)
+
....o
~
104
Rather than transforming the data we adopt a weighted least squares approach in our analyses, partly because our interest in this thesis is focused
on such methods, but also because in many practical situations transformation
of the data gives results which are difficult to interpret.
For the moment
we consider four methods of estimation:
(i)
The classical method, based on a normal likelihood assumption, using
maximum likelihood estimation at each stage - we shall refer to this method
as the 'generalized least squares' estimator (GLS).
setting all truncation parameters at
(ii)
~
It may be computed by
in the Mallows program.
The method proposed by Carroll and Ruppert (1982) - which guards
against the influence of large residuals, but not necessarily against high
leverage points.
This may be implemented by setting the Huber constant at
the desired value in the Mallows estimation program (we used 1.5), and setting the Mallows truncation values at infinity.
We shall refer to this as
the 'Huber' method (since the residuals are Huberized) and use the abbreviation 'HU' for the estimators.
(iii)
The Mallows-type bounded influence estimator described in Sec-
tion 4.4 (bounding Y3) - in implementing this we set the Huber constant at
1.5
and the Mallows truncation values at 3,3,3 respectively (i.e. 1.5 times
the lower bound).
We shall refer to this as the Huber-Mallows estimation
procedure and abbreviate it as HM3.
(iv)
The Krasker-Welsch estimator described in Section 3.3 (bounding Y3)
with the truncation parameters set at 1.5 times the lower bound. We shall
use the abbreviation KW3 when referring to this estimation procedure.
The various analyses were performed both on the full data and on the
reduced data (with observation #31 deleted) to assess the effect of deleting
105
the influential point.
We performed the analyses using both the model
A
a.1 = al~.1 IA and the model a.1 = a\x.1 I for the variance, with virtually idenI
tical results.
For this reason, only the results for the first model are
presented here.
Table 5.2.2 gives a summary of the parameter estimates
and standard errors for each method of estimation.
Examination of the table
reveals a few points:
(i)
The estimate of B is quite stable, whether or not observation #31
is included, regardless of the method of estimation. In view of the very pronounced linear trend evidenced by Figure 5.2.1 this is not too surprising.
(ii)
As noted earlier, if
~
is estimated by maximum likelihood methods,
the estimate of A changes considerably on deletion of observation #31. This
change is less for the Huber estimation procedure, but is still about 20%.
Both bounded influence methods give a more stable estimate of A, the KraskerWelsch estimate particularly so.
Similarly, the bounded influence techniques
yield by far the most stable intercept estimate.
(iii)
The Huber-Mallows estimates also exhibit the stability of variance
predicted by theory.
Deletion of observation #31 results in increased stan-
dard errors for each estimation method, but the change is smallest for the
Mallows estimates.
(iv)
Krasker-Welsch variances are also quite stable.
The analyses suggest a value of about .9 for A.
Leurgans' choice
of a log transformation, corresponding to A = I, is thus seen to be a sensible one (she also performed a weighted least squares analysis, using A= 1).
e
e
e
TABLE 5.2.2
GLUCOSE DATA - PARAMETER ESTIMATES (Standard errors in parentheses)
BU
GLS
Data
#31
deleted
Data
#31
deleted
4.325
3.326
4.609
(1.198)
(1. 264)
0.901
0.907
Full
a.
B
A
A
HM3
KW3
#31
deleted
Full
Data
3.223
3.407
3.214
(1.170)
(l.332)
(1. 344)
(1.358)
3.253
(l.474)
0.899
0.909
0.907
0.908
0.908
Full
Full
(.0099)
(.0118)
(.0110)
(.0125)
(.0130)
0.52
0.86
0.72
0.87
(.176)
(.169)
(.192)
(.206)
0.91
(.220)
I
I
J
;
#31
deleted
Data
3.060
(l.496)
I
I
0.910
I
(.0132)
(.0137)!
(.0138)
0.97
(.218)
0.73
0.75
I
(.210)
(.221)
1
II
!
1
i
~
(.;)
0\
107
Since predicting Y from X is a primary concern for these data, the widths
of prediction intervals are of interest.
We therefore computed such prediction
intervals (for a mean) for various values in the range of X, using each of the
four estimation methods, and investigated the change upon deleting observation
#31.
The results are summarized in Table 5.2.3, which gives ratios of confi-
dence interval lengths (length of C.l. based on reduced data) for different
length of C.l. based on full data
values of X, for the respective estimation methods.
TABLE 5.2.3
GLUCOSE DATA - RATIOS OF PREDICTION INTERVALS
Ratio:
Ratio:
Ratio:
Ratio:
reduced
full
GLS
reduced
full
HU
reduced
full
HM3
reduced
full
KW3
60
0.915
1.052
0.977
.996
100
0.873
0.950
0.937
.960
140
1.027
0.992
0.964
.969
180
1.136
1.048
0.986
.986
220
1.179
1.079
0.997
.995
260
1.195
1.096
1.002
1.000
300
1.202
1.106
1.005
1.002
340
1.205
1.113
1.007
1.004
380
1.206
1.118
1.008
1.006
Value
of X
108
The gain in stability by using bounded influence methods is obvious from
Table 5.2.3.
For the GLS method deletion of a single observation resulted in
changes in the length of the confidence interval of up to 20% - while using
the 'robust' (Huber) method remedied this to some extent, changes of 10-11%
still occurred.
The Huber-Mallows procedure resulted in a maximum change of
just over 6% (though over much of the range of X it was confined to only 1 or
2%), while for the Krasker-Welsch method the change never exceeded 4%.
It is of interest to see how much downweighting occurred in the bounded
influence estimation.
This information is summarized in Table 5.2.4 - the
1JJ (r.)
th
'Huber weight' referred to is the quantity I c r. ~ I where .r. is the i
residual,
~
~
and this will have a value of 1 if Ir.~ I < c, of c/lr.!
if Ir.~ I ~ c.
~
(In our
analyses c was 1.5.)
TABLE 5.2.4
GLUCOSE DATA - AMOUNT OF DOWNWEIGHTING IN TIiE BOUNDED
INFLUENCE METIiODS
(Mallows and Krasker-Welsch truncation
values = 1.5 x lower bound)
KW3
HM3
Mallows weight #31
= 1.0
= .81
Average Mallows weight
=
Mallows weight #31
Huber weight #31
Step I
Step II
e
Step III
Krasker~Welsch
weight #31
= .52
Average Krasker-We1sch weight
= .79
= .30
Krasker-Welsch weight #31
= .03
Average Mallows weight
= .95
Average Krasker-Welsch weight
= .92
Huber weight #31
Mallows weight #31
= .39
= .08
Krasker-Welsch weight #31
= .03
Average Mallows weight
= .93
Average Krasker-We1sch weight
= .89
.86
109
Certainly, both the Mallows and Krasker-We1sch estimators downweight
observation #31 considerably - in the final stage of the analysis the total
weight given to this point is only .03; in a sense this can be considered a
formalization of Leurgans' deletion of the point.
The gain in stability provided by the bounded influence estimators is
achieved at a price - namely a loss of efficiency.
Table 5.2.5 gives an
idea of the cost in prediction - it gives the ratio of the length of a
~re-
diction interval obtained by each estimation method compared to the prediction interval obtained by GLS (figures are based on the reduced data).
Use
of a Huber estimate results in an increase of about 6%; however, both bounded
influence methods result in an increase of about 10-15%.
Similarly, comnar-
ison of the variances for the reduced data shows a loss of efficiency of about
20% (over GLS) for the Mallows estimates and up to 35% for the Krasker-We1sch
estimate!
This seems quite a high price to pay for the extra stability of
these estimates.
TABLE 5.2.5
GLUCOSE DATA (REDUCED) - RATIO OF PREDICTION
INTERVAL LENGTHS TO GLS
Value
of X
HU
HM3
KW3
60
1.050
1.029
1.171
100
1.049
1.031
1.127
140
1.056
1.097
1.123
180
1.058
1.118
1.137
220
1.059
1.123
1.147
260
1.059
1.123
1.153
300
1.059
1.123
1.156
340
1.059
1.122
1.158
380
1.059
1.121
1.160
110
We therefore repeated both the Huber-~allows and Krasker-Welsch analyses, this time setting the truncation parameters at twice, rather than
one-and-a-half
times the lower bound, to offset the loss in efficiency
(it also seemed desirable to reduce the amount of Krasker-Welsch downweighting - the average K-W weight described in Table 5.2.4 is somewhat low - it
would be preferable to minimize the effect of observation #31 without necessarily downweighting as much across the board).
The results of these fur-
ther analyses are presented in Table 5.2.6.
TABLE 5.2.6
GLUCOSE DATA - BOUNDED INFLUENCE ANALYSES (BOUNDING y .. )
TRUNCATION PARAMETERS = 2 x LOWER BOUNO
,j
Huber-Mallows (3)
lUll
Data
B
I
Step
II
e
Step
III
#31
Full
deleted
Data
#31
deleted
3.457
3.202
3.545
3.379
(1. 265)
(1. 334)
(1. 363)
(1.370)
0.907
0.909
0.905
0.907
(.0120)
(.0127)
(.0129)
( .0129)
.83
.90
.84
.87
(.211)
( .203)
( .198)
( .191)
= 1.0
Mallows weight #31 = 1.0
Average Mallows weight = .93
~~llows weight #31 = .46
Average Mallows weight = .97
Huber weight #31 = .43
Mallows weight #31 = .16
Average Mallows weight = .98
Huber weight #31
Step
Krasker-Welsch (3)
K-W weight #31
= 1.0
Average K-W weight
= .87
= .07
Average K-W weight = .975
K-W weight #31 = .04
K-W weight #31
Average K-W weight
= .97
III
The estimates and standard errors appear quite stable, and the loss of
efficiency is reduced to about 15% in the Mallows case, 17-18% for the Krasker-Welsch estimates.
procedure
The amount of downweighting in the Krasker-Welsch
is, of course, also reduced; nonetheless observation #31 contin-
ues to receive a very low weight in the latter estimation stages.
In fact,
the K-W estimate of A with this amount of downweighting appears more trustworthy than the KW3 estimate of Table 5.2.2 - our feeling is that setting the
truncation values at 1.5 times the lower bound actually resulted in too much
downweighting in this example, adversely affecting the estimate of A.
On
the basis of the above analyses, a tentative recommendation would be to set
the truncation parameters at twice the lower bound, at least as an initial
value.
finally, we performed bounded influence analyses, of both the KraskerWelsch and Mallows variety, where the weight function was chosen with a view
to bounding Y rather than Y (we abbreviate these estimates as KW2 and HM2
2
3
respectively).
As might be expected, the amount of downweighting with these
Mallows estimators was less than for the corresponding estimators bounding
Y (roughly speaking, the explanation for this is as follows: an outlier in
3
~-space is weighted proportionally to I~I-l if YZ is bounded, proportionally
to I~I-Z if Y is bounded, norms being taken w.r.t. the relevant matrices 3
clearly, bounding Y will result in lower weights).
3
We ran the analyses ini-
tially setting the truncation parameters equal to twice the lower bound this gave reasonable results in the Krasker-Welsch case
the estimates were
fairly stable; observation #31 was given weight 1.0, 0.09 and .14 at the respective estimation stages, and the loss of efficiency over GLS was just less
than 15%.
~
However, in the Huber-Mallows case, setting the cutoff points at
I
2p 2 (resp. 2q~) did not achieve the desired stability - the degree of down-
112
weighting was minimal.
We therefore repeated the analyses with cutoff values
at1.5 times the lower bound - the results are presented in Table 5.2.7, to
illustrate the kind of behavior which occurs.
KW2 gave reasonably stable
estimates, though not as stable as the KW3 estimates.
It would appear from
Table 5.2.7 that one should choose even lower truncation values to stabilize
the Mallows estimates; we ran the program with cutoff values at 1.2 p~ (1.2 q~),
but even then we failed to achieve the same stability as was shown by the
1M3 estimator.
TABLE 5.2.7
GLUCOSE DATA - BOUNDED INFLUENCE ANALYSES (BOUNDING Y2)
TRUNCATION PARAMETERS = 1. 5 P\ 1. 5 q\ 1. 5 P~
Huber-Mallows (2)
Full
Data
B
Step
I
Step
II
Step
III
1#31
deleted
Krasker-We1sch (2)
Full
D3.ta
1#31
deleted
3.871
(1. 315)
3.232
(1. 333)
3.445
(1.384)
.904
.909
.910
.914
( .0120)
(.0126)
( .0127)
(.0139)
2.965
(1. 518)
.77
.88
.78
.85
(.201)
( .207)
( .221)
(.215)
= 1.0
weight 1#31 = 1.0
Mallows weight = .975
= .48
Huber weight 1#31
K-W weight 1#31
Mallows
Average
Average K-W weight
= .76
Mallows weight 1#31 = .73
Average Mallows weight = .99
K-W weight 1#31 = .04
Average K-W weight = .88
Huber weight 1#31 = .51
Mallows weight 1#31 = .46
Average Mallows weight = .99
K-W weight 1#31 = .06
Average K-W
weight
= .79
113
In summary, we found that either Krasker-Welsch estimation procedure
(KW2 or KW3) gave satisfactory results - we would recommend setting the
truncation values at twice the lower bound (setting them lower appears to
give "too much" downweighting, as well as a concomitant loss of efficiency).
If a Mallows estimator is to be used (and they are cheaper and easier to
compute), we would recommend HM3, i.e. bounding Y3 rather than Y2 - again,
setting the cutoff points equal to twice the lower bound gave reasonable results.
The Mallows estimator did, by and large, have slightly more stable
standard errors, but we do not feel, on the basis of these data, that this
phenomenon was noticeable enough to militate against the use of a KraskerWelsch type estimator.
5.3
Example 2 - Salinity data
The second example which we consider involves measurements on water
salinity and river discharge taken in North Carolina's Pamlico Sound. The
data, which are given in Table 5.3.1, were used by Ruppert and Carroll (1980)
to illustrate robust regression methods, and have been further analyzed by
Atkinson (1983) as an illustration of diagnostic displays for influential
points in regression and transformations.
114
SALINITY DATA
TABLE 5.3.1
,e
Observation
Salinity
Lagged
salinity
Trend
Discharge
1
7.6
8.2
4
23.005
2
7.7
7.6
5
23.873
3
4.3
4.6
0
26.417
4
5.9
4.3
1
24.868
5
5.0
5.9
2
29.895
6
6.5
5.0
3
24.200
7
8.3
6.5
4
23.215
8
8.2
8.3
5
21. 862
9
13.2
10.1
0
22.274
10
12.6
13.2
1
23.830
11
10.4
12.6
2
25.144
12
10.8
10.4
3
22.430
13
13.1
10.8
4
21. 785
14
12.3
13.1
5
22.380
15
10.4
13.3
0
23.927
16
10.5
10.4
1
33.443
17
7.7
10.5
2
24.859
18
9.5
7.7
3
22.686
19
12.0
10.0
0
21.789
20
12.6
12.0
1
22.041
21
13.6
12.1
4
21. 033
I
22
14.1
5
21.005
I
23
13.5
0
25.865
24
11.5
13.5
1
26.290
25
12.0
11.5
2
22.932
26
13.0
12.0
3
21. 313
27
14.1
13.0
4
20.769
28
15.1
14.1
5
21. 393
I
13.6
15.0
I
I
115
TABLE 5.3.2
SALINITY DATA
ORDINARY LEAST SQUARES ANALYSIS
#16
deleted
#16 & #5
Data
9.526
18.099
22.76
(2.92)
(3.26)
(3.71)
0.777
0.697
0.700
(.086)
(.072)
(.067)
-.025
-.157
-.250
(.161)
(.133)
(.131)
-.295
-.631
-.836
( .107)
(.123)
(.148)
Observation
#16:
Observation
#5:
Cook's D:
.917
FUll
deleted
A
(Intercept)
81
S.E. (8 1)
A
A
(Lagged
salinity)
82
S.E. (8 2 )
A
A
(Trend)
83
S.E.(8 )
3
A
A
(Discharge)
84
S.E. (8 4 )
A
Cook's D:
2.78
Studentized Studentized
residual:
residual:
2.016
3.037
h 5 = .48
h 16 = .55
In Table 5.3.2 we summarize the results of certain ordinary least squares
analyses of these data (obtained using SAS PROC REG).
An ordinary least
squares fit of salinity against lagged salinitY,discharge and a time trend
is very sensitive to the inclusion of observation #16 - deletion of this
point results in considerable change in the parameter estimates. Attention
is drawn to this point by the very large value of Cook's distance for the
observation.
The diagnostic displays developed by Atkinson (1983) also in-
dicate that observation #16 is highly influential.
Inspection of the data
116
reveals that the value of 'DISCHARGE' for this point is clearly disparate
(33.443, while almost all of the remaining discharge values lie between
23 and 27).
h
, the diagonal element of the 'hat matrix' H(H=X(X'X)-lX')
16
corresponding to this observation is approximately .55 and thus the point
has very high leverage in the usual sense (Huber, 1981, page 160, cautions
against h. values larger than .2).
1
The standard regression diagnostics provided by SAS (described in, e.g.,
Belsley, Kuh and Welsch, 1980) do not draw attention to any other influential points - nor do the diagnostic techniques of Atkinson.
However, if
one deletes observation #16 from the data and performs a least squares analysis on the reduced data (see Table 5.3.2) it becomes evident that observation #5 is an influential point in the reduced data set (h = .48).
s
Inspec-
tion reveals a discharge value of 28.895 for this point, a value which is
disparate in the reduced data set, but whose disparity is considerably
less in the complete data set.
It is clear from the plot of salinity
against discharge in Figure 5.3.1 what is going on - in the full data set
the leverage of observation #5 is masked by the even more extreme #16, and
measures of influence which are based on deletion of one point at a time
(such as Cook's distance) will not draw attention to points such as #5.
Figure S.3.l also indicates some heterogeneity of variance - as the
discharge value increases, there is an increase in variability.
One way of
A
modelling this (analogous to the Box-Hill model) is : a.1 =alx41. I where X41.
th
is the discharge for the i
observation; a value of A=O corresponds to
homoscedasticity.
We performed the score test proposed by Cook and Weisberg
2
(1983) for H : A= 0 - the test statistic, which has an asymptotic x (1) disO
tribution under H ' had a value of 28.52. We proceeded to estimate A by a
O
variety of methods - the results are summarized in Table 5.3.3.
117
Figure 5.3.1
SALINITY DATA: PLOT OF SALINITY AGAINST DISCHARGE
•
••
•
•
t
•
•
s
•
a
1
•
•
•
•
•
•
i
n
i
•
t
•
•
y
•
•
•
•
•
•
•
•
•
•
•
o
DISCHARGE
-+
36
118
TABLE 5.3.3
SALINITY DATA
Variance modelled as a.1
= alX41, I";
Estimation of A
(Standard errors of estimates in parentheses)
I
GLS
HU
HM3
KW3
HM2
KW2
Full
Data
#16
deleted
#16 &#5
deleted
1.69
1. 78
2.47
(1.17)
(1.62)
(1. 96)
1. 36
0.71
(2.22)
0.98 .
(3.74)
(2.86)
0.85
0.85
0.86
(3.21)
(5.57)
(2.88)
0.68
0.58
0.45
(2.00)
(2.26)
(2.53)
0.96
0.94
0.68
(4.58
(5.83)
(2.88)
0.75
0.61
0.57
(1. 57)
(1. 95)
(2.34)
I
I
119
Notation and abbreviations in Table 5.3.3 are similar to those used in the
previous section; GLS refers to estimation of A by maximum likelihood methods,
based on residuals from a preliminary least squares fit; HU refers to the
'robust' method of the previous section, again with the Huber constant at 1.5.
HM and KW refer to the Huber-Mallows and Krasker-Welsch methods described
previously, with presence of a '2' or '3' indicating whether the weight function was chosen with a view to bounding Y2 or Y3 . For the figures reported
in Table 5.3.3 truncation parameters were set at twice the lower bound for
HM3 and KW2 respectively, at 1.5
I:P
(resp. 1.5 Iq) for HM2.
The estimates
presented in the row for KW3 were obtained with truncation parameters set
at 1.5 times the lower bound; on setting them at twice the lower bound for
these data,
we encountered numerical problems; the routine ZXGSN did not
return a maximizing value of A.
It is clear from Table 5.3.3 that all of the bounded influence estimates exhibit far more stability than GLS or HU.
However, all of the esti-
mates have very low precision, particularly in the data set with observation #16 and #5 deleted.
This is presumably because most of the informa-
tion on A is contained in these two observations, and when these points
are deleted, or severely downweighted, there is a corresponding loss of
precision in the estimation of A.
Upon noting this, we performed the score test for H : A=O mentioned
O
earlier on the successively reduced data sets. For the data with #16 deleted, the test statistic had a value of 5.83 (i.e. there is still evidence
for heteroscedasticity); however, if #5 was deleted as well, the value of
the test statistic fell to 1.89. (Recall that the test statistic is com2
pared to a x (1) distribution.) The test is quite non-robust; it is the
two outliers - #16 and #5 - which are providing all the evidence for the
120
heteroscedasticity.
Atkinson (1983) discusses the need for a transformation in these data
and reports the following conclusion: if observation #16 is "corrected",
by altering its discharge value from 33.443 to 23.443, then, in the "corrected" data set, a
score test yields little evidence that a transformation
is needed; however, Atkinson finds that observation #3 is very influential
in denying the need for a transformation, and if this point is deleted from
the "corrected" data set, there is. strong evidence that a log transformation of the response is appropriate, the MLJ; of the Box-Cox transformation
parameter being -0.15.
While we do not necessarily agree with Atkinson's
conclusions (specifically, transforming salinity without transforming lagged
salinity does not seem appropriate), it is of interest to mimic his investigation, since if observation #3 is influential in estimation of the Box-
.e
Cox transformation parameter it may be reasonably expected to have a large
influence in the estimation of the variance parameter.
We would hope that
our bounded influence techniques for estimating A are not as sensitive to
the inclusion/exclusion of observation #3 as likelihood methods appear to
be in Atkinson's investigation.
Table 5.3.4 summarizes the results of our
analyses 'along these lines - the "correction" applied to observation #16 is
that suggested by Atkinson.
in Table 5.3.3.
Abbreviations have the same interpretation as
121
TABLE 5.3.4
SALINITY DATA
Variance modelled as a.~
= alx 4 " I A;
~
Estimation of A
(Standard errors of estimates in parentheses)
Full data
with
#16 'corrected'
GLS
HU
HM3
KW3
.e
HM2
KW2
#3 deleted
#16 'corrected'
1.71
1.72
(1. 65)
(1.73)
1.07
0.77
(3.41)
(4.04)
0.85
0.78
(5.02)
(6.11)
0.50
0.44
(3.14)
(3.08)
0.93
0.85
(5.23)
(4.58)
0.78
0.60
(1.85)
(1. 94)
By and large the bounded influence methods, with the possible exception
of KW3, gave more credible estimates than GLS or BU.
reported in Table 5.3.4, truncation
For the KW3 estimates
parameters were set at one-and-a-half
times the lower bound, resulting in an average Krasker-Welsch weight of 0.84
in the estimation of A - this may be a little too much downweighting, thus
explaining the divergence of these estimators from the others.
In view of the fact that, in the uncorrected data set, it is observations #16 and #5 which provide all the evidence for the heteroscedasticity,
122
and since the estimates of A provided by the bounded influence methods
typically lie in the range .5 to .8, implying very little heteroscedasticity for the bulk of the data, we feel that a 3-stage weighted regression
is not really necessary for these data, and that an unweighted analysis is
sufficient.
We performed the unweighted regression by a variety of methods
on the full data, then on the successively reduced data sets obtained by deleting observation #16 and then observation #5.
Specifically, we employed
the following techniques in estimating the regression parameters:
(i)
OLS - the ordinary least squares estimate, as already reported in
Table 5.3.2.
(ii)
(iii)
.e
Hll - a Huber estimate with the Huber constant set at 1.5.
HAM - this one-step Hampel estimate was obtained by starting with
the Huber estimate in (ii) and performing one further iteration with a Hampeltype redescending
(iv)
~-function
(with bending points at 1.5, 3 and 7).
HM3 - the Huber-Mallows estimate (bounding Y3) described in the
previous section.
(v)
(vi)
(vii)
(viii)
KW3 - Krasker-Welsch estimate with weight function chosen to bound
y~.
~
HM2 - Huber-Mallows estimate, bounding Y
2
KW2 - Krasker-Welsch estimate, bounding Y .
2
HN~3
- a Hampel-Mallows estimate, with Mallows weight function chosen
to bound Y , and a Hampel-type redescending
3
~-function
as in (iii).
Results of the analyses are presented in Tables 5.3.5-5.3.8.
e
e
TABLE 5.3.5
e
SALINITY DATA - ESTIMATION BY NON-BOUNDED INFLUENCE METHODS
(Standard errors of estimates in parentheses)
Estimation
method -+
HU
OLS
-
Full
Data
#16
deleted
#16 &#5
deleted
Full
Data
22.760
(3.71)
12.509
(2. 72)
18.062
(3.81)
#16
deleted
HAMP
#16 &#5
deleted
#16
deleted
#16 & #5
deleted
13 .100
(2.37)
18.062
(3.81)
23.025
(3.73)
Full
Data
a1
9.526
(Intercept)
(2.92)
18.099
(3.26)
a2
(Lagged
salinity)
0.777
0.697
0.700
.760
.710
.710
.755
.710
.710
(.086)
(.072)
(.067)
(.075)
(.080)
(.064)
(.064)
(.080)
(.064)
a3
-.025
-.157
-.250
-.081
-.167
-.271
-.089
-.167
-.271
(Trend)
(.161)
( .133)
(.131)
(.140)
(.148)
(.125)
(.120)
(.148)
(.125)
a4
-.295
-.631
-.838
-.406
-.615
-.820
-.429
-.615
-.820
(.107)
(.123)
(.148)
(.093)
(.137)
(.142)
(.080)
(.137)
(.142)
( Discharge)
Down",eighting
N/A
N/A
N/A
Huber
weight
Huber
weight
#16 = .41
#5 = .95
#5 = 1.0
23.025
(3.73)
average
Huber
weight
= .97
Results
Results
identical identical
to
to
HU
HU
~
t.J
t..l
e
-
e
TABLE 5.3.6
SALINITY DATA - ESTIMATION BY BOUNDED INFLUENCE METHODS
(Standard errors in parentheses - truncation parameter
Estimation
method +
HM3
e
(Y3 bounded)
=2
x lower bound)
KW3
Hampel-Mallows 3
Full
Data
#16
deleted
#16 & #5
deleted
Full
Data
lt16
deleted
#16 & #5
deleted
Full
Data
#16
deleted
#16 & #5
deleted
17.559
20.859
23.025
20.090
22.357
24.787
18.361
20.859
23.025
(2.767)
(3.70)
(3.73)
(2.83)
(3.32)
(3.61)
(2.547)
(3.70)
(3.73)
0.723
0.710
0.710
0.714
0.696
0.674
0.715
0.710
0.710
(.070)
(.075 )
( .064)
(.071)
(.067)
(.063)
(.065 )
(.075)
(.064)
13 3
(Trend)
-.163
-.228
-.198
-.223
-.243
-.178
-.228
- .271
( .130)
( .139)
-.271
(.125)
( .131)
(.126)
(.126)
( .119)
( .139)
(.125)
13 4
(D ischarge)
-.591
-.730
-.820
-.704
-.793
-.887
- .629
-.730
-.820
(.096)
(.134)
(.142)
(.098)
(.121)
(.1"37)
(.089)
(.134)
(.142)
Mallows
weights
Mallows
weight
K-W
weights
#5 = .67
#16 = .29
#5 = .47
All
Mallows
weights
f\
( [ntercept)
13 2
(Lagged
salinity)
Downweighting
Huber
weights
#5 = 1.0
#16 = 0.21
Huber
weight
=
1.0
#16 = .03
#5 = .24
K-W
weight
#5 = .13
Average
K-W
weight
=
.94
Mallows
weights
as in
HM3
Results
Results
i.dentical identical
to
to
HM3
HM3
#5 = .62
-~
--_..
_._---
--
......
t,J
~
125
TABLE 5.3.7
SALINITY DATA-ESTIr4ATION BY BOUNDED INFLUENCE METHODS
(Y2 BOUNDED)
(Standard errors in parentheses)
•
Estimation
method -+
'\
(Intercept)
13
2
(Lagged
salinity)
13
3
(Trend)
.e
13
4
(Discharge)
Downweighting
HM2
Truncation parameter
= 1.5 fP
KW2
Truncation parameter
= 21P
Full
Data
#16
deleted
#16 & #5
deleted
Full
Data
#16
deleted
#16 & #5
deleted
14.868
19.407
23.025
16.731
20.834
23.338
(3.11)
(4.11)
(3.73)
(2.47)
(3.02)
(3.66)
0.740
0.710
0.710
.727
.710
.703
(.074 )
(.079)
(.064)
(.070)
(.067)
(.063)
-.120
-.196
-.271
-.140
-.217
-.258
(.136 )
(.148)
(.125)
(.130)
(.123)
(.124)
-.495
-.670
-.819
-.570
-.732
-.838
( .110)
(.147)
(.142)
(.083)
( .107)
( .139)
K-W
Average
Results
K-W
weight
weight
identical
K-W
weight
to HM3
#16 = .11
#5 = .74
in
#5 = .68 #5 = .29 = .96'
Table
Huber
5.3.5
weight
(r.o
#5
=
#5 = 1.0
.74 Mallows
down#16= .32
weighting
Mallows
weights:
#5 = .93
#16= .61
Huber
weights
Mallows
weight
I
e
e
e
TABLE 5.3.8
(y 3
SALINITY DATA - ESTIMATION BY BOUNDED INFLUENCE METHODS
(Standard errors in parentheses - truncation parameter
= 1.5 x
bounded)
lower bound)
.. '
Estimation
method -+
HM3
,-
Hampel-Mallows 3
KW3
#16 & #5
deleted
Full
Data
#16
deleted
#16 & #5
deleted
Full
Data
#16
deleted
#16 & #5
deleted
81
20.257
22.742
24.578
22.635
25.084
26.086
21.309
22.742
24.578
( Intercept)
(3.18)
(3.52)
(4.00)
(2.57)
(2.84)
(3.21)
(2.58)
(3.52)
(4.00)
0.707
0.690
0.682
.687
.654
.645
0.700
0.690
0.682
(.075)
(.070)
(.069)
(.063)
( .060)
(.056)
(.061)
(.070)
(.069)
-.215
( .139)
-.242
-.204
-.202
-.212
-.219
-.230
-.242
-.264
(Trend)
(.129)
(.132)
( • 118)
( .110)
( .104)
( .iB)
(.129")
( .B2)
84
-.705
-.803
-.877
-.803
-.895
-.934
-.745
-.803
- .877
( .113)
(.129)
(.152)
(.083)
( .105)
( .123)
(.092)
(.129)
(.152)
82
(Lagged
salinity)
83
(D ischarge)
Mallows
weights
Downweighting
Mallows
weight
#5 =.37
h6 = .16
#5 =.28
Huber
weights
Huber
weight
#5 =.72
#16 =.22
#5 =.53
Average
Mallows
weight
= .94
K-W
weight
K-W
weight
#16 = .01
#5 = .09
#5 = .05
Average
K-W
weight
= .86
Full
Data
Mallows
weights
as in
HM3
#16
deleted
Results
Results
identical identical
to
to
HM3
HM3
......
t·)
0'
127
Inspection of the tables reveals a few points of interest:
(i)
There is a real need for bounded-influence (as opposed to merely
'robust') methods in analyzing these data.
instability of the OLS estimate of
~
We have already remarked on the
- Table 5.3.4 shows that use of the
traditional robust estimates goes some way to alleviating this, but they are
still susceptible to fairly large changes on deletion of #16 and #5 (particularly the estimates of
a3 and a4).
This is due to the fact that it is
the high leverage of these points, combined with moderately high residuals,
which causes the problem, so some effort must be made to combat this.
(ii)
By and large, the bounded influence estimates do give reasonable
stability.
If we were to single out any particular method, both KW3 and
HAMM3 seem to provide most stability.
In general, bounding Y3 results in
considerably more stable estimates than when Y2 is bounded .
.e
(iii)
No clear picture emerges from these data as regards efficiency -
the robust/bounded influence methods tend to be more efficient than generalized least squares, which is to be expected in a data set of this nature,
with severe outliers.
We are reluctant to draw firm conclusions about effi-
ciency from these data, however, other than remarking that use of bounded
influence methods is obviously worthwhile.
(iv)
Comparison of tables 5.3.6 and 5.3.8 indicates setting the truncation
value at 1.5 times the lower bound in the Krasker-Welsch (3) case may result
in too much downweighting.
give
Setting it at twice the lower bound appears to
reasonable stability, without overdoing the amount of downweighting.
This bears out our experience with the Leurgans data set.
For the Hampel-
Mallows (bounding Y ), setting the Mallows truncation parameter at 2p gave
3
reasonable stability, though setting it at 1.5p also gave quite acceptable
128
results - an average Mallows weight of .94, which does not seem excessively
low.
In practice, one might effect a compromise by starting with an initial
truncation value of 1.75 p.
Clearly, it would be foolish to make a universal
recommendation based on our limited information, and the suggestions which
we make here should be interpreted only as a guideline.
factors as the number
In practice, such
of parameters being estimated, the degree of "badness"
of the data might reasonably be expected to play a role in the choice of truncation parameter, so a certain amount of tria1-and-error seems unavoidable.
5.4
Example 3 - Gas vapor data
The final example which we consider is taken from Weisberg (1980, page
146).
The data are concerned with the amount of gas vapor emitted into the
atmosphere when gasoline is pumped into a tank.
.e
In a laboratory experiment
to investigate this, four variables were thought to be relevant for predicting
Y, the quantity of emitted hydrocarbons - namely
Xl - the initial tank temperature, in of
X - the temperature of the dispensed gasoline, in of
2
X - the initial vapor pressure in the tank, in psi
3
X4 - the vapor pressure of the dispensed gasoline, in psi.
A sequence of 32 experimental fills was carried out - the resulting measurements are given in Table 5.4.1.
129
GAS VAPOR DATA
TABLE 5.4.1
Gas
temp.
Gas vapor
pressure
Emitted
gas vapor
53
3.32
3.42
29
31
36
3.10
3.26
24
3
33
51
3.18
3.18
26
4
37
51
3.39
3.08
22
5
36
54
3.20
3.41
27
6
35
35
3.03
3.03
21
7
59
56
4.78
4.57
33
8
60
60
4.72
4.72
34
9
59
60
4.60
4.41
32
10
60
60
4.53
4.53
34
11
34
35
2.90
2.95
20
12
60
59
4.40
4.36
36
13
60
62
4.31
4.42
34
14
60
36
4.27
3.94
23
15
62
38
4.41
3.49
24
16
62
61
4.39
32
17
90
64
4.39
7.32
6.70
40
18
90
60
7.32
7.20
46
19
92
92
7.45
7.45
55
20
91
92
7.27
7.26
52
21
61
62
3.91
4.08
29
22
59
42
3.75
3.45
22
23
88
65
6.48
5.80
31
24
91
89
6.70
6.60
45
25
63
62
4.30
4.30
37
26
60
61
4.02
4.10
37
27
60
62
4.02
3.89
33
28
59
62
3.98
4.02
27
29
59
62
4.39
4.53
34
30
37
35
2.75
2.64
19
31
35
35
2.59
2.59
16
32
37
37
2.73
2.59
22
Observation
Initial
temp.
1
33
2
Ini tial
vapor pres.
I
I
I
I
I
I
I
I
130
This data set has been used by Cook and Weisberg (1983) to illustrate
their proposed score test for heteroscedasticity.
They found definite
evidence of heteroscedasticity. with the variance being a.function of Xl
and X4 • and obtained this empirical estimate of the direction in which the
variance is increasing:
Xs = 0.778 + .110 Xl - 1.432 X4
A plot of the residuals from fitting the ordinary least squares model
s
against X is shown in Figure 5.4.1 - the megaphone-shaped plot is clear
evidence of heteroscedasticity.
.e
We therefore chose to analyze these data using the 3-stage estimation
procedure for heteroscedastic linear models.
used a Box-Hill type model for the variance:
a.1
where X is as before.
s
= alxs1'
+
o.slA
In view of Figure 5.4.1
we
131
Figure 5.4.1
GAS VAPOR DATA - PLOT OF LEAST SQUARES RESIDUALS
AGAINST Xs
•
5
•
4
t
R
e
•
3
a
1
2
1
•
•
•
•
•
0
.e
•
•
5
i
d
u
•
•
•
•
•
•
1
•
•
•
•
•
•
•
•
•
•
•
2
3
•
•
•
4
•
5
•
6
-0.5
0.0
0.5
1.0
1.S
2.0
132
Initially we viewed these data as a "good" data set, in the sense that
there are no apparent outliers - our aim in the analyses was to use these
data to assess the performance of the various
estimation methods under con-
sideration when the underlying data are good.
To this end we analyzed the
data using the following five estimation procedures:
(i)
(ii)
(iii)
generalized least squares (GLS)
Huber estimation with the Huber truncation value set
= 1.5
(HU)
Huber-Mallows estimation, bounding Y3 (HM3) - here the Huber con-
stant was 1.5 and the Mallows truncation values were 2p, 2q, 2p respectively.
(iv)
Huber-Mallows estimation, bounding Y2 (HM2), Mallows truncation
values were set at 1.5 p \
1.5 q \
1.5 p~.
(v)
Krasker-Welsch estimation, bounding Y
2
parameters at 2 p \ 2 q \ 2 P ~ .
.e
(~~2)
- with truncation
Table 5.4.2 summarizes the parameter estimates and associated standard
errors for the various estimation methods.
Probably the most striking feature
in the table is the discrepancy in the estimate of A among the various estimation methods.
Closer scrutiny revealed that all three bounded influence
analyses downweighted observations #1 and #2 severely in the estimation of A,
while giving full weight to all the other data points (for instance, the
Krasker-Welsch weights for observations #1 and #2 were .002 and .20 respectively).
This suggests that these two points, observation #1 in particular, exert
a lot of influence in determining the estimate of A.
This is borne out by
Table 5.4.3; the MLE for A jumps from .28 to .50 on deletion of observation
#1 from the data set.
(Subsequent deletion of observation #2 as well raised
the MLE for A to .80; we forgo presentation of the full table of estimates
for this case, since our interest in these data is largely for illustrative
purposes.)
133
TABLE 5.4.2
GAS VAPOR DATA - FULL DATA SET - PARAMETER ESTIMATES
(SLandard errors in parenLheses)
GLS
HU
HM3
KW2
HM2
.28
.28
.78
.63
.52
(.087)
(.139)
(.210)
(.129)
(.222)
.407
.340
-.422
- .110
-.029
(1.33)
(1.46)
(1.39)
20)
(1. 59)
-.081
-.091
-.003
-.072
-.086
(.058)
(.064)
(.067)
(.053)
(.075)
.234
.233
.221
.212
.225
(.040)
(.044 )
(.043)
(.049)
(.058)
"-
A
"-
60
(intercept)
(1.
"-
61
(ini L. Lemp.)
"-
62
(gas temp.)
"-
63
-3.75
(ini t. pres.) (2.31)
-3.21
-4.23
-4.25
-3.70
(2.54)
(2.57)
(2.37)
(2.74)
8.62
8.77
9.77
9.21
(2.33)
(2.38)
(2.05)
(2.73)
"-
84
9.01
(gas pres.) (2.12)
134
TABLE 5.4.3
GAS VAPOR DATA - OBSERVATION #1 DELETED -
PARA~ETER
(Standard errors in parentheses)
..
"
A
"
80
HU
HM3
KW2
HM2
.50
.40
.83
.85
.74
(.106)
(.180)
(.222)
(.181)
(.250)
.427
.473
-.474
-.057
-.088
(1.130)
(1. 394)
(1. 297)
(1.514)
(1. 286)
81
-.025
-.031
-.006
-.049
- .036
(ini t. temp)
(.066)
(.080)
(.075)
(.077)
(.076)
82
.184
.187
.225
.191
.183
(gas temp.)
(.038)
(.049)
(.044)
(.056)
(.051)
(intercept)
"
"
.e
GLS
"
83
-5.19
-4.73
-4.30
-5.02
-4.89
(ini to pres.)
(2.29)
(2.75)
(2.61)
(2.55)
(2.54)
84
10.36
9.92
8.69
10.50
10.28
(gas pres.)
(1.95)
(2.38)
(2.37)
(2.85)
(2.42)
"
I
•
ESTIMATES
135
It is clear from this example that bounded influence methods are useful
not just in providing stable estimates, but also in diagnosing which points
are influential.
We stress that, just because a point is influential does
not mean it is a 'bad' point - it may, in fact. be highly informative; we
do feel, however, that it is of value to be able to identify highly influential points.
We remark that, on the basis of Tables 5.4.2 and 5.4.3, HM3 certainly
appears preferable to HM2 - the latter method seems to involve a considerable
loss of efficiency (over GLS) without any spectacular gains in stability;
while HM3 also entails a loss of efficiency, it does, at least, give quite
stable estimates.
Comparison of Tables 5.4.2 and 5.4.3 shows that, for these data, standard
errors of the Mallows (HM3)estimates are more stable than those of the KraskerWelsch (KW2) estimates, in accordance with the theory of Section 4.3.
It is
possible, however, that reducing the truncation parameters in the KraskerWelsch case may stabilize the variances also.
Finally we note that computation of the
expensive for these data -
~V3
estimates was prohibitively
five minutes of CPU time produced an estimate
= 1.01), but no final estimates of B. It
KW3
is not clear why this method was so much more expensive than the others;
of A for the full data set (A
however, since cost was a factor, we omitted this particular method from
our analyses.
5.5
Conclusions
Consideration of the above examples indicates two main reasons to use
bounded influence estimation methods:
136
(1)
By and large, the bounded influence techniques which we have dis-
cussed do give stable, reasonable estimates, in situations where classical
methods fail to do so.
(2)
By their very nature the methods proposed give diagnostic information
on influential points.
Observations which are severely downweighted by the
bounded influence methods are tagged as being influential.
Since
the
weights are computed automatically as part of the estimation procedure, no
extra effort is required to obtain this diagnostic information.
Furthermore,
the problem of one influential point masking another, which arises with diagnostic measures based on deletion of one point at a time, is dealt with to
some extent.
In theory, the Mallows estimates provide diagnostic informa-
tion that Krasker-Welsch estimates do not - since they bound the absolute
value of the eVF as well as the Ie. points with a large influence on either
the parameter estimates or their standard errors will be downweighted in the
Mallows estimation scheme, whereas the KW weights only flag those points
which have a large effect on the parameter estimates.
This did not seem to
be an important issue in the data sets which we considered, however.
Our experience indicates that both Mallows and Krasker-Welsch estimates
give reasonable protection in practice, though we would recommend bounding
Y3 rather than Y if a Mallows estimate is used. Other things being equal,
2
Mallows estimates are, of course, cheaper and easier to compute. We are reluctant to. give strict recommendations on the choice of truncation parameters
based on such limited experience; while some trial and error seems inevitable,
the lower bounds of Section 5.1 at least provide a point of departure.
The price for the gain in stability obtained by using bounded influence
methods is a loss of efficiency if the data are 'good'.
Whether or not one
137
is willing to pay this price is a matter of taste - we feel that, in any
reasonably complicated data set the risks involved in not using some type
of bounded influence procedure are enormous.
Mallows (1975) has used the
analogy of an insurance premium - one is willing to make certain sacrifices
in order to avert complete disaster.
lVhile we cannot claim that bounded
influence methods are a panacea against all possible ills, they do go a
long way towards avoiding catastrophe and we recommend the use of bounded
influence techniques as a routine precaution.
138
BIBLIOGRAPHY
Atkinson, A. C. (1983) "Diagnostic regression analysis and shifted power
transformations", Teahnometrias~ 25~ 23-33.
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980) Regression Diagnostics:
Identifying Influential Data and Sources of Col linearity, New York:
John Wiley and Sons.
Bickel, P. J. (1978) "Using residuals robustly I: Tests for heteroscedasticity, nonlinearity", AnnaZs of Stati8tias~ 6~ 266-291.
Box, G. E. P. and Hill, \'I. J. (1974) "Correcting inhomogeneity of variance
with power transformation weighting", Teahnometria8~ 16~ 385-389.
Carroll, R. J. (1982) "Power transformations when the choice of power is
restricted to a finite set", JoumaZ of the Ameriaan StatistiaaZ Assoaiation~ 77~ 908-915.
.e
Carroll, R. J. and Ruppert, D. (1981) "Prediction and the power transformation family", Biometrika, 68, 609-615 .
Carroll, R. J.and Ruppert, D. (1982) "A comparison beb'een maximum likelihood and generalized least squares in a heteroscedastic linear model",
Journal of the Ameriaan StatistiaaZ Assoaiation, 77, 878-832.
Carroll, R. J. and Ruppert, D. (1982b) "Robust estimation in heteroscedastic
linear models", AnnaZs of Statistias, 10~ 429-441.
Cook, R.D. (1979) "Influential observations in linear regression", Journal,
of the Ameriaan Statistiaal, AS80aiation~ 74, 169-174.
Cook, R.D. and Weisberg, S. (1983) "Diagnostics for heteroscedasticity in
regression", Biometrika, 70, 1-10.
Dunford, N. and Schwartz, J. T. (1958)
Interscience Publishers.
Linear Operators, Volume I, New York:
Hampel, F. R. (1968) "Contributions to the theory of robust estimation",
unpublished Ph.D. thesis, University of California, Berkeley.
Hampel, F. R. (1974) "The influence curve and its role in robust estimation",
Journal, of the Ameriaan Statistiaal, Assoaiation, 69, 383-394 .
•
Hampel, F. R. (1978) "Optimally bounding the gross-error-sensitivity and
the influence of position in factor space", 1978 P1'oaeedings of the
ASA Statistiaal, Computing Seation, American Statistical Association,
Washington, DC, 59-64.
139
Hampel, F. R., Rousseeuw, P. J. and Ronchetti, E. (1981) "The change-ofvariance curve and optimal redescending M-estimators", JournaZ of the
American StatisticaZ Association, 76, 643-648.
Huber, P. J. (1973) "Robust regression: Asymptotics, conjectures and
Monte Carlo", AnnaZs of Statistics, 1, 799-821.
Huber, P. J. (1981)
Robust Statistics, New York: John Wiley and Sons.
Huber, P. J. (1983) "Minimax aspects of bounded-influence regression",
JournaZ of the American StatisticaZ Association, '18, 66-71.
Jobson, J. D. and Fuller, W. A. (1980) "Least squares estimation when the
covariance matrix and parameter vector are functionally related",
JournaZ of the American StatisticaZ Association, 75, 176-181.
Krasker, W. S. (1980) "Estimation in linear regression models with disparate data points", Econometrica, 48, 1333-1346.
Krasker, W. S. and \1elsch, R. E. (1982) "Efficient bounded-influence regression estimation using alternative definitions of sensitivity",
J01~Z of the American StatisticaZ Association, '1'1, 595-604.
.e
Leurgans, S. (1980) "Evaluating laboratory measurement techniques", in
Biostatistics Casebook, John Wiley and Sons. (Ed. R. G. Miller, B. Efron,
B. W. Brown, L. E. Moses)
Luenberger, D. G. (1968)
John Wiley and Sons.
Optimization by Vector Space Methods, New York:
Mallows, C. L. (1975) "On some topics in robustness", unpublished memorandum, Bell Telephone Laboratories, Murray Hill, NJ.
Maronna, R. A., Bustos, O. H. and Yohai, V. J. (1979), "Bias-and-efficiencyrobustness of general M-estimators for regression with random carriers",
in Smoothing Techniques for Curve Estimation, Lecture Notes in Mathematics, No. 757, Berlin-Heide1berg-New York: Springer-Verlag, 91-116.
Ronchetti, E. and Rousseeuw, P. J. (1983)
in regression analysis", to appear.
"Change-of-variance sensitivities
Ruppert, D. (1982) "Some results on bounded influence estimation", (unpublished manuscript).
...
Ruppert, D. (1983) "On the bounded-influence regression estimator of
Krasker and Welsch", Institute of Statistics Mimeo Series #1531, University of North Carolina .
Ruppert, D. and Carroll, R. J. (1980) "Trimmed least squares estimation
in the linear model", JournaZ of the American StatisticaZ Association,
'15, 828-838.
140
Stahe1, W. A. (1981) "Robuste Schatzungen: infinitesima1e Optimalit~t
und Sch~tzungen von Kovarianzmatrizen", Ph.D. thesis, ETH, Zurich.
Weisberg, S. (1980)
and Sons.
)
.e
Applied Linear Regression, New York:
John Wiley