Carroll, R.J. and Ruppert, DavidDiagnostics and Robust Estimation When Transforming the Regression Model and the Response."

Diagnostics and Robust Estimation When Transforming
The Regression Model and The Response
R. J. Carroll
and
David Ruppert
Department of Statistics
University of North Carolina
Chapel Hill, N.C.
Abstract:
In regression analysis,
27514
the response is often transformed to
remove heteroscedastici ty and/or skewness.
When a model
already exists
for the untransformedresponse, then it can be preserved by transforming
both
•
the
model
methodology,
several
and
which
recent
the
we
papers,
response
call
and
II
with
transform
appears
the
same
both
highly
sides II
useful
transformation.
has
been
appl ied
in practice.
• parametric transformation family such as power transformations
~en
the transformation can be estimated by maximum likelihood.
however
is
very
sensitive
to
outliers.
In
this
article,
This
we
When
in
a
is used,
The MLE
propose
2
diagnostics which
regression
estimator
parameters.
similar
diagnostics
software.
indicate cases
and
to
the
We
also
influential
propose
the Krasker-Welsch
robust
estimator
a
for
the
transformation or
robust
bounded-influence
regression
can
be
estimate.
implemented
on
Both
the
standard
e
3
ACKNOWLEDGEMENT
The research of Professor Carroll was supported by Air Force Office
of
Scientific
Research
Contract
AFOSR-S-49620-8S-C-0144.
Professor
Ruppert's research was supported by the National Science Foundation Grant
DMS-8400602.
4
1.
INTRODUCTION
In regression analysis, the response y is often transformed for two
di stinct purposes,
and
to
improve
variables x.
known
to induce
the
fit
normally
to
some
di stributed,
simple
model
homoscedastic errors
involving
In many si tuations, however, y is already believed to fit a
f(~;!)
model
transformation
being
6
of
y
is
a
p-uimensional
still
needed
parameter
to
manner.
Specifically,
parameter
~
be
a
~ ~) = f ( ~)
(x; B)
€l""'€N
are
'4IJistributed.
-
-
+
skewness
f(~;!)
transformation
a
and/or
in the same
indexed
by
the
~
and assume that for some value of
1
where
y(~)
let
If
vector.
remove
heteroscedasticity, then one can transform both y and
y
explanatory
e-€.
(1)
1
independent
and
at
least
approximately
normally
Notice the difference between (1) and the usual approach of
transforming only the response, not f(x;!), i.e.,
y ( ~)
=
f ( x; B) + e-€ .•
-
-
( 2)
1
It should be emphasized that model
is not a substitute for
(1)
models are appropriate, but under different circumstances.
been amply discussed by Box and Cox
Smith
f
(~;!)
(1980)
and
Cook
and
(1964)
Weisberg
and others,
Model
e.g.,
Typically
(1982).
(2).
in
is linear but in principle nonlinear models can be
Both
(2) has
Draper and
model
used.
(2),
MOdel
(1), which we call "transform both sides", has been discussed extensively
in Carroll
&
Ruppert
(1984),
and Snee
(1985),
and Ruppert
(1986) and we will only summarize those discussions.
f(~;!)
. when
has two closely related interpretations;
the
error
~stribution of
is
y
zero
given
and
x.
it
In
is
the
Carroll
and
Accordi nq to
f(~;!)
median
and Carroll
of
Ruppert
(1),
is the value of y
the
conditional
(1984),
we
were
concerned wi th si tuations where a physical or biological model provides
5
f
(~.i~'>'
but where the error structure is a priori
Snee (1985), Carroll and Ruppert
Bates,
Wolf,
highly
and Watts
effective
with
(1985)
real
~,
~
Exampl es by
(1984), Ruppert and Carroll
(1986),
show that transforming both sides
data,
available and, as Snee shows, when
By estimatinq
unknown.
both
when
f(~;!)
a
theoretical
e
and
can
be
model
is
is obtained empirically.
and! simultaneously, rather than simply fitting
(~;!),
the or iqi nal response y to f
we achieve two purposes.
Firstly, !
is estimated efficiently and therefore we obtain an efficient estimate of
the condi tional median of y.
distribution of y given
~,
Secondl y,
and,
in
we model
partic~lar,
the entire condi tional
we have a model wich can
account for the skewness and heteroscedasticity in the data.
(1984)
Ruppert
discuss
di str ibution of y
in a
the
importance
special case,
Atlantic menhaden population.
y,
for
and
fixed
let
F
be
a
of
modeling
the
spawner-recrui t
of
the
To specify the conditional distribution of
distribution
approximately normal,
conditional
analysi s
~ let h(y,~) be the inverse of y(~), Le.
the
Carroll and
function
of
but not necessarily
e 1..
We
exactly
h(y(~) ,~) =. y,
assume
normal
~.
that
since
if
F
is
y(
U
is the Box-Cox (1964) modified power family,
y(~)
=
(Y~-l)/X
~~O
( 3)
= log(y)
then
given
F must have
~
~=o
finite
,
support whenever
~~O.
'The
p-th
quantile
of
y
is
h{[f(~)(~;!) + ~F-l(p)],~}
and the condi tiond 1
)fl'",] r)
( 4)
of Y is
(5 )
where
-a..s.x~a
estimation of
is
(4)
the
and
support
(5).
of
F.
E(ylx)
Ruppert
and
Carroll
(1986)
is easily estimated by Duan's
discuss
(1983)
~
6
"smearing" estimate,
~the
which estimates F by the empirical distribution of
residuals; see section 5.
Many
data
sets
we
examined
in
the
untransformed response y, but not in the residuals y ( ~) - f ( ~) (~;!) ;
the
transformation has accommodated,
have
substantial
or explained,
outliers
the outlying
y's.
There
is still the danger, however, that a few outliers in y can greatly affect
x and J3.
outliers should not be automatically deleted or downweighted,
especially when they appear to be part
response,
but
it
should be
of
the
standard practice
normal
variation
to detect and
in
the
scrutinize
influential cases and when outliers are present to compare the MLE with a
robust estimtor.
In this paper we propose a diagnostic and a
"bounded-
influence" estimator which can be used together for detecting influential
~
cases and for robustly estimating
Case deletion diagnostics
·e
Belsley, Kuh,
and Welsch
for
(1980)
and
!.
linear
regression
and Cook and Wei sberg
been extended to the response transformation model
(1983) and Atkinson
in
~
(1986).
(2)
are
(1982),
the
influential
be
searching for them.
to
have
The last two papers approximate the change
can be unwieldly because of
are
and
in
by Cook and Wang
as single cases or subsets of cases are deleted.
subsets
di scussed
large
number
detected,
one
of
Subset deletion
possible subsets.
needs
some
If
strategy
to
Alternatively, one can examine weights from a robust
estimator with good breakdown properties.
Bounded-influence
regression
estimators,
so-called
because
they
place a bound on the influence of. each observation, have been proposed by
Krasker
(1980),
Hampel
(1978),
and Krasker and Welsch
last paper provides a good overview.
tit bounded
(1982),
and
this
Carroll and Ruppert (1985) proposed
influence transformation (BIT) estimator extending the Krasker-
Welsch estimator to the response transformation model (2).
7
In
this
paper
we
adapt Atkinson's
(1986)
estimator to the "transform both sides" model.
diagnostics
and
the BIT
e
The basic technique is to
linearize the model (1) by a Taylor approximation at the MLE, and then to
apply ordinary regression diagnostics and bounded-influence estimates.
Our
methods
software.
are
designed
to
be
easily
implemented
on
All our computations were performed on the SAS package using
PROC NLIN and rather simple data manipulations in PROC MATRIX
steps.
standard
and DATA
The computations would also be straightforward on other software
packages.
influence
Our
computational
estimate
for
the
eliminating the need for a
techniques
can
be
applied
response-transformation
to
model
a
bounded(2) ,
lengthy FORTRAN program used in Carroll
thus
and
Ruppert (1985).
e·
8
MAXIMUM LIKELIHOOD ESTIMATION
Throughout
this
paper
y(~)
is
the
modified
power
transformation
Under model (1) the log-likelihood is
(3).
N
= \ log f (!, ~, 0- )
L
i=l
where
.10g f(!,>--"cr) = -tl0g(2Jlcr 2 ) + (~-l)log(y.)
1
-1/ ( 2cr 2 ) [y. ( >--.) - f ( ~) (x. ; B ) ] 2 •
1
_4Ifor fixed B and
maximizes L(.!!,
-1
(6)
-
~
~,cr)
in cr.
Thus, the MLE of B and
~
maximizes
N
=-
(N(2) 10g{N- l )
(7 )
i=l
where
y
is the geometric mean of Yl' ..• IY N.
Therefore, B and
~
minimize
( 8)
9
Following Box and Cox
For fixed >.-, minimize (8)
(1964),
in
!
13 and
by ordinary
squares and call the minimizer !( >.-).
maximize
graphically
or
>.- can be computed as follows.
Plot
numerically.
(typi c~lly nonlinear)
Lmax (!(>'-) ,>.-)
This
technique
= !.T!
attractive when f is not transformed and f(!.i!)
minimized
the
in 13 by
technique
is
linear
less
least-squares.
at tracti ve
When
on
is
a
grid
and
particularly
for then (8) can be
transforming both
computationally
least-
but
it does
sides,
gi ve
the
confidence interval
( 9)
where
xi (1-0()
one degree
of
is the (I-a:)
freedom.
quanti Ie of the chi-square distribution wi th
Minimizing
(8)
simultaneously
in
>.-
straightforward with standard nonlinear regression software.
fits the dummy variable Di
D.
1
= [Y ~ >.-) - f
1
-1
matrix
of
13
is
One simply
to the pseudo-model
(!,~).
using
e-
( 10)
-
Not only is the least-squares estimate
(!, >.-) the MLE, but for small values of
covariance
13
( >.-) (x. i 13 ) ] /y >.- -1
with regression parameter
of
=0
and
the
0-
pseudo
and large N estimating the
regression
model
(10)
essentiaLly equivalent to inverting the Fisher information for
(>.-,13
is
,0-) ,
see the appendix.
•
10
3.
and
DIAGNOSTICS
'!( i) ) be the
( ).. (i)
MLE' s
wi th
A
The changes .6~i
respectively.
interpreted
measures
of
=
influence,
called
case
i,
I\a. = (a-a.)
are easily
- -1.
and
(>-"-)..0»
and without
-1.
the
sample
influence
curve
A
(Cook and Weisberg
(1982».
Unlike
1\).. 1.. and -I\a.
1.
in linear regression,
cannot be computed exactly without actually recomputing the MLE wi th case
i
deleted.
1\).. 1..
However,
and
-
1\13.
-1.
can
be
approximated
by
applying
Atkinson's (1986) "quick estimate" to a linearization of model (1).
A
1\).. 0) and I\a( i ) we
To approx ima te
1 i near i z e the model
(11)
Let
w(
=
)..,!)
u . ( ).. , a)
1
-
( )..) (~;!) ] I~ ).. -1 ,
= [y ( )..) _ f
z()..,!)
(J/J)..)z()..,a),
=
(J 1 J a . ) z ( ).. , a)
-1
-
= - ( J 1 J -1
a . ) f ( ~)
(x; B ) 1'1).. -1 ,
-
and
=
(u. ()..,a) , .. .,u ()..,a»
1.
p-
Sometimes we will
z(y,~;!,)..)
write
dependence on y and x.
=
z
z
we
If
~
=
=
z()..,!), w
=
fit
G.
tki nson
'"
-()..-)..)w -
(!-!)
equation
T
(12)
If instead we fit
IS
(1986)
II
•
instead
of
z(!,)..)
The same holds for w()..,!)
w()..,!), and
A
T
~
= ~()..,!).
~
and
to
emphasize
~()..,!).
Also let
Then (11) is approximated by
+ error.
to
the
(12)
full
the
(12)
data,
then
of
course
with the ith case deleted,
).. =)..
and
then we obtain
quick estimate II approx ima tioD, which we ca 11
).."'Q( i)
and
11
"Q
!( i) •
.6~.1
Let
Q
=
" "Q
(~-~(i»
"
;\~.
"
and
;\fL
Q
-1
;\B .•
" "Q
(!-!(i»
=
be
(12)
the
resulting
linear,
refitting
approximations
to
without case i
is easy using standard matrix identities, which have been
-
and
1
Because
--1
is
e
programmed in many statistical packages.
We used the following
computational scheme on
(~,!),
minimized by PROC NLIN to obtain
in a DATA step, and finally
then z, w,
the linear model
(12)
was
(;\}...
-
not
1
Q
"Q
,;\13.
-1
part
is a
) is DFBETA.
1
of
in the Belsley,
standard SAS
relatively large val ues of
(1986 )
Atkinson's
calculating
"
;\~.
-
1
;\~.
1
1
Kuh,
If we
output.
"Q
,;\B.
).
--1
Welsch
are
(19)
gives
We feel, however,
When only the response is transformed, !
both sides , !
~
The
unsealed
nomenclature and is
interested
in
simple
a
cases
wi th
formula
for
that influence for both
13_
should probably
be assessed together and DFBETAS.1
.
sensible to estimate
fit on PROC REG.
then the scaling is immaterial.
equation
Q alone.
"Q
(.6~.
scaled version of
was
(Belsley, Kuh, and
1
which
(8)
and u were generated
PROC REG calculates the regression diagnostic DFBETAS.
Welsch 1980),
First
SASe
first and
then
is
depends heavily on
to estimate 13.
is usually vet:y stable as
ideal
~
When
is perturbed and
~
~
for
and
this.
e-
and it is
transforming
~
and Bean
be treated simultaneously.
In the example in section 5 and in other examples
;\~.
report,
-
and
1
;\~. Q were
1
often
-
considerably
that we will not
different,
which
is
surprising since the quick estimate is reasonably accurate when y alone
is transformed (Atkinson 1986).
~ and 13.
The
approximation
.
The difference is that here u depends on
;\~.Q
does indicate cases with relatively
1
-
large val ues of ;\ ~., and ;\ ~. Q seems adequate for diagnostic purposes.
-
1
-
To
obtain
a
compute
Cook's
D or DFFITS
psuedo model
(12).
1
single measure
of
(Belsley,
joint
Kuh,
influence
for
and Welsch
(~,!)
one
can
(1980) )
for
the
•
12
4.
A general
approach to
variance
subject
approach
was
Hampel
to
begun
a
by
(1978), Krasker
ROBUST ESTIMATION
robust
bound
Hampel
estimation
on
the
<1968,
is
to minimize
gross-error
1974),
appl ied
asymptotic
sensitivity.
to
This
regression
by
(1980) and Krasker and Welsch (1982), and used in
the response transformation problem by Carroll and Ruppert (1985).
Here
we
will
find
parameters \ and!.
wi th a robus t
Let
an
estimator
scale functional,
t{~,y;~,B)
~,
We will ignore
e. g.
bounding
the
influence
for
the
which can be estimated separately
the MAD,
appl ied to the residuals.
be the score function
-e
<13 )
= z().."B)(
-
w().."B)
-).
~()..,,!)
Since the MLE minimizes (8) it solves
N
r(x. ,y. ;~,B)
\
-1
1
/
i=l
at least when \
of B.
=
and !
0,
are unconstrained and f
(~;!)
is a
smooth function
The MLE is highly sensitive to cases with large values of z(\,!),
w()..,,!),
or
~()..,,!)
corresponding
to
response
outliers,
high
leverdge
')ints for \, and high leverage points for !, respectively.
A robust bounded-influence estimator ()..,,!) is found by solving
13
N
\
W(y. ,x. i~,J3)t(y. ,x. i~,J3)
1 -1
1 -1
/
i=l
=
0
The optimal
where W is a scalar weight function such that wt is bounded.
choice
of
W was
first
studied
(~,!)
measured by the covariance of
(1968,
Hampel
When
univariate parametric families.
measured by the norm of wt
by
choosing w,
1974)
for
asymptotic
general
efficiency
must be balanced against robustness
(~,!),
For a multivariate parameter such as
this balancing raises philosophical questions since there are many ways
,
of
comparing
covariance matrices
approach we take generalizes
regression estimates.
or
of
norming
the Krasker-Welsch
vector
(1982)
functions.
The
bounded-influence
Whether the Krasker-Welsch estimator is optimal in
any meaningful sense is
an
(Ruppert 1985),
open question
but
it
seems
quite satisfactory in practice.
Let
any
be the gradient of log f(.!, >.-,cr)
Q,
weighting
function
W,
the
influence
with respect to
function
evaluated
(.!, >.-).
at
For
(y.,x.)
1
-1
This defini tion of IF coincides wii th the usual defini tion when the
Xl
are
the
will be defined as
where
N
B
=
N
-1"\
E {W(y,x. ,~,J3)t(y,x. ;>.-,13)9,
y
1
-1
/
T
(y,x.
;>.-,B)}.
-1
i=l
i.i.d.
definition
for
of
B
some
is
H,
and
replaced
the
by
averaging
over
expectation
with
definition is appropriate for fixed or random
XIS.
{xl' ..• ,x N}
respect
to
in
H.
but not here,
t
=
1.
Our
In the defini tion of
B on page 5 of Carroll and Ruppert (1985), W is incorrectly squared.
that article,
s
In
The asymptotic covariance matrix of
14
N
A
An
=
-1 \
2
•
T
N /
E {W (y,x.,~,B)r(y,x.;~,B)t Cy,X.i~,B)}.
_
y
-1
-1
-1
i=l
intuitively
reasonable
way
covariance;
see
Krasker
discussion.
The
resultant
and
to
norm
Welsch
measure
wt
(1982)
of
is
to
for
use
the
further
influence,
the
asymptotic
motivation
so-called
and
self-
standardized gross-error sensitivity is
... T -1
max [IFCy.,x.;>.-,B) V IF«y.,x.,>.-,B)]
1 -1
1 -1
i
= max
i
([tCy.,x.,>.-,B) T-l
A t(y.,x.,~,B)] t W2 Cy.,x.;>.-,B)}.
1 -1
1 -1
1 -1
-
_~Pte that w2 has been incorrectly omitted from the last term in equation
C15) of Carroll and Ruppert
experience wi th other problems we
"a"
is
between
satisfactory.
1.1
and
To bound
=
Y2
Y2 mus t be at least Cp+l) t .
From
suggest bounding Y2 by a Cp+l) t,
where
(1985).
1.5,
and
a
=
1.2
or
1.3
has
generally
been
by a(p+l)t, we use the weighting function
.
.;
T -1.
-;
m1n{1,a(p+l) [t(y,~;>.-,!) A r(y,xix,B)]
}.
BereA is defined implicitly since it depends upon Wand vice versa.
In
practice, >.-, !, and A are estimated iteratively.
We used a simple iterative scheme:
(1) Fix a>l.
~p
e·
Let C be the total number of cycles.
be preliminary estimates, probably the MLEs.
(2) Define
Set c=l.
Set W. =11
-
Let >'-p and
15
In t use the weighted geometric mean y
=
\ w,lOgy.ILw.).
\
exp(l
1.
1.
1.
(3) Update the weights:
W.
1.
=
t
- - T--l
- -t } •
mi n {l, a (P+ I) [ t (y. , x. ,).. ,13 ) A t (y. , x. i).. ,13 )]
1. -1.
P -p
1. -1.
p-p
(4) Solve
N
\
1
W.t(y.,x.;)..,I3) = 0
1.
1. -1.
-
i=l
13 =!, c=c+l and return to (2).
(5) If c<C, set >"p = ).. and , -p
If c=C then
stop.
In the examples discussed
will
not
report,
)..
and
13
in section 5 and other examples
stabilized
at
C=2.
Therefore,
we
that we
recommend
C=2, or perhaps C=3 for small N or data sets with extremely influential
points.
In fact C=l seems adequate,
at
least
for
diagnostic
purposes.
We calculated step (3) with a short program in PROC MATRIX and step
was performed in PROC NLIN.
(4)
PROC MATRIX is needed only to invert A, and
the program should be easily modified when PROC MATRIX is replaced by an
interactive matrix language.
easy on other packages.
This
iterative
transformation model.
Undoubtedly, the computations would also be
We will call the final estimate the BITBS.
method
can
also
be
used
for
the
response
Instead of using (3) one sets
It should be noted that this approach differs from the BIT estimate of
e-
16
Carroll
and Ruppert
(1985),
since
the BIT estimates
-Because the likelihood scores for Band
0-
are,
0-
simultaneously.
respectively,
linear and
quadratic in the residual, the joint bounded-influence estimator of
and
0-
behaves as a rescending
(Hampel 1978).
outliers
bounded.
,
"psi-function", e.g.
Therefore, the influence on
approaches
zero
when
the
\
influence
\'!'
a Hampel M-estimator
and! of extreme response
for
0-
is
simultaneously
5.
AN EXAMPLE
When managing a fish stock, one must model the relationship between
the
annual
spawning
stock
size
and
the
eventual
production
catchable-sized fish (returns or recruits) from the spawning.
of·
new
Ricker and
Smi th (1975) gi ve numbers of spawners (S) and returns (R) from 1940 until
1967
for
the
assumptions
Skeena
about
River
factors
sockeye
salmon
influencing
the
stock.
Using
survival
of
some
simple
juvenile
fish,
Ricker (1954) derived the theoretical model
R = a
S exp(a S) = f(S:!)
(14)
2
relating Rand S.
Other models have been proposed, e.g. by Beverton and
HoI t
l
(1957).
However,
the
Ricker
model
appears
adequate
and,
in
particular, gives almost the same fit for this stock as the Beverton-Holt
model.
From figure 1, a plot of R against S,
it is clear that R is highly
e·
variable and heteroscedastic, with the variance of R increasing with its
mean.
Several cases appear somewhat outlying, in particular #5, #19, and
#25.
The
model
(12)
number:
, -
model
(14)
and
the
was
linearized about
square
see figure 2.
root
of
Cook's
the MLE
D
was
to
form
plotted
the
pseudo
against
case
Case #5 and especially case #12 stand out.
In
figure 1 case #12 is somewhat masked by the heteroscedasticity since the
residual on the or igi nal
scale
[R 12- f (S 12:!)]
after transformation by the MLE >-is
substantially
larger
though
extremely high leverage point,
model
(12)
is h
12
= 0.685.
An
=
a
leverage
[R~U-f(~)
(S. ,a)]
1
1 -
point
=
e.1
by
this
relatively
.314 the residual
still
not
[R
excessive.
12
small,
but
(>--)-f(>--) (S:,i»)
Case
#12
is
an
and its Hat matrix diagonal according to
h
value
exceeding
considered high by Hoaglin and Welsch (1978).
also
is
criterion.
2p/N = 6/28
Since h
In
S
=
figure
are plotted against S aWl though
3,
=
.214
is
.23, case #5 is
the
residuals
#12 stands out,
e
18
it does not seem extremely outlying until one accounts for leverage.
-compensate
for
leverage,
Belsley,
Ruh,
and
(1980)
Welsch
To
suggest
standardizing A. by its estimated standard error to produce
1
RSTUDENT. = e./S(i)(l-h.)t,
111
where S(i) is the root mean square 8rror without case i.
For this data
set RSTUDENT 12 = -4.40!
In table 1 we present influence diagnostics applied to model (12),
exact
change
the
qui ck
es tima te
1\ ".- . Q,
and
another
the
1\"..,
-
1
.
t'10n --~i
I\~ N .
approxlma
1\"..,
1
-
and
there
It
are
at
-
is
evident
least
that
1
I\~.Q
is not always close to
1
-
two possible
causes
of
inaccuracy:
(i)
I\~. Q uses a linearization of the parameters and (ii) in model (12) y is
-
1
held fixed rather than readjusted as cases are deleted.
_.e
effects of cause
computed
the
(ii),
we
experimented wi th
nonlinear
least
squares
a
To isolate the
different approximation.
-
estimate
". (i)
N
minimiz ing
(8)
when the y is calculated from the full data but the sum of squares in ( 8 )
is
over
Then we
j;ii.
Of course,
set
is
as
difficult to compute as 1\".. itself and is not of interest as a practical
-
approximation;
inaccurate.
we
have
In table 1,
1
calculated
I\~.N just to
learn
--1
II\".i' is large for i=5 and 12.
why
I\~.Q
-1
is
In both cases,
I\~.N
approximates --1
1\".. much better than I\~.Q
which suggests that cause
1·
1
(i)
It is clear, though, that small changes in y
is the primary problem.
from deleting an outlying y.
1
can have a notable effect on
"..
We
have
compared I\~. and 1\".. Q for several other data sets, the kinetics data of
--
Carr
(1960)
1
-
1
analyzed
in
Box
and
Hill
(1974)
and
Carroll
and
Ruppert
(1984), the Atlantic menhaden spawner-recruit data in Carroll and Ruppert
:1985),
.arrOll
and
the
(1986).
"population
In
all
A"
spawner-recruit
cases
data
though
in
not
Ruppert
an
and
accurate
19
If',,~. I,
approXim.ation, did indicate large values of
and
1
f'"l.. Q
appears to ...
1
•
be an effective diagnostic of high influence.
We computed the BITBS estimate with bound a=1.2 and C=3 cycles.
table
2,
and
)...,
iteration of BITBS.
W.
non-uni t
The changes
are
1
given
for
the
MLE
In
and
in the estimates and weights
each
from
the
first to the second iteration are small, and one iteration seems adequate
for most practical purposes;
suffice~
certainly two iterations
the BITBS estimate severely downweights case #12,
Although
the estimate of
~
only
changes from .31 to .13 in two iterations of BITBS, while the MLE becomes
~
=
-.2
if
#12
is
deleted.
For
these
data,
the
BITBS
influential points and reduces, but does not eliminate,
detects
their
the
influence.
Case #12 was the year 1951 when a rock slide severely reduced recruitment
(Ricker and Smith 1975).
To model
one would delete case #12 and refit.
recrui tment
under
Without #12,
normal
condi tions
the MLE and the BITBS
estimate are similar (table 3), and one would probably use the MLE.
The
bulk of the data indicate that recruitment is highly heteroscedastic and
a
severe
errors.
transformation
if
is
needed
to
induce
homoscedastic
Because the anomalous #12 occurs where S is small,
less heteroscedastici ty
used
()... = -.2)
#12
is
not
and
a
deleted.
more
Since
moderate
case
transformation
#12
is
not
from
it indicates
()... = .3)
the
is
target
popula ti:,n of normal spawni ng years, it seems safe to delete it. Thi s data
set is an example where one might consider a robust estimator that gives
essentially zero influence to extreme outliers, e.g.
transformation
models
of
a
redescending-psi
a generalization to
M-estimator.
We
normally
prefer using a redescending M-estimator to rejecting outliers, since an M~timatnr
rejection
has a known large sample distribution.
methods
asymptotically.
upon
the
MLE
are
not
The effects of outlier
well
understood,
even
Here we di stingui sh between rej ecting outliers based on
e-
20
some
statistical
criterion,
~observations known
say
a
hypothesis
test,
and
deleting
to come from a non-target population.
Although the MLE and the BITBS are similar, when #12 is deleted,
#4 and #5, are substantially downweighted by the BITBS.
cases,
had been z:nasked when #12 was included,
the presence of #12.
further
final
information
If
analysi s.
deleted,
two
Case #4
but #5 was already influential in
One might explore fur ther deletions though wi thout
we
feel
#4
that
#5
and
in addition to #12,
should
case #4,
then the MLE changes noticeably;
be
case
retained
#5,
see Table 4.
or
in
the
both
are
The deletion of
#5 and the deletion of #4 affect the MLE in somewhat opposite directions,
though deleting #4 affects the MLE more severely.
The BITBS downweights
#4 somewhat more than #5, so effects of downweighting #4 and
#5
tend to
cancel.
·e
a
We do not view the "transform both sides" method chiefly as choosing
new
scale
for
conditional
analyzing
distribution
the
of
y
response,
on
the
but
rather
original
scale on which direct measurements have been made.
(\)
f\
=
f
(\)
f\
modeling
By
scale.
scale, we mean the scale of primary scientific interest,
Y
as
the
"original"
usually also the
The model
(~;!) +€
leads to
y
=
yeS)
=
We
are
= (f\(~;!)+~€)l/~
exp(log
assuming
approximately
f(~;!)
that
€
symmetric,
\ ¥ 0,
+ \S)
is
and
\
= o.
approximately
this
last
normal
point
and
in
suggests
particular
that
the
estimate
of
conditional median of y given x can be estimated by
. h e condi tional mean
~an
(1983).
is easily estimated by
Let r. by the i-th residual
.1
the
"smearing"
21
=
r.
1
y. >...- f >.. (x. , B)
1
-1
-
=
>.. (y ~ >.. ) - f ( >...) (x. , B))•
1
-1
-
Then the smearing estimate is
N
=N
-1,\
(15)
/
i=l
with obvious changes if >..
= o.
If >.. is a or if >...-1 is an integer, then E(yl~) can be estimated by
methods of Miller
~
when
to
a
(1984).
Miller's estimators are particularly simple
It is a common practice to round >..
is in the set {-l,-t,a,!,l}.
value
in
applicable.
this
set,
in
which
case
Miller's
(1984)
estimate
is
We do not necessarily advocate rounding >.. especially since
there is some theoretical evidence against this practice (Carroll 1984),
but when the rounded value is very plausible according to a
then
test,
inference.
the
rounding
The
common
should
little
have
rationale
for
effect
rounding
>...,
hypothesis
on
subsequent
to
make
the
transformation easily interpretable, is less compelling when one views
as part of a model for y on the original scale.
or
is
admittedly
of
little
>..
In the present example,
biological
direct
_-
and
economic
interest, but the same is true of, say, 10g(R).
A
In
figure 1,
plotted.
~,
m(RIS)
A
and E(R/S),
both calculated without
E(R/S) was also estimated by fixing>...
=
#12,
are
0, re-estimating !
and
and using Miller's (1984) estimate:
~
A
f(~;!)e~
2/
2
The smearing estimate and Miller's estimate are so close that they would
be barely distinguishable had Miller's estimate be included in figure 1.
A
To
8~e
. the
influence
of
case
#12
on
m(R/S)
A
and
estimates were calculated both wi th and without case #12.
E(R/S),
When
these
#12
is
tit
22
deleted then not only are \
and B set equal to the MLE without #12,
but
4ILlso the averaging in (15) is over i¥12.
The changes in the estimated median and mean caused by deletion of
case #12 are graphed in figure 4.
As
might
be
expected,
deleting
#12
caused the estimated median and mean recruitment to increase for small S,
especially for S near 300-400.
The most
dramatic
change when
deleting
#12 is a decrease in estimated median and mean recrui tment for
large S.
This decrease is
since B
2
largely brought about by the decrease in B ,
2
controls the shape of the Ricker curve.
The
"transform both
sides"
model
is
certainly
that would be appropriate for this example.
not
the
only
model
Since R is heteroscedastic
but not greatly skewed one should consider heteroscedastic models such as
(16 )
R
=
f(S;~,>
OC
+ o-f (S;!)e,
where the variance of R is propor.tional to a power
of S or
of
f.
In
Ruppert and Carroll (1986), the model
R (\)
=
f ( ~) (S;!)
+
0- Soc e
( 17>
was fit to the Skeena River data,
\ = • 75
by
with case #12
and oc
= . 5.
However, both Ho·. oc
1 ikel ihood
ratio
tests
at
level
.10,
=
included.
0 and Ho·• \
so models
=
(4)
The MLE was
1 are accepted
and
(16)
both
appear reasonable for
these data,
would be of interest.
We plan to study diagnostics and robust estimation
for model (17) in the future.
though a re-analysis wi thout case #12
23
6.
When
a
response
y
is
SUMMARY
thought
to
fit
a
f(~~!),
model
heteroscedastic and/or nonnormally distributed, then y and
transformed
in
the
same
manner
to
induce
f(~~!)
normal errors while retaining the model
of
y.
Often
outliers
transformation~
in
the
approximately
original
but
f(~~!)
y
is
can be
homoscedi-istic,
for the conditional median
response
are
accommodated
that is, the outliers are seen to be the resul t
by
of the
skewness or heteroscedasticity in the untransformed data.
In
some
situations,
an
outlier
will
indicate
a
substantially
different transformation than that fitting the bulk of the data.
In our
example with 28 observations,
case #12 is a response outlier associated
with
conditional
a
small
Therefore,
the
rest
value
case
of
of
#12
the
the
counter-indicates
data,
=
transformation from \
and
deleting
.3 to \
=
median
(and
the
severe
#12
changes
mean)
response.
heteroscedastici ty
the
estimated
4It~
in
power
-.2.
Influential cases should be detected and scrutinized as a matter of
standard
case
good
#12
in
influential
statistical
our
practice.
example,
case.
In
other
there
cases,
In
are
some
situations,
good
the
reasons
for
appropriate
such
as
with
removing
treatment
of
an
the
outliers will be less clear-cut.
In this paper, we propose an approximation to the sample influence
curve.
Although
effective
the
diagnostic
approximation
for
influence
is
not
cases.
highly
We
accurate
also
propose
it
a
is
bounded
:nfluence estimator, which can be used to pinpoint influencial cases,
to accommodate them,
or both.
The diagnostic
can both be computed with standard software.
and
the
robust
an
or
estimator
e
APPENDIX
There are at least three methods of estimating the covariance matrix
of
(!, ~) :
(i)
Evaluate the Hessian of -L(!,
the observed Fisher information matrix,
left
submatrix
of
I
(!,~).
evaluated at
-1
(i i)
.
Use
(I
I*
Let
* ) -1 .
(iii)
~,o-)
1.
Use the
be
the
~
obtained
of
estimation
and we have
described
estimat~d
in
!(~),
by
Hessian
of
which
Then
2.
is
max
(!,
~)
(10 )
we
have
the least squares
y(~) is fit to f(~) (~:!) with ~ fixed at~.
estimate when
-L
upper-
Suppose we followed the
section
......
!
(p+l)X(p+l)
Also there is a fourth method
which only estimates the covariance matrix of 6.
method
This is
•
Use the estimate from model
treated as a nonlinear regression problem.
Box-Cox
<.!!' ~,o-)
at
This fit also
!, which we will call the method
gives an estimated covariance matrix for
(iv) estimate.
-e
(i i i)
Methods
lmplemented
on
justified by
standard
(iv)
are
the
nonlinear
the well.;...known
Method (ii)
estimation.
for
and
large
ignores
0-
easiest
to
regression
sample
use
since
software.
theory
of
they
can
be
Method
(i )
is
maximum
likelihood
as a
likelihood
and treats L
(6,>--)
max -
(!, ~) .
In
theorem
identical,
general
not
for
A.2
just
below,
for
parametric
we
the
show
that
methods
(i)
transformation
problem
under
estimation
where
a
parameter
and
is
(ii)
study
are
but
in
eliminated
by
maximizing the likelihood over that parameter.
N
~
In general, methods (i) and (ii) are not equivalent to (iii)
even as
co,
are the
but by theorem A.l below all three estimates of var (B)
same in the limit as N
Carroll a;;r} Ruppert
. .rovide a
~ ~
co and
~
(1984)
co and
0-
o.
~
have let N
~
simple asymptotic theory for
0-
fixed
theory
Bickel and
co and ()-
~
Doksum
"Small
and
0 simul taneously to
transformations,
is complicated.
(1981)
0-"
since the usual
asymptotics
have
25
often
proved
to
checked against
from
be
good
approximations
to
finite
Moreover,
in
many
Monte Carlo.
engineering and
the
physical
sciences,
cr
sample
data
does
results
sets,
when
especi ally
seem small
in
the
sense that the model fits the data very well.
In Carroll and Ruppert (1984) we show that methods
equivalent estimates of var(!) as N
In
summary,
methods
(i)
and
~
(ii)
are
that method (ii> does not estimate var(cr)
and
All
>.-.
var ( J3 )
as
four
N
~
methods
and
00
cr
give
It
does
correctly estimates var(>'-) or cov(>'-,!).
>.-
can
be
used
in
place
of
A
a
O.
identical,
except of
not
equivalent
appear
that
estimates
method
The confidence interval
standard
course
or the covariance of cr with J3
asymptotically
o.
~
~
and cr
00
(i) and (iv) are
error
for
but
a
of
(iii)
(9)
for
6-method
A
standard error of E(yl~) or m(yl~) will require an estimate of cov(>'-,!)
and var(>'-).
Programming method (i i) by computing the analytic second deri va ti ve
matrix of L
is somewhat a bother, but the gradient of L
max
programmed and can be differentiated numerically.
2
.
L
(8,>.-) = -(N/2)log~ (!,>.-)
max -
max
is easily
Since
where
N
=
N
-1 \
/
z 2 (y.,x.;/3,>.-)
1
-1
-
i=l
and since the gradient of ~2 at
N
=
_-2
-cr
\
/
i=l
is zero,
Cd /J >.-)
the Hessian of L
max
at
(z~)J
(J/J~)(zw)
where all quanti ties on the right hand side are evaluated at
(!, >.-) .
It
e
26
is not difficult to numerically compute the derivatives of (zu) and
~ith
respect to! and
~.
In table A.l we
compare
errors
for
the
Skeena
River
the
data
method
with
(ii>,
case
methods produce similar standard errors of 6
~
of
~
#12
and 13 .
2
and
(iv)
deleted.
standard
The
three
The standard error
by method (iii) seems substantially inflated.
Theorem A.l
()-
1
(iii>
(zw)
0,
Suppose
methods
(i)
that
and
~1'~2' ...
(iii>
are
i.i.d.
Then
as
N ~ 00
and
of estimating the covariance matrix of !
are asymptotically equivalent.
Sketch of proof:
t
N
12
= j\
22
u(y.,x.;a,Uw(y.,x.;B,~)
1 -1 1 -1 -
-
i=l
N
J_
t
Define:
=
~
j
2
w (y i ' ~i ; a , ~)
A
A
i=l
and
'"
t
~12
~ll
=
~12
T
~22
Then the estimated covariance matrix of (!,).,)
.here s2 is the mean square error.
Now as N
~
by method (iii)
0)
is s2
t
-1
27
and by Taylor expansions
(No-)
-Itt 12
~ 0
and
as N
( 1) • )
~
and
00
0-
o.
~
Therefore,
by
(Note that YI is a function of €I;
method
<iii)
the
estimate
see equation
Var«N!/o-)!,N'~)
of
converges to
o
D
By
-1
theorem
1
of
Carroll
and
Ruppert
(1984)
method
has
(i)
asymptotic estimate of var( (N'/o-)!) but a different estimate of
Note:
Some
type
of
asymptotics to hold.
regulari ty
condi tions
The assumption that
on
{x.}
-1
but other assumptions could be used instead.
{x.}
-1
are
same
Var(N'~).
needed
are i. i. d.
For a
the
for
the
is convenient
rigorous
result,
an
appropriate regularity condition on f would also be needed.
Theorem A. 2:
Let
L(~,o-), e€R k and o-€R q ,
be
a
real-valued
Let Le and Lo- be the first partial derivatives and
let
Lee'
function.
Leo-'
and
e
28
e
L
o-e-
be the second partial derivatives of L.
For each
9
suppose
0-(9)
satisfies
L(9,e-(9»
= sup
(A. 1 )
L(~,o-).
0-
Then
is the upper left kXk submatrix of
-1
(A. 2)
T
L 90- (i!.,o- (i!.»
Proof:
It is enough to prove the theorem when g=l, for then the general
case follows by induction.
a
Le-(~'O-(i!.»
=
By (A.l)
0,
-.so that
Therefore
(A. 3)
Next
(A. 4)
where all terms on the right-hand side of (A.4) are evaluated at i!.,o-(i!.).
Substituting (A.3) into (A.4) we have
29
Using the identity,
V"
v.eR
k
(A- 1 +UV T )-1
=
A-1_A-1UVTA-1/<l+VTA-1U)
if A€R kXk and
(see problem 2.8, page 33 of Rao 1973) we have
(A. 5)
By another identity (see problem 2.7, page 33 of Rao 1973),
(A.5)
is the
kXk upper left submatrix of (A.2).
e-
30
Table 1
Diagnostics for the Skeena River sockeye salmon data.
Diagnostics
Case Number
5
12
19
25
Residual
917
-939
-882
-922
RSTUDENT
2.25
-4.40
-1.93
-2.04
Hat diagonal
.23
.68
.08
.08
Cook's D
.43
8.09
.09
.11
1. 23
-6.49
-.55
-.62
-.46
-.71
.13
.19
.55
1. 38
-.33
-.41
-1.01
6.06
-.11
-.11
-.31
1. 56
-.04
-.04
-.17
.77
-.01
-.01
-.10
.51
-.03
-.03
DFFITS
~
DFBETAS-J3
-e
DFBETAS-J3
1
2
~
DFBETAS- ~
/\~.1 Q
-
~
/\~
-
.N
1
A
/\~.
-
1
31
Table 2
Maximum likelihood and bounded-influence estimates
for the Skeena River data.
C=o (MLE)
f3
1
f3
2
3.295
-6.9998x10- 4
.3141
>--
No cases deleted.
C=l
C=2
3.590
3.619
-8.307x10
-4
-8.49x10
C=3
3.622
-4
-8.50x10
.1921
.1329
.1138
.579
.647
W
s
1.0
.448
w6
1.0
.931
w12
1.0
.253
.188
.172
wI9
1.0
.811
.857
.874
w25
1.0
.733
.776
.790
1.0
-4
1.0
e~
32
Table 3
Maximum likelihood and bounded-influence estimates for
the Skeena River data.
C=O (MLE)
13
1
A
13
2
3.78
-9.54xlo- 4
C=1
C=2
3.89
3.98
-10.2x10
Case #12 deleted.
-4
-9.93xlO
A
~
-.199
-.254
-.235
w
1.0
.377
.575
W
s
1.0
.448
.753
w
6
1.0
1.0
.946
wg
1.0
1.0
.954
1.0
.904
4
~eW12
This case is deleted
.w
18
1.0
w
19
1.0
.781
.860
w
1.0
.703
.846
25
-4
Table 4
Maximum likelihood estimation for the
SkeenaRiver data with selected cases removed
Cases Removed
6
....
1
6'2
>--
#12
#4, #12
#5, #12
3.78
4.20
3.89
-9.54X10- 4
-.199
-11.2Xl0- 4
-.428
-10.5Xl0- 4
-.126
#4, #5, #12
4.30
-12.1Xl0- 4
-.392
34
Table A.l
Estimated standard errors for the Skeena River
sockeye salmon data without case #12
Method
s.e(/31)
s.e. (/32)
s.e.
(>-,)
( ii>
0.698
3.17XIO- 4
0.369
<iii>
0~711
3.33XIO- 4
0.624
(iv)
0.694
3.06XIO- 4
35
LIST OF FIGURES
Fig.
1 -
median
Plot of returns
recruitment
(or
estimated
recrui ts > against
without
case
spawners
#12.
wi th mean
Selected
cases
and
are
identified.
Fig. 2 - Square root of Cook's distance plotted against case number.
Fig.
3 - Residuals
spawners.
=
[R\-f(S,i>\] from the full-data MLE plotted against
Selected cases are identi.fied.
Fig. 4 - Differences in mean and median recruitment estimated without and
with case #12 plotted against spawners.
e'"
3600 - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - t
3
.'
2
Cl
-.,:,,:.
o
o
u
~
o
+->
o
o
ex::
Q)
\...
Itl
:::l
CT
(/)
I)
N
8
10
12
l il
16
Case Number
18
3
"
*5
2
-
*
*
*
*
-
*
*16
*
>I<
.ro
0
·e~
~
*>1<
'"
*
*
(J)
0::
1-I
-1
>I<
~
>I<
*>1<
-1
J
.j
* 22
*4
J
* 6
-J
I
-i
.j
-2
_i
J
~
19*
*12
* 25
J
\'
-3
-i
\
0
e
250
•
,
,
I
500
750
Spawner's (in thousands of fish)
1000
*9
100
,
,,
,
,
,
50
'\ , MEAN
,
\
MEDIAN
\
\
N
.-
'l<=
\
0
\
\
OJ
\
\
Vl
\
ItS
U
\
\
\
\
oJ:::
\
.j-J
\
\
3:
\
\
\
-0
C
ItS
\
\
-SO
\
e·
.j-J
\
:;:,
\
0
\
oJ:::
.j-J
\
\
\
3:
\
\
OJ
\
.j-J
\
\
ItS
\
E
.j-J
Vl
-100
\
\
OJ
\
\
\
C
OJ
OJ
\
\
3:
\
.j-J
OJ
.0
OJ
U
c
OJ
-1 SO
OJ
44-
Cl
OJ
0>
C
ItS
oJ:::
u
-200
f
-250
400
20 r
Spawners (in thousands of fish)
1200
REFERENCES
ATKINSON, A. C. (986).
"Diagnostic tests for
To appear in Technometrics.
transformations."
(980) •
"Relative curvature
BATES, D. M. and WATTS, D. G.
Journal of the Royal Statistical
measures of nonlinearity."
Society, series ~, 42, 1-25.
BATES, D. M., WOLF, D. A., and WATTS, D. G. (1986) .
"Nonlinear
least squares and first-order kinetics." Manuscript.
BELSLEY, D. A., KUH, E., and WELSCH,
Diagnostics. New York: Wiley.
BEVERTON, R. J. H., and S. J. HOLT.
exploited
fish
populations."
Office, London. 533 p.
R.
E.
(1980) •
Regression
(1957).
"On the dynamics of
Her
Majesty's
Stationary
analysis
of
BOX, G.
E.
P.
and
COX,
D.
R.
(964) .
"An
transformations (with discussion)."
Journal of the Royal
Statistical Society, Series B, 26, 211-246.
-e
BOX, G. E. P. and HILL, W. J. (1974).
Correcting inhomogeneity
of
variance
with
power
transformation
weighting.
Technometrics, 16, 385-389.
CARR, N. L.
pentane.
(1960) .
Kinetics of catalytic isomerization of nIndustrial and Engineering Chemistry, 52, 391-396.
CARROLL, R. J. (1982).
Prediction and power transformation when
the choice of power is restricted to a fini te set.
Journal
of the American Statistical Association, 77, 908-915.
CARROLL, R. J. and RUPPERT, D. (1984).
"Power transformations
Journal of the
when fi tting theoretical models to data."
American Statistical Association, 79, 321-328.
CARROLL, R. J.
regression:
and RUPPERT, D.
(1985) •
"Transformations
a robust analysis." Technometrics, 27, 1-12.
in
COOK, R.
D. and WANG,
P.
C.
(1983) .
"Transforma tion and
influential cases in regression." Technometrics, 25, 337-343.
COOK, R. D. and WEISBERG, S. (982).
Regression. New York and London.
DRAPER, N. and SMITH, H. (981).
2nd Edition. Wiley: New York.
Residuals and Influence in
Chapman and Hall.
Applied Regression Analysis,
37
DUAN,
N.
(1983).
"Smearing
estimate:
a
retransformation
method."
Journal
of
Statistical Association, 78, 605-610.
HAMPEL, F. A.
Estimation.
Berkeley.
(1968) .
Ph.D.
nonparametric
the
American
contributions to the Theory of Robust
Thesis.
University
of
California,
HAMPEL, F. R.
(1974).
"The influence curve and its role in
robust estimation."
Journal of the American Statistical
Association, 62, 1179-1186.
HAMPEL,
F.
R.
(1978) .
"Optimally boundi ng the gross-errorsensi ti vi ty and the influence of posi tion in factor space."
1978 Proceedings of the ASA Statistical Computing Section.
ASA, Washington, D.C., 59-64.
HOAGLIN, D. C. and WELSCH, R.
(1978).
"The hat matrix in
regression and ANOVA." The American Statistician, 32, 17-22.
KRASKER, W. S. (1980).
"Estimation in linear regression models
with disparate data points." Econometrica, 48, 1333-1346.
KRASKER, w. S. and WELSCH, R. E.
(1982).
"Efficient
influence regression estimation."
Journal of the
Statistical Association, 77, 595-604.
boundedAmerican
MILLER, D. M.
(1984) .
"Reducing transformation bias
fitting.
The American Statistician, 38, 124-126.
in
RAO, C.
R.
(1973).
Linear
statistical
inference
applications, second edition. Wiley:
New York.
and
RICKER, W. E.
(1954).
"Stock and recruitment."
Fisheries Research Board of Canada, 11, 559-623.
curve
Journal
its
of
RICKER, W. E. and SMITH, H. D. (1975).
"A revised interpretation
of
the
hi story
of
the
Skeena
River
sockeye
salmon
(Oncorhynchus nerka)."
Journal of the Fisheries Research
Board of Canada, 32, 1369-1381.
RUPPERT,
D.
(1985) .
"On
the
bounded-influence
regression
estimator of Krasker and Welsch."
Journal of the American
-Statistical Association, 80, 205-208.
RUPPERT, D. and CARROLL,
rp0cession analysis
relationships."
To
Yorque Conference of
New York.
R. J. (1986).
"Data transformations in
with applications to stock-recruitment
appear in the Proceedings of the Ralph
Natural Resource Management.
Springer:
38
SNEE, R. D. (1985).
"An alternative approach to fitting models
when re-expression of the response is useful. II' Manuscript.
..
·e
•