.
INFLUENCE AND MEASUREMENT ERROR IN
LOGISTIC REGRESSION
by
Leonard A. Stefanski
-e
A dissertation submitted to the faculty of
the University of North Carolina at Chapel
Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy
in the Department of Statistics
Chapel Hill
1983
Approved by
1Lb~~(~-e/
Advis r
JfAAn-i~'
LEONARD A. STEFANSKI.
Regression.
Influence and Measurement Error in Logistic
(Under the direction of Raymond J. Carroll and David
Ruppert.)
This dissertation concerns the use of logistic regression when certain
standard model assumptions are violated.
Chapters I and II study the prob-
lem of estimating regression parameters when covariates are subject to
measurement error.
The latter chapters study robust methods applicable
to logistic regression.
To facilitate study of the errors-in-variables problem a small measurement error asymptotic theory is developed. This allows comparison of
certain estimators which have appeared in the literature and also suggests
new estimators which are shown to have better asymptotic properties.
A
small Monte-Carlo study confirms the superiority of the new estimators in
certain settings.
In the course of studying the asymptotic behavior of
the various estimators interesting use is made of some random convex analysis.
To deal with the problem of messy data, i.e. outliers and extreme covariables, several bounded influence estimators are proposed.
The optimal-
ity properties of these estimators are studied in Chapter III.
Asymptotic
theory for the robust procedures is given in Chapter IV.
Finally, Chapter V
concludes the thesis with an application of these methods to two sets of
data.
ii
ACKNOWLEDGEMENTS
I would like to express gratitude to my co-advisors Raymond Carroll
and David Ruppert for their time and effort in directing my work and preparing me for a career in academic research.
I am also indebted to the
other members of my dissertation committee, I. M. Chakravarti, Ed Davis,
and Steve Marron for their patience in reading the manuscript.
I have benefited greatly from my contact with the faculty of the
Statistics Department at the University of North Carolina.
In addition
to committee members I would like to acknowledge the roles played by
-e
Professors Leadbetter, Cambanis, and Kelly, and the Department Chairman,
Professor W. L. Smith, in my training as a research statistician.
Special thanks are due to Ruth Bahr for her meticulous typing of
this manuscript.
Finally, I cannot overstate the importance of the support I received
from my wife and family throughout my schooling.
During my graduate career financial support was provided in part by
a George E. Nicholson Memorial Fellowship and by the US Air Force Office
of Scientific Research under Grant AFOSR F49620-82-C-0009 and National
Science Foundation No. MCS-8l00748.
iii
TABLE OF CONTENTS
Notation . . . . . • . . . . . . . . . . . . . . . . v
CHAPTER I:
INTRODUCTION . . . .
1
1.1
Logistic Regression.
1
1.2
Summary . . . . .
1.3
EIV: Introduction and Preliminaries.
5
COMPARISON OF EIV ESTIMATORS . . . .
9
CHAPTER II:
-_
. . 4
2.1
Asymptotic Theory . . . . . . . . . .• . . . . . . 9
2.2
Monte-Carlo
.
15
2.3
Proofs of Primary Results.
22
BOUNDED INFLUENCE ESTIMATION . .
34
3.1
Motivation. .
34
3.2
Influence and Sensitivity.
• 36
3.3
Related Results
· 40
3.4
Bounded Influence Estimators
47
BOUNDED INFLUENCE ASYMPTOTICS
60
CHAPTER III:
CHAPTER IV:
.
4.1
Introductory Remarks
.
60
4.2
Consistency and Asymptotic Normality.
61
4.3
One-Step Estimators . . .
• 67
4.4
Proofs of Major Results.
· 68
iv
APPLICATIONS OF BOUNDED INFLUENCE ESTIMATORS.
.89
5.1
Introduction. . . . . .
.89
5.2
Computation. . . . . . . . . . . . . . . • . . . . . 90
5.3
Examples.
CHAPTER V:
Bibliography. • . • . .
-e
.93
• • .124
v
NOTATION
t
F(t) = l/(l+e- ).
F(k)
( 0)
= kth
derivative of F(o)
T
X.x ..
J.
~(o,o,o)
-e
J.
a generic estimating equation.
IC~
the influence function of the estimator corresponding to
PN
empirical measure which assigns mass l/N to the points xl' ... ,x .
N
EN
expectation operator associated with the measure P .
N
ret)
= min(l,t)
£
t
~
0
= £(y,x,8) = (y-F(xT 8))x
met)
= max(F(t),
I-F(t)).
~.
CHAPTER I
INTRODUCTION
1.1
Logistic regression
Binary regression is concerned with modeling the relationship between
an explanatory vector x and a response variable y indicating the occurrence
(y=l) or nonoccurrence (y=O) of some event.
The statistical model employed
is usually of the form
-e
(1.1.1)
Pr(y=llx)
=
T
g(x 8),
where g(.) is a known function and 8 is an unknown vector of parameters.
Estimates of
8 are derived from a random sample (y.,x.) i=l, ... ,N. Models
1.
1.
of this type are quite popular and have applications in both the physical
and social sciences.
Several examples are given in the monograph by Cox
(1970) .
The model makes sense only if the range of g(.) is restricted to the
unit interval.
In many respects the simplest way of doing this is to choose
g(.) from the class of probability distribution functions.
The two most
popular choices are
(1.1.2)
g(z) = 4>( z)
-
(l/2n)
!.:
2
z
2
J exp(-t /2)dt, probit regression
_00
e
(1.1.3)
g(z) = F(z)
-
l/(l+exp(-z)) ,
logistic regression.
2
Probit regression has its origins in the work of Fechner (1860) and Muller
(1879).
Much of its popularity is due to Finney (1947a).
Berkson (1944)
defined the logit transformation and his work contributed heavily to the
prevalence of the logistic model.
similar results.
In practice the two models often yield
However, certain theoretical considerations, to be dis-
cussed shortly, suggest that the latter is appropraite in a wider variety
of settings.
This provides the rationale for limiting our investigations
to the logistic model.
In the following discussion it is necessary to distinguish between
intercept and slope parameters. This is accomplished by writing
T
T
and x. = (l,c.).
1.
1.
eT=
T
(a,S)
The various notation is used interchangeably without fur-
ther comment.
Much of the motivation for logistic regression derives from its connec-
·e
tion to normal theory discriminant analysis.
Suppose the random vectors
(y.,c.) are i.i.d. with the following distributional properties:
1.
1.
(1.1.4)
(1.1.5)
(c·ly·=l)
"" N(lll'~)
1.
1.
(1.1.6)
(c.ly·=o)
""N(llO'~)'
1.
1.
From (1.1.4)-(1.1.6) one finds that
(1.1.7)
Pr(y.=llc.) = F(a+c!S)
1.
1.
1.
where
T -1
a = log ( TIl / TI O) - ( lll~
(1.1.8)
S
= ~-1 ell l -1l0)
.
T-l
1l1-llO~ llO)/2,
3
Thus whenever the normal discriminant hypothesis holds, the conditional
distribution of y. Ix. follows the logistic model.
l.
l.
A more convincing argu-
ment is obtained by noting that (1.1.5) and (1.1.6) can be replaced by some
general exponential family assumptions on the density of c.,(Efron (1975)).
l.
These take the form
fCc. Iy. =1)
(1.1.5')
l.
l.
(1.1.6')
where L , L ' and n are parameters similar to ~l' ~O' and~. The conditionl
O
al distribution of y. Ix. is again logistic although identification of the
l.
l.
parameters in (1.1.8) altered.
Another property of the logistic model suggesting its use is the exis-
·e
For a sample of size N the likelihood can
tence of sufficient statistics.
be written as
N
exp (a
(1.1.9)
N
L yi + jI(I1
1
N
y. c . .) s. )
l. l.J
J
T
II (1+exp(a+c.S))
1
l.
from which it follows that
N
(L1
is no sufficient statistic
y.,
l.
N
T
T
L
y.c.)
1 l. l.
is sufficient for (a,S).
in the probit model.
There
Both Cox (1970) and Berk-
son (1951) note this in their arguments for logistic regression.
As a conse-
quence of sufficiency the likelihood scores are reminiscent of those for
linear regression.
Differentiating the logarithm of (1.1.9) with respect
to e results in the equations
N
(1)
L(y.-F.) (F.
1 l. l.
l.
where F. = F(x!e) and
l.
l.
IF. (l-F.))x.
l.
l.
l.
F~l)=
F(l) (x!e). The logistic distribution function
l.
l.
4
satisfies the differential equation F(l)
=
F(l-F); thus the likelihood
equations reduce to
(1.1.10)
r (y.-F(x.6))x.1
111
N
T
o.
=
A common method of solving (1.1.10) is iterative weighted least squares,
Walker and Duncan (1967).
Under sufficient regularity conditions
(1.1.11)
where
(1.1.12)
IN
•
~
-1
N
T
(1) 2
l x.x.(F.
111 1
) /F.(l-F.)
1
1
r F(1)(x~60)
N
= N- l
1
1
T
x.x.
1
1
The second equality in (1.1.12) follows from the identity F(l) = F(l-F).
·e
One final attractive feature of the logistic model is the linearity
of the log-odds ratio, i.e.
log [
pr(Y.=llx.). )
1 ~ = x.T 6 •
Pr(y.=Olx~)
1
1
1
This often enhances interpretation of the estimated parameter vector.
1.2
Summary
Although logistic regression finds applications in diverse settings,
historically much of the data analyzed were obtained under controlled experimenta1 conditions, e.g. dose response and bio-assay applications.
The cur-
rent extent of applications include the analysis of data obtained in observational studies.
Such data, as opposed to experimental data, are often
far from ideal in regards to compliance with standard model assumptions.
Often there are explanatory variables which cannot be accurately measured.
5
Thus the reported predictor can differ significantly from the true predictor
creating an errors-in-variables (EIV) situation.
Introductory material
for this problem is given in Section 3 of the present chapter.
In Chap-
ter II an asymptotic theory is developed which facilitates comparison of
various EIV-estimators.
The theory suggests new estimators with better
asymptotic properties.
A small simulation study confirms the superiority
of the improved estimators in certain settings.
Also characteristic of observational data are outlying responses and
extreme design points, either of which can have a deleterious effect on the
quality of estimates obtained by maximizing the likelihood.
In Chapter III
this problem is discussed in detail and certain robust estimators are proposed.
The asymptotic behavior of these estimators is studied in Chapter IV.
Finally, Chapter V concludes the investigation with applications of the robust methods to two data sets.
1.3
E-I-V:
Introduction and preliminaries
In the measurement error model the response variable y. is related
1.
to the predictor x. via the logistic model, i.e.
1.
i=l, ... ,N
where 8 is unknown.
0
Observable data consist
N
of the pairs (Y.'X.)l'
1.
1.
where X. has the formal representation
1.
(1.3.2)
X.
1.
= x.1.
1~
1
'\1'''''2
/'2
+ if'
E. m
1.
1
In (1.3.2)
matrix
*and
~~ is the square root of the sYmmetric positive semi-definite
(E.) are i.i.d. random vectors with identity covariance and
1.
zero first moments.
apparent later.
The significance of m, a known real number, will become
6
In most cases some components of x. are measured without error. View~
ing these as the first r components it follows that
*= (~
(1.3.3)
where it is assumed that
** is positive definite.
measurement error covariance
*is unknown,
Also, in most cases, the
so where necessary it is assumed
A
that a sequence of estimators (*) exists satisfying
(1.3.4)
With the model and notation established we turn our attention to the problem
of estimating the parameter vector.
As an estimate of 80 one might simply choose to ignore the measurement
error and regress y.1 on the observed X..
1
The resulting estimator, desig-
nated 8L , is_obtained by solving (1.1.10) with Xi replacing xi' i.e.
eL
satisfies
N
(1.3.5)
T
A
L (y.-F(X. ell )x.1 = o.
111
There are three major effects of measurment error on this procedure.
is bias, which increases as measurement error increases.
First
This has been
noted by Clark (1982) for logistic regression and by Michalik and Tripathi
(1980) in the context of discriminant analysis.
The second, which is
actually a consequence of bias, is the tendency to underestimate the probability of events with high probability and overestimate the probability of
events with small probability; it will be said that measurement error attenuates predicted probabilities.
testing.
Finally there is a problem with hypothesis
In the following chapter it is shown that the usual asymptotic
tests for individual regression components can have level higher than ex-
7
pected.
An
example where this occurs is an unbalanced two-group analysis
of covariance, where interest lies in testing for treatment effect and the
covariable is measured with error.
The severity of these problems depends, of course, on the magnitude of
the measurement errors.
ily.
In some situations 8 L might perform satisfactor-
However, when measurement error is substantial the performance of 8
is suspect and alternative estimators are required.
L
Some proposals follow.
To a great extent the study of linear regression EIV models is facilitated by the existence of closed form solutions to certain likelihood equations.
In particular this allows one to show without difficulty that the
naive linear regression estimate, obtained by regressing on the observed
Xi
converges in probability not to 8
0
but rather, using established nota-
tion,
(1.3.6)
where
N
(1.3.7)
=
· N-l \"'L x.x.T
11m
N
1
1
1
The near linearity of F(·) on [-3,3] (Cox (1970), pages 89-80) hints that
a result similar to (1. 3.6) might hold approximately for logistic regression
as well.
Combining this with ideas of Fuller (1980) suggests estioating 80
c satisfying
by 8
(1.3.8)
x.
lC
=
0
where
(1.3.9)
A
and ~X and ~ are the sample covariance and mean of the (Xi)·
This estimator
8
has been studied by Clark (1982).
She motivates it by noting that x
iC
is a Bayes-type estimate of x.]. given X..
].
Assuming that the errors in (1.3.2) are normally distributed and
is known, one can maximize the joint likelihood for (y.]. ,X.).
].
mator is then obtained by replacing
t
N
I1
AT
A
A third esti-
A
with~.
maximum likelihood estimate designated 8 .
F
(1.3.10)
t
This is a type of functional
The defining equations are
A
(y.-F(x. 8 )) x. = 0,
].]. F
].
(1.3.11)
i=l, ... ,N.
The final estimator presented in this chapter is also obtained by
assuming the errors in (1.3.2) are normally distributed and
-e
In this case a sufficient statistic for xi' given
(1.3.12)
T.]. (80'~) = X.].
+
t
t
~
is known.
and 80, is
80(Y·-~)/m
.
].
Again assuming normal errors, the conditional probability that (y.=l)
given
].
(1.3.13)
It follows that
which suggests solving the equation
N
(1.3.14)
T
A
L
(y.-F(T.(8,~)8))
1
].].
A
T.(8,t) = 0
].
to obtain 8 ' the sufficiency estimator. The performance of the various
S
estimators is studied in the next chapter.
CHAPTER II
COMPARISON OF EIV ESTIMATORS
2.1
Asymptotic theory
In this section an asymptotic theory is pursued which provides a
basis for comparing the various EIV estimates.
fixed and N +
00,
If in model (1.3.1) m is
then all the estimators are generally inconsistent. Fur-
thermore, the resulting bias terms are complicated functionals of the
error law and give little insight into the construction of good
-e
tors.
estima-
Fixing N and letting m+ oo , all the estimators reduce to the same
quantity.
Thus, to obtain useful insight into the behavior of the esti-
mators, it seems reasonable to allow m, N +
00
simultaneously.
A similar
brand of asymptotics has been successfully employed by Wolter and Fuller
(1982) and Ameniya (1982) to nonlinear errors-in-variables models.
Model (1.3.1)-(1.3.2) is appropriate for two situations as min(m,N) +00:
(i)
m independent replicates of (x.)
exist in which case the (E.)
be1
1
come effectively normally distributed and
(ii)
a local model in which measurement error is small but nonnegligible.
In the latter case the moments of order greater than two of
diff~r
(E
i
) generally
from those of a normal variate.
The first problem to be dealt with is that of consistency.
tence of a consistent estimator is not an issue here.
The exis-
With a fairly simple
10
argument one can verify that 8
and 8
c
L
design conditions provided min(m,N)
+
00.
are consistent under reasonable
However, the dimensionality of
the functional likelihood grows like O(N) and it is not immediately evident that 8
as 8
F
is consistent.
Conditions insuring consistency of 8
F
as well
and 8 are given in Theorem 2.1. All proofs are deferred to SecC
tion 2.3. The sufficiency estimator presents a minor problem in that equaL
tion (1.3.4) can have multiple solutions, not all of which produce a consistent sequence.
To guarantee uniqueness and consistency as min(m,N)
+00
In the theois defined to be the solution to (1. 3.14) closest to 8 -.
L
S
rem IN(y) denotes the information matrix evaluated at y, i.e.
8
T
(2.1.1)
X.x.
1
Also, for
1
a positive definite matrix M, Al(M)
~ A2(M)~
.•. ~ AJM) denote the
ordered eigenvalues of M.
Theorem 2.1
(AI)
(A2)
N
2
Illx.11
1
Assume
=O(N),
1
max Ilx.11
1
2
=
o(N);
l~i~N
There exists some 0 > 0 such that
lim inf
N
and
If min(m,N)
+
00
then 8 ,
L
c,
8
8
F
and 8 converge in probability to 80 ,
S
The consistency results provide the first step toward computation of
11
asymptotic distributions.
In the next theorem limiting laws are given for
the various estimators assuming min(m,N)
-+
in such a way that N/m 2 -+ A2< 00
00
In this setup all the estimators have the same asymptotic covariance matrix
but different asymptotic biases.
These biases provide a way to compare the
estimators as well as to verify the attenuation and hypothesis testing difficulties mentioned earlier.
Theorem 2.2
1
(a)
Assume (AI), (A2), and (A3).
L
A
~(8 -8 ) => N(-AI
L
0
-1
~,
r -1 ) ,
~
f{F (1) (xiT80 ) t
(2.1.2)
If 8
If min(m,N)
-+
00
and
where
T
0 + 80
*
80 F
(2)
T
(xi 80 )x/2};
where
(b)
(2.1.3)
(c)
N(-AI
(2.1.4)
=
c
L
-
-1
1
cc' r-) where
-1 ~{
T
lim N
L A(x. -x) F (x. 8 )
NIl
1
0
+
- T T
(1) T
x. (x. -x) A 8 F
(x. 8 )}
11
0
1
0
N
x =
I1
x.
1
L:
=
x
lim N- l
N
N
T
L x.x.
1 1
1
(d)
Remarks
Two comments are in order.
First, Theorem 2.2(a) provides
an asymptotic theory for logistic regression when there is no measurement
error by simply taking ~ = O.
Second, in the asymptotic theory developed
•
12
here, the estimators differ only in their limiting bias.
From this per-
spective the sufficiency estimator is necessarily best since it has no
limiting bias.
However, the theorem also suggests a strategy for construct-
ing improved estimators which are capable of competing with the sufficiency
estimator.
This topic is pursued after a discussion of the attentuation
and hypothesis testing problems mentioned earlier.
Consider simple logistic regression through the origin with 8
0
>
o.
One expects to see attenuation in the naive estimator, i.e. a negative bias
term in (a).
For most designs this is true.
Somewhat surprisingly and
completely at odds with the linear regression case, -AI
tive.
L can be posiSince F(2) = F(1)(1-2F)
Consider an individual summand in (2.1.2).
(2.1.5)
2
t 8 + 8 ~
o t
oo
F ( 1) (x 8 )
-1
c
F( 2) (x 8 ) x/2
0
·e
The leading term ~ 80F(1)(X 8 0 ) is positive by assumption.
For It I large
the function (1+t(1-2F(t))/2) is negative, indicating that (2.1.5) is negative for IX801 large.
Thus one situation in which _AI-
l
c
L
is positive
arises when most events have very high and/or very low probabilities of
occurrence.
It is easy to show that -AI
-1
c
F is strictly positive.
Thus
Theorem 2.2(b) suggests that the functional m.l.e. tends to overcompensate
for the effect of measurement error.
T
i
Consider a two-group analysis of covariance, x. = (1,(-1) ,d.),
1
e~
1
(800,801,802). The covariable d i is measured with error having variance 0 2
Often interest lies in testing hypotheses about the treatment
=
. A standard method of testing 8
= 0 is to compute its logis01
01
tic regression estimate and compare it to the usual asymptotic standard
effect 8
13
error.
Theorem 2.2(a) suggests that this test approaches its nominal level
only if the second component of 1
-1
of 1
-1
c
L
is zero.
Denoting the second row
by s2 it is seen that the correct level is achieved only if
(2.1.6)
This will not hold in the common epidemiologic situation in which the true
covariables are not balanced across the two treatments.
Thus when substan-
tial measurement error occurs in a nonrandomized study one can expect bias
in the levels of the usual tests.
tic regression.
Similar results hold for multiple logis-
Of course, in a randomized study (2.1.6) will be true,
so that the ordinary tests would be appropriate.
Since (2.1.2) and (2.1.3) are fairly simple expressions it is possible
to construct corrected versions of
eL
under the conditions of Theorem 2.2.
and
er
which have no asymptotic bias
Several modifications are possible.
For the usual logistic regression estimate it is simplest merely to subtract an estimate of the bias.
(2.1.7)
-1
I (y) = N
N
Write
N
L F(l) (x!y)
1
1
T
X.X.
1
1
and
(2.1.8)
IN(y)
-1
=N
N
{IP(l)CX!y)
L
1
1
+
Cl/2)p C2 )cx!y)X.yT}
The modified usual logistic regression estimator,
1
e
1
is defined as
LM
(2.1.9)
A different approach is taken for the functional m.l.e.
tions (1.3.10) and (1.3.11) are modified so that
e
FM
The defining equa-
is obtained as the
14
solution to
N
AT
A
A
L
(y.-F(x· M8 ))
1
1
1
FM
(2.1. 10)
= X.1
(2.1.11)
"
A
TA
* 8
+
X1· M = 0,
FM
(y.-F(X.8 pM ))/m
1
1
i=l, ... ,N.
The proof that 6LM and 6
are asymptotically unbiased under the condiFM
tions of Theorem 2.2 is routine though rather tedious and will not be presented.
However, for future reference the result is stated as
Theorem 2.3
Under the assumptions of Theorem 2.2 the modified esti-
A
-e
A
mators 8 , 8 , and the sufficiency estimator 8 all have the same limi tFM
S
LM
ing distribution.
The modified estimators are viable competitors to 8 under the condiS
2
2
tion N/m + A. In the next result their performance is studied under the
weaker condition N/m
Theorem 2.4
4
A2
+
<
00
The proof is sketched in Section 2.3.
0
Assume (AI), (A2), (A3) and
(E.) have zero third moments and
(A4)
1
n
for some
4
2
If N/m + A <
A
00
as N +
then 8
00
>
LM
<00
o.
A
, 8
4 n
E(IIE:.11
+ )
1
A
(6(0)-6 0), are asymptotically normally distributed with covariance I
bias terms of the form -AI
-1
c(.).
For the sufficiency estimator,
s = (1/24)8~ ~ 80
(2.1. 12)
C
x lim N-
N
+
T
l
I {4F(3)(X~80) ~~(Q-3I)
1
k
1
k
1,
, and 8 ' when placed in the form N'2
FM
S
8 $z(Q-3I) *z8 0F
0
(4)
T
(xi 80 )xi }
-1
and
IS
where Q is a matrix of fourth moments satisfying
(2.1.13)
The other bias terms are extremely complex.
Theorem 2.4 provides information on two issues.
Remarks
First, one
can expect the modified estimators to improve on their unmodified versions.
This is confirmed to some extent in the simulation.
Second, the asymptotics
illustrate the effect of nonnormality on the sufficiency estimator.
the errors
(E
i
) are normally distributed, Q=31 and the bias term
When
s= 0 •
C
Thus in large scale studies with normally distributed measurement errors
the sufficiency estimator should perform quite well.
Equation (2.1.12)
suggests that the sufficiency estimator will have less optimum behavior
for decidedly nonnormal measurement errors.
Before discussing the Monte Carlo results a final comment is in order.
The modified estimators weaken the necessary condition for asymptotic normality from N/m
2
on the error law.
definitely.
=
0(1) to N/m- 4 = 0(1) at the expense of stronger conditions
As might be expected it is possible to play this game·in-
With appropriate assumptions on the first 2k moments of the
error law one can construct a modified version of the naive estimator, eL ,
2k
for any positive in= 0(1)
which is asymptotically normal provided N/m
teger k.
However, the potential for application of these results is margin-
al and they are not pursued further.
2.2
Monte-Carlo
A limited Monte Carlo study was performed to gain insight into the
following questions.
Do the corrected estimators have any practical value?
16
Under what conditions is 8C a viable choice?
To what extent is the asymp-
totic theory a reliable guide to the performance of the estimators, in particular that of
eS ?
The model for the study was
Pr(y.=llx.) = a+ Sx.,
i=l, ... ,N.
111
2
The following sampling situations were employed, where Xl denotes a chisquared random variable with one degree of freedom:
(I)
(II)
...
(a,S)=(-1.4, 1.4), (x.)
1
2
~N(O,cr
X
= .10), N = 300, 600;
2
(a,B)=(-1.4, 1.4), (x.)
~cr (Xl-l)/IZ,
1
X
N
= 300,
600.
2
For each case the measurement error variance cr was one third the
2 2 2
variance cr of the true predictors (cr = cr /3). Also, two sampling disx
x
2
tributions for the measurement errors (£.) were considered: (a) N(o,cr )
1
2
and (b) a contaminated normal distribution, which is N(O,cr ) with probabil2
ity 0.90 and N(O,25cr ) with probability 0.10.
These two sampling situations are felt to be realistic, but of course
their representativeness is limited by the size of the study.
sizes N '" 300, 600 may appear large.
TIle sample
However, there is much interest in
large scale epidemiologic studies where such sample sizes are common.
For
example, Clark (1982) was motivated by a study with N > 2000; Hauck (1983)
quotes a partially completed study with N > 340, and the present research
is in response to the need to analyze certain Framingham data (Gordon and
Kannel (1968)) where N > 550.
Furthermore, the results of the study sug-
gest that correcting for measurement error in most small sample situations
is unwarranted.
17
The values of the predictor variance
0
2
x
and the measurement error
variance are similar to those found in the Framingham data mentioned earlier
when the predictor was log ((systolic blood pressure-75)/3), a standard
e
2 2
transformation. The ratio 0 /0 ~ 1/3 is not uncommon; Clark (1982) finds
x
nearly this ratio in her study of triglyceride.
from Framingham data as well.
The choice of (a,S) comes
All experiments were repeated. 100 times.
In the experiment two independent replicates (X
each x.; thus m= 2.
1
,X ) were taken of
il i2
is in this case a scalar, estimated by the sam-
~*/2
pIe variance of (X. l -X. 2)/2, while ~* + **/2 is also a scalar, estimated
1
1
X
by the sample variance of (X
)/2.
i2
forms of the estimators were studied.
il
+X
The following simple computational
1)
Usual logistic regression solving (1.3.5);
2)
Clark's linearized estimator (1.3.8);
3)
A one-step version of the functional maximum likelihood estimate.
On the right side of (1.3.11) replace xi by Xi and 8
taining a new x.'
1
F
by 8 ,
L
ob-
Then solve (1.3.10);
4)
Corrected usual logistic regression (2.1.9);
5)
A one-step version of the corrected functional estimate.
On the
right hand side of (2.1.11) replace 8FM by 8
obtaining a new xiM '
L
Then solve (2.1.10).
6)
A version of the sufficiency estimator obtained by solving (1.3.14)
A
A
but with T.1 (8,*) replaced by T.1 (8 ,
L
A
t).
It can be shown that the one-step estimators defined in (3), (5), and (6)
and their fully iterated counterparts differ
asymptotically
only by a con-
stant term, e.g. the one-step version of 8 is also asymptotically normal
S
18
provid~d N/m4 ~ A2 ; however, the bias term generally differs from (2.1.12).
Results of the study are summarized in Tables 2.1-2.4 for the slope
parameter
B.
"Efficiency" refers to mean square error efficiency with re-
spect to ordinary logistic regression without correcting for error.
Although
sweeping conclusions are not possible, some qualitative suggestions can be
made.
First, the ordinary logistic estimator is less variable but more
biased than the others.
Large sample studies are such that bias dominates
mean square error and hence are candidates for using corrected estimators.
The opposite conclusion holds for small samples where variance dominates.
A second suggestion from the tables is that when the usual logistic
estimator loses efficiency (Case I (b), II (b), and when N = 600), the corrected
estimators perform quite well.
To some extent these numbers justify the
asymptotic theory of the previous section, without which the corrected
estimators would not have been found.
Clark's estimator performs very well in this study when the true predictors are normally distributed (Case I), but it does have a sizable drop
in efficiency when the predictors are highly skewed as in the chi-squared
Case II.
To some extent this is expected because the estimator is based on
an assumption of normally distributed predictors.
Somewhat surprising is
that the one-step functional and sufficiency estimators performed so well
when the measurement errors are not normally distributed (Cases I(b), II(b)),
as both are defined via an assumption of
norm~l
errors.
Before proceeding with the proofs of the next section some concluding
remarks are appropriate.
The asymptotic theory developed in this chapter,
which is interesting in itself, has proved useful in two ways. First, heuristically it provides a better understanding of the measurement error model and
it suggests a problem for further study, namely, in what situations can one
19
expect the usual inference ignoring measurement error to be of the wrong
level, i.e., at what point does increased bias overwhelm decreased variance?
Secondly, the theory was used to construct two new estimators with
reasonable large sample properties and enabled study of the sufficiency
estimator under nonnormal conditions.
All three of these along with Clark's
estimator performea well in the small
Monte-Carlo study.
The pressing practical problem now appears to be to delineate those
situations in which ordinary logistic regression should be corrected for its
bias.
Studies of inference and more detailed comparison of alternative es-
timators will be enhanced by the identification of problems where measurement error severely affects the usual estimation and inference.
TABLES
These are the results of the Monte-Carlo study.
"Efficiency"
refers to
mean square error efficiency with respect to ordinary logistic regression.
e
e
e
CASE lea)
TABLE 2.1
(0.,8)
2
= (-1.4,1.4), (x.)
are N(O,o = .10),
1
x
Ordinary
logistic
N
N
= 300
Corrected
logistic
Functional
(e.) are N(O,o
1
Corrected
functional
Clark
2
= a x2/3)
Sufficiency
Bias
-0.21
-0.04
-0.05
-0.06
-0.02
-0.06
Std. Dev.
0.52
0.61
0.61
0.60
0.61
0.60
Efficiency
100% *
85%
85%
87%
84%
88%
= 600 Bias
-0.22
-0.05
-0.05
-0.06
-0.02
-0.06
Std. Dev.
0.33
0.38
0.38
0.38
0.38
0.38
Efficiency
100% *
108%
106%
108%
107%
108%
CASE l(b)
TABLE 2.2
-------
Same as Case lea) but measurement errors (e.) have a contaminated
1
normal distribution
N= 300 Bias
-0.49
-0.16
-0.19
-0.20
0.02
-0.20
Std. Dev.
0.34
0.48
0.48
0.46
0.54
0.46
Efficiency
100%*
143%
139%
142%
121%
143%
N = 600 Bias
-0.53
-0.20
-0.21
-0.22
-0.03
-0.22
Std. Dev.
0.24
0.33
0.34
0.33
0.38
0.33
Efficiency
100%*
223%
215%
216%
234%
216%
t-J
o
*By definition
e
e
e
CASE II (a)
(a,S)
TABLE 2.3
N
=
2
2
are 0 (X1-1)/I2,o
(1.4,1.4), (x.)
1
x
X
=
0.1, (E: 1.) are N(0,0
2
Ordinary
logistic
Corrected
logistic
Functional
300 Bias
-0.28
-0.05
-0.07
-0.08
0.10
-0.08
Std. Dev.
0.47
0.58
0.57
0.56
0.66
0.56
Efficiency
100~~*
90~6
91%
93%
69%
93%
=
Corrected
Functional
Clark
=
i/3)
X
Sufficiency
N = 600 Bias
-0.27
-0.03
-0.04
-0.05
0.11
-0.05
Std. Dev.
0.33
0.41
0.41
0.40
0.45
0.40
Efficiency
100%*
111%
110%
112!1o
85%
112%
CASE IT(b)
TABLE 2.4
Same as Case II(a),
but measurement errors have a contaminated
normal distribution
N
=
300 Bias
Std. Dev.
Efficiency
N
=
I
-0.43
-0.13
-0.15
-0.17
0.12
-0.17
I
0.33
0.44
0.45
0.43
0.53
0.43
100%*
141 96
134%
140%
103%
141%
600 Bias
-0.46
-0.15
-0.16
-0.18
0.10
-0.18
Std. Dev.
0.25
0.33
0.34
0.33
0.40
0.33
Efficiency
100%*
201%
190%
193%
159%
194%
N
~
*By definition
22
2.3
Proofs of primary results
Because the number of unknown parameters increases with increasing
sample size the classical results on consistency and asymptotic normality
of maximum likelihood estimates are not immediately applicable.
The proof
of consistency involves an interesting application of random convex analysis, that for normality is accomplished with more usual large sample theory
techniques.
Proof of Theorem 2.1:
for 8
+
co
Of the four estimators, consistency is proved
and. 8 employing assumptions (A.l)-(A.3) and the condition min(N,m)
F
L
The proof for 8e is similar while the convention regarding multiple
solutions to (1.3.14) insures consistency of the sufficiency estimator.
l'
l'
l'
Write x. = (u.,v.)
111
where v. corresponds to the components of x. measured with error. Analo1
1
AT
l' AT
A
l'
l' l'
gously, write x. = (u.,v.) for x. given in (1.3.11) and X. = (u.,V.) where
1
1
1
1 1
1
1
To present the proofs some notation is needed.
l'
-1
V. = V.+E. and E(E. E. ) = m ~*' Let H(·) = log F(') and note that H(')
111m
1m 1m
1
has Lipschitz norm one and IH(t) I : ; 1 + It I for all t E R . Finally let
N
(2.3.1)
gN(Y)
l
= N-
(2.3.2)
GN(y)
= N-
(2.3.3)
N
L (y , {v i } 1)
N
N
l'
1
T
+ (1-y.) Il(-x.y}
= N- L {Yi If(x.y)
1
1
1.
i=l
l'
l'
l'
l'
+ F(-x 8 ) H(-x.y) }
{F (xi 80 ) H(x.y)
1
1
i 0
i=l
L
AT
AT
{yo H(x.y) + (l- Yi) H(-x. y)}
1
1
1
i=l
1 N
L
N
l' A_I
-1
N (m/2) L (V.-v.) ~* (V.1 -v.1 ) .
. 1 1 1
1=
In defining (2.3.1) and (2.3.3) and elsewhere in the proof the sequence
{vi}~ is used both to represent the true but unknown predictors, (2.3.1)
23
and as mathematical variables in the argument of the function L , C2.3.3).
N
The context should make clear which interpretation is appropriate.
L is
N
the normed functional log-likelihood assuming normal errors and replacing
** by its consistent estimate
~*; thus by definition
C2.3.4)
for all y
E
P
N
Rand {vi}l
E
R
N(p-r)
.
But C2.3.4) implies that
Thus Or maximizes the random concave function GNC·).
Note that since G C·)
N
is defined in terms of xi it depends explicitly on OF also.
However, this
does not affect the validity of the inequality
·e
C2.3.5)
The naive estimator 0L also maximizes a certain random function. Specifically,
The function gN C·) is concave for each Nand CA.I), via the inequality
IHCt)1 ~l+ Itj, implies that for each fixed y, {gN Cy )} is a bounded sequence
of real numbers. Although our assumptions do not imply that {gN C·)} converges,
it is true that every subsequence contains a further subsequence converging
uniformly on closed bounded subsets of R P to some finite concave function
(Rockafellar Thm. 10.9)
Assumption CA.2) insures that the limit of every
convergent subsequence possesses a unique maximum at
moment that GN(y) - gN Cy )
A
= 0pCl)
for each fixed y.
°
0
.
Suppose for the
Pick any subsequence
A
{OF N}
, k
from {OF N} and let {gN C·)} be the corresponding subsequence from
'
k
24
{gN (.)} .
Now from{gN (.)} one can always choose a further subsequence,
k
(.)} which converges to some concave function g(.) with a unique max-
{gN
k, j
imum at SO'
Of course this implies G
N
(y) - g(y) = Opel)
and
since
k ,j
S
maximizes G (.) an appeal to Theorem 11.1 of Anderson and Gill
F,Nk,j
Nkj
(1982) implies SF N - So = 0 (1).
, k,j
P
This shows that every subsequence of
{SF , N} contains a further subsequence which converges in probability to
'"
So which in turn implies SF,N- So = Opel).
Thus to prove consistency of
SF one need only show GN(y) - gN(y) = Opel) for fixed y.
Similarly, con-
N
sistency of 8
is established by showing LN(y,{Vi}l) - gN(y) = Opel).
L
complete the task we start with:
Proposition 2.3.1
Assume (A.l) and suppose min(N,m)
-+00 ,
To
then
·e
Proof:
TIle quantity under investigation may be written as T
l
+
T
2
where
N
T
T
-1
T
T = N
L {y.1 H(X.1y) - F(x.1SO) H(x.y)},
1
l
i=l
l
T = N2
N
L
i=l
T
T
T
{ (l-y . )H(- X. y) - F(- x. SO) H(-x.y)}.
111
1
Furthermore
~
T
T
-1 N
T
T·
) H(x.y)}
{y.[H(X.y)-H(x.y)]hN
L {(y.-F(x.S
1
1
O
'1
1
'111
1
1=
1=
L.
The Lipschitz condition on H implies that
ITlll::;N
-1 N
T
-1 N
L
!(X.-x.)
y/::;IIYIIN
. 1 1 1
. L1 11£·1m "
1=
1=
25
~
The last expression is Opel) provided min(m,N)
T has zero mean and
12
00.
variance
which vanishes in the limit in view of (A.l).
identical argument T
2
=
Thus T
l
= 0
p
(1) and by an
Opel) concluding the proof.
A
In addition to proving consistency of 8
Proposition 2.3.1 yields
V
the following two useful corollaries.
Corollary 2.3.1a
Proof:
From (2.3.1) and the definition of H(')
IgN (8 0 ) I ~ 2
Corollary 2.3 .lb
1
s up It log
O<t<l
N
Pr{N- (m/2)
A
L IIV.-v·11
i=l ]
t
2
~ Iltlj}~ 1.
1
Proof:
(2.3.6)
LN(eF'{~i}~) ~ LN(eO,{vi}~)
By definition
-1
N
I~ 1
NATA
L {y.H(X.8 F)
i=l 1
1
or equivalently
AT A
+ (l-y.)H(-X.8 )} ~
1
1
F
N
-1
N A T A_I
A
LN(8 0 ,{v'}1) +N (m/2) L (v.-v.) ~* (v.-v.).
1
"Ill
11
1=
N
Since the L.B.S. of (2.3.6) is almost surely nonpositive and Pr{L (8 ,{v }1)
i
N 0
< -I}
~
0
it must be that
Pr{N
-1
(m/2)
N
A A-1
L (v"-v.)~*
. 1 1 1
1=
A
(v.-v.) ~ I} ~ 1.
1
1
26
**
A
The conclusion follows from the consistency of
and an application of
2
T
the inequality IItl1 : : ; IIAII t A-I t, true for all positive definite matrices A.
We are now in a position to complete the proof of consistency for
Proposition 2.3.2
Proof:
In addition to (A.l) suppose min(m,N)
~
00
eF .
and
In light of Proposition 2.3.1 it suffices to show GN(Y) -
N
LN(y,{Vi}l) = op (1).
Write this last quantity as
l
WI = N-
N
L
AT
{y. (H(x. y)
1
i=l
1
where
Wl +\'I2'
H(X~Y)) },
1
N
T
W2 = N- l L {(l-y.1 )(H( -X.1 y)
i=l
AT
H( -x. y))}.
1
The Lipschitz condition on H(') and Schwarz's inequality imply that for
j=1,2
IW·I
J
(2.3.7)
::::;
l
= II y II NThe r.h.s. of (2.3.7) is
N
N- l
L
i=l
N
I
i=l
0
p
A T
I(X. -x.) y I
1
A
Ilv.-v·1I
1
1
1
::::;
IIYIIN-
l
A
Ilx.-x·II
1
1
tl
L
i=l
::::;
l
IIyll {N-
N
L
i=l
'" 2 1
Ilv. -v. II }~
1
1
(1) by virtue of Corollary 2.3.1 b, and this com-
p1etes the proof .
•
Next we prove Theorem 2.2, parts (a) and (b) and sketch the major steps
in the proof of Theorem 2.4 for the sufficiency estimator.
Proofs of the
other results are nearly identical.
The proof of Theorem 2.2 is developed via a series of lemmas.
In each
'"
case the conditions of Theorem 2.2 are assumed as is the consistency of ~.
27
Recall that x. is defined in (1.3.10) and (1.3.11).
1.
.
•
-1 N
Def1.ne DN = N
Lemma 2.3.1
A
L y.(x.-X.).
I 1. 1 1.
N
L p(l) (x!8 0 )
1
1.
+ 0
P
(1).
1
t
N~DN =
Proof:
1
A
-1 N
Write
N
where
Al
1/
N
AT A
8pm- N-~ L y.(y.-F(x.8 p ))'
I 1. 1.
1.
AT
A
L y.1. (y 1.. - F (x.1. 8 p ))
I
-1 N
=N
= Al + A
2
T
Ly.(y.-F(x.8 0 ))
1. 1.
1.
I
The difference between Al and its expectation is 0p(I),
(2.3.8)
Al = N-
I
that is
N
L p(I)
(Xr 80) + 0p(I)
.
I
Also
(2.3.9)
S;
N-
l
I
S;
N-
f ill\-xill 11 11 + II~eFll 11 11 /m+ Ilxill lloo-GplP
I {1I*~£ill 11801l/m~+ "*8 l 11 80 11 /m+ Ilx II 11 80- 8F II}
N
8
8
0
N I l
AA
A
Lemma 2.3.2
A
p
and clearly this last term is opel).
and the consistency of
0
t
Pinally an
appe~I
to (2.3.8), (2.3.9)
A
and 8p completes the proof.
.
Define ~-
.
i
-~
_IN
=N
~
f
T
AT
A
(P(X.S)X.-F(x.B)x.).
1.
1.
1.
1.
28
Proof:
yields
N~ = N-!z~1l
(2.3.10)
(X. -xA.) F(XT• 6 ) + N-!z~L (X.-X.)
A T 6 X.F (1) (X.6
T )+N _!zNL r.
0 1
11
1 0
111
1 0
1
1
where
* and 6F implies N-~ I
A
Consistency of
1
A
N
1
II r. II
= 0
1
P
(1).
The first term on the
r.h.s. of (2.3.10) equals
(2.3.11)
With an argument similar to the one used in Lemma 2.3.1 we can replace
ATA
T
x i 6 by x 6 in each of the summands in (2.3.11) altering it only by a
i 0
F
term which
(2.3.12)
The normed sum in (2.3.12) has zero mean and asymptotically negligible
variance; thus the first term in (2.3.10) is
0
one can show the remaining term in (2.3.10) is
1 N
p
(1).
0
p
In a similar fashion
(1), finishing the proof.
T
k
Define T N == N- L (y.-F(X.6 0))X .. Then N4r L N conL,
111
1
,
verges in law to a multivariate normal random variable with mean -Ac
L
and covariance matrix T.
Lemma 2.3.3
Proof:
kIN
N4r L,N
=
N-~
I
1
1
+ N-~
N
T
(y.-F(X.6
)(X.-x.)
1 101 1
T
L (y.-F(X.6
))x.
1
1
1 0
1
29
By expanding F(Xieo) in a Taylor series around xIe o in each of the above
sums we find, after recombining terms
k
1
N
1
N
- N
-'2 \'
(X.-x.)(X.-x.)T eoF(l)
~
1
1
- N
X.1.
1.
N
-'2 \'
~
1
where
T
-'2 \'
N~L , N = N 1~ (y.-F(x.eO))x.
1.
1.
1.
(2.3.13)
1.
1.
T
1.
(X.-x.)
eOF
1.
1.
(1)
(XT1.. eo)
T
(x.eO)x.
1.
1.
,....
and
X.1.
are on the line segment joining X. and x..
1.
1.
The third term
in (2.3.13) is 0 (1) by virtue of its zero mean and vanishing variance,
p
since m +
-e
(2.3.14)
The first term can be written as
00.
N -~
~~ (y.-F(x.eO))x.
T
T
+ N-~ ~~ (y.-F(x.eO))(x.-x.),
1
1.
1.
1.
1
1.
1.
1.
1.
and the second term in (2.3.14) also has zero mean and asymptotically negligible variance.
Assumptions (A.l), (A.2) and an appeal to the Lindeberg
Central Limit Theorem are used to show that the first term in (2.3.14) is
asymptotically normally distributed with zero mean and covariance matrix I.
Write the fourth term in (2.3.13) as B +B where
l 2
1
Bl - -N -'2
1
B2
=
-N -'2
IN ((X.-x.) Teo) 2
1
N
I
1.
1.
F(2) (xieo) x./2,
1.
T 2 (F(2) (ii'e ) - F(2)
((x.-x.)
8 )
1.
1.
o
0
1
(Xi80))
x./2
1.
.
Assumption (A.l) and the 2+0' moments of IIEill imply Bl-E(B l ) = 0p(l). As
for B , the inequality IF(2)(x)_F(2)(y) I ~ min(1,3Ix-YI) can be used to con2
clude
30
and hence (A.l) implies
The Dominated Convergence Theorem together with the Markov Inequality
are used to show that 8
2
converges to zero as m +
00
and thus
Similarly, the second term in (2.3.13) can be shown to equal
Combining the preceding facts and noting the definition of c
L
establishes
the desired result.
-e
Proof of Theorem 2.2:
In each summand appearing in (1.3.5) and (1.3.8)
apply the mean value theorem to P(·) to arrive at
where
N
l
.)
SL = N- L p(l) (x:e
1 L,1
~
I
~
Sp
Por each i ,
8L,1.
and
-1
=N
8p ,1.
N
p(l) (~:e .)
L
1 P,1
1
T
X.X.
1 1
" "T
1 1
x.x.
"
1 ie on the line segments joining 8 - and 8p to
L
,.."
80 respectively.
In light of the previous results we need only show that SL
,.."
and Sp converge to I in probability.
,.."
This is proved for SL
only, a similar
31
.....
demonstration works for SF as well.
= (;/m)
Since X.-x.
1
1
k
~£.
1
and by assumption m ~
00,
it is not difficult to
show
S _
N- l
L
N
\' F(l)
f
(x:eL, .)
1
1
x. x.T
1
1
= 0 P(1).
Thus, omitting terms of order 0 (1)
P
(2.3.15)
...
1N(6)
0 - SLU
N
=
1 (6)
N- l \' F(l) (xT......6
.) x.x.T
N 0 1 LU,l
1 1
f
The norm of the r.h.s. of (2.3.15) is bounded by
(2.3.16)
~
-e
A
Lemma 2.3.3 and (A.2) imply N (6 L - 60 ) = Opel) and hence (A.l) implies that
(2.3.16) is 0p(l). By assumption I (6 0 ) ~ 1 which in turn implies SL ~ 1
N
completing the proof.
We now outline the proof of Theorem 6 for the sufficiency estimator.
Proof of Theorem 2.4 (Sufficiency Estimator):
After a preliminary
expansion we arrive at
=
SI
+
SII
+
SIll
+
SIV
+
SV
where
1
(2.3.17)
SI
N
= N-~ L
1
SII
SIll
=
T
[y. -F(x. 6 )]X.
1
1 0
1
N
T )X.
_6 T ~ 6 m-1 N-~ L(y.-~)F (1) (X.8
a
0
1 1
1 0
1
I
T
)2m-2N-~ F(2)CX:6 )X./8
= (6 o ; 6O
Il 0 1
+ 0
P
(1)
32
S IV
=~
l
IN
eO m- N-'2
T
L (y.-~)
[y.-F(x.e )]
1
1
1
1 o
1
"'-
In deriving (2.3.17) we have used the fact that N'2(~-t)
= 0p (1).
By using
an argument similar to one employed in Lemma 2.3.3. we can write
(2.3.18)
SI
=
N
T
1
L (y.-F(X.e ))
111 O
1
N
N-'2
- N-'2
3
l. l
1 j=O
-
N-~
x.
1
T
C)
T
I 1
1
x.
((X1.-x1·)Teo)j F(j) (xT1.eO)/j ! + 0p(l) .
Ij=l
Because of the bound on the 4+<sth moment of
-e
.
(X.-x.)((X.-x.)
1 1
1 1 eo)J F J (x1.eO)/j!
liE.1 II
replacing the last two
terms in (2.3.18) by their expectations alters 51 only by a term which' is
0p(l).
Thus, writing F(k) for F(k) (xIe )
o
(6.15)
I {F(l)* eo + m- 1 ~~Q *~eo(e6*eo) F(3) /3!}
- N-~-le~teo I {xi F (2)/2 + m-1e6t!2Qt~eoXiF(4)/4!}
-
N
N-~m-1
1
+
0
p
(1)
.
Q is the matrix appearing in (2.1.13) and ZN has a limiting N(O,I) distribution.
Similarly,
33
SIll
S IV
SV
= -(eToteO)2
N-~-2
I x.p(2) /8
1
~
+ 0
P
I1 {p(1)-m-1(ebteo)p(2)(F-~)/2}
(1)
= -ebteoteo N-~-2 I p(l) /4
=
*eoN-~m-1
+ 0
1
P
Combining these terms and using the identities
p(3) = _3P(2)(p_~) _ p(l) /2
p(4) = _4P(3)(p_~) _ p(2),
we find
-e
(1)
proving the theorem.
+ 0
P
(1)
CHAPTER III
BOUNDED INFLUENCE ESTIMATION
3.1
Motivation
As in the preceding chapters our interest lies in estimating 8
0
in
the logistic regression model
(3.1.1)
PreY'1
T
-e
x.1
=
F(t)
= llx.)
1
C1,x.1, 1'··· ,x.1,p-1)
=
-t -1
(l+e)
i=l, ... ,N
.
.
.
. ( y.,x. ) were
h
cons1sts
0f t h e pa1r
y. t a k es th e va 1ue
Th e 1. th 0b servat10n
1 1
1
o or
1 and x. is a known vector of constants.
1
The maximum likelihood
estimate of 80 is obtained by solving the equation
(3.1.2)
and is, under certain design conditions optimal in the sense of being
asymptotically efficient or best asymptotically normal.
optimality properties
Of course these
hold only when the data are generated in strict
accordance with model (3.1.1).
In most applications, however, one can
claim only that (3.1.1) is a reasonable approximation.
A key principal
underlying the theory of robust statistics is that optimality at a model
35
often does not imply "approximate optimality" at an approximate model. Thus
one should not be wed to the theory of maximum likelihood and it is, in
fact, desirable to develop
procedures which perform well in situations
close to the idealized model.
In addition to model insensitivity an esti-
mator should not be overly influenced by any small subset of the data, i.e.
the
estima~or
should depend smoothly on the data.
Let G be the empirical
N
distribution function; that is, G assigns mass liN to each of the points
N
CYi'xi).
Then viewed as a functional of GN,
some reasonably defined sense.
e = 8(G N)
should be smooth in
Suppose such an estimator is found, then
alteration or deletion of some small subset of the data would produce only
A
moderate changes in the value of 8.
In much the same fashion this estima-
tor would be relatively insensitive to small perturbations in the under1y-
-e
ing model.
Pregibon (1981, 1982) has shown the the logistic regression
m.l.e. can be highly influenced by a relatively small subset of the data
and thus its performance in settings other than at the idealized model can
be severely impaired.
Although it is clear that 8CG) should depend "smoothly" on G it is not
as yet evident how to make this condition precise.
For a function of a real
variable f(t), a sufficient condition for smoothness of fC·) over an interval
is that its derivative, f'C·), be bounded over the interval.
that If(t)-f(s)
I
This insures
A
is small whenever It-sl is small.
A
By analogy 118CG)-8(H)
should be small whenever G is close to H. A reasonable strategy then is to
A
define an appropriate derivative of 8C·) and restrict consideration to
only those estimators which possess bounded derivatives.
discussed in the following section.
This approach is
II
36
3.2
Influence and sensitivity
In order to maintain the independent non-identically distributed nature
inherent in regression models no stochastic assumptions concerning the predictors xl, ... ,x
N
are made.
This presents a minor difficulty in that Ham-
pel's (1968, 1974) definition of the influence function is geared toward
i.i.d. settings.
However, since our interest lies in the class of M-esti-
mates, i.e. those defined as solutions to
N
L 1/J (y ].].
. , x. , 8)
(3.2.1)
1
= 0
an appropriate analogue to the influence function is
(3.2.2)
IC1/J(y,x,8) = -(E
•
N8
.
In (3.2.1) 1/J denotes (a/a8)1/J.
·e
(1/J(Y,X,8)))
Also
-1
1/J(y,x,8)
both 1/J and its influence curve IC1/J
are allowed to depend on N although the additional notation is often suppressed.
In (3.2.2) E denotes expectation with respect to P , a measure
N8
N8
to (l,x.) and mass (l-F(x~8))/N
to (O,x.),i=l,
.• ,N.
which assigns mass F(x!8)/N
] . ] .
].
].
Upper case (Y,X) represents
has distribution P .
N8
a
generic
response-predictor
pair
which
For future reference it is noted here that P is
N
used to represent the marginal distribution of P corresponding to X and
N8
EN is the expectation operator associated with P .
N
Thus for example x =
(xl+ ... +xN)/N can be written as EN(X).
The definition of influence, (3.2.2), is motivated more rigorously as
follows:
Suppose G assigns mass gi/N • . . to (l,x i ) and (l-gi)/N
N
(O,x ) for i=l, ..• ,N. For Fisher consistency 8 = 8(GN) must satisfy
i
to
(3.2.3)
Next contaminate G .
N
Define H
= (1-£) G + £ Q, then, writing 8(£) for
N,£
N
37
S(lI
N,£
) it follows that
f
(3.2.4)
W(Y,X,S(£)) dH N,£
= O.
Differentiate (3.2.4) with respect to £ and evaluate at £ = 0 to find
Finally, take Q to be point mass at (y,x) to arrive at (3.2.2) when G
N
P .
Since
NS
(3.2.5)
=
ENS(IjJ(Y,X,S)) = 0,
expression (3.2.5) can be differentiated to find
T
-ENS (~(Y , X, S) ) = ENS(W(Y,X,S) i (Y,X,S))
where i(Y,X,S) ~ (Y_F(STX))X
and T denotes transposition.
Thus
an a1ter-
native expression to (3.2.2) is
(3.2.6)
Under sufficient regularity conditions a solution SIjJ of (3.2.1) satisfies
= N-~
(3.2.7)
as N
-+
00.
N
L IC",(Y.,x.S o)
1
'Y
1
1
+
0p(l)
And, if
(3.2.8)
then
(3.2.9)
In the previous section it was argued that bounding an appropriate
derivative of S(·) should produce robust estimators.
This derivative of
38
course is just
the Gateaux derivative of e(·) in the direction
IC~(y,x,e),
of point mass at (y,x) evaluated at the logistic model, i.e. at P N' More
e,
intuitively,
IC~(y,x,e)
can be regarded as an approximate measure of the
normalized effect on the estimate of adding a new observation at (y,x).
That is, if e
solves
yx
N
L ~(y.,x.,e
)
1
1
yx
+ ~(y,x,e
1
yx
)
= 0
and e solves
N
'"
L ~(y.,x.,e) = 0
1
11
then under reasonable regularity conditions N(eyx-e)
as N
+
00.
Thus, selecting
~
so that
IC~
+
0p(l)
is bounded limits the amount of
'"
influence any observation can have in determining e.
·e
= IC~(y,x,e)
Such as estimator is
said to have bounded influence.
Following Hampel (1974) the gross-error-sensitivity is defined as
(3.2.10)
s1 (~)
=
sup
(y,x)
II
IC~(y ,x, e)
II
and is a measure of the maximum influence any observation can have on the
estimate e.
of coordinate
This definition of sensitivity is not invariant to a change
~ystem.
The following definition of sensitivity, made pop-
u1ar by Krasker and Welsch (1982), is invariant.
(3.2.11)
lyTIC~(y,x,8)1
sup sup
(y,x) ylO
=
sup
(y,x)
T
k
(y v~(e)y) z
T
-1
(IC~(y,x,e) v~
(8) IC~(y,x,8))
!<;
z
For this definition of sensitivity the estimator's influence function is
normed by the corresponding asymptotic covariance matrix.
Because of this
39
Stahel (1981) calls
s2(~)
the self-standardized sensitivity.
As a consequence of our definitions the maximum likelihood estimate has
influence function
(3.2.12)
and both sl and s2 are infinite.
identity F(l)
=
In deriving (3.2.12) use is made of the
F(l-F).
The goal in implementing a bounded influence estimator is to insure
that the fitted model is determined by the bulk of the data and not by a
few extreme observations.
As an alternative to bounded influence one
might therefore consider a procedure which 1) uses appropriate diagnostics
to detect influential observations, 2) establishes criteria for rejecting
·e
influential or outlying data, and 3) employs a standard analysis, e.g. maximum likelihood on the reduced data.
This would presumably result in a
fitted model similar to that obtained from a bounded influence estimator.
However, a rejection scheme is necessarily a discrete operation.
Huber
(1981) claims that robust procedures " ..• are superior because they can
make a smooth transition between full acceptance and full regection of an
observation".
Furthermore, the correct identification of outliers is often
enhanced by a preliminary robust fit, see Hampel (1976), Pregibon (1982).
Finally, a bounded influence estimator has associated with it a complete
theory for asymptotic inference.
The ability to continuously downweight observations and the associated
asymptotic theory are the two most often cited advantages of robust procedures.
Another less frequently noted advantage, yet one which is encoun-
tered later, lies in the fact that while maximum likelihood estimates may
exist for the complete data, the likelihood function for the "cleaned"
40
data may be very unstable and maximizing values may not exist.
uation the rejection approach fails.
In this sit-
However, a bounded influence estimator
may still yield meaningful results.
3.3
Related results
Unlike linear regression the problem of robust estimation in the logis-
tic model has received little attention to date.
In this section some re-
suIts for linear regression are reviewed with the intent of motivating certain proposals for logistic regression.
Next, those aspects of Beran's
(1982) theory which pertain to logistic regression are discussed and finally
some comments are made on specific proposals by Jones (1982) and Pregibon
•
-e
(1982) .
To facilitate the comparison between linear and logistic regression it
is assumed that scale is known and equal to one in the former.
Thus in
each case the maximum likelihood equations take the form
N
(3.3.1)
I
E.X.
11
1
=0
where
E.
1
= y.1
T
x.8
= y.1 -
T
F(x.8)
linear
1
1
logistic
In linear regression E. is usually assumed symmetric.
1
However, this is ob-
viously not the case in logistic regression.
Huber (1973) proposed solving for 8, the equations
N
(3.3.2)
I
1
~b(E.)x.
11
=0
where
~(E)
=
E min (l,b/IEI).
41
Por a certain matrix D this has influence function IC (y, x, 0)
b
=1j)b (y-xTO) D-Ix.
While 1j)b(·) limits the effect of large residuals IC b (y,x,8) is unbounded
since Ilxll can be arbitrarily large.
Something which is obvious but never-
theless worth noting is that in logistic regression the residuals are bounded
by one.
Thus it is only the predictor which prevents the m.l.e. from having
bounded influence.
With a common sense proposal Mallows (1973, 1975) suggested modifying
(3.3.2) to
N
L 1j)b(s.)
1.
(3.3.4)
1
w(x.)x.
1.
1.
=0
where w(·) is a weight function chosen so that Ilx II w(x) remains bounded.
While this yields bounded influence, the indiscriminate downweighting of
-e
x values can seriously affect the efficiency of such a procedure when the
model holds.
Intuitively, this is because the information contained in the
T
point (y,x) is proportional to xx • Thus, observations associated with
large
In logistic regression the
information in (y,x) is proportional to p(l) (xTO) x xT . Since p(l) (t) ,; (d/dt)
XIS
can contain substantial information.
F(t) decreases exponentially as It I
-+ 00,
T
F(l) (xTO) x x can be large only
when !lxll is large and IxTOI is small, i.e. x must be nearly orthogoanl to
O.
The point here is that downweighting large
XIS
should not affect effici-
ency in logistic regression as severely as in linear regression.
This fact
together with the natural bound on the residuals make Mallows type estimators
particularly adaptable to logistic regression.
Schweppe (Handschin et aI, 1975) proposed modifying (3.3.2) to
(3.3.5)
x ..
1.
42
In (3.3.5)
w(x.)
1
= (I-h.)
1
!.:
Z
where h.
1
= x.T1 (EN (XXT-l
))
x ..
1
This proposal
provides the protection of a Mallows but with less loss of efficiency.
Krasker (1978, 1980) considered the class of M-estimators and found
that the choice
(3.3.6)
~(y,x, 6)
where D satisfies
(3.3.7)
A
minimizes the trace of the asymptotic covariance matrix of IN (8-8 ) sub0
ject to a bound on sl.
Noting the lack of invariance in Krasker's proposal, Krasker and Welsch
-e
(1982) suggest solving
r
N
(3.3.8)
T
(y. -x. 6)
111
x. min (1,
1
b
) =
T I
-1
~.
ly.-x.6 (x.A x.)
111
1
0
where the matrix A satisfies
(3.3.9)
and
(3.3.10)
They prove that within the class of estimators obtained as solutions to
their proposal alone satisfies a first-order necessary condition for strong
optimality, i.e. minimizing the asymptotic covariance matrix in the sense
of positive definiteness, when a bound is placed on s2"
Subsequently,
43
Ruppert (1983) has shown that no such estimator need exist, i.e. any best
function w can depend on the particular linear combination yT 8 being
estimated.
Huber's proposal is a direct generalization of his work on estimating
location parameters.
The Mallows and Schweppe forms are common sense modi-
fications of the Huber estimator designed to insure bounded influence.
Krasker's approach of minimizing the trace of the asymptotic covariance
matrix subject to a bound on sl generalized to linear regression Hampel's
(1968) approach to the single parameter problem.
Finally, Krasker and
Welsch introduced an invariant version of the so-called Hampel's extremal
problem for linear regression and found an estimator which satisfies a necessary condition for optimality.
-e
The theme common to all these approaches
is the attempt to find estimators which perform efficiently at the model
while limiting the influence of any single observation.
Beran (1982),
arguing that " •.. efficiency at the model is logically important only if
one believes that the parametric model holds strictly for most samples •.• "
proposed a general theory for robust estimation in i.n.i.d. models which
avoids considerations of efficiency.
Following is a summary of his theory
only as it pertains to binary and in particular logistic regression.
Start
by assuming that the true distribution of y. given x. satisfies
:1.
(3.3.11)
Pr(y.=llx.) = P.
1.
1.
:1.
i=l, ... , N•
:1.
For each N define 8 , N as that value of 8 which minimizes
0
(3.3.12)
N
L II
1
T
2
:1.
:1.,
P. - F(x.8)11. 8 •
:1.
In (3.3.12) the choice of the norm
II-II.:1., 8
is made by the statistician.
44
Beran suggests taking 11·lli,8~
1·1
and for this choice 80 ,N exists uniquely
and there is no need to consider the local functionals Beran finds necessary
to introduce in his general theory.
The parameter 8 N minimizes a distance
0,
between the true distribution of the responses (P1, ... ,PN) and the modeled
T
T
distribution (F(x 8), ... ,F(x 8)).
N
1
8 N.
0,
Beran maintains that one should estimate
Since his theory distinguishes estimators only via their asymptotic
properties it can be shown that the value of e which minimizes
N
(3.3.13)
l
1
(y._F(x:e))2
1
1
peforms optimally in Beran's framework, i.e. satisfies his Theorem 2.
The
resulting estimator is a solution to
N
(3.3.14)
-e
I
1
(y.-F(x:e)) F(1)(x~8) x.
1
1
1
=
O.
1
The contamination neighborhood over which this estimator is optimal takes
the form
(3.3.15)
The solution to (3.3.14) is none other than the least squares estimator
of 8.
That such a simple estimator results from Beran's rather complex the-
ory derives from the fact that the regression function E(YIX) completely determines the distribution of ylX for binary models.
More specifically min-
imizing the distance between the observed or empirical distribution and the
modeled distribution is equivalent to minimizing the distance between the
observed regression function and the model regression function.
The influence function of the estimator determined by (3.3.14) is
(3.3.16)
45
Although the factor FCl)CxTe) limits the influence over much of the design
space, C3.3.l6) is bounded for all x only in the case of simple logistic regression with non-zero slope.
The least squares estimator puts relatively
little weight on observations associated with extreme probabilities and
from an operational perspective this is its major claim to robustness.
By taking other choices of
form
II' Iii , e
one can show that the most general
of a Beran estimator is asymptotically equivalent, in Beran's frame-
work, to the M-estimator defined as the solution to
r CY.-Fcx~e))
1 1F(1)cx~e)x.
1 1 ~e1.
C3.3.l7)
=0
1
N
where {~ei}l
are bounded non-negative weights.
In this case the contamin-
ation neighborhood over which the estimator is optimal takes the form
-e
C3.3.l8)
For arbitrary choices of
~ei
it is difficult to interpret BNceO'C) .
Studying problems in item response theory Jones (1982) suggested a
class of robust estimators applicable to logistic regression.
For a cer-
tain matrix D these have influence functions
N
C3.3.19)
where £ indexes the class.
Taking £
squares estimates respectively.
= 0,
1 yields the m.l.e. and least
To stretch an analogy one might claim that
the factor CF(1)CxTe))£ in (3.3.19) treats the residual in much the same
fashion that a so-called redescender acts on the residual in linear regression.
When ly.-F(X~e)
F(x~e)
1
+
1
1
I
approaches its maximum value of 1, it must be that
0 or 1 and in either case
ual gets near zero weighting whenever E > O.
Hence
an extreme resid-
The influence function (3.3.19)
46
remains unbounded for all
E
however.
Pregibon (1982) proposed fitting logistic regression models by choosing
8 to minimize
N
L A(-2L(y.,x.,8)j1T(x.))
(3.3.20)
1
~
~
T
where L(y,x,8) = Y log F(x 8)
t <
~
1T(X.)
~
T
(l-y) log(1-F(x 8)), A(t) is some increasing
+
, and 1T(X.) are positive weights. The choice of
function for 0
~
is left open.
However, he suggests taking
00
~
1T(.)
t
A(t)
= {
k
2t ~-b
2
2
where b > 0 is adjusted to provide various degrees of protection and efficiency.
·e
The resulting estimator solves
(3.3.21)
N
L
L t(y.,x.,8)min~(1,-b
1
~
~
2
1T(x.)j2L(y.,x.,8)) = 0
~
~
1
and is generally inconsistent at the logistic model.
In fact one can show
that minimizing (3.3.20) produces consistent estimates only when A' (-21og p) =
A'(-2log(l-p))
for
a
< p < 1.
In this case the estimate is a weighted like-
lihood type in which the weights do not depend ion the y. 'so
1
Although in any given situation it may be that model discrepancies
make the question of consistency meaningless, it nevertheless makes sense to
'center' estimation procedures at the model.
In the next section estimators
analogous to the Krasker and Krasker-We1sch proposals are derived and their
existence and optimality properties are discussed.
Attention is then
re-
stricted to a computationally simpler class of estimators a 1a Mallows type
and optimal bounded influence estimators are found within this class.
invariant and noninvariant procedures are considered.
Both
47
3.4
Bounded influence estimators
In this section the notation and conventiofts adopted earlier are employed
without further explanation.
Since
~
and its influence function
the same estimator there is no loss in generality in assuming
necessary and sufficient condition for
~
~
IC~
define
= IC~.
A
to be in canonical form is that
(3.4.1)
Henceforth only those
satisfying (3.4.1) are considered.
~'s
Recall that
under sufficient regularity conditions the asymptotic covariance matrix of
(3.4.2)
·e
Our task is to find that
~
function which minimizes
tr(V~(e))
subject
to the constraints
(3.4.3a)
sl (~) ,;
(3.4.3b)
sup Ill/J(y,x,e)
(y,x)
Suppose there exists a vector C
K
~
= MK, N(e)
II
~ b.
= CK, N(e)
and a positive definite matrix
(K designates Krasker type estimator) such that
II
~~l(9--CK)
II))
(3.4 .4a)
ENe ((9--C K)r (bl
(3.4.4b)
ENe ((9--C K) (9--C K) T r(bl IIMK- l (9--C K)
where ret)
= minC1,t),
Define
(3.4.5)
t
~
O.
= 0
II )) = MK
In (3.4.4) and elsewhere 9-
= 9-(Y,X,8)
-
48
T
EN8(~Kt
then as a consequence of (3.4.4) it follows that
~K
~K
is in canonical form. Clearly
To show that
~K
is optimal let
~
)
= I pxp '
i.e.
satisfies the constraints (3.4.3).
be any competitor and write
(3.4.6)
- E
N8
T
By supposition E(~ t )
=I
and E(~)
= O.
(t.1
-1 (t-C)~)
T
K
K
Therefore ~ persists on the right
hand side of (3.4.6) only in the term EN8(~ ~T)
= V~(8).
Thus to minimize
tr(V~(8)) it is sufficient to minimize EN8 (11 M.~lu.-CK) -~Ir) which is clearlyaccomplished, subject to (3.4.3), by taking ~ = ~K. This result is
stated as
Theorem 3.1
If for a given choice of b > 0 equation
sesses a solution (CK,MK) then
~K
minimizes
tr(V~(8))
(3.4.4) pos-
among all
~
satis-
fying (3.4.1) and (3.4.3).
~K
is the analogue of the Krasker (1980)
estimator.
The introduction
of the vector C is due to the asymmetric residuals in logistic regression.
K
As with the Krasker estimator neither
sl(~K)
nor the estimator itself is
invariant under a change of coordinate system.
Next
an invariant proce-
dure is considered analagous to that of Krasker and Welsch (1982).
The restrictions imposed on
e
(3.4.7a)
EN8(~(Y' X,
(3.4.7b)
s2(~) ==
8))
sup
(y,x)
in addition to (3.4.1) are
~
=0
T
~
-1
(y,x,8)VI/J
~(y ,x,
8) ::; b
2
.
49
Suppose that for fixed b > 0 there exist CKW
= CKW , N(6) and l\w = ~W , N(6)
.
satisfying
(3.4.8a)
(3.4.8b)
Define
(3.4.9)
l/iKW(Y' X, e)
where
(3.4.10)
·e
Clearly l/i , satisfies (3.4.7a) which in turn implies (3.4.1). Also, since
K
.
-1
-1.
T -1
T -1
VKW = Vl/i = DXW M. KW DKW l.t follows that l/i KW VKW l/iKW = (~-CKW) M'KW(~-CKW) x
KW
2
T -1
2
reb /(~-CKW) MKW(~-CK~) and hence s2(l/iKW) ~b.
In light of R~pert's
(1983) counterexample one might reasonably ask what optimality properties,
if any, does l/i. possess?
KW
To answer this let l/i be any competitor and write
(3.4.11)
Again l/i persists on the right hand side of (3.4.11) only in the term
T
-1
ENe(l/il/i ) = Vl/i(6). Therefore tr(Vl/iV ) is, up to an additive constant,
KW
proportional to
(3.4.12)
50
Let
</>
=
V~~
t/J
and note that
11</>11
2
=
In terms of
</>
(3.4.12)
becomes
(3.4.13)
which is clearly minimized, subject to the constraint
11</>11
2
~ b 2 , by taking
However, this implies that tr(vt/Jv~~) is minimized, subject to (3.4.1) and
2
the constraint t/JTV~~ t/J ~ b by taking t/J = t/JKW. This last implication follows
.
.
from the 1dent1ty
-1
MKw
-1 -1 -1
DKWVKWDKW and provides the following optimality re-
=
sult.
Theorem 3.2
the solution (c
If for a given choice of b > 0 equations (3.4.8) possess
,M.. ) then t/J
KW KW
(3.4.1), (3.4.7a), and
Remarks-1.
on
t/JKW
KW
T -1
V
KW
t/J
t/J
minimizes tr(V",v-l) among all ljJ satisfying
0/ KW
2
~ b .
In Theorem 3.2 the conditions for optimality of ljJKW depend
-1
itself through V .
KW
This is somewhat disconcerting and necessarily
weakens the conclusion of the theorem.
Nevertheless,
t/JKW
does satisfy an
optimality criterion and has the additional property of invariance.
2.
The theorem is a special case of a general theory outlined by
Ruppert (1982).
inite matrix.
Let 11-11* be any norm on lR p and Bpxp be any positive defIf £8 is the maximum likelihood score function then the ljJ
which minimizes tr(E(ljJljJT)B subject to Iit/J!I* ~ b is
where Pb(z), z € lR P is the projection of z onto {z€lR P : I\zll* ~ b}
with
51
T
k
Le. IIPb(Z) "* ~ b and
II x-Pb(z)
respect to the norm IIzllB
= (z Bz)
inf{ IIx-z 11 M: liz 11* s b}.
The vector C and matrix M are chosen to insure
2,
that "l/J* is unbiased and sat isfies "l/J*
T -1
(z VKWZ)
k
2
and R.
8
In Theorem 3.2 B =V~:,
= IC"l/J*'
II B
=
II z II * =
= R.(Y,X,8).
Both Theorems 3.1 and 3.2 assume the existence of a vector and positive definite matrix satisfying (3.4.4) and (3.4.8) respectively and this
puts a restriction on the bounding constant b.
One can prove directly or
appeal to the optimality of the m.l.e. to show that V ~ r-l where r =
K
(1) T
T
.
T
EN(F
(X 8)XX ) is the information matrix. S1nce VK = E ("l/J"l/J ) it follows
N8
that
tr(I-l) ~ tr(V )
K
= EN8 II M~l (R.-C K) II
2
2 r (bl
II M~l (R.-C K) II
)
s b2 •
-e
Thus
1
k
b> (tr(r (8)))2
(3.4.14)
is a necessary condition for the existence of a solution to (3.4.4). Furthermore, a fixed point argument similar to that employed by Krasker (1980) can
be used to show that a solution to (3.4.4) always exists for some sufficently large b, provided r(8) > O.
For the invariant estimator note that (3.4.8b) implies
and thus p
= tr(I) ~
b
2
so that
(3.4.15)
is a necessary condition for the existence of a solution to (3.4.8). Also,
52
it follows from results of Maronna (1976) that a sufficient condition for
existence of a solution is
(3.4.16)
b
2
> p/(l-Pr*(H))
where
Pr*(H) = sup PNe(H)
He:H
and
H is the set of all hyperplanes in
R P.
Although the predictors {xi}~
are supported entirely on a hyperplane, the variable
~
=
~(Y,X,e)
with mea-
sure P
is supported on at least two hyperplanes unless Y is constant a.s.
Ne
(P ). Thus (3.4.16) is not vacuous. However, it is not necessarily true
Ne
that Pr*(H)
+
0 as N
+
00.
For example, if e
and, because of the intercept, P*(H) = 1/2.
r
-e
=0
then
~
=±
X/2 a.s. (P
Ne
)
If, however, the non-intercept
components of X are generated from a continuous distribution on ~p-l and
at least one of the slope parameters is nonzero, then P converges weakly
Ne
to a measure which for fixed e is absolutely continuous with respect to
Lebesgue measure.
In this case p*(H)
+
r
0 resulting in the asymptotic nec-
!-,;
essary and sufficient condition b > p2 .
and ~
exist for suitable choices
K
KW
of b and that both satisfy an optimality property at the logistic model. It
. So far it has been shown that both
~
would appear as though our search for a robust estimator is over.
nately, there are considerable difficulties computing
these
Unfortu-
estimates.
Clearly, an iterative method is required and this may involve a good deal
of computing.
Particularly troublesome is the fact that the success of any
algorithm is likely to depend heavily on the choice of an appropriate preliminary estimate.
One recourse is to employ so called "one-step" estimates.
Starting with a root-N consistent estimate and performing one Newton-Raphson
iteration yields an estimate which behaves asymptotically like the fully
53
iterated solution.
Crucial to the viability of such a procedure, especially
for small to moderate sample sizes, is the quality of the initial estimate.
For the remainder of this chapter attention is restricted to a computationally simpler class of estimators.
Both optimal and optimal invariant members
These estimates may be employed in statu
are found within this class.
quo
or may serve as starting values for computation of 8 and 8 or one-step
K
KW
versions thereof.
The class of estimators considered have influence functions of the form
(3.4.17)
IClJi
= lJi
w
w
= (y_F(xT8))
t(x,8) w(x,8)
where t(·,·) is a scalar valued function chosen by the statistician to attenuate large residuals, e. g. one might take t (x, 8)
Jones (1982).
for all 8.
a:
(F (1) (xT8)) E, E > 0 a 1a.
It is convenient to normalize t(·,·) so that sup t(x,8) = 1,
x
In the example just cited this implies t(x,8)
is the appropriate choice.
(4F(1)(xT8))E
The vector w(.,.) is a function of x and 8,
but not y, and is varied to achieve optimality.
Since E(YIX)
members in this class yield Fisher consistent estimators.
= lJi w
=
= F(XT8),
all
The requirement
is satisfied whenever
(3.4.18)
In finding the vector function w(.,.) which minimizes tr(V
ation is restricted to only those w(·,·)'s
(3.4.19)
Generally, v 1 (w)
sup
(y,x)
=
II
satisfying
T
(y-F(x e))w(x,8)
sl(lJiw) only if t(x,8)
) considerlJiw
= 1.
II
~ b •
That is, (3.4.19) puts a
restriction on only a portion of the influence curve unless t(x,8)
= 1.
that this entails no loss of generality since one can always take t(x,8)
Note
=1.
S4
Furthermore, if t(.,.) appeared within the norm in (3.4.19), then the solution to the variational problem would be independent of t(·,·).
= t(x,8)w(x,8)
fically, by defining w*(x,8)
More spec i-
the variational problem could
be stated as: choose w* to
minimize
tr(v~
)
w*
where
=
~w*
T
(y-F(x 8))w*(x,8)
= IC~
.
w*
The solution to this problem w* t' which can also be obtained in the conop
text of the present theory by taking t:: 1, clearly does not depend on t ( . , . ) .
Note also that the normalization of t (. , .) guarantees that sl (~w)
~ VI
(w) .
Thus if (3.4.19) is replaced by the weaker condition
a stronger optimality result is obtained.
ing
tr(v~
All we are doing here is minimiz-
) over a broader class of vector functions w.
It follows that for
w
a fixed choice of b the optimal estimator corresponding to the choice
t(·,·)
- 1 is at least as efficient as any optimal estimator corresponding to some
t(·,·)
1- 1.
The problem is formulated in this fashion to allow greater flexibility
in designing optimal estimators.
In particular it will be possible to link
the present results with those of Beran (1982) by indicating the optimal
choice for the weights {~8i} appearing in (3.3.17).
Since w(·,·) is not a function of y, (3.4.19) is equivalent to
(3.4.20)
T
sup m(x 8)
x
II
w(x,8)
II ~
b
55
where
T
m(x B)
(3.4.21)
= max(P(xTB),
T
1-P(x B)).
Suppose there exists a positive definite matrix
~
= ~ , N(B)
such
that
(3.4.22)
In (3.4.22) and elsewhere it is to be undertsood that p(l)
t
= t(X,B),
and m
then from (3.4.22)
= m(XT B).
= p(l)(XTB),
If
EN(P(l)t XwT)
M
= I pxp
so that by (3.4.18)
(3.4.23)
is in canonical form.
Using the standard argument one finds that for any
allowable w(·,·)
(3.4.24)
- EN (P (1) tM;/ X wT )
- EN(P(l)twXT~l)
+
EN(P
(1) 2
T
t ww )
where w persists on the right hand side of (3.4.24) only in the term
2 T
E (P(1)t wW ) = V . Thus to minimize tr(V,!, ) it is enough to consider
N
I/J
'l'w
w
EN (P (1) t 2 11
M~/X/t-w 11 2 )
which is minimized, subject to (3.4.20) by taking w
=
w "
M
56
Theorem 3.3
~>o
If for some choice of b> 0 equation (3.4.22) has a solution
then lj!M minimizes tr(vlj!) among all lj!w of the form (3.4.17) satisfying
w
(3.4.20) .
At t
==
1 the estimator is a solution to the equation
N
TTl
L
(y.-F(x.e)) x. r(b/m(x.e) II M- (e)x.lI) = O.
1
~
~
~
~
-~
~
(3.4.25)
The weights r(b/m(x~e)
~
in light of (3.3.4).
II M-1 (e)x.11 ) might be called optimal
With t = 4F(1) the equations to solve
-~
~
Mallows weights
are
(3.4.26)
Comparing (3.4.26) to (3.3.17) suggests calling the weights
1
min(l/4 F(l) (x:e),
b/m(x~e)
II M(e)x.1I
)
~
~
-~
~
optimal Beran weights.
The resulting estimator satisfies two qualitatively
different optimality criteria.
this, however.
One should not assign undue importance to
The neighborhood over which the estimator is optimal in
Beran's theory is sufficiently complex as to be almost meaningless.
Also,
in Theorem 3.3 only a portion of the influence curve is bounded when
t~
1.
This is necessary to insure that the resulting estimator depends on t in
a non-trivial fashion.
Finally, as suggested earlier, it might be reasonable
to consider the family of estimators corresponsing to the choice t = (4F(1))E:.
E:
As with the Krasker estimator lj!M lacks the desirable property of invariance.
To obtain this replace (3.4.20) with
(3.4.27)
v 2 ( w)
= sup
x
m2 (xTe) wTVlj!-1 w
w
~
b2
and consider only t ( . , .) which are functions of xT e. Suppose that for b > 0
there exists
~I = ~I
, N(e) > 0 such that
57
(3.4.28)
Define
(3.4.29)
(3.4.30)
T
W
=
(y-F(x 8))
M1
(3.4 .31)
w (x,8).
M1
A by now familiar argument leads to the following result.
Theorem 3.4
the solution
(3 .4. 17)
If for a given choice of b> 0 equation (3.4.28) possesses
~I
> 0 then
. f·
sat~s y~ng
W
M1
m2wTV-l
Mlw
minimizes tr(V
~
b2 .
-1
WVM1 ) among all
Wof
the form
w
The first remark following Theorem 3.2 is applicable to the present result.
·e
This chapter ends by finding necessary and sufficient conditions on
b> 0 so that (3.4.22) and (3.4.28) have non-trivial solutions.
For
W
note t hat
M
since V > I -I ,
M
and thus
(3.4.32)
is a necessary condition for a solution to (3.4.22) to exist.
For
W
M1
(3.4.28) implies
I
= EN(F(l)~~xxT r(b2t2/m2xT~~x))
and after taking traces one finds that
58
(3.4.33)
is a necessary condition for the existence of a solution to (3.4.28).
To obtain sufficient conditions start by defining
T
B = EN(F(l)XX r(t/mI\X\i)),
M
2
T
2
2
MI = EN(F(l)xx r(t /m I\xI1 )).
B
For any positive semi-definite matrix A let Al(A)
~
A2 (A)
~ ... ~
Ap(A) denote
the ordered eigenvalues of A.
Theorem 3.5
M~
in this case B
-1
A solution to (3.4.22) exists provided b > Al (B ) and
M
~ ~
2
A solution to (3.4.28) exists whenever b >A
~ ~I ~
and in this case BMI
·e
I.
1
(B
MI )
I.
The proofs of the two assertions are similar so only that one
Proof:
for MMI is given.
Let M be any positive definite matrix with M > B
MI . Then
M-1 < B-1 and therefore
MI
2 T -1
mx M x
Thus, whenever M > B
MI
Let M(O)
-1
=I
and define recursively
59
Note that L(·) is an increasing function in the sense that for positive
definite matrices Al and A2 , Al
~
A2 => L(A2)
are in the sense of positive definiteness).
~
L(A l ) (all inequalities
Since M(l) = L(M(O))
~
M(O)
a routine inductive argument shows that the sequence {M(K)} is decreasing
in the sense of positive definiteness. It is also bounded below by B •
MI
Thus
l~
M(K) exists and is just
Remarks-I.
~I.
If p= 1, i.e. one slope parameter and no intercept, then
the necessary and sufficient conditions are equivalent.
In higher dimen-
sions there can be a substantial difference between the two.
It is conjec-
tured, and experience supports this to some extent, that in most cases
(3.4.32) and (3.4.33) are asymptotically sufficient as well as necessary.
-e
2.
~I(e)
The proof is constructive.
for a given 8.
It indicates a method of calculating
More is said on this in Chapter V.
In the next chapter consistency and asymptotic normality of the optimal
Mallows estimator (t= 1) is proved for both the invariant and noninvariant
cases.
Also the construction of one step versions of e and e
is outlined.
K
~
CHAPTER IV
BOUNDED INFLUENCE ASYMPTOTICS
4.1
Introductory remarks
In Chaper III the form of the optimal estimating equations was derived
under the supposition that the estimator was asymptotically normal with
known covariance matrix.
To justify these results it must now be proved
that the optimal estimators do in fact behave as expected.
In Section 4.2
sufficient conditions are found for consistency and normality of the optimal
·e
Proofs of these results are given in Section 4.4.
Mallows estimators.
The
construction of one-step versions of 6K and 8KW is outlined in Section 4.3.
The fact that x ,x , ... are not assumed to be i.i.d. random vectors
l 2
presented little difficulty in the previous chapter.
However, in Section
4.4 a price is paid for this added generality in the form of lengthy and detailed arguments.
This is an equitable tradeoff especially in the context
of robust estimation.
(4.1.1)
N
I
1
For example, the -assumption
II
~
Ilx.
2
= O(N)
allows a far greater range of behavior of the sequence {x.}
than does
~
(4.1. 2)
N
In particular (4.1.2) implies, via the SLLN, that N-ll Ilx. 11
1
almost surely which in turn implies
~
2
converges
61
max
(4.1. 3)
l~i~N
2
(!Ix.~ II IN) = 0(1).
Thus, assuming (4.1.2) obviates to a great extent the need for bounded influence.
Under (4.1.1), however,
max !Ix. 11
<' <N
1 -~-
~
2
can increase at rate O(N).
Assumption (4.1.1) models more closely those designs in which bounded influence estimators are necessary and hence is to be preferred over (4.1.2)
in theoretical investigations.
One consequence of our model is that it makes no sense to talk about
consistency of the sequence
{~(e)}.
The point of view is adopted that
(3.4.22) and (3.4.28) define non-random matrix valued functions of the
vector argument e.
Naturally, the behavior of these sequences must be lim-
ited to some extent and this is accomplished by imposing some mild design
conditions on {x.}.
~
·e
4.2
~onsistency
and asymptotic normality
Consider the invariant estimator e .
MI
One obstacle which must be over-
come in finding sufficient conditions for consistency is that of determining
the bound b.
matrix
~I'
The crucial requirement is the positive definiteness of the
i.e. b must be chosen to insure that
away from zero for all N.
ical formulation.
Al(~I)
remains bounded
This requirement is now given a precise mathemat-
To indicate the dependence of
~I
~I=~I(e,b).
on b write
There should be no confusion if the subscript MI is dropped from
e .
MI
and
Unless stated otherwise M will always designate the matrix correspond-
ing to the invariant procedure.
Al(A)
~I
~ ... ~
In keeping with
established
notation
Ap(A) denote the ordered eigenvalues of the matrix A.
From (3.4.28) i t is obvious that M>O if and only i f there exists
2
£ > 0 such that b A (M( e, b)) ~ £2.
l
Given any £ > 0 define
62
(4.2.1)
Let {eN} be any sequence of real numbers, possibly random, bounded below
by one.
Finally define
2
2
Obviously, b A (M(a,b))
1
(4.2.3a)
~ ~.
A
N
T
2
(y.-F(x.a))x.
1
~
~
~
Suppose now that a and M = M(a) satisfy
~ 2 2 T A_1
r (b Im.x. M x.) = 0
~
~
~
(4.2.3b)
TA
TA
where m. = m(x. a) and m = m(X a).
~
Since
~
~
> 0 is arbitrary
stating the
problem in this fashion merely makes precise the condition that b must
be chosen sufficiently large to insure M > O.
Consistency of a is proved
assuming
N
I Ilx.~ II
1
(C1)
2
=
O(N),
lim inf A (E (F(1) (xTao)xxTllxll- 1 )) > O.
1 N
(C2)
N
Suppose a is a solution to (4.2.3).
Theorem 4.1
If (C1) and (C2)
hold then 8 converges in probability to 80 ,
The proof of this result is given in Section 4.4. There is a similar result
for the non-invariant estimator which is presently designated S to distinguish
it from 8. Let b be defined by (4.2.2) and suppose Sand M=MCS) satisfy
N
(4.2.4a)
\
L (y.
1
~
T~
~-1
-F(x. 8))x. r(b/m.11 M x·ll)
~
~
~
~
(4.2.4b)
~
Theorem 4.2
Suppose 8 is a solution to (4.2.4).
If (C1) and (C2)
63
hold, then
e converges
Remarks-I.
cedure.
in probability to 8 ,
0
Assumption (C2) is necessary.
Consider the invariant pro-
From (3.4.28)
A (M) E (F(l)xxTllxll- l )
p
N
2
T
T
E ( II X11 ) EN(F (1) (X 8 ) XX II xii-I) .
N
0
In deriving this inequality it is helpful to recall that rea) ~ a, t
2
~ 1,
2
2
l
1/2 ~ m ~ I, XTM-IX > Al(M- ) IIxl1 2 = Ilx11 /A (M), IIxll ~ 1, and A (M) ~
P
p
E (II XI12). If (C2) is violated, then the sequence {b (8 ,E:)} must be unN
N 0
bounded for all E: > 0, Le. it is impossible to obtain a bounded influence
estimator.
2.
(C2' )
(Cl) and (C2) together imply (C2')
lim inf Al(EN(F(l)(XT8)XXTllxll-1)) > 0
N
for all 8 in some neighborhood of 8 ,
0
This follows from the fact that
In the proof of consistency use is made of (C2')
The consistency results are important although not sufficient for purposes of statistical inference.
Fixed sample size distribution theory is
prohibitively complex; thus only asymptotic results are possible.
In The-
orems 4.1 and 4.2 it was shown that as long as b is chosen sufficiently large
64
both 8
and 8 are consistent. In discussing distributional theory it is
M
MI
convenient to suppose that the bound b is fixed. To accommodate this simplification assume the existence of some 01 > 0 such that
lim inf
(Nl)
N
Some conditions restricting the behavior of
for distributional results.
and 0-pxp >
{~(.)}
are also necessary
Consider the matrix valued function of 8E RP
o.
(4.2.5)
Let I
N,
2(8,Q) be its partial derivative with respect to Q.
mality it is required that
·e
near 8 and N large.
0
IN,2(8,~(8))
In proving nor-
possess a bounded inverse for 8
That is, for some 02 > 0 it is assumed that
(N2)
That (Nl) and (N2) are required to hold for all 8 near 8 is a conse0
quence of our rather minimal assumptions on {xi}~.
E( Ilxlll) <
00
In an i.i.d. setting with
it should be possible to formulate these assumptions in terms
of the behaviour of
{~}
and {I
N,
2} at the point 8 alone.
0
The role played by (N2) is made apparent in the following proposition.
Proposition 4.1
If (Nl) and (N2) hold, then the sequence of matrix
valued functions {~(.)} is equi-Lipschitzian for 118-80" ~
Proof:
=
0, i.e.
(4.2.6)
°;. min(01,02)'
For all large N and 118-80 II ~ 0 ~(8) is a solution to I (8,Q)
N
65
Differentiate (4.2.6) with respect to e to find
where
IN,I(e,~(e))
=
ajay J(y,Q)
I
e ,~(e)
Without much difficulty one can show
so that (CI) implies
II
J (e,~(e))
I
II
is bounded independently of e.
This
fact, together with (N2) imply
IIM'(e)1I ~
const.
The conclusion follows from an appeal to the Mean Value Theorem for normed
o
spaces.
Remarks-I.
(NI) is just another restatement of the condition that
b must be chosen sufficiently large.
~
Assumption (N2), by allowing us to prove the proposition, plays
essentially the same role as assumption 7 of Krasker and Welsch (1982). Since
the xi's are non random it makes no sense to talk about consistency of
However, (N2) allows us to conclude that
any consistent eN.
In particualr, if
II
~(eN)-~(eO)
N~(eN-eO)
=
II
=
ope
Opel), then
II
~(e).
eN-eoll ) for
N~(~(eN)-~(eO))
= °P (1).
The main distributional results follow w
Theorem 4.3
Assume (CI), (C2), (NI), and (N2) and suppose e is a con-
66
sistent sequence of solutions to (4.2.3) where b is now fixed.
Then
(4.2.7)
where
~MI(""')
is defined in (3.4.31).
Two matrices appear in the definition of
The proof is given in Section 4.4.
~MI(·,·,8),
D N(8), see (3.4.28) and (3.4.29) respectively.
MI ,
namely
~I
, N(8) and
In the following corollary
it is assumed, for convenience, that these sequences converge.
Corollary 4.3.1
(4.2.8)
Proof:
In the right hand side of (4.2.7) DMI,N and
~I,N
can be re-
placed by their limits without altering the sense of the equality.
-e
~MI
Since
is bounded the proof is completed with a routine application of the Linde-
o
berg Central Limit Theorem and the Cramer-Wold device.
Theorem 4.3 and its corollary provide the justification for our definition of sensitivity and the construction of optimal score functions in
Chapter III.
Similar results hold for the non-invariant estimator.
Theorem 4.4
Assume (Cl), (C2), (Nl) and (N2) and suppose
sistent sequence of solutions to (4.2.4) where b is now fixed.
N
(4.2.9)
where
'"
- -~ \"
N~ (8-8
0) - N L ~M(y.,x.80) +
~MC""')
1
~
~
0
P
(1)
is defined in (3.4.23).
Corollary 4.4.1
Assume MM,N(80)
+
M(8 0)
and
8 is
Then
a con-
67
then
In the next section construction of one step versions of 8K and 8KW is
discussed.
4.3
One-step estimators
Consider the invariant estimator 8 •
KW
Under
sufficient regularity
conditions it is to be expected that
(4.3.1)
where
~KW(·'·'·)
procedure.
is defined in (3.4.9).
This suggests the following one-step
Compute 8 , the Mallows invariant estimator.
MI
Next compute
M (8MI ), CKW (8 MI ), and DKW (8MI ) from (3.4.8) and (3.4.10). This provides
KW
all the necessary building blocks for the one-step operation. The one step
•
A
(1)..
est1mate, 8KW
1S g1ven
b
y
(4.3.2)
This construction is similar to what Bickel (1975) calls a Type II one-step
procedure.
Since
~KW
is in canonical form, i.e.
~KW
= IC~
,the one-step
KW
operation in (4.3.2) appears deceptively simple.
Although the computational savings are apparent the ultimate justification for employing one-step estimators lies in showing that the estimate and
its one-step version are asymptotically indistinguishable in the sense that
N~(8(1)_8) = 0 p (1). Equivalently it is sufficient to show that (4.3.1) reA
A(l).
mains true when 8KW is replaced by 8KW '
i.e. that
68
N!zCe. (1 ) -8 )
KW
0
C4.3.3)
1
N
= N-~ L ~KWCy.,x.80)
11
1
+
0 (1).
P
Verification of C4.3.3) is long and tedious and will not be undertaken. The
crucial requirement is that the sequences {SKW,N(8)} and {Nkw,N(8)} must be
equi-Lipschitzian in a neighborhood of 8 • The analysis is similar to that
0
encountered in the next section in the course of proving Theorem 4.3
For the noninvariant estimator one first computes 8 , C C8M) , and
M K
A
A
MKC8M).
where
4.4
The one-step version of 8K is just
~KC·,·,·)
is defined in C3.4.5).
Proofs of major results
Throughout this section the reader should bear
in mind that often
"Mathematics consists in proving the most obvious thing in
the least obvious way" - Polya
In proving the theorems of Section 4.2 much use is made of the fol1owing result.
Theorem 4.5
For
f:!, E
R
m
Let {€.} be a sequence of independent random variables.
write XiN(f:!,)
1
= ¢iNC€i'f:!,)
where ¢iN(·'·) is a measurable function
=.
( 0) - 0
from R X R m to R p and'+'~iN·'
to exist and is denoted by
AiN(~).
Th e expec t a t·10n 0f XiN ( uA) 1S
• assume d
Define
Assume the existence of positive constants biN' b, c iN ' c, and dO such that
(4. i)
69
N
(4. ii)
Lb·N::;;Nb,
1
1.
2
(4.iii)
E(UiN(Ll,d)) ::;; ciNdN
L ciN
(4. iv)
::;; N c ,
1
then
Corollary 4.5.1
For any real B <
00,
N
sup
N-~I 1L (X·N(Ll)-A·N(Ll))
1
1.
1.
II Llil ::;;B/N'2
I
=
0
p
(1).
Theorem 4.5.1 and its corollary are essentially i.n.i.d. versions of
Huber's (1967) Lemma 3 and Theorem 3.
The proof given follows closely that
of Huber's.
Proof of Theorem 4.5:
In the proof we can work with any norm equiva-
lent to the Euclidean norm and it is convenient to take lyl=max(IY11, .. , IYrl)
for Y
E
:R
r.
Also suppose w.1.o.g. that dO
= 1. Write
N
L (X·N(Ll)-A·N(Ll))
1.
1.
1
ZN(.6.)
= --'k
----2
N +NILlI
We must show
as N +
00.
Let J be a large positive integer to be specified later.
define the sets
cK =
Set q = J
-1
and
70
The sequence of integers {K } is also specified later.
N
Given any
£
> 0 we
have the following inequality.
KN-l
IZN(ll) I ~ 2d
p{ sup
(4.4.1)
Illl~l
L
~
K=O
+ p{
p{ sup
llECk
IZN(ll) I
~
2£1
sup
IZN(ll)! ~ 2d .
Ill\EBK,N
Subdivide C into R smaller cubes with edges of length
K
2d. = (l-q) K q
(4.4.2)
K
such that the coordinates of their centers are odd multiples of d
K
and the
distance from the centers to the origin is
(l-q) K(1_q/2) .
(4.4.3)
·e
Label these cubes CK, 1"'" ~ , R'
of K.
Let
It turns out that R < (2J) m, independent
. denote the center of CK ., j=l, ... ,R.
,J
equality we have
~K
,J
~
(4.4.4)
Suppose now that II E
Using an obvious in-
R
I p{ sup IZN(~) I ~ 2£} •
j=l.6E1<,j
c.. , then we know
J(~
J
(4.4.5)
N
We can write
ZN(ll) =
L(X·N(ll) - X'N(~K .) + A·N(E: .) - A·N(ll))
1 1.
1.,J
1.
"1<,J
1.
-----~---------
k
N2 +Nllll
N
I(X'N(~K
.) - A·N(F,K .))
1.
,J
1.
,J
1
+
!<
N2
+
Nilli
71
and hence for
~ €
C . (4.4.5) implies
K,J
N
I1
IZN(~) I
U'N(~K
.,d ) + IA'N(~K .) - A'N(~)
1,J K
1,J
1
I
;s; - - - -......t : - - - - - - - Z
N + NI~I
•
Since for ~
d
K
€
C . we have IA'N(~K .) - A'N(~)
K,J
1,J
1
= (1_q)K(1_q/2) + (1_q)K q / 2 ;s; 1.
I :S;E(U'N(~K
.,d ))
1,J K
and I~K
·1
,J
Assumption (4.i) implies
(4.4.6)
I~ I € CK,J.
From (4.4.6) and the fact that
implies
I~ I > (l-q) K+l
we get
N
·e
(4.4. 7)
IZN(~)
I
.I_n(UiN(~K,j.
,d K) + biNd K)
K:--l=----N(l-q) +
:s;
N
I (X'N(~K .) - A'N(~K .))
1
,J
1
,J
1
+
whenever
~ €
N~
CK .
,J
Let TN K . and
, ,J
VN K . represent the two terms appearing in the r.h.s.
, ,J
of (4.4.7) respectively.
(4.4.8)
+ N(l_q)K+l
Then from an obvious inequality
IZN(~) I ~ 2d :s; Pr{T
p{ sup
~€CK .
K . ~
N, ,J
E:}
+ pr{V K . ~ d.
N, ,J
,J
But
(4.4.9)
Pr{T N K . ~ d
, ,J
N
:: pr{I(u'N(~K . ,d ) -If. N(~K . ,d K)) ~
1 1,J K
1"J
N(l-q)
where
UiN (·,·) =E(U iN (·,·)).
K+l
N
E:-
\
ebiNdK+UiN(~K,j,dK))}
Together (4.4.9), (4.i) and (4.ii) imply
+
72
N
P{TN K .;:: d ~ PO: (U·N(E,;K . ,d K) - U·N(E,;K . ,d K)) ;::
, ,J
1
~,J
N(l-q)
K+1
(N(l-q)
K
2 b N(l-q) q}.
E: -
We now choose J so that with q = J
(4.4.10)
~,J
K+1
-1
E: -
K
K~ ~
2 b N(l-q) q) > (N(l-q) q :lc 2 ) .
A little algebra will verify that
(4.4.11)
J >
c/E: 2
+ 4 +
8b/E:
With J chosen accordingly we get
is sufficient.
~
P{TN, K, ].;:: d
(4.4.12)
·e
The last inequality follows from Chebyshev's inequality, (4.iii), and (4.iv).
Returning to V K . note that from elementary considerations together
N, , ]
with Chebyshev's inequality we get
(4.4.13)
p{V K .;::
N, , ]
~
d
N
p{ IL (X·N(E,;K .) - A.N(E,;K .))
1
,J
~
N
~
,J
I ;::
E:
N (1_q)K+1} s:
2
L E(X·N(E,;K
.))
1
~
,J
E: 2N2 (1_q)2K+2
Obviously
IX.N(~K.)
I s: U.~ N(0,(1-q)K(1-q/2)).
~
,J
that E(X:N(~K .)) s:
1
,
J
C.
~
N(1-q)K(1-q/2).
Thus from (4.iii) it follows
Using this we have from (4.4.13)
73
(4.4.14)
p{V N K . ~ d
, ,J
~
c
2
K+2 .
£ N(l-q)
Together, (4.4.8), (4.4.12) and (4.4.14) imply
(4.4.15)
p{ sup
liEC
.
K, J
IZN(li) I ~ 2£ }
~
-1
N (l-q)
-K
(1
+
c
2
2 ),
£ (l-q)
which in light of (4.4.4) and the bound on R implies
(4.4.16)
Summing the appropriate geometric series one finds from (4.4.16)
K -1
N
(4.4.17)
1\
L.
p{ sup
K=O
IZN(li) I
< (23) m
~ 2d -
N
1
c
2) --""";K~-l-' .
q(l-q) N
£ (l-q)
(l + 2
liEC K
+
_
K
N
UiN(O,(l-q) ).
(4.4.18)
Apply the Markov Inequality to (4.4.18) and use (4.ii) to find
N_
(4.4.19)
p{
sup IZN(li)
liEB K, N
I
~ 2d ~
2I U· N(O,(l-q)
1 1
K
N
)
!-<
2£ N:l
Finally, from (4.4.1), (4.4.17), and (4.4.19), we get
c
(4.4.20)
(1 + 2
E
!-<
+ N :l(l_q)
2)
(l-q)
KN
b/E.
1
Thus
74
The proof is concluded by showing that we can choose
of (4.4.20) tends to zero as N ~
so that the r.h.s.
This requires
1
K
N~(1-q) N ~ 0,
(4.4.21)
K
N (l-q) N
The choice
~
~
00.
log N
will do.
= [(-3/4) log(l-q) ]
Proof of Corollary 4.5.1:
B2/d~,
00.
{~}
Let~'
D
have norm I~'
I
~
B.
For all N ~
N-~I~' I ~ dO; hence for all large N
(4.4.22)
·e
Note that for I~'
IS
B
(4.4.23)
From (4.4.22) and (4.4.23) we find
=
(4.4.24)
0
p
(1)
However, (4.4.24) is equivalent to
=
which proves the corollary.
0
p
(1)
D
75
Remark
=
0
p
The requirement that XiN(O)
==
-!q;,N
0 can be weakened to N 2LlXiN(O)
(1) by applying Theorem 4.5 to the random quantities X~N(~)=X.N(~)-X.N(O).
111
In proving consistency the following definitions are helpful.
Definition 4.1
A sequence of concave functions is said to be aC(I)
if every subsequence contains a further subsequence converging to some necessarily concave function.
When the sequence of functions is random and con-
vergence is in probability the sequence is said to be ac (1).
p
Definition 4.2
For a real function f(·) the set {yER P : fey) ~a} is
called Lev (a, f) .
We are now in a position to prove the consistency results.
The proofs
of Theroems 4.1 and 4.2 are nearly identical so only that for the invariant
·e
eatimator is given.
The proof is long and detailed and starts with some pre-
liminary lemmas.
Lemma 4.1
Suppose f ( .) and g (.) are finite concave functions on R p.
If f(.) has a unique maximum and f
y
E
~
g then g attains its maximum at some
R P and the sets Lev (a,g) are bounded for all a.
Proof:
Let YO be the maximum point of f.
Then Lev(f(YO),f) = {yO} and
by Rockafellar (1970) Corollary 8.7.1, Lev(a,f) is bounded for all a.
Since
f dominates g it follows that Lev(a,g) S Lev (a,f) and hence Lev(a,g)
is
bounded for all a.
The remaining conclusions follow
from the continuity
o
of g.
In the following lemma the space of all square pxp matrices M is normed
with the usual operator norm.
The product space RP x M has typical element
76
/). = (8,Q) where 8 E
Lemma 4.2
sup
II/).II~B
IN
Proof: Let
:R P ,
Q EM, and nonn 11/).11
Assume (Cl) and (C2).
_IN
= max(11811, IIQII).
Then for every y
:R P and B
T T l
--,-:::..---)1 =
L(y.-F(x.8 )) (log F(x.y)) min(l,
0
1
E.
E
1
1
m(x:e)IIQx.11
1
1
1
= (Y.-F(x~80))'
111
b.
<
(1),
0
P
= (8,Q), /)., = (8',Q') and define
T
~
k
t. (8,Q) = min(l, ljm(x.e N ) II N:lQx·1I )
1
N
1
1
(4.4.25)
Since Ilog F(t)
I
~ 1 + Itl,
(4.4.26)
Recall that met)
= max(F(t),
l-F(t)).
With this in mind differentiate
tiN(y,Q) with respect to y to find
By the Mean Value Theorem then
(4.4.27)
From the inequality Imin(l,s-l) - min(1,t(4.4.28)
1
)
I~
Is-tl we get
I
k
It.1 N(e' , Q) - t 1.N( e' , Q') I ~ Nk1n (x.T1 e 'N:l)
II Qx.1 II -
00
II Q' x.1 II
I
77
The triangle inequality, (4.4.27), and (4.4.28) imply
(4.4.29)
From (4.4.26) then we have
Ix.N(Ll)
-X.N(LlI)1
~
~
~ 41Ix.IIC1+ lyTx . !) II Ll-LlI II·
~
~
This implies
(4.4.30)
Let b'~ N
·e
sup
IILl-LlIII~d
=
IX.N(Ll) - X.N(LlI) I ~ 411x·11 (1 + JyTx .
~
~
~
~
l) II Ll-LlI II.
T
411x.11
(1+ Iy x.I),
then (C2) implies
~
~
N
(4.4.31)
I1
bON
~
= OCN)
2
From the inequality (s+t)2 ~ 4Cs 2+t ) we get
(4.4.32)
Since 0 ~ t iN (·,·) ~ 1, (4.4.27) and the ineqaulity min(l,t 2) ~ t
C4.4.33)
It·~ NC6,Q) - t 1 NC6 1 ,Q)
Similarly we get
(4.4 .34)
o
I2
2
k
~ min (1, 2N 2 1Ix·11116-6 1
~
II)
imply
78
Thus, (4.4.32), (4.4.33), and (4.4.34) imply
!t· (8,Q) - t· (8' ,Q')
1N
1. N
I2 .,;
!<
4N:lllx·11 (IIQ-Q'
1
II
+ 2118-8'
II)
and hence from (4.4.30)
U~N(~,d)";
(4.4.35)
Let c
N-
l
iN
1.
=
(N-l(l + lyTx.I)2)
1.
_!<
l6N :1(1 + lIyll)
2
3
Ilx.ll,
then
1
L~ N-~llxiIl3 = 0(1).
16N~lIx.lld
1.
16N-~(1
::;
N
L c.1.N = O(N)
1
+ IIYII)2 I1x . 11 3 d.
1
and
if
only
if
From the Cauchy-Schwartz inequality
(4.4.36)
·e
N
L c iN = O(N)
and thus (C1) insures
1
The assumptions of Theorem 4.5 are satisfied with A (6)
iN
=0
and dO
I
0
Thus, from Corollary 4.5.1 we know
which in turn implies
(4.4.37)
But (4.4.37) is equivalent to
(4.4.38)
sup IN
II ~ II ::;B
-1 N
L E. (log
1
1.
T
T
1.
1
F(x.y)) (1-min(1,1/m(x.8) IIQx·ll))
1
=
(1).
P
= 1.
79
-1 N
The proof of the lemma is completed by showing N
T
L £.log
F(x.y) = o
11
1
(1).
P
0
Verification of this last step is straightforward.
In Chapter II we appealed to some results by Andersen and Gill (1982)
concerning random concave functions.
In this section we need a slight gen-
eralization of these results.
Theorem 4.6
LetE be an open convex subset of RP and let F , F , ...
1
2
be a sequence of random concave functions on E such that V y E E, FN(Y) ~ F(y)
as N +
00
where F is some, possibly random, function on E.
Then F is also con-
cave and for all compact AcE,
sup IFN(Y) -F(Y)I ~ 0 as N
yEA
Proof:
+
00.
The only difference between the above result and that of Ander-
sen and Gill (1982) is that the limit function F is allowed to be random,
Le. for fixed Y E E F(y) may be a random variable such that FN(y) ~ F(y).
Andersen and Gill's proof works for this generalization without modification.
o
Our interest actually lies in the following simple corollary.
Corollary 4.6.1
Suppose that with probability equal to one F has a
A
unique maximum at Yo
Proof:
E
E.
Let YN maximize F ·
N
A
P
Then Y + Yo as N +
N
The proof, a simple exercise in analysis, is omitted.
00.
o
We are now in a position to prove the consistency results.
Proof of Theorem 4.1:
(4.4.39)
W.
1N
Write
. ~
2/ 2 T A_I
= m1n
(l,b m.x. M x.)
111
A
where b, M, and m.
1
=
T
m(x.6)
are defined in (4.2.2) and (4.2.3).
1
Define the
80
random function of y
E:
~-(y)
(4.4.40)
-N
R p.
IN
= N- I1
T
T
(y.log F(x.y) + (l-y.) log(l-F(x.y)))
1
1
Note that wiN does not depend on y.
(4.2.3a) implies GN(e)
= O.
1
1
W' '
1N
Since
Also. since
{GNCo)} is a sequence of random concave functions each attaining its maximum
A
at
A
ec=eN) .
The inequality Ilog F(t)1 s 1+ It I is used to show
s 2N-
C4.4.4l)
1 N
I
1
T
Cl+ \y x.
I)
1
Since ENCllxll) is bounded and the inequality in C4.4.4l) holds almost surely.
the sequence {GNCo)} is aCCl). Rockafellar (1970). Theorem 10.9 (see also
Definition 4.1).
In particular. this means that every subsequence of {GNCo)}
contains a further subsequence converging in probability
surely) to some. possibly random. concave function.
Cin fact
almost
Suppose for the moment
that the limit of every convergent subsequence of {GN(o)} has. with probability one. a unique maximum at 8 ,
0
Pick any subsequence {eN} and let {G Co)}
N
K
K
be the corresponding subsequence from {GNCo)}. By supposition we can find
GN Co) converging in probability to some G .Co) having. with probability
K.J
K.j
one. a unique maximum at 8 , Since 8
. maximizes G
it must be that
0
N
N
P
A
eN
.
K. J
~
K.J
80 by Corollary 4.6.1.
k.j
"Thus. every subsequence
{8
NK
} possesses
81
a further subsequence converging in probability to 8 •
0
However, this implies
that 6 converges in probability to 8 . Thus, to prove consistency, we need
0
N
only demonstrate that the limit of every convergent subsequence of {GN(e)}
possesses a unique
maximum at 8 with probability one.
0
Suppose therefore that
(4.4.42)
where G(e) is some, possibly random, concave function.
G(e) has a unique maximum at 6 .
0
We want to show that
With b and E as in (4.2.1) and (4.2.2)
note that the inequalities
1/2
T'"
~
~
m(x.6)
~
,?:
imply wiN ~ min (I, E/ II x i
I
2
E
II) .
Thus, since IIx.11 ~ 1 we can find €* > a such
1
that
~ €*/llx·11
.
1
Define
(4.4.43)
IN
=E*N- iLe(y.log
1
UN(Y)
and let UN(Y) ;. EN(UN(Y)).
opel), (ii)
{UN(o)}
T
T
F(x.y) + (l-y.)log(I-F(x.Y)))/llx.11
1
~
1
1
It should be apparent that (i) UN(Y) - UN(Y)
= aC(I),
and (iii) each UN(o) possesses a
=
maximum at 6 .
0
Furthermore, condition (C2') implies that the limit of every convergent subsequence of {UN(o)} has a unique maximum at 6 .
0
Returning to G (0) note that
NK
(4.4.44)
82
Pick a further subsequence such that U
(e)
+
U(o).
Then obviously
NK , j
G(o)
~
U(e).
Since G(') may be random this last inequality as well as (4.4.44) should be
interpreted as almost sure results.
By Lemma 4.1 G(') attains its maximum
and the sets L(a,G) are bounded for all real a (again these are almost sure
results).
Since eN
maximizes
~
K
(.) this implies the existence of a constant
K
R < 00 such that
A
(4.4.45)
II
Pre lieN
~ R)
+
1
as NK
K
+
00.
Define
(4.4.46)
IN(Y) =N
_IN
T
T
T
T
2(F(x.8 o)log F(x.Y) + (l-F(x.eO))log(l-F(X. y ))) w1.'N .
11.·1.1.
1.
Then
(4.4.47)
where
A_I
Let Q
N
l k
Tl(y) = N- L E.(log F(x:y)) w' N '
K 1 1.
1.
1. K
2A
= b M and note that by (4.2.2)
(4.4.48)
E
A
The matrix Q is positive definite.
(4.4.48) we have
(4.4.49)
o~
Al
Q~ ~ E
-1
I
pxp
2
I
pxp
A~
Denote its square root by Q , then from
83
Because of (4.4.45) it must be that
IT I (y) I
IT I (y) I
=
A
l(118N II~ R) +
0
K
where leA) = indicator of the event A.
P
(1),
Thus neglecting terms of vanishing
order,
(4.4.50)
sup
11811~R
1
IIQ~II~l/e:
Appealing to Lemma 4.2 we have that ITl(y)
IT (y)
2
I
I
= opel).
Similarly one finds
= opel) and thus from (4.4.42) and (4.4.47) it must be that
J
(.)
NK
e.
G(·).
However, for each N the concave function I (·) has a maximum at 8 ,
N
0
(C2')
together with the inequality w'
1. N
Also
~ e:*/lIx.11 imply that for all large N
1.
I N(') has a unique maximum at 80 and, more importantly, that the limit of
every convergent subsequence of I (') has a unique maximum at 8 ,
0
N
G(') also has a unique maximum at 8 with probability one.
0
Therefore,
This completes
the proof.
D
With the consistency results established we turn our attention to
Theorems 4.3 and 4.4.
Because of their similarity the proof is given for
the invariant estimator only.
For
II e-8 0 II
~ <5 define
. k
2 2 T
TA_l
w. (8) =ml.llz(l,b 1m (x.e) x.M x.),
1.N
1.
1.
1.
(4.4.51)
£.(8) = y. - F(x:8).
1.
1.
1.
84
In proving Theorem 4.3 we appeal to Theorem 4.5 with
C4.4.52)
XiNCli) = £i C6 0 +li) xiwiNC60+li) - £i (80)xiwiNC80),
=
\NCli)
T
T
CF Cx i 8 0 ) - FCxi(60+li)))xiwiNC80+li),
To simplify notation we set
~
AN(li)
N-
N
I
L AiN(li).
I
The following result is needed.
Lemma 4.3
Under the assumptions of Theorem 4.3 there exists a> 0 such
that
lilill ~ 0,
Proof:
Fix y
TN
,y
Since T
C.): :R p
N,y
E:
(8)
+
N large.
:R P and consider the function of 8
E:
:R P
_IN
T
T
T
L((F(x.8 ) - F(x. (8 +8))) w·NC80+li)x.y)
0
1
1 0
1
1
1
=N
lR 1
the Mean Value Theorem implies
(4.4.53)
~
~
for some 8 on the line segment joining 8 to the origin.
on the chosen vector y, however
(4.4.52) can be written as
Now take y
(4.4.54)
=8
= li to arrive at
II'SII
~ 11811 for all y.
Note that 8 depends
Since TN
,y (0)
=0
85
where
(4.4.55)
and I\'EII ~ b..
~
= N- 1
N
\'
L
I
F(l) (xTi(e
+ A)) w'
O0
1.N
(8 0+ 0A) x.x.T
1.1.
However, (4.4.54) implies 1~(b.)b.1 ~ 1\b.II
Z
AI(~) and hence
Assumption (NI) implies the existence of some positive a such that wiN(eO+l'.)
~ (if I\x.11 whenever 11b.11 ~ 0 and hence
1.
o
The existence of the constant a> 0 follows from (CZ').
We are now in a position to prove Theorem 4.3.
-e
Proof of Theorem 4.3:
Suppose for the moment that with XiN(l'.) defined
in (4.4.52)
(4.4.56)
II
X·N(b.)
- X·N(b.')
1.
1.
whenever 11b.11, Ill'.'
of Theorem 4.5.
II
~
With
o.
~
II
~ const. (1 + IIx·ll)
1Il'.-b.'
1.
Then, using (CI) one can verify the conditions
= eN-eO
it then follows that
(4.4.57)
For all large N Lemma 4.3 implies
and this in turn implies from (4.4.57)
N(4.4.58)
k2 N
II
A
L (X·N(l'.)
1
1.
A
- A·N(l'.))
1.
=
0
P
(1)
•
86
Note that
_~ ~
Z =N
N
•
(4.4.59)
L
1
A
XON(~)
1
= N_~
~
L £.(8 0)x.wO N(8 0)
1
1 1
1
and without much trouble one can show ZN
= Opel).
Equation (4.4.58) implies that for every
n
> 0
which in turn implies
Take n = a/2 to find
1
A
p{N~11 AN(~)
1
A
that is, N~ II ~ (~)
(4.4.58) is
0
p
II = 0p (1)
II
~ 211 zNII
also.
+ oj -+
1,
This implies that the numerator in
(1), Le.
(4.4.60)
= 0
P
(1).
But
1/
(4.4.61)
I
A
N'Z L(M =
.~
N->2
N
T
L (F(x.8
)
1
1 0
T
A
A
- F(x.8))x. woN(8)
1
1
1
and by applying the Mean Value Theorem to each summand in (4.4.61) we get
(4.4.62)
where
l
D* = NN
~
L F(l) (xT.8A.) x.x.T
11111
w·N(e)
A
1
and each 8 lies on the line segment joining 8 to 8 . With a routine analysis
0
i
(el), (Nl), (N2) and the Lipschitz property of
{~(.)}
are used to show
87
(4.4.63)
where D (8 ) is defined in (3.4.29).
M1 0
Finally, (4.4.60), (4.4.62) and (4.4.63) imply
which is the desired result.
Thus, to prove Theorem 4.3 it is sufficient to
verify (4.4.56) whenever 11611, 116 1 II
::;
0.
One finds
·e
and employing the Mean Value Theorem to F(') along with some obvious inequalities we have
(4.4.64)
IXiN (6)
- XiN (6 1)
I ::; II xi
III wiN ( 80+6) - wiN (8 0+6 1) I
+ Ilxill
Note that
2
wiN(80+61)116-6111.
II~(·) II : ; EN(ll x I1 2); thus by (el), 1I~(8) II is bounded for all 8.
Also, (Nl) implies that lI~l(8) II is bounded whenever 118-8011 ~ 0. Together
these two facts imply
(4.4.65)
Using (4.4.65) a simple argument shows that there exists Kl such that
(4.4.66)
l
The inequality Imin(l,s-l) -min(l,t- ) I ~ \s-tl/st, (4.4.65), and some
88
elementary inequalities are used to show
(4.4.67)
\W
iN
(6 +lI) - w (6 +lI')
iN 0
0
I
~
K2(llxilllm(xi(60+lI)) -m(xt(6 0+t1'))
+ Ilxillll
for some constant K .
2
~(eo+lI)
Since Im'(t)
-
~(60+lI') II
I
)/ll x i I1
2
I = F(l)(t)
(4.4.68)
Also from the Lipschitz property of
IIMN(-) II
we know there exists some K3
such that
(4.4.69)
-
~
However, (4.4.67)-(4.4.69) imply that there exists K4 such that
(4.4.70)
Finally, (4.4.64), (4.4.66) and (4.4.70) imply
II
X· ([\) - X· ([\')
~N
~N
II
~ K( Ilx·11
+ 1)
~
for some constant K, thus completing the proof.
11[\-[\' II
o
CHAPTER V
APPLICATIONS
5.1
Introduction
How does it work in practice?
After wading through the theory of the
previous chapters this is a natural and legitimate question.
In this chap-
ter a partial answer is provided by applying the proposed techniques to two
sets of data.
The first, originally due to Finney (1947b) has appeared
elsewhere in the context of diagnostics and robustness, Pregibon (1981,
·e
1982).
The response indicates the occurrence or nonoccurrence of vaso-con-
striction in the skin of the digits.
This is to be regressed on the loga-
rithms of the rate and volume of air inspired. The second data set is from
a study relating. coronary heart disease
pressure and cholesterol level.
sure-7S) and log(cho1esterol).
and is referred to as SVC.
(yes-no)
to
systolic
blood
The predictors are log(systolic blood presThe first data set contains 39 observations
The latter is designated CHD and contains 108
observations.
There is no attempt to draw conclusions from the data.
Interest lies
not in the analysis per se, but rather in the method of analysis.
In parti-
cular we want to determine what information, if any, the bounded influence
estimators tell us about the data, that we would not learn from the m.l.e.
Also, the data provide
the proposed procedures.
a means to examine and fine tune the performance of
After a discussion of computational strategy the
analyses appear in Section 5.3.
90
5.2
Computation
Several approaches to the computational problems are possible.
The
algorithm outlined here is that which was implemented to obtain the estimates in the following section.
efficient.
It is not necessarily best or the most
The sample size is now fixed and it should cause no confusion
if the subscript N is dropped from various vectors and matrices.
In its
place lower case In' is used to denote the nth iterate of an estimate.
First, the algorithm for obtaining the Mallows invariant estimate, 8 ,
MI
is discussed.
It follows from the results of Chapter III that
~I(8)
exists only
if b > L(8) where
(5.2.1)
In the algorithm the bound is chosen accordingly.
b
= b(8)
(5.2.2)
'More
specifically
has the form
bee)
= BL(e),
for B > 1.
The choice of B > 1 is made by the statistician.
Define
(5.2.3)
For now t(·) is assumed general.
Later we make some specific suggestions.
A
Our problem is to find 8 satisfying
(5.2.4)
N
TA
A
A
A
I(y.-F(x.8)) w(X.,e,M_ (8), b(e))x.
1 1
1
1
-M I
1
= o.
The iterative procedure used to solve (5.2.4) is suggested by (4.2.7). Pick
some initial estimate eO and define recursively
91
A
-1 A
_IN
TA
A
A
A
8 1=8 + D (8)N 2(y.-F(x.8 ))x.w(x. ,8 ,M (8 ) ,b(8 ))
MI n
n+
n
1 1
1 n
1
1
nM I n
n
A
(5.2.5)
The matrix D (8) is defined in (3.4.29).
MI
Were it not for the fact that
MMI(8) is implicitly defined this would be a fairly routine procedure.
A
For-
A
tunately, computation of MMI(8 ) is straightforward.
n
Let MO = 1(8n )
and
define recursively
(5.2.6)
This procedure necessarily converges (see the proof of Theorem 3.5) and
the limit is just MMI(8 ).
n
8 + from 8n
n l
Thus the major steps involved
in
computing
are:
A
-e
1)
Compute b(8 ) from (5.2.2);
n
2)
Compute MMI(8 ) according to (5.2.6);
n
3)
Compute D (8 ) from (3.4.29);
MI n
4)
Compute 8n+ 1 from (5.2.5).
A
(5.2.7)
A
Upon convergence (5.2.4) holds and
(5.2.8)
A
MMI(8) = EN(F
(1)
TA
T
(X 8)XX
2
A
A
A
w (X,8,MMI(8), b(8)).
A
A
The achieved or realized bound is b(8) = BL(8).
The construction of the one-step Krasker-Welsch estimator has already
been outlined in Section 4.3.
for 8 = 8.
The only problem is to solve equations (3.4.8)
In this procedure the bound remains fixed.
is discussed in the next section.
(5.2.9)
Define
Its determination
92
(5.2.10)
The problem is to solve
(5.2.11a)
J 1 (C,M)
=0
(5.2.11b)
for C and M.
The result will be C (8) and M (8). The point of view is
KW
KW
taken that (5.2.11b) defines M as a function of C. For any given C, M(C)
is found as follows.
Let
and define recursively
-e
(5.2.12)
The sequence {M } necessarily converges to M(C) (see the proof of Theorem
K
3.5, Section 3.4).
(5.2.13)
With this is mind set Co = 0 and define recursively
C 1 = C + (l/ENAe(r~(b2/(5I--C )TM-1(C )(5I--C ))))J (C ,M(C ))
n+
n
n
n
n
n
1 n
If C converges to some C* then it follows that.
n
J (C*,M(C*))
1
=0
and
M(C*)
=
J (C*,M(C*)).
2
A
Thus C* and M(C*) are just C (8) and
KW
~W(8),
the desired quantities.
The
motivation for (5.2.13) derives from the fact that ENe(r~(b2(5I--C)TM-1(C)(~-C)))
pxp is an approximation to -(a/aC)J 1 (C,M). Thus (5.2.13) is a type of
quasi Newton-Raphson iteration. Of course there is no guarantee that {C }
x I
n
93
converges.
Although no problems were encountered in the following examples
it is possible that other data sets might require a more sophisticated technique.
A
(1)
In summary, the major steps involved in computing 8
KW
5.3
-e
= 8MI ;
1)
Compute 8
2)
4)
Compute CKW (8), MKW (6) as outlined above;
A
Compute D (6) from (3.4.10);
KW
A
Compute ~KW("" 6) from (3.4.9);
5)
A(l)
Compute 8 KW from (4.3.2).
3)
are:
Examples
In the first pass at the data the optimal Mallows invariant estimate
(t:: 1) was computed for several choices of B> 1.
For each such estimate
the K-W one-step estimate was calculated using the realized bound from the
Mallows procedure.
estimates.
This standardization provides a basis for comparing the
Also computed was a one-step modified version of the K-W esti-
mate obtained by fixing C = O.
If this differs little from the full K-W
one-step, a case could be made for using the computationally simpler est i(
mate.
The results for the SVC data appear in Table 5.1 and those for the CHD
data in Table 5.2.
by example.
The significance of the table entries is best explained
Consider the SVC data and in particular the entry corresponding
to the Mallows estimate for B = 1.75.
(-4.27, 6.25, 6.99).
The estimated coefficient vector is
The numbers in parentheses are z statistics obtained
by dividing the estimate by its estimated asymptotic standard error. Finally
94
the seventh entry gives an approximate measure of efficiency relative to maximum likelihood.
8~
For an estimate
with asymptotic covariance
V~(8)
define
the trace efficiency as
tr(l- l (8))
- --
treff(8) =
(5.3.1)
The table entry is
tr(V~(8))
treff(8~)
where
8~is
the appropriate estimate.
The vertical trends in the tables will be discussed shortly.
For com-
parison of the different estimators the horizontal trends are more interesting.
The K-W estimator follows
to shrink back toward the m.l.e.
8 quite closely with a slight tendency
MI
The opposite is true of the modified K-W
estimator suggesting that the correction vector C(8) does have an effect on
the estimator.
-e
Furthermore, the discrepancy between
creases with decreasing B, as might be
expecte~.
e~~) and e~~)(c=O) in-
As B decreases all the
estimators tend to expand away from the m.l.e. with the trend most notable
for 8 (C=0). The z statistics remain fairly stable throughout the table
KW
with the exception of the jump from B= 1.50 to 1.25 for the CHD Data. With
B= 1.50 the coefficient of log cholesterol is not significant at the .05
level for any of the estimators.
tained by
eMI
and
e~~)(c=O)
However, at B= 1.25 significance is ob-
and very nearly so' by
~~~).
Having compared the estimators among themselves it is time to determine
if they are performing as intended, i.e. are they resistant to changes in
a small percentage of the data?
Pregibon (1981, 1982) has shown that obser-
vations 4 and 18 are influential points in the SVC data.
these are removed?
e
What happens if
A look at Figure 5.1 indicates that the cleaned data is
very nearly singular in the sense that almost perfect discrimination is possible.
Attempts at obtaining an m.l.e. fit to the reduced data with the SAS
PROC LOGIST procedure failed.
The near singularity of the data as (4, 18)
95
are progressively downweighted also explains the lack of convergence of
8 for B= 1. 25. By judicious choice (= lucky guess) of starting value a
MI
convergent solution to the likelihood equations was eventually obtained.
The estimated coefficients and z statistics are
-24.58 (-1. 75)
(5.3.2)
31. 94 ( 1. 80)
39.55 ( 1.70)
Points 4 and 18 are more than influential; together they virtually determine
the fit entirely!
The bounded influence procedure with B= 1.5 produced the
estimates
"
8
MI
(5.3.3)
-e
eel)
(C=O)
KW
"(1)
8
KW
-23.41 (-1. 72)
30.43 ( 1. 77)
-23.44 (-1. 72)
30.47 ( 1. 77)
-24.79 (-1. 68)
32.20 ( 1. 72)
37.60 (1.67)
37.66 ( 1.67)
39.89 ( 1. 63)
.93
.92
.93
Certainly it is unfair to judge the sensitivity of the estimators by their
performance on this unusual data set.
The CHD data provides a more reason-
able setting.
An application of Pregibon's diagnostics to the CHD data suggested
that observations 4 and 35 are influential.
After deleting these the m.l.e.
for the cleaned data is
(5.3.4)
-76.92 (-3.09)
5.48 ( 3.49)
9.27 ( 2.45)
~
The most dramatic change is in the coefficient for log cholesterol which is
not significant at 108 observations, yet is for 106 observations.
In this
96
example the bounded influence estimators perform more as expected.
For
B= 1. 2S the following estimates were found.
A
"'(1)
8
KW
8
MI
(5.3.5)
e(1) (C=O)
KW
-68.89 (-2.90)
-69.05 (-2.89)
-82.46 (-2.80)
4.86 ( 3.27)
4.93 ( 3.28)
8.30 ( 2.26)
5.29 ( 3.15)
8.33 ( 2.28)
.87
9.90 ( 2.27)
.88
.83
These should be compared with the estimates in Table 5.2 corresponding to
B= 1.25.
For the coefficient of log cholesterol the least change occurred
in the Mallows estimate, the greatest in
ei~{c=o). In terms of inference
e(l)was least stable.
KW
In deriving the Mallows type estimators an effort was made to increase
-e
flexibility by introducing the attenuating function t (x, 8).
The reader no
doubt is wondering if this aspect of the theory has any application.
The
choice (F(1)(x T8))£, suggested earlier, proved unsatisfactory both in terms
of protection and efficiency.
Output from the bounded influence procedures
provides some insight into the problem of selecting t(x,8).
This also pre-
sents the opportunity to comment on the diagnostic aspects of a bounded inf1uence fit.
First the diagnostic plots are discussed, then selection of
t (. , .) .
The output considered here corresponds to B= 1.50 for the SVC data and
B = 1. 25 for the CHD data.
weighted m.1.e. 's.
All the estimators have a representation
as
The Mallows weights, depending only on position in the
design space, are not very informative as a diagnostic tool.
However, to
give the reader a feel for the weighting pattern, Figure 5.2 contains a
contour plot of the weights corresponding to all potential predictor pairs.
The general tendency, as expected, is for the weights to decrease as distance
97
from the center of the design increases.
The K-W weights do provide a convenient diagnostic tool.
For the SVC
data the only observations assigned weight less than one are 4 and 18 with
weights of .34 and .39 respectively.
For the CHD data observations
and 71 received weights of .72, .47, and .86 respectively.
4, 35,
Lowering the
bound to 1.1 added observations 43, 92, and 105 to the list of those receiving weight less than one.
For the Mallows procedure define the observed influence of the i
th
ob-
servation as
(5.3.6)
01.
1
and the normalized observed influence as
-e
(5.3.7)
NOI. = OI./(realized bound)
1
1
theory suggests that observations with large NOI correspond to those having
the greatest effect on the fitted model.
In Figure 5.3 the NOI for the SVC
data is plotted against observation number.
4 and 18 are influential.
#29.
It is evident that numbers
However, from this plot one might also suspect
In Figure 5.4 NOI is plotted against the fitted logits, i.e. NOI.1 *
A
T
x 8.
i
In comparison to neighboring values NOI
29
appears less extreme. Further-
more, numbers 4 and 18 are revealed as truly exceptional observations.
Plots of NOI for the CHD data appear in Figures 5.5 and 5.6.
f1uentia1 points are evident.
The in-
While numbers 35 and 71 clearly do not fit
the pattern of the data the same cannot be said of observation 4.
The next
plot illustrates why number 4, in conjunction with 35, has such a great effect on maximum likelihood estimation.
Figure 5.7 is an overlay plot of the
fitted response function and the actual responses versus fitted logits.
98
With a listing of the output. sorted by response and fitted logits. important points on the plot are easily identified.
Number 35 has the largest
residual among those observations with positive response; number 4 the
largest residual among those with negative response.
Figure 5.7 illustrates
grahpically how 4 and 35 work together to attenuate the slope of the plotted
response function.
This explains in part why maximum likelihood. which
assigns equal weight to all observations. is so sensitive to these two points.
With the information obtained from the various plots it is now possible to
construct an appropriate attenuating function.
To stay within the context of the Mallows theory the t-function must
depend only on the design and e.
It should distinguish between potentially
troublesome data and that which is innocuous.
CLlCe) •...• LNce)).
·e
Let L.ce)
=x~e
and Lce) =
1
1
Certainly any observation for which ILiCe)
I
is large is
potentially influential and hence a candidate for downweighting. Furthermore.
these observations contain relatively little 'information'. Thus downweighting them should have a minimal effect on efficiency.
A reasonable range for
the cutoff value is C3.4) as the logistic distribution function assigns most
of its mass to the interval (-3.3).
Other potentially troublesome observa-
tions are those associated with the extremes of the set {LlCe) •...• LNCe)}.
Often these are synonymous with large IL.
1
I.
as in the SVC data, but, as
evidenced by the CHD data this need not be the case.
L..
1
Let R. be the rank of
For some cutoff value. CV.and trimming constant. a
1
~
a < 1/2 define the
indicator function
if IL.1
I
~ CV.
R.1 > aN+l. R.1 < N-aN.
C5.3.8)
otherwise.
Finally. for a weight 0 <
£
~
1. define t*Cx 1.• e) = tN*Cx 1.• e)
as
99
t*(x.,8)
= X. + E(l-X·)·
1 1 1
(5.3.9)
Whenever an observation is potentially troublesome t* assigns it a weight
of
All others receive full weighting.
E.
There are some potential problems
with using the Mallows estimate corresponding to this choice of t(·,·).
Be-
fore discussing these let's first determine if it performs as expected.
In
the following examples the tuning parameters are set at CV
and
E
= 3.5,
a = .05,
= .5.
Although the SVC data is somewhat unusual it does provide a good arena
for testing the performance of the new estimator.
Setting B= 1.50 produced
the following estimates.
8
MI
(5.3.10)
·e
A(l)
eel) (C=O)
8
KW
KW
-10.02 (-2.22)
-4.66 (-2.49)
-5.31 (-2.45)
13.54 ( 2.29)
6.72 ( 2.63)
7.57 ( 2.57)
15.61 ( 2.17)
7.11 ( 2.65)
8.24 ( 2.58)
.88
.97
.97
In light of the previous discussion the result for 8
is expected.
MI
How-
ever, it is somewhat disconcerting that the other two estimators shrink back
towards the m.l.e. so much.
There are two explanations for this.
First is
the fact that it is no longer appropriate to use the realized bound from the
Mallows procedure in the one-step operation.
Evidence of this is found in
the approximate efficiencies with respect to maximum likelihood.
The jump
from .88 to .97 indicates that a lesser bound should be employed for the onesteps.
and
l~
Of course this would increase the downweighting of observations 4
bringing the estimate more in line with 8 .
MI
The second explanation
concerns the viability of the one-step operation in this setting.
As 4 and
18 are downweighted the information matrix approaches near singularity, warn-
100
ing of computational instability.
In such a situation the performance of
a one-step operation is suspect.
Compare the plot in Figure 5.8 to that in Figure 5.4
With the excep-
tion of the points corresponding to 4 and 18 the basic pattern of influence
is unchanged.
tially reduced.
of
However. the NOI of the maverick observations is substanTheir resulting value of .5 is a consequence of the choice
E:.
Next consider the CHD data.
With B = 1.25 the new estimates are:
108 obs.
"(1)
8
(5.3.11)
·e
KW
eel) (C=O)
KW
-55.48 (-2.73)
-47.79 (-2.94)
-53.95 (-3.01)
4.19 ( 3.05)
4.17 ( 3.41)
4.64 ( 3.42)
6.45 ( 2.04)
5.08 ( 1.95)
5.80 ( 2.07)
.95
.94
.70
Again there is a tendency for the one-steps to shrink towards the m.l.e .•
although to a lesser degree.
as that for the SVC
The explanation for this behavior is the same
dat~.
The NOI for the Mallows estimate is plotted in Figure 5.9.
The pattern
of influence is similar to that in Figure 5.6 with the notable exception of
observations 4 and 35.
Apparently t* is performing as expected.
To test for stability the new estimate was calculated for the reduced
data. sans 4 and 35.
(5.3.12)
e
8
MI
-69.54 (-2.64)
4.89 ( 3.03)
8.43 ( 2.12)
.71
106 obs.
"(1)
8
KW
-69.95 (-2.94)
KW
-79.37 (-2.91)
5.00 ( 3.34)
5.66 ( 3.27)
8.40 ( 2.29)
9.56 ( 2.34)
.91
e(1) (C=O)
.88
101
Of the three procedures the Mallows is most stable both in terms of estimation and inference.
However, it also sacrifices the most efficiency.
It might be argued that t* has been tuned specifically to the two examples, thus diminishing its credibility in other situations.
I do not believe
this to be the case.
a
The motivation for classification
as
potentially
troublesome observation exists independently of the two examples.
Further-
more, the choices CV=3.5, a= .05, and £=1/2 are quite natural, given the
context of the problem.
However, this issue will only be resolved by fur-
ther application of these methods to a wide variety of data.
The potential problems with t* stem from its lack of continuity as a
function of 8.
First, this can cause computational problems.
presented in this section no difficulties were encountered.
In the results
For the CHD data
with B = 2.0, l.25 and starting from the m.l.e., the routine converged
only a few steps.
in
However, for the intermediate value B= 1.50 the routine
quickly moved to the appropriate neighborhood but then oscillated between
two similar values indefinitely. The second problem is theoretical.
Although
Theorem 4.1 gives consistency for the Mallows estimate only for t :: 1 the
proof can be pushed through for t* with minor modifications.
ment is that
£
> O.
The key require-
However, for asymptotic normality it is probably neces-
sary to make some strong assumptions concerning the empirical measure of the
predictors, specifically weak convergence to a continuous probability distribution.
This problem remains to be investigated.
Of course both problems
could be resolved by obtaining a continuous function behaving similarly to t*.
Work on this approach continues.
•
102
TABLE 5.1
SVCDATA
'"
B
00
(mle)
"'(1)
eM1
eKW
e(1) (C=O)
-2.95 (-2.21)
-2.94 (-2.21)
-3.07 (-2.22)
4.65 ( 2.50)
4.63 ( 2.51)
4.81 ( 2.51)
5.24 ( 2.73)
5.23 ( 2.74)
5.43 ( 2.73)
KW
-2.92 (-2.27)
4.63 ( 2.59)
5.22 ( 2.81)
1.0
2.0
.94
1. 75
·e
.94
-4.27 (-2.33)
-4.18 (-2.33)
-4.51 (-2.32)
6.25 ( 2.51)
6.14 ( 2.51)
6.58 ( 2.49)
6.99 ( 2.60)
6.88 ( 2.62)
7.40 ( 2.57)
.94
1.50
.94
.93
-5.65 (-2.32)
-5.56 (-2.32)
-6.29 (-2.27)
7.96 ( 2.43)
7.85 ( 2.44)
8.84 ( 2.37)
8.91 ( 2.42)
8.81 ( 2.43)
10.02 ( 2.34)
.88
1.25
.94
.89
.87
DID NOT CONVERGE
Table entries are in the format:
Estimated
coefficient
vector
(z-statistics)
Estimated efficiency w.r.t. maximum likelihood
103
TABLE 5.2
CHD DATA
" (1)
6KW
B
00
(mle)
eKW
(1) (C=O)
-46.70 (-2.96)
4.24 ( 3.53)
4.82 ( 1. 88)
1.0
2.0
1. 75
1.50
1.25
1.10
-46.05 (-2.93)
4.19 ( 3.51)
-46.06 (-2.93)
4.19 ( 3.51)
4.74 ( 1.86)
4.74 ( 1.86)
-46.72 (-2.94)
4.25 ( 3.51)
4.82 ( 1.88)
.99
.99
.99
-46.67 (-2.93)
-46.66 (-2.93)
-47.93 (-2.95)
4.20 ( 3.48)
4.20 ( 3.48)
4.33 ( 3.50)
4.85 ( 1. 89)
4.84 ( 1.88)
4.96 ( 1. 90)
.98
.98
.98
-47.16 (-2.92)
-47.20 (-2.92)
-50.46 (-2.93)
4.19 ( 3.42)
4.20 ( 3.43)
4.53 ( 3.45)
4.95 ( 1. 90)
4.95 ( 1. 90)
5.26 ( 1. 94)
.96
.96
.95
-49.55 (-2.95)
-48.21 (-2.91)
-57.57 (-2.93)
4.09 ( 3.31)
4.13 ( 3.33)
4.93 ( 3.30)
5.46 ( 2.03)
5.19 ( 1.95)
6.22 ( 2.06)
.91
.91
.87
-49.64 ( 2.83)
-48.17 ( 2.83)
-65.22 ( 2.78)
3.97 ( 3.11)
4.07 ( 3.22)
5.39 ( 3.04)
5.57 ( 1.99)
5.23 ( 1. 92)
7.21 ( 2.07)
.82
.86
.76
Table entries are in the format:
Estimated _
(z-statistics)
coefficient
vector
Estimated efficiency w.r.t. maximum likelihood
104
Figure 5.1
SVC data: This is a plot of the response (x ~ 1, 0 ~ 0) as a function of
log(rate)
= LRATE
and log(volume)
= LVOL.
For reference with Figure 5.2
the line corresponding to an estimated probability of response equal to
one half, as determined by the Mallows estimate, is superimposed.
observations #4 and #18 are removed the data are nearly singular.
·e
If
105
1.1
L
R
A
~
T
E
D
D
)('
)(
)(
)(
D
D
D
)(
D
X
)(
4
X
c
18
)(
O. 1
·'e
D
D
D
c
D
-t.'
-l. '21
O.2S
LVOL
Figure 5.1
l.
'D
106
Figure 5.2
SVC data:
This is a contour plot of the Mallows weights as a continuous
function of the design variables log(rate)
= LRATE and log(volume) = LVOL.
Superimposed is the line corresponding to an estimated probability of response equal to one half.
·e
107
l.i
L
R
A
T
E
I.S
-I. i
e
-t.
l"2
' -to
---:'I-=----=i[4~i·
--11.1
-I.,.
D. D
1.1
LVOL
Figure 5.2
I. i
1. .9
108
Figure 5.3
SVC data:
This is a plot of the Mallows normalized observed influence
(NOl) versus observation number.
observations.
-e
Large values of NOl indicate influential
109
1.1
X
)(
4
18
I.J
I .•
N
o
I
1.1
I.~
)(
29
1.5
·e
x
I .•
I. "3
)(
x
x
x
x
x
I. '2
X )(
)(
I.t
x
Xx
D. D
1
x
x
5
x
XXX
tI
x
X
15
x
x
25
20
OBSERVATION NUMBER
Figure 5.3
)(
)(
X
"31
X
"3S
.
110
Figure 5.4
SVC data:
This is a plot of the Mallows normalizrd observed influence
versus fitted logits (LOGITF).
observations.
·e
Large values of NOI indicate influential
111
1.1
)()(
4
18
I.J
N 1.1
o
I
1.1
x
29
I.t
1.5
.
-
x
x
I .•
X
X
1.1
)(
.J4
X
I. '2
X
X
M
)C
I. t
)(
I
X
I. I
>OOC
-'25
-15
¥X)()(
-S
LOGITF
e
Figure 5.4
x
)C
~
5
)C)(
)C
15
112
Figure 5.5
CHD data: This is a plot of the response (x"'" I, 0"'" 0) as a function of
log(cho1estero1) = LCHOL and log(systo1ic blood pressure -75) = LSBP.
114
Figure 5.6
CHD data:
This is a plot of the Mallows normalized observed influence
(NOl) versus observation number.
observations.
Large values of NOl indicate influential
115
l.1
x
x
4
35
x
71
X
105
X
92
I.J
X
N
0
I .•
I
X
1.1
116
Figure 5.7
CHD data:
This is a plot of the Mallows normalized observed influence
(NOI) versus fitted logits (LOGITF).
fluential observations.
-e
Large values of NOI indicate in-
117
X
35
x
X
4
71
)(
105
X
92
x
)(
~
X
X
)(
-e
)(
)(
..,..2'
)0( • •
)(
-11
-1
-t
LOGITF
Figure 5.7
-1
)(
118
Figure 5.8
CHD data:
This is a plot of observed and modeled response curves as func-
tions of the fitted logits (LOGITF).
The plot illustrates the effect of
observations #4 and #35 on the fitting procedure.
-e
119
l.1
,
I
•
fI" It
I
35
1. .1
R
E
s
..
1.1
p
0
N
s 1.1
E
..
I.t
..
..
..
•
1.5
II
-e
•
I .•
.I
.
1.1
I
1.1
t
/.
1.1
I. I
•
I"
3r--.--...--J.L......,.L...,--!.!!-!.'I!IP71J• •'!J":.I
••g",e!!"!!-l!l'8!MI!.!!"~f1!S"~.~i...!.J!BD~D!.!DD;;a~4
-11
-1
-t
LOGITF
Figure 5.8
-1
120
Figure 5.9
SVC data:
This is a plot of the Mallows (t*) normalized observed influence
(NOl) vesus fitted logits (LOGlTF).
ential observation.
Large values of NOl indicate influ-
121
l.1
1.9
N
o 1.8
I
1.1
x
1.6
29
I. S
)(
4
·e
)(
18
I. t
I. '3
)(
)(
Xx
I. '2
)C
-'25
-15
-s
LOGITF
Figure 5.9
S
15
122
Figure 5.10
CHD data:
This is a plot of the Mallows (t*) normalized observed influence
(NOl) versus fitted logits (LOGlTF).
ential observations.
Large values of NOl indicate influ-
123
)l
71
X
105
9~
)l
x
)l
X
X
35
..
)l
4
X
.-
X
)l
)(
X
)(
X
X
X
)(
XXX
,X
)(
I. I
-\I
_
X
X
-1
._- ---/
-t
LOGITF
Figure 5.10
-1
'2
BIBLIOGRAPHY
Amemiya, Y. (1982) "Estimators for the errors-in-variables model", unpublished Ph.D. thesis, Iowa State University, Ames.
Andersen, P. K. and Gill, R. D. (1982) Cox's regression model for counting
processes: A large sample study. Ann. statist. 10~ 1100-1120.
Beran, R. (1982) Robust estimation in models for independent non-identically distributed data. Ann. Statist. 10~ 415-428.
Berkson, J. (1944) Application of the logistic function to bio-assay.
Amer. Statist. Assoc. 39~ 357-365.
•
Berkson, J. (1951)
327-339.
Why I prefer logists to probits.
Biometrics
J.
7~
Bickel, P. A. (1975) One-step Huber estimates in the linear model. J. Amer.
Statist. Assoc. 70~ 428-434.
Clark, R. R. (1982) "The errors-in-variables problem in the logistic regression model", unpublished Ph.D. thesis, University of North Carolina,
Chapel Hill.
Cox, D. R. (1970)
Analysis of Binary Data.
Chapman and Hall Ltd., London.
Efron, B. (1975) The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70~ 892-898.
Fechner, G. T. (1860)
Hartel.
Elemente der Psychophysik.
Leipzig, Breitkopf und
Finney, D. J. (1947a) Probit Analysis. A Statistical Treatment of the Sigmoid Response Curve. Cambridge University Press, New York.
Finney, D. J. (1947b) The estimation from individual records of the relationship between dose and quantal response. Biometrika 34~ 320-334.
Fuller, W. A. (1980) Properties of some estimators for the errors-in-variables
model. Ann. Statist. 8~ 407-422.
Gordon, T. and Kannel, W. E. (1968) Introduction and general background in
the Framingham study - The Framingham Study, Sections 1 and 2. National
Heart, Lung and Blood Institute, Bethesda, Maryland.
Hauck, W. W. (1983) A note on confidence bands for the logistic response
curve. The American Statistician 37, 158-160.
Hampel, F. R. (1968) Contributions to the theory of robust statistics. Ph.D.
thesis, University of California, Berkeley.
Hampel, F. R. (1974) The influence curve and its role in robust estimation.
J. Amer. Statist. Assoc. 69, 383-394.
Hampel, F. R. (1976) On the breakdown points of some rejection rules with
mean. Research Report No. 11, Fachgruppe fur Statistik Eidgenoessische
Technische Hochschule, Ziirich.
Handschin, E., Kohlas, J., Fiechter, A., and Schweppe, F. (1975) Bad data
analysis for power system state estimation. IEEE Trans. on Po~er Apparatus and Systems 2, 329-337.
Huber, P. J. (1967) The behavior-of maximum likelihood estimates under nonstandard conditions. Proceedings of Fifth Berkeley Symposium Mathematical
Statistical Probability 1, 221-233, University of California Press.
..
Huber, P. J. (1973) Robust regression: asymptotics, conjectures, and Monte
Carlo. Ann. Statist. 1, 799-821 .
Huber, P. J. (1981)
Robust Statistics (p 5).
John Wiley
&Sons,
New York.
Jones, D. H. (1982) Tools of robustness for item response theory. Program
Statistics Research. Technical Report, No. 82-36, Educational Testing
Service, Princeton, New Jersey.
Karsker, W. S. (1978) Application of robust estimation to econometric problems. Ph.D. thesis, Massachusetts Institute of Technology, Department
of Economics.
Krasker, W. S. (1980) Estimation in linear regression models with disparate
data points. Econometrica 48, 1333-1346.
Krasker, W. S. and Welsch, R. E. (1982) Efficient bounded influence regression estimation using alternative definitions of sensitivity. J. Amer.
Statist. Assoc. 77, 595-605.
Mallows, C. L. (1973) Influence functions. Presented at the National Bureau
of Economic Research Conference on Robust Regression, Cambridge, Massachusetts.
Mallows, C. L. (1975) On some topics in robustness. Unpublished memorandum,
Bell Telephone Laboratories, Murray Hill, New Jersey.
Maronna, R. A. (1976) Robust M-estimators of multivariate location
scatter. Ann. Statist. 4, 51-67.
and
Michalik, J. E. and Tripathi, R. C. (1980) The effect of errors in diagnosis
and measurement on the estimation of the probability of an event. J. Amer.
Statist. Assoc. 75, 713-721.
MUller, G. E. (1879) Uber die Maassbestimmungen des Ortsinnes der Haut
mittels der Methode der richtigen und falschen FaIle. Pf~ug. Arch. ges.
PhysioZ. 19 3 191-235.
•
Pregibon, D. (1981)
705-724 .
Logistic regression diagnostics. Ann. Statist. 93
Pregibon, D. (1982) Resistant fits for some commonly used logistic models
with medical applications. Biomet~ics 38 3 485-498.
Rockafellar, R. T. (1970)
Princeton.
Convex Analysis.
Princeton University Press,
Ruppert, D. (1982)
notes.
Some results on bounded influence estimation. Unpublished
Ruppert, D. (1983)
and Welsch.
On the bounded-influence regression estimator of Krasker
Stahel, W. A. (1981) Robuste Schatzungen: Infinitesimale Optimalitat und
Schatzungen von Kovarianzmatrizen. Ph.D. thesis, Swiss Federal Institute
of Technology, Zurich.
•
Walker, S. H. and Duncan, D. B. (1967) Estimation of the probability of an
event as a function of several independent variables. Biomet~ika 54
167-179.
3
Wolter, K. M. and Fuller, W. A. (1982) Estimation of nonlinear errors-invariables models. Ann. Statist. 10 3 539-548.
© Copyright 2026 Paperzz