Simpson, Douglas G.; (1985).Some Contributions to Rubust Inference for Discrete Probability Models."

SOME CONTRIBUTIONS TO ROBUST INFERENCE FOR
DlSCRETE PROBABILITY MODELS
by
Douglas G. Simpson
Mimeo Series #1594lr
December 1985
DEPARTMENT OF STATISTICS
Chapel Hill,. North Carolina
M1ME:6
SERt~s
1594T
SimpSd~, Douglas G.
Some Contributions to
Robust Inference for
Discrete Probability
Models ..
Name
Date
DATE DUE
HIGHSMI TH 45 220
SOME CONTRIBUTIONS TO ROBUST INFERENCE
FOR DISCRETE PROBABILITY
M~DELS
by
Douglas Gareth Simpson
-e
A dissertation submitted to the faculty of
the University of North Carolina at Chapel
Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy
in the Department of Statistics
Chapel Hill
1985
.
ii
DOUGLAS G. SIMPSON.
Some Contributions to Robust Inference for Discrete
Probability Models.
(Under the direction of RAYMOND J. CARROLL and
DAVID RUPPERT)
Issues that arise in the theory and applications of robust inference for discrete data are investigated.
This is motivated by data from
. research in chemical mutagenicity at the National Institute of Environmental Health Sciences.
The data consist of counts, rather than contin-
uous measurements, and aberrant values are known to occur on occasion.
Discrete probability distributions generally lack the srmmetry and
invarianceproperties inherent in the location-scale and regression
frameworks, which have been the focus of much of the work on robust inference·to date, and the discreteness itself has an impact.
Two dif-
ferent approaches to ropust estimation for discrete parametric distributions are considered:
tance (MHD) estimation.
(1)
M-estimation and (2) minimum Hellinger dis-
The M-estimators comprise an extremely flexible
class of estimators; that desired properties of an M-estimator can be
specified in a direct way through its score function is well-known.MHD
estimation is especially suited for discrete models and has the advantage of intuitive interpretability when the model is inexact.
Extensions of the asymptotic distribution theory of M-estimators
are developed that are relevant to models having a discrete component.
Implications of the theory are discussed.
functions are recommended.
In particular, smooth score
Improved results concerning the consistency
. and asymptotic normality of the MHD estimator are obtained for discrete
models with infinite support.
Breakdown properties of the MHD are in-
vestigated in considerable generality.
A new measure of the stability
of an estimator, the probability breakdown point, is proposed for parametric estimation.
iii
ACKNOWLEDGEMENTS
I would like to thank my advisors,Raymond J. Carroll and David
Ruppert,for their guidance and support.
Professors Carroll and Ruppert
have provided constant intellectual stimulation, and their enthusiasm
has made this .work a pleasure.
My educational experience has been en-
riched by many conversations with Barry H. Margolin, and his essential
role in the formulation of this topic is gratefully acknowledged.
My
thanks go to the other members of the committee, Gary G. Koch and J.
Stephen Marron, for their interest and for valuable suggestions.
-
.
I would like to thank my wife, Fung-Yin Kuo, for her patience and
encouragement .
My graduate study has been supported in part by a George E. Nicholson,
Jr., Memorial Fellowship, a University of North Carolina Fellowship, by.
the National Science Foundation under Contract DMS 8400602, and by Air
Force Office of Scientific Research Contract AFOSR-S-49620-85-C-0144.
Finally, I am grateful to Ruth Bahr for her excellent typing of
this dissertation •
.
iv
TABLE OF CONTENTS
CHAPTER I
INTRODUCTION • •
1.1
1.2
1.3
CHAPTER II
•
0
•
•
• 1
•
Background and summary •
An example from mutation research
. Remarks
.. .
~
•••••••••••••
• • 1
• • • • • 4
• 6
ASPECTS OF M-ESTlMATION FOR DISCRETE DATA
2.1
2.2
2.3
2.4
2.5
2.6
.....
Introduction • . • • . • •.• • • • • •
.11
Parametric M-estimation: Definitions,
optimality and examples •••
. . . . . .12
Extended asymptotic distribution theory • ••• •15
A counterexample • • • • • • •
.21
Smooth score functions
.23
.20
Further remarks
..
•
APPENDIX 2.A
APPENDIX 2.B
APPENDIX 2.C
CHAPTER III
•• 11
•
•
.•
0
•
•
0
•
•
•
•
•
.26
.27
.28
0
MINIMUM HELLINGER DISTANCE ESTIMATION
.31
FOR DISCRETE DATA
3.1
3.2
3.3
3.4
3 5
0
Introduction....
• • • ••
• • • • • • 31
Minimum Hellinger distance versus maximum
likelihood • • • • • • • • • • • • • • • • • • 33
Almost sure convergence ••• • • •
• .·36
Asymptotic normality and efficiency
.42
Discussion
••
•• • • •
• .49
APPENDIX 3.A
APPENDIX 3.B
CHAPTER IV
·..
0
•
•
•
•
•
•
0
••••
Introduction
•
Breakdown point
• • • • •
Probability breakdown point
Discussion
••
APPENDIX
REFERENCES
.51
BREAKDOWN ANALYSIS OF THE MINIMUM HELLINGER
DISTANCE ESTIMATOR
4.1
4.2
4.3
4.4
. . . .49
0
•
0
•
•
•
•
•
•
0
•
·..
. . . . ·.. .
•
0
••
55
• .533
•
53
•. 59
• 61
0
61
0
·'. .
•
0
63
CHAPTER I
INTRODUCTION
1.1
Backgro~nd
and summary
In the years since J. Tukey's (1960) paper on sampling from contaminated distributions, inference under non-standard conditions has received
considerable attention in the statistical literature.
It has been recog-
nized that the optimal performance of a classical procedure is frequently
sensitive to a
see~ingly
minor misspecification of the parametric model,
and that such a procedure can be misleading in the presence of a few discordant observations in the data.
Considerable research effort has been expended in the development of
alternatives to the classical procedures, especially for estimation, to
accommodate aberrations of this type.
Such procedures, designed to be rel-
atively insensitive to mild deviations from the model, are said to be "robust", following Box (1953).
For a review of some of the techniques and
a thorough discussion of the issues of robust inference, see Huber (1981).
Much of the work on robust estimation is concerned with models having
symmetric continuous error laws.
For instance, numerous alternatives to
the sample mean have been proposed to estimate the center of a symmetric
continuous distribution.
See Andrews et al (1972) for an extensive study.
More recently, robust alternatives to least squares have been proposed for
regression models.
See, for instance, Huber (1973) and Krasker and Welsch
(1982) .
The present investigation focuses on robust estimation for discrete
data.
Discrete probability distributions generally lack the symmetry and
invariance properties
of the location-scale and regression frameworks,
2
and the discreteness itself has an impact,
few authQrs appear to have
explicitly treated robust estimation for discrete data, although in princip1e existing techniques can be used.
One technique is to modify the maximum likelihood estimator in the
following way.
Replace the likelihood score in the estimating equation
by a suitable bounded score functionljJ to limit the effect that an individual observation can have;
The resulting estimate T solves an equation
n
of the fOrIII
n
I
(1.1.1)
i=l
ljJ(X.,T) =0,
1
n
where X ,X ",;"X comprise a sample of size n.
n
1 2
Estimates solving equa·
tions of this form are proposed in Huber (1964).
They are known as M...
estimateS since they generalize the Class of maximum likelihood estimates.
An attractive feature of the class .of M-estimates is its flexibility.
Desired properties of an M-estimator are specified through the score function; see Hampel (1974) or Huber (1981).
Moreover, M-estimation readily
extends to higher dimensional and regression problems.
1y considers robust
M~estimation
Hampel (1968) brief-
of the Poisson and binomial parameters.
The literature on M-estimationfn general is extensive.
Further refer-
ences can be found in Huber (1981).
Some.aspects of M-estimation in the context of discrete data are examined
in Chapter II.
By focusing on the asymptotic distribution theory specific
recommendations' are made concerning the choice of the M-estimator score
furic.tion.
The results there also fill in some of the gaps in the theory
of M-estimatidn and the a$sociated optimality theory of Hampel (1968, 1974).
While M-estiination is extremely flexible, and construction of M-esti..:
mate with desired' properties is relatively straightforward, the technique
3
does have one minor drawback.
The quantities being estimated may not have
a simple interpretation in asymmetric situations if the model is not exactly correct.
In this regard minimum distance estimation is appealing.
A minimum
distance estimate is obtained by minimizing a measure of the discrepancy
between the assumed parametric model and the data.
The quantity being esti-
mated can always be interpreted as an index of the parametric distribution
giving the best fit to the actual distribution, according to the discrepancy
being used.
The resulting estimators are frequently robust in some sense.
The early work on minimum distance estimation precedes the literature
on robust inference.
Many references can be found in Parr (1981).
More
recently, interest in inference under non-standard conditions has led to
--
a resurgence of interest in minimum distance methods.
Minimum distance
approaches to robust estimation are advocated by a number of authors, ineluding Holm (1976), Beran (1977a, 1977b, 1982), Parr. and Schucany (1980),
and Millar (1981).
For discrete data, minimum distance estimation provides another generalization of maximum likelihood estimation ~
distribution on {0,1,2, .•. } and f
f
n
e
Suppose Fe is a parametric
is the corresponding count density. If
is the empirical density, i.e., for x=0,1,2, ... f (x) ·is the proportion
n
of x's observed, then the maximum likelihood estimate of
e maximizes
00
Since
I
f n (x) logfn (x) is .constant in
to minimizing
00
(1.1. 2)
e
and finite, this is equivalent
4
the KUllback-Liebler discrepancy between f
for properties of this discrepancy.
n
and f ; see Kullback (1959)
8
Viewed as a minimum distance estimator,
the well-known sensitivity of the maximum likelihood estimator, in many instances, to outliers in the data is a consequence of the weight given in
(1.1. 2) to observations that are improbable relative to the assumed model.
Chapter III is cqncerned with minimum Hellinger distance estimation
(1.1.3)
Unlike the Kullback-Liebler discrepancy, the Hellinger distance gives little
weight to observations that are improbable relative to the model; outliers
have little effect.
In addition to this robustness to outliers, minimum Hellinger distance
esti.mation is known to be asymptotically efficient at the model in certain
circumstances; see the references cited in Chapter III.
Hence, minimum Hellinger distance estimation is promising as a means
for providing robust and efficient estimates.
Moreover, a distance be-
. tween densities is especially suited for discrete distributions.
In Chap-
ter III improved results are obtained concerning asymptotic properties of
the minimum Hellinger distance estimator, especially for discrete data.
Chapter IV is concerned with the breakdown of the estimator.
See Hampel
(1968), Huber (1981) and Donoho and Huber (1982) for a discussion of the
breakdown point as a measure of the stability of an estimator.
1.2
An
example from mutation research
Screening chemicals for mutagenicity is an important first step in the
identification of cancer-causing, agents in the environment.
Chemicals that
5
cause mutations are suspect as potential carcinogens.
A well~known assay for chemical mutagenicity is the sex-linked recessive lethal test in Drosophila (fruit flies).
In this experiment groups
of male flies are exposed to different doses of a chemical to be screened.
Each male is then mated to unexposed virgin females.
The number of daugh-
ter flies carrying a recessive lethal mutation on the X chromosome is ob·
served.
For details see Woodruff et al (1984) or Zimmering et al (1985).
Table 1.1 presents control data from this assay (J. Mason, 1984, personal communication).
For these data approximately one-hundred daughters
were sampled from each male.
Each row in the table corresponds to one run
of the· assay and shows the frequency distribution for the numbers of recessive lethal daughters from individual males.
Observe that the right-most
column contains actual counts larger than six rather than frequencies.
Aside from the rarity of spontaneous recessive lethal mutations in general, the most striking feature of these data is the occasional occurrence
of exceptionally large counts.
For instance, in the second run on day 177,.
one ma1e·is reported to have produced 91 recessive lethal daughters, while
none of the other males in the run produced more than two .
. Woodruff et al. (1984) refer to these exceptional counts as "clusters".
They conjecture that, unlike the majority of the 1etha1s, which result from
mutations during meiosis, a cluster results "from a single spontaneous premeiotic event."
They advocate omitting from the analysiS observations that
are identified as clusters.
This can be justified on the grounds that clus-
ters reflect a different mechanism than the remaining observations, one
that might well have acted prior to exposure.
That a given count is a product of
c1ust~ring
apparently can only be
deduced from its discordance with the remaining counts.
Woodruff et a1.
6
(1984) and Zimmering et al.(1985) use a routine outlier screen for large
counts, based on a Poisson model for the data.
In effect, they compute a
robust, although biased, estimate of the Poisson mean.
1.3
Remarks
Outlier rejection rules require a zero-one decision concerning each
observation.
Although this is not problematic for extreme cases, e.g.,
there is Ii ttle doubt that the 91 and the 13 in Table 1.1 are outliers,
in moderate cases the situation is less clear-cut.
For instance, Zimmering
et al. (1985, p. 93) are led to conclude in several instances that three lethals constitute a cluster, while two lethals do not.
The robust estimates studied in the remaining chapters avoid the need
to make arbitrary decisions concerning how extreme an observation can be
before it is excluded.
Instead, a discordant observation is downweighted
in a smooth manner according to its position relative to the bulk of the
data.
Moreover, by comparing a robust fit of the model to the observed
frequencies, diagnostic information is obtained concerning outliers in the
data; see Section 3.2.
Note that the methods investigated here are
p~rametric.
They do not
avoid the need for a model that provides a reasonable summary of the data
for the most part.
7
Table 1.1
Control runs from the Drosophila assay
Number of Parental Males with i Lethals .
~e
Run
0
1
2
3
4
5
6
>6
17
10
0
0
0
0
0
0
0
18
15
1
1
0
0
0
0
0
20
26
4
0
0
0
0
0
0
22
19
2
0
0
0
0
0
0
24
23
3
0
0
0
0
0
0
25
10
0
0
0
0
0
0
0
26
24
0
1
0
0
0
0
0
27
25
4
0
0
0
0
0
0
28
23
3
0
1.
1
0
0
0
31
31
3
0
0
1
0
0
0
32
29
2
0
0
0
0
0
0
33
18
4
2
0
0
0
1
0
35
17
2
0
0
0
0
o.
0
36
24
2
2
0
0
0
0
0
37
15
5
0
0
0
0
0
0
38
26
4
0
0
0
0
0
0
39
17
3
0
0
0
0
0
0
40
19
3
0
0
0
0
0
0
41
27
1
0
1
0
0
0
0
42
20
5
0
0
0
0
0
0
43
13
4
2
0
0
0
·0
13
44
16
6
1
1
0
0
0
0
45
20.
2
0
0
0
0
0
0
46 .
17· .
6
0
1
0
0
0
0
47
25
2
0
0
0
0
0
0
48
19
8
2
0
0
0
0
0
49
16
5
0
0
0
0
0
0
50
18
5
0
0
0
0
0
0
51
15
9
0
0
0
0
0
0
52
23
2
0
0
0
0
0
0
53
20
3
0
0
·0
0
0
0
8
Run
0
1
2
3
4
5
6
>6
54
28
0
0
0
0
0
0
0
55
15
5
1
0
0
0
0
0
56
23
2
0
0
0
0
0
0
57
23
2
0
0
0
0
0
0
58
26
2
0
0
0
0
0
0
59
28
6
0
0
0
0
0
0
60
27
1
0
0
0
0
0
0
61
20
0
0
0
0
0
0
0
62
30
0
0
0
0
0
0
0
101
26
3
1
0
0
0
0
0
102
15
4
0
0
0
0
0
0
103
17
3
0
0
0
0
0
0
104
14
3
1
0
0
0
0
0
105
37
7
1
2
0
0
0
0
106
27
4
0
1
0
0
0
0
107
36
4
0
0
0
0
0
0
108
40
6
1
0
0
0
0
0
110
40
7
0
0
0
0
0
0
112
28
7
0
0
0
0
0
0
113
29
5
1
0
0
0
0
0
114
26
.7
2
0
0
0
0
0
115
25
9
1
0
0
0
0
0
iH3
30
4
1
0
0
0
0
0
119
16
3
1
b
0
0
0
0
126
31
3
0
1
0
a
0
128
29
5
0
0
a
a
0
0
0
130
30
3
1
1
0
a
0
0
131
26
8
1
0
0
0
0
a
132
30
5
0
0
0
0
0
0
133
29
5
1
0
0
b
0
0
137
2,4
6
2
0
1
0
0
0
139
2S
6
1
0
0
0
0
0
140
26
3
3
0
0
0
0
0
141
26
6
1
0
0
0
0
0
e
~
9
-e
Run
0
1
2
3
4
5
6
>6
142
25
8
1
1
0
0
0
0
143
8
1
0
0
0
0
0
0
144
28
3
2
0
0
0
0
0
145
30
5
1
1
0
0
0
0
146
31
14
2
2
0
0
0
0
147
57
2
0
0
0
0
1
0
148
31
12
0
0
0
0
0
0
149
45
·4
1
1
0
0
0
0
150
31
17
2
0
0
0
0
0
151
26
1
1
0
0
0
0
0
152 .
24
2
0
0
0
0
0
0
153
67
14
1
0
0
0
0
0
153
42
1
2
0
0
0
0
0
154
54
3
1
0
0
0
0
0
154
44
5
0
1
0
0
0
0
154
29
6
1
1
0
0
0
0
157
30
17
2
1
0
0
.0
0
157
34
11
2
0
0
0
0
0
157
34
13
3
0
0
0
0
0
158
40
7
1
0
0
0
0
0
158
41
9
0
0
0
.. 0
0
0
159
48
1
0
0
0
1
0
0
160
40
1
0
0
0
0
0
163
22
7
. 4
0
0
0
0
0
0
. 165
39
10
1
0
0
0
0
0
165
44
5
1
0
0
0
0
0
166
41
9
0
0
0
0
0
0
166
43
6
0
0
0
0
0
0
166
44
5
1
0
0
0
0
0
167
23
10
2
1
0
0
0
0
167
23
8
. 5
0
0
0
0
0
167
35
12
3
0
0
0
0
0
168
60
6
4
0
0
0
0
0
168
29
5
0
0
0
0
0
0
10
Run
0
1
2
3
4
5
6
> 6
169
26
9
0
0
0
0
0
0
170
22
10
0
1
0
0
0
0
170
23
10
1
1
0
0
0
0
171
25
4
3
0
0
0
0
0
171
26
8
1
0
0
0
0
0
172
16
11
2
0
0
0
0
0
172
28
6
0
1
0
0
0
0
172
30
6
3
I
0
0
0
0
173
28
5
I
0
0
0
I
0
173
20
8
0
0
0
0
0
0
176
19
8
3
I
0
I
0
0
176
23
9
3
0
0
0
0
0
177
25
9
1
0
0
0
0
0
177
23
7
3
0
0
0
0
91
i78
20
12
1
2
0
0
0
0
178
23
II
0
I
0
0
0
0
179
23
9
I
0
0
0
0
9
179
18
2
2
0
0
0
0
0
180
25
4
1
0
0
o.
0
0
180
21
11
1
2
0
0
0
0
181
18
13
3
1
0
0
0
0
182
22 '
10
1
0
0
0
0
0'
182
36
28
6
0
0
0
0
0
183
26
16
2
0
0
0
0
0
183
24
6
1
0
0
0
0
0
e-
CHAPTER II
ASPECTS OF
2.1
M-ESTI~~TION
FOR DISCRETE DATA
Introduction
M-estimation, originally proposed by Huber (1964) to estimate a loca-
tion parameter robustly, has since been applied successfully to a variety
of estimation problems where stability of the estimates is a concern.
There is, for instance, a substantial body of literature on M-estimation
for regression models; see Krasker and Welsch (1982) for a recent review.
For a complete account of M-estimation see Huber (1981).
Much of the popularity of M-estimators can be attributed to their
flexibility.
That desired properties of an M-estimator, such as
insensi~
tivity to or rejection of extremely outlying data points, can be specified
in a direct way through the M-estimator score function is well-known.
In
particular, the influence function of an M-estimator is proportional to
its score function; see Hampel (1974) or Huber (1981) for details.
Surprisingly, M-estimation for discrete data seems to have received
little attention.
Discrete data are no less prone than continuous data
to outliers or partial deviations from an otherwise reasonable model, as
evidenced by the data from the Drosophila assay described in Chapter I.
This chapter investigates some aspects of M-estimation for discrete data.
A useful optimality theory has been developed by Hampel (1968, 1974)
for robust M-estimation of a univariate parameter.
tion facilitates the
~onstruction
His general prescrip-
of robust M-estimators with nearly opti-
12
mum efficiency at a specified model.
Proposals for robust estimation of
the binomial and Poisson parameters, for instance, can be found in Hampel
(1968).
Hampel's theory is briefly reviewed in Section 2.2.
Extensions
of this optimality theory to regression models are discussed in Krasker
(1980), Krasker and Welsch (1982) and Ruppert (1985).
The score function for Hampel's optimal M-estimator is not smooth.
This can lead to complications in the asymptotic theory when the data are
discrete.
For instance Huber (1981, p. 51) considers the case where the
underlying distribution is a mixture of a smooth distribution and a point
mass.
He observes that if the point mass is at a discontinuity of the de-
rivative of the score function, then an M-estimate for location has a non·
normal limiting distribution.
Along the same lines, Hampel (1968,p. 97)
notes that the optimal M-estimate for the Poisson parameter isasymptoti. cally normal at the Poisson distribution, provided the truncation points
of the score function are not integers.
He conjectures that "under any
Poisson distribution it is asymptotically normal (with the usual variance);
however, this relpains to be shown."
This chapter provides extensions to the asymptotic distribution
ry of M-estimators especially relevant to discrete data.
are given in Section 2.3.
theo~
The main results
Among the applications of the theory are a more
complete account of the asymptotics of the Huber M-estimate for location
and a proof of Hampel's conjecture.
Aside from providing a more complete
asymptotlc theory for M-estimation, the results have implications for choosing a score function when the data are discrete.
the final sections.
2.2
These are discussed in
In particular, smooth score functions are proposed.
Parametric M-estimation:
Definitions, optimality and examples
Suppose Xl' X , ••. are thought to be independent observations, each
2
13
having distribution function (d.f.) Fe' where e belongs to a parameter
set 8; here 8 is a subset of Rd , d ~ 1.
(2.2.1)
Define
= J ¢(o,t)dF,
M(t; ¢,F)
1
where F is ad.£. on R , ¢(o,o) is a measurable real-valued function on
R1x 8, and t
€
8.
Then Tn is an M-estimator for e, based on a sample of
size n, if it solves an equation of the form
(2.2.2)
M(T;
1/J,F
)
n
. n
= 0,
where F is the empirical d.f.; (2.2.2) is just a restatement of (L1.1).
n
The standard requirement
M(e; 1/J,F e )
(2.2.3)
= 0,
e
€
8,
ensures that T estimates e when the model is correct.
n
1
Suppose now that 8 cR.
The influence function at Fe of an M-estima-
tor for e has the form
~(x,e)
=
1/J(x,e)
d
. ,
-j{de 1/J(o,e)}dF e
provided this exists;
see Hampel (1974). Subject to (2.2.3) and a bound
on D (which guarantees the robustness of the estimator) Hampel (1968, 1974)
obtains the general form of the M-estimator score function that minimizes
J~
2
(o,e)dF ,
e
which is, under regularity conditions, the asymptotic variance of the estimator at Fe'
Assume Fe has a density f e with respect to a suitable measure,
and assume the parametrization is smooth.
Letting
~(x,e)
d
= de
the optimal score according to Hampel's criterion has the form
log fe(x),
14
~c(8)(~(x,8)
(2.2.4)
- a(8)),
where
~
=
(u)
c
u,
'
{. c sign (u),
lui:::; c
lu I > c,
and a is defined implicitly by (2.2.3).
The truncation point c(8) determines the bounds on n(o,8) and hence
the robustness of the estimator to outlying data points.
The correspond-
ing estimator is optimal in the sense that it has minimum asymptotic variance at the model among M-estimators with the same bounds on their influence functions;
·normaL
this provided the estimator is indeed asymptotically
Observe that the maximum likelihood estimator has the form (2.2.4)
with c(8}
~
and
00
~
a(8)
O.
Two examples given in Hampel (1968) will be of special interest here.
If F is the normal d.f. with mean 8 and unit variance then
8
= x - 8. By sYrnrnetrya(8) ~ 0, and constant variance suggests set-
Example 2.1
~(x,8)
ting c(8)
~
c.
The resulting estimator, with score
~
c
(x-8), is the Huber
(1964) M-estimator for location.
ExamEle 2.2
If F
is the Poisson d.f., with density f 8 (x} = e
8
x=0,1,2, ... , then
~(x,8)
= x8
-1
-k
c(8) ,= c8· 2 on the grounds that
-1.
-k
~c(X8
(2.2.5)
1
2 -
8 Ix!
on
Hampel (1968, p. 96) suggests taking
-k
~(x,8)
this choice (2.2.4) is equivalent to
-8 x
~
has standard deviation 8 2
-k
c
k
(x8 2_8 2 _a(8)).
For
The version
B(8)),
where (3(8) = 8~ + a(8) is defined by (2.2.3), is slightly more convenient.
15
2.3
Extended asymptotic distribution theory
As in the previous section Xl' X2, •.. are independent observations
thought to be from a parametric d.f. Fe' and interest is focused on estimating e.
Initially e may be a vector parameter.
discussion, e will be assumed to be a scalar.
might not be in {FeL will be denoted by G.
Later, to simplify the
The actual d.f., which
The results of this section
are mainly relevant when G has a discrete component, but Theorem 2.1 is
somewhat broader in scope.
Conditions for consistency (convergence almost surely or in probability) of an M-estimator can be found in Huber (1964, 1967, 1981).
Since
the smoothness considerations of concern here play no role in the cons istency proofs, consistency will usually be assumed as a condition.
In studying the asymptotic distribution theory of M-estimators Huber
shows under quite general conditions that if Tn-+e = T(G) in probability
as n -+ 00 then
(2.3.1)
~
-n 1-1 (T ; 1jJ, G)
n
where M is given by (2.2.1);
= n -~
n
r 1jJ(x. ,8)
i=l
+0
1
p
(lL
see Theorems 3.2.4 and 6.3.1 of Huber (1981).
In particular, 1jJ need not be differentiable; monotonicity or Lipschitz
integrability conditions are sufficient.
That T is asymptotically
n
nor~
mal follows immediately from (2.3.1) provided M(t; 1jJ,G) has anon-zero
derivative at 8 and 0<
f
1jJ2(e,e)dG<00; see Corollary 6.3.2 of Huber (1981).
For stronger almost sure representations for Tn under stronger conditions
see Carroll (1978a, 1978b).
To avoid having to verify Lipschitz conditions for score functions
like (2.2.5) that have implicitly defined centering parameters, it is useful to observe that (2.3.1) also holds under conditions like those in Boos
16
and Serf1ing (1980).
II IIv
Denote by
0
the total variation norm, given by
k
/I hll
v
= lim sup
L
i=l
!h(x.) - hex. 1)
1
1-
where the supremum is over partitions a = X o < xl
I,
< ... <
x = b
k
of [a, b], and
the limit is as a -+ -00, b -+ 00.
The proof of the following is contained in the
proof of Theorem 2.2 of Boos
and Serfling (1980).
Lemma 2.1
Let Xl' X2 , ... be independent, each with d.f. G, and let
e = T (G) •
Suppose ljJ (x, t) is continuous in x for t
lim
t-+e
If T
-+
n
Remark
e
II
ljJ (
0
,
t) - ljJ(
0
in probability as n
,
-+
e)
00
II
v
d
8c R
and
= o.
then (2.3.1) holds.
The score functions of Examples 2.1 and
variation.
E
2~2
are continuous in total
For the former see Boos and Serf1ing (1980).
For the latter,
which is slightly more complicated, see Appendix 2.B.
When the underlying distribution is discrete, points where ljJ fails to
have a derivative can have positive probability for certain parameter values.
In light of (2.3.1), it is natural to ask whether M can have a derivative at
such
parameter values, and hence whether T can be asymptotically normal.
n
The following theorem addresses this question.
d
For e E 8 c R , Fe is as-
sumed to have a density f e = f ( ,6) with respect to a cr-fini te measure
0
ljJe=ljJ(o,e) is measurable for each 6.
valent to the Euclidean norm.
(AI)
f
Let
II
11,
d
oil denote any norm on R equi-
Some regularity conditions are needed:
There are measurable functions w =w(o,t) and gt =g(o,t) for which
t
wtftdlJ, flljJt1gtdlJ and
f
and
wtgtdlJ are finite and, for some 0> 0,
17
almost everywhere [~]
(a.e.) when lis - til
S
0;
(A2)
d
•
There is a measurable R -valued function f
(A3)
,I,
"'s
.-+ ,I,
. "'t
Theorem 2.1
a. e.
•
t
-- f(·, t)
as s -+ t.
If for each t E8
(AI) - (A3) hold and
(2.3.2)
then
-LM(s' F)I
(2.3.3)
ds
T
'
t
s=t = -
J ,I,"'t f t dl.l.
(where the dependence of M on ljJ has been suppressed).
Proof
For s, t
E
(2.3.4)
where
nated Convergence,
8
such that
18
(2.3.5)
Rt (s)
= 0 ("
s-t
II ).
Similarly, (A2) and Dominated Convergence imply
(2.3.6)
=
0
(II s-t II)
since the integrand is dominated by
21/
as s + t,
s-t II l1/J
t
Igt
on
1/
s-t II ~ <5.
From
(2.3.4) to (2.3.6) conclude
Hence
Td M(s; Ft ) exists at t and is given by (2.3.3).
ds
Remarks·
1.
Note that 1/J
t
need not be differentiable.
2.
When 1/J t =!/,t = f/f t ,
(2.3.3) generalizes the usual information identity.
3.
Huber (1981, p. 51) observes a special case, namely (2.3.3) holds when
~
is Lebesgue measure, 1/J(x,t) = 1/J(x-t), \"here 1/J(0) is skew-symmetric about
zero, and f(x,t) = f(x-t), where f(o) is differentiable and symmetric
'lbout zero.
4.
Theorem 2.1, when it holds, also guarantees that the influence function
at the
mode~
given by
. {--r
d M(s; F ) 1-1
1/J(x,t)
t s=t}
ds
is defined for each t
Example 2.2 (continued)
€
e.
Suppose f(x,t) = e
-t x
t Ixl
. }
on {O,1,2, ... , t> O.
Recall that the optimal M-estimator has the score 1/J(x,t) = 1/Jc(xt-~-B).
This
19
estimator is known to be asymptotically normal at the Poisson distribution
when t is in one of the open intervals where neither of the truncation
1
points t~(S ± c) is an integer; see Hampel (1968, p. 97).
It can now be
shown to be asymptotically normal at every Poisson distribution, as conjectured by Hampel.
First, the conditions of Theorem 2.1 are verified. For (AI) and (A2)
2~,
-1 8
use g(x,t) = e
f(x-l, t+8) + 8 (e -1-8)f(x,t) (see Appendix 2.A), w(x,t) =c
and f(x,t)
= f(x-l,
f(x,t).
t)
In Appendix 2.8 S is shown to be contin-
uous 'when (2.8.3) holds, for which c
2.8.1.
Hence (A3) holds for c
~ 1.
~
1 is sufficient; see Proposition
Finally, (2.3.2) is satisfied because
of the way S is defined.
It has already been observed that Lemma 2.1 applies.
0<
f
2
1/itftd].l::; c
2
for c
~
1.
Moreover,
It follows that the estimator is asymptoti-
cally normal at every Poisson distribution if it is consistent.
Consis-
tency follows from the discussion on p. 96 of Hampel (1968) arid Theorem 2
of Huber (1967).
In Theorem 2.1, (2.3.2) allows smoothness of the parametrization to
be substituted for smoothness of 1/i within the assumed parametric model, so
that the estimator is asymptotically normal under further conditions.
,side the assumed parametric model, however, 'this approach fails.
Out~
In cer-
tain cases it is still possible to obtain the limiting distribution of Tn
from (2.3.1).
Assume for simplicity that
G
is an open subset of the real line. The
score functions used for robust estimation are generally at least piecewise differentiable.
In such cases the one-sided derivatives ofM(t; G)
will usually exist when M fails to be differentiable.
met; G)
=
d
dt ,M(t; G)
Write
20
when the derivative exists.
By a well-known result from calculus, if
m(8-; G) and m(8+; G) exist, they are equal to the corresponding one- '
sided derivatives ofM(t;G) at 8; see, e.g., Franklin (1940, p; 118).
Theorem 2.2
Suppose for some 8 interior to
be a zero of M(t; F ), n=1,2, ... , where F
n
n
(=)
that M(8; G) = 0, and let T
n
is the empirical d.f.
Assume
the following:
(Bl) "m(8-; G) and m(8+; G) exist finitely and are non-zero and
of the same sign;
i
a < 00, where
=
f
1/J~ dG;
(B2)
0<
(B3)
T -+8 in probability as n-+ oo , and (2.3.1) holds.
n
Then
(2.3.7)
lim
n-+oo
sup
_
H(z)
~
n
-oo<Z:<oo
where
and
1
Ipr{n'2(T -8) ~ ,z} - H(Z) I = 0,
-
,{r,'
~(lm(8+; G)I~/a),
z
~(Im(e-; G) Iz/a)~
Z. :s
~
0
0,
is the standard normal d.f.
Remarks
1.
Huber (1964, p. 78) alludes to a similar result for a location
estimator.
2.
The requirement that m(e ±; G) have the same sign is actually implied
by the remaining conditions.
If the one-sided derivatives were to have
opposite signs, M(t; G) would not change signs in a neighborhood of
e,
'and (2.3.1) could not hold.
The proof of Theorem 2.2 is deferred to Appendix 2.e.
Example 2.1 (continued)
Recall that the Huber M-estimator for location
(
)
has the scoreljix,t
= ,IJo/c ( x-t ) .
For any d.£.. G M(-ro,· G). = c = -M(ro; G),
21
and M(t; G) is continuous in t so it has a zero 8.
Assume 8 = O.
o
0
This is
unique ifG(c-) > G(-c+), in which case T +0 in probability by Proposition
n
Since W is continuous in total variation, (2.3.1)
c
2.2.1 of Huber (1981).
holds by Lemma 2.1.
observe that
.
-W(x,t-)
.
Letting W(x,t) = dldt W (x-t) = -W'(x-t) if it exists,
c
.
and -W(x,t+) =IC-c<x-t5C), where 1(-)
c
=
I(-c~x-t<c)
. denotes the indicator function.
Bounded convergence yields
-m(O-; G) = G(c-) - G(-c-)
and
-m(O+; G) = G(c+) - G(-c+).
k
Hence, by Theorem 2.2, n~n is asymptotically normal if G(c+) - G(c-) =
G(-c+) - G(-c-); otherwise it has a limiting distribution consisting of the
left and right halves of two normal distributions with different variances
(cf. Huber (1981, p.5l)).
2.4
A counterexample
It is instructive to examine the extent of the non-normality that occurs
in a specific example.
parameter.
Consider again the optimal M-estimator for the Poisson
The score function is
-c,
w(x,t)
=
-k
W (xt 2_13)
c
~
£(t)
_k
xt 2_13, £(t) < x < h(t)
=
c,
k
x
h(t)
~
x,
k
where £(t) = t 2 (S(t)-c) and h(t) = t 2 (S(t)+c).
tet G be the actual d. f. and let 8 = T(G).
when
e
is small.
The simplest situation is
Assume henceforth that £(8) < 0 < h(8) = 1.
Calculation yields Set} = c(et_l) for £(t) < 0, 0 <h(t) ~.l, and Set) =
- l for £(t) < 0, 1 ~ h(t)
2
de t (l+t) - 1 -l} + t k
(l+t)
~
2.
Since S is continuous,
22
equating the two expressions at
e
gives
e ~ e e = c -1 .
(2.4.l)
8
.
1 8
-2
The one-sided derivatives of B at 8 are B' (8-) = ce ,and B' (8+) = ~e (1+8) ,
where (2.4.1) was used.
1/J ~ ( c -)
=1
and 1/J ~ (c +)
Note that
B is strictly increasing at 8.
Since
= 0,
(2;4.2)
x=o
x=1,2, ...
and
"21 ce 8 (1+8) -2 , x=O
(2.4.3)
"21 ce 8{8 -1 + (1+8) -1 }, x=l
0.,
x=2, 3, ...
Suppose G is a mixture of a Poisson distribution F and a point mass
t
at an integer z, Le., G = (l-e)F + eo .
t
z
Assume i> h(t) so ~(z,e±) = O.
From (2.4.2) and (2.4.3)
m(8+; G)
1 t l+t
m(8-; G) = "2 ("6+ 1+8),
(2.4.4)
where m(8-; G) = -ce
8-t
(1-e).
The ratio (2.4.4) is unity only when t = 8,
which corresponds to e = O.
By Theorem 2.2, the limiting distribution of
n~(Tn
the right and left halves of two normal distributions.
- 8) consists of
The ratio of their
standard deviations is (2.4.4) .
.'
.'
8-t
Solving 0 = M(8; G) = cO - (1-e)e
} yields t = 8 + log(l- E:). Table 2.1
shows the values for t and (2.4.4) for several values of e when 8= 0.25
(see (2.4.1)).
nominal
.O~
tail probability is snown.
In addition, the effect on a
23
For very small values of
E
the effect is minimal, which accords with
the robustness of T in the sense of weak* continuity (see Hampel (1971)),
n
since it is asymptotically normal at the model.
As E increases, however,
the effect becomes more serious, and inference based on T can be substann
tially biased .
. For related work see Stigler (1973), who observes that a bias of this
type can arise when the trimmed mean is used for discrete or grouped data.
Tab 1e 2. 1 Effect of contaminating mas s
t
E
e
2.5
with
e = 0.25
fixed.
<p( -1. 645r)
r=(2.4.4)
0.25
1
.05
0.01
0.24
0.976
.054
0.05
0.199
0.877
.074
0.10
0.145
0.748
.109
0.15
0.087
0.610
.158
0.20
0.027
0.465
.222
0
-
E
Smooth score functions
In the example of the preceding section, one might argue that the para-
meter values where problems arise are unlikely to occur in practice, or that
c can be changed slightly.
It
is not, however, the non-normal limiting dis-
tribution of T at certain distributions that is of concern, but the.instan
bi'lity of inference based on T near those distributions.
n
This phenomenon
can alternatively be interpreted as a discontinuity of the asymptotic vari. ancefunctional V(T(G); G) = {m(T(G); G)}-2
p. 51).
f
tlJ2(o, T(G))dG; cf. Huber (1981,
In the neighborhood of a distribution where V is discontinuous, esti-
mates of the variance of Tn may be highly unstable.
24
Instability of this type can be avoided by requiring the M-estimator
score function to be smooth.
This can be achieved with negligible loss of
efficiency at· the model, relative to Hampel's optimal estimator (2.2.4), by
replacing
~
c
(.) with a smooth approximation.
A natural way to construct
such a function is by rescaling a smooth distribution function .
.Suppose F is an absolutely continuous d.f. with density f symmetric
about zero.
Then
is monotone increasing, skew-sYmmetric about zero, and satisfies
and
~'(O)
::: 1.
Observe that
~
~(oo)
:::
C
is obtained from (2.5.1) by taking F to be
c
1
1
the uniform distribution on [-2' 2].
This can be approximated arbitrarily
closely by a sYmmetric beta distribution with a small value for the shape
parameter, i.e., f(x)cx {(}+X)(}_x)}a on [-},}].
The resulting score
function is complicated, however, and its second derivative has jump discontinuities.
A more convenient choice is the logistic distribution, which
leads to the smooth function
(2.5.2)
L (x) :::
c
C
tanh (x/c).
This has appeared previously.
L
c
(x-t) is the maximum likelihood score for the
location of a logistic distribution with scale 1. Holland and Welsch (1977) inelude a regression M-estimator using L
c
in a Monte Carlo study of robust re-
gression estimates.
For the important special case of estimating a Poisson parameter robust1y, a smQoth version of the optimal M-estimator solves
(2.5.3)
n
-1
n
I
i:::l
_1
L (X.t ~ - B(t)) ::: 0,
C
1
25
B is
where
defined in the usual manner.
Table 2.2
gives asymptotic variances V and bounds Y on influence
e
e
functions for the estimator defined by (2.5.3), labeled L , and the optimal
c
estimator, labeled tli.
c
In each case c
= 1.5.
The calculations are at' the
Poisson model, and V and Y are stabilized by dividing by
e
e
e
and
1:2
e
respec-
tively.
Note that
ve/e
is the asymptotic relative efficiency of the maximum
likelihood estimator (sample mean) with respect to the corresponding
mator.
~1-esti-
The asymptotic variances for the logistic score are slightly smaller
than those for the "optimal" score.
This is possible because the bounds on
the influence function of L are slightly higher than for
c
~.
c
In terms of
26
2.6
Further remarks
The need for smooth score functions is most clear when the data consist
of counts.
In this case every deviation from the model involves point masses.
An important consequence of Theorem 2.1 is that Hampel's optimal estimator (2.2.4) is indeed optimal as claimed when the model distribution is
discrete. It would be disturbing if the theory were to break down at a countable
number of parameter values.
Moreover, the smooth versions discussed in
Section 2.5, which provide more stable inference, are justified fot every
parameter value as being nearly optimal.
Although the discussion has focused on the score functions arising
from Hampel's optimality theory, it is not limited to that context.
For
instance, a score based on Hampel's three part redescending 1/J (see Huber
(1981, p.l02)) will be prone to the difficulties here, and a smooth veTsion will be more stable.
APPENDIX
2.A
Lipschitz condition for the Poisson density
Let f(x,t) =e-ttX/xl, x=0,1,2, •.. , t>O.
t-s (sit) x - 1.
I
. f(x, t) Ie.
Now
1e
Then If(x,s) - f(x,t)! =
t-s (sit). x - 1 I = let - s (sx _ t x) t - x + et - s _ 1 I
. 1
-x It-sl XI-l i x-l-i
ooI Is_tIJ-)
:;; Is - It(t e s t
+
~
.
J" I
i=O
j =2
for ls~tl :;;
o.
Simplifying yields
If(x,s) - f(x,t)! :;; Is-tl g(x,t)
~ f (x, t ) .
where g(x, t) = e 20 f(x-l, t+o) + 0-1 (e 0 - 1 - u)
e-
27
2.B
Properties of the optimal score for the Poisson parameter
This section provides details concerning the M-estimator score func-
tion (2.2.5) for the Poisson parameter.
(2.B.l).
Let
=I
B(t,u)
x=O
where·f(x,t)
(2.B.2)
e
-t x
t lxi, x=0,1,2, •.. , t>O.
Then B(t) is defined by
B(t,B(t)) = O.
Note that B is continuous in both arguments.
B(t) always exists.
If, moreover,
Ix ot -~ -
(2.B.3)
Since B(t,_oo) = c = -B(t,oo),
B(t) I < c
for some non-negative integer x , then B is uniquely defined and contino
uous at t, since B(t,u)is then strictly decreasing in uat B(t); see, for
instance, Franklin (1940, p. 50).
See also Hampel (1968, p. 96) who ob-
serves but does not prove the continuity of a = B- t
Proposition 2.B.l
If c
~
k
2.
1 then for each t> 0 (2.B.3) is satisfied for
some non-negative integer x .
o
Proof
(2.B.3) holds at x
(,Il,(t), h(t))
o
if and only if x
k
0
is in the open interval
k
where ,Il,(t) =t 2 (B(t) -c) and h(t) =t 2 (B(t) +c).
It will be
shown that this interval always contains at least one non··negative integer
if c ~ 1.
Since B(t,O) > 0, B(t,B(t)) = 0 and B(t,u) is decreasing in u,B(t) >0
for t > 0 and h(t) > O.
k
The length of the interval is 2ct 2;::: 2 for t;::: 1,
1
c;::: 1, so the result holds for t;::: 1.
Suppose 0 < t < 1.
Then t - ct~ < 0 i f
28
c;::: 1.
This implies
1
B(t,t~)
00
=I
x=o
-k
f(x,t) min{c, (x-t)t 2} ~ 0
sinceL(x-t)f(x,t) = o.
Write 1jJ(x,t)
= l/J c (xt -k2 -
Proposi tion 2. B. 2
lim !I1jJ (
s-+t
where 11
Proof
0
It follows that B(t) ~ t
B(t)) .
k2
< c.
Hence Q,(t) < 0< h(t) .
Then the following holds.
If c;::: 1 then
0 ,
s) - 1jJ (
0 ,
t)
\I v = 0
II v is the total variation norm.
Let K(s,t)'be the set
{x: min(Q,(s), Q,(t)) ~ x ~ max(h(s), h(t))},
where Q, and h are as in the proof of Proposition 2.B.1.
11jJ
c
(xs-~
13 (s)) - 1jJ
~
c
(xt-~-B(t))
IB(s) -B(t)1 +
= 0,
e'
Then
I
Is-~-t-~I
lxi,
x
€
K(s,t)
x f. K(s,t).
Hence \l1jJ(o,s) -1jJ(o,t)
II v ~
g(x) = IxIIK(S,t)(x).
Since IIgl1 v is uniformly bounded for s in a compact
IB(s) -B(t)1 +
Is-~-t-~I
II gllv' where
interval about t, the result follows from the continuity of 13 for c;::: 1.
2.C
Proof of Theorem 2.2
Since the d.£. H is continuous, uniform convergence in (2.3.7) will
follow from pointwise convergence via Polya's Theorem (Serfling (1980, p.18)).
Write M(t) for M(t; G) and met) for met; G).
Denote by
Uta)
the set
29
(t: 0 < It-8 I < 0) .
Moreover , given
£
By (Bl), m is defined on U( 0) if 0 is sufficiently small.
> 0, there is a 0 for which t EU (0) implies
Im(t) - m(8-)
I<£
if t < 8
Im(t) -m(8+)
1< £
i f t > 8.
and
Choosing
£ <
I,
min{ Im(8-)
away from zero on U(o).
Since M(8) = 0,
(2.e.l)
M(t)
t
Im(8+)
I}
then guarantees that Im(t)
I
is bounded
Fix such a o.
E
U(o) implies
= meT) (t-e)
for" some T strictly between t and
e,
by the Mean Value Theorem (which only
requires one-sided derivatives at the endpoints of the interval on which it
is applied).
Sincem is bounded away from zero on U(o), (2.e.l)
= O(IM(t) I)
It-~I
as t -+ 8.
(2.e.2)
shows
The right hand side of (2. e.1) equal s
D(t)(t-8) + R(t),
where
D(t) =m(8+) l(t>8) +m(8-) l(t<8),
R(t) = [{m('r) - m(8+)} I (t>8)
oj.
{m('r)-m(8-)} l(t<8)] (t-8),
and l(A) is the indicator for the set A.
t=8.
(2.e.3)
Note that (2.e.2) also holds if
Since R(t) = o(lt-81) = o(IM(t)I), (2.e.l) and (2.e.2) yield
k
1
1
D(T )n 2 (T -8) = n~M(T ) + o(ln~(T )
n
n
n
n
I).
30
Because of (B2), (B3) and the Lindeberg-Levy central limit theorem, the
2
right hand side of (2.C.3) converges in distribution to a N(O,a ) random
variable, and, hence, so does the left hand side.
To obtain the limiting distribution of Tn' partition its range and
consider cases.
If Z <
° then
1
pr{n~(T -8):S;Z, T >6} = 0,
n
n
while
·
1
1
pr{n~(T -8}:S;Z, T <8}=pr{10(T )ln~(T -e):s; 10cT )Iz}.
n
n
n
n
n
Since O(T ) = m(6-) when T < 8, and OCt) does not change sign on (8-0,6+0)
n
n
by (B1), (2.C.3) implies that this last probability converges to <I>ClmC6-)I z/a)
as n +00.
Similar arguments establish that, for Z > 0,
and
1
pr{n~(T -6) :s; z, T > 8}
n .
n
and finally
1
pr{n~(T -8) :s; O} = 1 - pr{ Im(6+) In~(T -6) > O} + 2
n
n
1
1
~
as n +
00.
The result follows by collecting terms.
CHAPTER III
MINIMUM HELLINGER DISTANCE ESTIMATION
FOR DISCRETE DATA
3.1
Introduction
Stability of an M-estimate is achieved by bounding the score function
in the estimating equation.
An alternative approach to stable estimation
is to estimate a parametric distribution directly by minimizing some discrepancy between the data and the model.
For a suitably chosen discrepancy
criterion the resulting estimator will be stable.
This chapter investigates minimum Hellinger distance (MHD) estimation
as a means of dealing with outliers and minor deviations from a parametric
model when the data consist of counts.
The existence of a parametric mod-
el {Fe; ee: e} that is useful and reasonable is assumed.
density f
measure.
e
Suppose Fehas a
with respect to a measure ].1, e.g., counting measure, Lebesgue
If f
n
is a non-parametric estimate of the density based on a
sample of n observations, then an MHD estimate of e minimizes
.
.
II.
f!in -' f~e II ].1
2
as a function of e, where II • 11].1 is the L norm with respect to ].1.
Originally introduced in the context of a parametric multinomial distribution, where f
n
is taken to be the vector of observed proportions
(Matusita, 1954), MHD estimation has been of theoretical interest as a
member of the class of "regular best asymptotically normal" estimates
(Rao, 1963; 1973, p. 352).
Beran (1977a, 1977b) is the first to propose
the use of an MHD estimate outside the multinomial context and to suggest
that the estimator is robust.
Arguing for Hellinger differentiability of
32
,
a statistical function as a criterion for robustness, he discusses a local
asymptotic minimax property of the MHD estimator among Hellinger differentiable estimators of
e.
Beran (l977b) is concerned with estimation of a continuous parametric
distribution and proposes the use of a kernel density estimate in the estimation process.
Here the focus is on parametric models for count data.
To motivate further study of the MHD estimator for count models, data from
the Drosophila assay for chemical mutagenicity, described in Chapter I,
are used in Section 3.2 to contrast the behavior of this estimator with
that of the maximum likelihood estimator.
Of special interest here are count models with infinite support.
this case, as in the finite dimensional case, taking f
·n
In
to be the vector
of observed proportions, i.e., the empirical count density, provides a consistent estimate of the actual density.
The tvlID estimator for
e
See Section 3.3.
can be expressed as a functional of F ,the
n
distribution corresponding to f.
n
Beran's (l977b) discussion of the ex-
istence, uniqueness and continuity of this functional assumes
pact and the Hellinger distance function to be continuous on
e
e.
to be comAl though
this requirement is satisfied by a continuous location-scale model, after
a transformation of the parameter space (Beran, 1977b, p. 48),i tis too
stringent for certain count models, e.g., the two parameter negative binomial model.
Consequently, an extension of Beran's result is given in
Section 3.3.
Beran (l977b) also shows that the tvlID estimator for a continuous parametric distribution is
asymptoti~al1y normal,
as well as efficient at the
model, but the result requires the mOQel distribution to have compact support.
This is a more difficult analogue of the known asymptotic normality
33
of the MHO estimator for a finite multinomial distribution.
Both results
assume that the model distribution has bounded support, and, more importantly, that the density is bounded away from zero on the region of support.
Stather (1981) demonstrates, both for continuous distributions and for
·count distributions, that asymptotic normality of the MHO estimator does
not require the support of the model density to be bounded.
His results,
however, are obtained at the cost of extra conditions on the underlying
density.
In Section 3.4 it is shown that, for a count distribution with
infinite support, the MHO estimator is
asymptot~cally normal
conditions involving the underlying distribution.
without extra
For this, an extra con-
dition on the model is needed that, nevertheless, appears to be weaker than
the conditions imposed by Stather when the model obtains.
3.2
Minimum Hellinger distance versus maximum likelihood
A minimum distance estimator is not guaranteed to have desirable ro-
bustness properties, nor is it guaranteed to be reasonably efficient at the
model.
In this regard it is instructive to compare the behavior of the MHO
estimator to that of the maximum likelihood (ML) estimator for a count model.
Recall that in this situation the ML estimator is also a minimum distance
estimator.
The MHO and ML estimators are fruitfully compared via their estimating
equations.
density f
e
d
Suppose Fe' eE 8c R , is a parametric count distribution with
on the non-negative integers.
proportion of
XIS
For x=O, 1,2, ... let f (x) be the
n
observed in a sample of size n; Equating the derivative
of (1.1.2) to zero yields the estimating equation for the ML estimator:
00
(3.2.1)
L
x=O
~e(x) fn(x) = 0,
34
where
.R-
a
e(x) = ae
log f e(x).
SimilarlY (1.1. 3) leads to the estimating
equation for the MHD estimator:
(3.2.2)
Since the expectation in (3.2.2) is taken with respect to the geometric
mean of f
e and
f , rather than f
n
n
itself,. observations that are improbable
relative to the model have substantially less effect on the MHD estimate
th~n
on the ML estimate.
On the other hand, when the model provides a good
fit to the data, there will be close agreement between t}:le solutions of
(3.2.1) and (3.2.2); the ML and MHD estimates will be nearly the same.
Consider the data in Table 3.1, which shows four control runs from the
Drosophila assay described in Chapter I.
A Poisson model has been fit with-
in each run using the ML and MHD estimators (the ML estimate is just the
sample mean, of course).
The estimated standard deviations, shown in par-
entheses, ,are computed as
{n i (T )}
. n
...1.<
2
=
1
(T In)"5.
n
where T is the estimate of the Poisson mean for a sample of size nand
.'
let)
n
-1 .
= t'
lS the Fisher informati9n.
it is piased if the Poisson model 'is
This is a rough approximation, and
inex~ct;
however, it provides a scale
on which the estimates can be compared.
For Day 27 and the first run on
p~y
177, the ML and MHD estimates are
nearly the same, and the fitted frequencies for both agree quite well with
the observed frequencies. The estimates differ substantially, however, for
Day 28 and the s,econd run on Day 177, both of which contain exceptionally
large counts
re~ative
servations.
The ML models for these two runs fit the frequencies rather
to the P9is50n m9Qels indicated by the remaining ob-
36
poorly.
In contrast, the M-:lO models, which give little weight to the im-
probable counts, provide a reasonable summary of the remaining frequencies.
In addition, the extreme Counts are readily identified as being unusual relative to the model because of their small fitted frequencies.
The observed similarity between the MHO and ML estimates when the model
appears adequate and differences between them in the presence of outlying
observations lend credence to the claim that the MHO estimator is both efficient at the model and robust.
Once could, of course, envision alternative approaches to the analysis
of these data.
The fitting of a Poisson model is used here to illustrate
the potential usefulness of an MHO fit to count data prone to outliers.
3.3
Almost sure convergence
Implementing MHO estimation requires a non-parametric estimate of
the density.
The key requirement is that the estimate be consistent in
the Hellinger metric.
When the data consist of counts the empirical denFor a sample of n Qbservations X ,X , ... ,X , the
l 2
n
sity is convenient.
em~
pirical density is given by
n
(3.3.1)
f
n
=.!. L
(x)
n i=l
L(X.
=x),
1
where I(o) is the indicator set function.
The estimate f
n
is strongly
Hellinger consistent as follows.
Suppose X .X , ... are independent random variables each
l 2
. having count density g on {O,1.2 .... l. If f is as in (3.3.1), then
n
Proposition 3.1
[l
00.
(3.3.2)
x=O
almost surely
as n ~oo
~.
1
2 "2
{f~(x) - g(x)}] ~ 0
k
0
e-
37
Proof
The Strong Law of Large Numbers and Scheffe's Theorem (Scheffe,
1947; Billingsley, 1968, p. 224) imply that L: I f (x) - g(x)
n
L:{f~(x) - g\x)}2 ::; L:lfn (x) - g(x)
n
Remark
I,
I-+-o
a.s. Since
(3.3.2) follows.
Stather (1981) establishes the weak Hellinger consistency (conver-
gence in probability) of f
.
assuming g to have a finite first moment.
n
This
condition is unnecessary.
Ignoring momentarily questions of existence and uniqueness, the MHD
estimator is a·functional T of F , the distribution corresponding to f .
n
n
Suppose Fe is absolutely continuous with respect to a measure 1:1 and f
its density,
eE e.
Then the MHD functional for
e is
e
is
defined at a distri-
bution G by
(3.3.3)
II g~- f~(.,T(G)) II
=
1:1
minll
tEe
g~- f~(.,t) \I '
1:1
where g is the density of the absolutely continuous part of G with respect
to 1:1, and where
II
011
2
denotes L distance with respect to 1:1.
11
will be omitted in the sequel.
With f
n
The subscript
as in (3.3.1), F is the empirical distribution function.
n
In
view of the previous proposition, the almost sure convergence of T = T(F )
n
n
to T(G), for a count distribution G, will follow from the Hellinger continuity of T atG.
When
e
is compact the
existenc~
and continuity of Tare
characterized in Theorem 1 of Beran (1977b); the Lebesgue measure there can
be replaced by any a-finite measure on the real line.
theorem also applies when
distance function d(t)
=
II
e
k
Beran notes that the
can be embedded in a compact space
g 2_ f
~
t
II extends
continuously to G.
e
and the
This is the
case, for instance, when {Fe} is: a continuous location-scale family of distributions.
function.
In some instances, however, d does not extend to a continuous
38
Example 3.1
Consider the two parameter negative distribution with density
. -1)
.
f(x) = f. ( x+c. (me)
8
xlf(c- l ) l+mc
(3.3.4)
x
x=0,1,2, •.. , where 8 = (m,e) and 8= (0,00)
I
(
)
c
-1
l+mc
x
[0,(0).
is discussed in Collings and Margolin (1985).
This parametrization
That e == 0 corresponds to
the Poisson distribution with meanm is well-known.
To compactify the parameter space first write m= tan (p) cos (n) and
c = tan(p) sin(n), where 0 < p < ~ and O:s; n <~.
e onto
the projection of
a sphere with unit diameter, as shown in Figure
3.1 (cf. Apostol, 1958, p. 11).
7T
Then p and n parameterize
A compact space
e
results by including
....
n="2' whieh corresponds to m=O, and by mapping (m,c) = (0,0) to the south
pole of the sphere (p= 0), and points with JiI= 00 or c= 00 to the north pole
(p= ;).
A related transformation is used by Beran (1977b, p. 48) for 10-
cation arid scale parameters.
To see that d is not continuous on
log f (0) = -c
-1
S
e,
first note that
..
log(1+mc)
12
log{l + 2" tan (p) sin(2n)}
=---....-..;..;.-------------tan(p) sitHn)
2 109{t tan (p) sin\2n)}
ta.n(p) sin(n)
1T
which converges to zero as p-+_ . .
2
2·..
.~.
when
.
n is fixed and positive.
7T .
f (0) -+ 1 and d (S) -+ 2-2g (0) as p-+Z"" with
S
fe(x) -+ 0, x=O,1,2,...
as p-+
~-
n # 0 fixed.
a
On the other hand,
2
. with n= 0, which implies that d (S) -* 2
(this corresponds to lettingm+ oo in the Poisson model).
d does not have
Hence
we ll-defined limit at p = ;.
Hence, if g(O»O,
39
Figure 3.1
Projection of G onto a sphere
40
Although in Example 3. Itheat'tempt to embed 8 in a compact spaceG
such that d is continuous on 8 has failed, this does not show that such an
embedding cannot be done.
Rather it shows that suchembed<1ings can be
dif~
ficult, .especially in multiparameter models, and that consistency results
for non-compact parameter spaces can be useful.
The continuity of the distance function at the boundary of 8 is not
essential to the existence or continuity ·of TC') at G.
It is enough that
parameter values outside a sufficiently large compact set correspond to
distributions in {Fe} that fit G poorly when fit is measured by the Hellinger distance.
This is the idea of the next result, which extends The-
orem I of Beran (1977b).
Using the notation of (3.3.3), define G to be the class of distributions G for which there exist t* € 8,
E> 0
and a compact subset C of 8 sat-
isfying
(3.3.5)
lnf
t€8-G
Ilg~-f~(.,t)11 ~ Ilg~-f~(.,t*)" +
E.
If 8 is compact then (3.3.5) is vacuously satisfied when C=8.
case G is the class of all distribution functions.
1,1
singular with respect to j.l then
Theorem 3.1
1
1
1
Notice that if G is
g~- f~11 = 2~ for all t€
e.
If f(', t) is continuous a.e. [j.l] in t then, for each G€ G,
(i) a solution T(G)
E
8 of (3.3.3) exists, and
(ii) if T(G) is unique, then
for distributions Gn having densities. gn with respect to j.l,
II g ~n - g~. II· ~~ 0
Proof
II
k:
g 2_ f
t::;:
e.
In this
.
implles
T(G)
n
-+ T (G)
n= 1,2, •.. ,
as n -+ 00.
To prove (i) use the fact that as in Theorem 1 of Beran (1977b),
~
t
II
is continuous in t, so its minimum over C is achieved, say, at
By (3.3.5) the value at
e
is also a global minimum over 8.
41
To establish (ii) it is first shown that T(Gn ) exists for large n
when
II g!';'g~11
n
eventually.
-+
0 and GE G.
By (i) it is sufficient to show that G E G
n
The Minkowski inequality yields
and
so that
(3.3.6)
Since the right hand side of (3.3.6) is independent of t the inequality
holds uniformly in t.
II
For large n
Hence
k
~
g~- g II s £/4
k
k
\I
and, by (3.3.5), infll g2_ f~ll;:::
where the infimum is over 8- C.
.
Since, in addition,
·k
III gn2_
k
k
f
k
g2.;. f:*II+£,
2*"_
t
II
L
k
g'2_ f 2*11
t
s £/4 for large n, by (3.3.6), it follows that
Hence G E G and, moreover, T(G ) E C eventually.
eventually.
n
n
Restricting
attention to the compact set C, (ii) now follows from Theorem 1 of Beran
(1977b) with Lebesgue measure replaced by
~.
Identifiability of {Fe} is equivalent to the requirement that
pIes
~.
~
II f e - fe' II
.
> O.
e ~ e'
Consequently T(F ) = e uniquely
e
This fact and Theorem 3.1 yield the follow-
See Pitman (1979).
whenever {Fe} is identifiable.
ing corollary.
Corollary 3.1
im-
Suppose that t;' tt implies
~(ft;'
f ,) >0 for t, tl E 8.
t
42
Suppose a compact subset C of 0 and e: > 0 exist for which
inf II
tE0-C
(3.3.7)
Then
2:
Suppose also that f
where e E C.
tE C.
f~ - f~ 1/
II
t
gJ- f e II + 0 implies
~
1.
e:,
is continuous a. e. [].1] as a function of
T(G ) + e
n
as n+ oo •
A useful sufficient condition for (3.3.5) is the existence of an increasing sequence C
l
C
C c.
2
•
•
of compact sets in 0 for which
(3.3.8)
for some t* E 0.
fimum of
E~ample
The left hand limit in (3.3.8) always exists since the in-
k
1.
Hg2_
f;ll.
over 0- C is increasing and bounded.
n
3.1 (continued)
and let C = {(].1,o): n
n
-1
Consider again the negative binomial model (3.3.4),
:::;m:::;n, O:::;C:::;n}, n=I,2, . . . .
The discussion follow-
ing (3.3.4) implies that
(3.~.5)
Hence
holds at G if at least one negative binomial distribution fits
G. better than a point mass at zero dqes.
At the model this amounts to
1
2- 2 ~(O»
0, equivalently, fe(O) < I, which holds for every eE 0.
It
follows that the MHO estimatOr is consistent at the negative binomial
model.
3.4
Asymptotic normality and efficiency
As in the previous section,"
e
is a d-dimensional vector parameter, and
Fe' eE 0, andG have densities f e and: g with respect to a a-finite measure ].1.
43
Later
~=
counting measure.
If the parameterization is smooth and further
regularity conditions are satisfied, the squared distance
.
= II
H ( t ; G)
(3.4.1)
112
f~ - g~ \I
t
.
has a vector derivative H(t;G) with respect to t.
The MHD functional at G
is then a solution of .
H(t;G) = O.
(3.4.2)
This section treats the asymptotic distribution of sequences {T } of solun
tions to
(3.4.3)
H(T ·F )
n' n
= 0,
when l.lis counting measure on {O ,1,2, ... } and F is the empirical distrin
bution function.
Using the following lemma, conditions can be stated under which
(3.4.1) is sufficiently differentiable for (3.4.2) to have a first order
expansion.
Lemma 3.1
d
For tEe c R let a
2
ponents in L
(~).
t
P
be an R -valued function, p;:: 1, with com2
Suppose there is a p x d matrix cit with elements in L
(~)
satisfying
-1.
Iu 1
{a
t+u
.
•
- a - a u} -+ 0
t
.
. ..
L2 as lui
componentwlse
In
2
bE L (]..l) then
-+
0, where
1·1 is the infinity norm on Rd
atb d]..l has the derivative
. denote the i th componen t
t,l
th
Then thei
component of
Proof
at.
f
t
Let a
f
0f
If
citb d]..l with respect to t.
at an d a
. th.e
t,l
1,
th row
0
f
44
J
(at+n - a t - au)
t b dlJ
is
J {a t+n,1,1
. . - at . -
at ,1.u)b dlJ.
By the Cauchy-Schwarz inequality this is dominated in absolute value by
II at +U,1. - a t,1. If
a t,1. u II
II b II.
Hence
a t+u·
b dlJ - Jab
t dlJ - fa t b dlJ I ~
max{1I at· +u,1. - at ,1. -at ,1.ulillb
For notational convenience let St=
f~.
II;
i=l, ... ,p}= o(lul).
Following Beran (l977b), in-
troduce smoothness conditions on St as follows.
For t interior to 0 suppose
2
a (dxl) vector ;t and a (dxd) matrix ·StWith components in L (lJ) exist and
satisfy
(3.4.4)
as lui
+
0 and
(3.4.5)
2
componentwise in L as lui +0.
.
1
Using Lemma 3.1 and the fact that H(t;G) = 2 - 2 f Stg~dlJ, condition
(3.4.4) implies that H has the derivative
(3.4.6)
with respect to t T, and similarly (3.4.4) and (3.4.5) imply that
derivative
(3.4.7)
Hhas
the
45
with respect to t .. Observe that s°
t
1 ~
=-fQ.
2 t t
[~],
a.e.
where Q,t is the
T
derivative of log f
with respect to t .
t
The following theorem, which provides an expansion for theMHD functional, is a straightforward modification of Theorem 2 of Beran (1977b).
Theorem 3.2
Suppose (3.4.4) and (3.4.5) hold for each t interior to 8.
Suppose also that H(t;G) has a zero e interior to 8 and that °H(e;G) is nonsingular.
).I,
Let the distribution function G have density g with respect to
n
n=1,2, .. ., and assume
n=1,2, .•• , and if e
n
e
(3.4.8)
Remarks
~
II gn - g211
-+ 0
as
n-+
n
00.
e as n -+ 00, then
= e + {_oH(e ; G) -1 +
n
1.
-+
~
0
(l)} H(8; G ).
n
Beran (1977b) incorporates conditions sufficient for the Hel-
linger continuity of the MiD function into the result.
It is stated here
in terms of sequences of solutions to the estimating equation in order to
separate conditions for consistency and asymptotic normality.
A consequence of (3.4.8) is the Hellinger differentiability of the MHO
. functional.
Implications of Hellinger differentiability are discussed in
Beran (1977a, 1977b).
This type of differentiability appears to be too
weak to be ·of use in deriving the asymptotic distribution of an estimator.
The expansion (304.8) shows, however, that, if f
of the density, the asymptotic distribution of
.
1
n
is a consistent estimate
ri~{T(F)
- T(G)} is determined
n
by that of n~ H(8;F ).
n
Specializing now to count distributions, let F and f
n
the empirical
distribution function and densityo
tional condition on the model, namely
n
be, respectively,
Then, under an addi-
°
that St
has components in L1 (~),
H(8;F ) can be approximated by an average of independent, identically disn
46
tributed random variables as follows.
Lemma 3.2
Let X ,X , ..• be independent random variables, each having disl 2
tribution G with density g on {0,1,2, ... }. Suppose e is a zero of H(8;G)
and (3.4.4) holds at 8.
•
1
If, in addition, the components of s8 are in L (w),
where w is counting measure, then
•
(3.4.9)
-H (8 ; F )
n
-~ + 0
= n -1 ~'.
l. 58 (x. ) g (x. )
i=l
1
1
P
. -~).
(n
Lemma 3.2 is central to the derivation of the limiting distribution
for the lliD estimator.
The proof is given in Appendix 3.A.
The main re-
suIt of this section, the asymptotic normality of the MHD estimator at a
count distribution with infinite support, can now be stated.
Theorem 3.3
d
Let Ft' tEO 0 cR·, and G have densities f t and g with respect
to counting measure Won {0,1,2, ••• }.
Suppose (3.4.4) and (3.4.5) hold
and the components of ;t are in Ll(w) for each t interior to 0.
Suppose
H(t;G) has a zero 8 interior to 0, and ~H(8;G) is nonsingular.
If T is a zero of H(t;F ), where F is the empirical distribution
n
n
n
.
1
function, and Tn -+ 8 in probability, then n~(Tn -- 8) converges in distribu.
tion to a d-variate normal random variable with mean zero and covariance
(3.4.10)
~ H(8 ; G) -1 i (8) 'H(8 ; G) -1,
where i(t) = 4
• 'T
J StSt
.
dW is the Fisher information matrix for Ft'
.
.-1
special case G= F8 , (3.4.10) 1S 1(8)
..,.~
In the
.
l'
•
.'
T
Proof
If X'" G then E s8(X)g{X) =-'2 H(8;G) = 0 and E s8(X)s8(X) g(X)
.
•
•
T 1.
= E s8(X)s8(X)= "4 1 (8). Hence, by Lemma 3.2 and a multivariate version
-1
of the Lindeberg-Levy Central Limit Theorem, n~ H(8;F ) converges in distrin
bution to a d-variate normal random variable with mean zero and covariance
47
matrix
t
ice).
Theorem 3.2 yields
1
!.:
••
n~(T(F ) - e) = n 2 {-H(e;G)
n
-1
+
0
•
(l)} H(6;F ),
P
n
and (3.4.10) follows.
By a standard argument (justified by Lemma 3.1)
1
...
1
Hence H(e;F ) =2" ice) and (3.4.10) reduces to i(e)e
=4 i(e).
Remark
· -T - J--Bef~e
f sese::
1
ifG
3l
e.
F
When G= Fe the MHD estimator has the same limiting distribution as
the maximum likelihood estimator. In fact a stronger results holds, namely,
the MHD estimator is first order efficient in the sense of Rao (1973,
p. 348) and hence asymptotically equivalent to the maximum likelihood
estimator at Fe'
To see this, note-that (3.4.8) and (3.4.9) yield
T(F )
n
= e+ n- 1
n
i(e)-l
l:
i=l
}/,e(X') + 0 (n-~),
1
P
which is the expansion for the maximum likelihood estimator at fe' up to
terms of order
0
p
(n-~).
The requirement that se have components in L1(~) is somewhat restrictive, but it is a condition on the model only, not on the actual distribution.
The strong result obtained using this restriction is analogous
to the strong results that are possible concerning the asymptotic distribution of an M-estimator with a smooth, bounded, score function.
It is easily demonstrated that ;e€ L1(~) for a number of common
count modelS, e.g., the Poisson, the negative binomial of (3.3.4), and
the log series. For an instance where the condition fails, see Example
3.2 below.
For comparison, Stather (1981) requires, at a count distribution G
with density g, the existence of a sequence of numbers {K } for which
n
48
(3.4.11)
and
(3.4.12)
l:
g(x)
x>K
= o(n -1 ),
n
~.
as n -+ 00, where 8 = T(g) .
In the context of robust estimation a condition of this type is somewhat unsatisfactory since it involves the underlying distribution G. Neverthe1ess it is of interest to compare (?4.11) and (3.4.12) at the model
with the condition used here that ;8€ L1(~)'
Example 3.2
Word frequencies in text are sometimes modeled using the zeta
distribution (see Johnson and Kotz, 1969, Pi 240) with density
(3.4.13)
f (x) = c(8)x
8
- (1+8)
,
x= 1,2,. .. ,8> 0, where c- 1 = L y-(1+8).
• = 2"
1 f;Q,8
~
and the condition s8
€L
00
I
x=l
x
-~:t(1+8)
1
is equivalent to
log x
which holds if and only if 8> 1.
In this case Q,8(x) = -log x+ c'/c,
<00
,
On the other hand, for (3.4.13.). Stather's
conditions (3.4.11) and (3.4.12) reduce to
K
In log x =
x-o
o(n~)
and·
l:
x>K
x-(1+8) = o(n -1 ).
n
8
The former implies K log 1< = o(n~), while the latter. entails n = 0 eK ) .
n
n
n
Together· these imply that n = 0 (n8/2). Hence 8 > 2 is necessary for Stather' s
49
/
conditions to hold at the model; an extra moment is required.
In Appendix 3.B general conditions are given under which Stather's
conditions at the model imply ;8E Ll(~).
The possibility that (3.4.11)
and (3.4.12) hold for some G but not at the model seems of limited importance .
.3.5
Discussion
The asymptotic properties of the MHO estimator for count data make it
a serious pompetitor to the maximum likelihood estimator, even when a great
deal of confidence can be placed in the model.
Work remains to be done con-
cerning the extent to which its asymptotic properties obtain in finite sampIes.
For some preliminary work in this direction see Stather (1981) and,
in a different context, Tamura and Boos (1985).
The most attractive feature of the MHO estimator, however, is its insensitivity to outlying observations, as illustrated in Section 3.2.
The
stability of the MHO estimator is examined further, via its breakdown point,
in the next chapter.
APPENDIX
3.A
Proof of Lemma 3.2
Write ;(x,t) for ;t(x).
The sum on the right in (3.4.9) is equal to
00
\
•
x~O s (x, 8) g(x)
-k
2fn (x) •
•
Let R
Since H(8;G)
n
- L:
;(x,8)g(x)-~{f n (x)-g(x))
= -2L:
•
k:
s(x,8)g\x) = 0,
50
•
= -I: s(x,8)g(X)
-~
~
k
2
{f. (x) - g2(X)} ,
n
k
~
k
where the last equality follows from the identity 2b 2(a 2- b ) - (a-b)
1
1 2 t h
Let R i denote the i· component of R
-(a~- b~) , true for a~ 0, b~ O.
and s.(x,t) the i
th
n,
component of ;(x,t).
1
n
Then
(3.A.I)
By Fubini's Theorem, the right hand side of
00
l.t
(3.A.2)
(3.A.l)
Is. (x, 8) Ig(x) -k2· n k2 E{f ~.2(X) -
x=O
g ~ (x)} 2 .
n
1
k
~
2
E{f 2(X) - g (x)} : :; Elf (x) - g(x)
Now
n
so the summand in
I
n
(3.A.2) is dominated by
Isi (x, 8) I.
i t will follow, by Dominated Convergence, that
asn
!.-
+
k
00,
2
if it is demonstrated that E Y (x)
n 4{f 2(X) - g 2(X)}.
n
Since si (',8)
(3 .A. 2)
€
L (11) ,
1
converges to zero
0, x=0,1,2, ••. , where Y (x) =
+
n
~
has expectation
n
But Y (x) converge" in distribution to zero, and, since
n
E{Y (x) }2(1+£) : :; [g(x) {I-g(x) }]~(l+£) <
00
n
uniformly in n for 0 < £ < I, {y2 (X)} is uniformly integrable.
n
sired convergence holds, and
n~IRn, i l +
0 by
(3.A.l).
Hence the de-
Finally, the Markov
inequality yields, for each £ > 0,
pr(n
~
IRnl. I > £)
: :; £
-1 ~
n
I
E R .
nl
I·
This implies (3.4.9) since
pr(n
~
I
d
I
1
I
I
pr (n~ R . > £)
max R . I, > £) : :;
. 1
nl
l:::;;i:::;;p nl
1=
+
0
as n +
00,
51
Remark
If g(x) = 0 for any non-negative integer X, omit that term from all
sums in the proof.
3. B Comparison of conditions for asymptotic. normali ty
d
Letf(x,8) be a parametric count density and let .R-(x,8) = d8 log f(x,8)
(assume this exists) .. For simplicity assume 8 is a univariate parameter
(Stather only considers this case).
Write f(x) = f(x,8) and .R-(x) = .R-(x,8) .
At the model, (3.4.11) and (3.4.12) become
K
n
r I.R-(x) I = o(n~)
(3.B.l)
1
x=O
and
I
(3.B.2)
f(x)
= o(n- l ).
·x>K
n
Proposition
(ii) I.R-{x)
I
Suppose (i) f(x) is decreasing for x sufficiently large,
is increasing for large x, and (iii) at least one of the fol-
lowing is true:
a)
f has a regularly varying tail at
b)
I.R-(x)
00;
or
I
is a function of slow growth or regular variation.
•
1
~
1
Then (3.B.l) and (3.B.2) imply that s = - f 2 .R- e: L (lJ), where lJ is
8 2
counting measure.
y
Proof
Square (3.B.l) and multiply by (3.B.2) to obtain
K
n
(3.B.3)
{l:
x=o
e·
1.R-(x)I}2
l:
y>K
f(y)=o(l).
n
If f has infinite support then (3.B.2) entails K +00, so K can be replaced
n
n
by 2n in (3.B.3). Hence
52
'/f
x=n
Ii(x) /}2
3f f(x) = 0(1),
y=2n+l
2
which implies that n 3 Ii(n) 1 f(3n) = 0(1) by (i) and (ii). This in turn im3/2
~
plies n
li(n) I f2(n)= 0(1) because of (iii). The result follows since
00
I
n=l
n-3/2
<00
CHAPTER IV
BREAKDOWN ANALYSIS OF THE MINIMUM
HELLINGER DISTANCE ESTIMATOR
4.1 . Introduction
As originally introduced by Hampel (1968), the breakdown point of
an estimator is roughly defined to be the smallest fraction of contamination of the data by erroneous observations that can cause the estimator to take on arbitrarily large values.
Breakdown analysis is ad-
vocated by Donoho and Huber (1982) as a means for examining the robustness· of an estimator in small samples.
Section 4.2 presents results concerning the breakdown point, as defined by Donoho and Huber (1982), of the MHD estimator.
These results
are general in scope; they are not restricted to the discrete case.
In-
teresting complications arise, however, when discrete data are considered.
The usual definition of the breakdown point is not invariant to reparameterizations.
In the context of estimation for parametric models
this is an unsatisfactory aspect of breakdown analysis, since the probability structure remains the same regardless of the parameterization.
Section 4.3 introduces a natural alternative definition of the breakdown
point, invariant to reparameterizations of the model, and applies it to
the MHD estimator.
4.2
Breakdown point
Mixture contamination models of the form
(4.2.1)
(1- E) F+EG
54
will be considered, where 0:;;; E::;;; 1, and F a.nd G are distribution functions.
An analysis could be carried through using other types of contamination,
e.g., by considering distributions within a Hellinger neighborhood of F,
but the interpretation of the corresponding breakdown point would not be
as clear.
It should be observed that, since the estimators considered here are
functionals of the distribNtion, the data contamination models of Donoho
and Huber (1982) can be recovered by taking F to be the appropriateestimate of the distribution function, and by restricting the range of E: to
rational fractions of the form m/(n+m), where n is the sample size and
m is the number of contaminants.
Denote by Fe = Fe(F) the set of e-mixtures of the form (4.2.1) with
since (l-e)F+ eG= (l-e')F +
Note that e < e' implies F c F ,
e
E:
E:' {(l-(e/e'))F+ (e/e')G} and 0 <e/e' <1. For a functional T taking
F.
values in 8cR
=
00
d
let bee; T,F) = sup{IT(G)-T(F)I: GE Fe(F)},
then T is said to breakdown at F under E:-mixtures.
If bee; T,F)
The breakdown point
e*(T,F) ofT at F is given by
e*(T,F) = inf{e: bee; T,F) = oo};
cf. Donoho and Huber (1982).
In studying the breakdown properties of the MHD estimator it is convenient to work with the affinity:
p(F ,G) =
J (fg) k d ]1,
2
where F and G are dominated by the measure ]1 and have densities f and g
1
k
k 2
.
respectively, e.g., ]1= 2" (F+G). Since II f2_ g211 = 2-2p(F,G) if F and G
are probability measures, minimizing the Hellinger distance is equivalent
55
to maximizing the affinity.
the right if and on ly if
]J
Note that
O~
p(F,G)
~
1, with equality on
(f ~ g) = 0, and with equality on the left if
. and only if F and G are mutually singular.
Moreover, p(F,G) does not
depend on the choice of the dominating measure.
See, e.g., Pitman (1979).
The following result provides a lower bound for the breakdown point
of the MHD estimator at an arbitrary distribution F.
A
A
Theorem 4.1
Let p= p(F) = max{p(F,F ), te: 0}, and suppose the maximum
t
occurs interior to 0. Let p* = p* (F) = lim sup p(F, F ), where 1·1 is
~ itl>M
t
equivalent to the euclidean norm on 0c Rd. Then
2
A
E*(T,F) ~ ~(P_-~~~*)~~2
1 + (p-p*)
(4.2.2)
if T is the MHD functional for 8.
Proof
First note that beE; T,F) =
~
if and only if a sequence {G } of
n
distributions exists for which
IT ( ( 1- E) F + E Gn )
. (4.2.3)
as n -+ 00.
That (4.2.3) implies b=
G can be selected for which
n
00
- T (F)
I -+
00
is obvious.
IT( (I-E)
F + EG ) .
n
The converse holds since
T(F) I ~n,
n=l, 2, .•.
Next note that the following inequalities hold for distributions
F, G, and H, and for
O~ a~
1:
~
max {(l-a)p(F
,H), a ~ p(G,H)}
~
p((I-a)F+ aG, H)
~
(l-a)~ p(F,H) + (j,~ p(G,H).
1
1
56
Fix F.
Let e be a value of t that maximizes P(F,F ).
Since
t
1
p((l-£)F+ £G , Fe) ;::: (l-£)~ p, the existence of a sequence {G } of disn
n
tributions for which (4.2.3) holds entails the existence of a sequence
A
. {t } c
n
e
for which
It n
1-700
and
!.:
A
p((l-£)F+ £G , F ) > (1_£)2 p
n
eventually.
t
n
!.:
But the left side is at most £2+ (1-£)
~
p(F,F
t
),
and
n
p(F, Ft ) :s; p*(F) + cS eventually for every cS > O. Hence T cannot break down
n
!.:
k
!.:
at F if (1_£)2 p>£2+(I_£)2 p*, or, equivalently, if £ is less than the
A
right side of (4.2.2).
The quantityp(F) in (4.2.2) depends on the fit of the model to F.
The closer the fit the larger the lower bound for £*.
p(Fe)
= 1,
e€
e.
Frequently
In
particular,
p* = 0, in which case the breakdown point is at
1
least 2 at the modeL In checking whether p*
=0
it is useful to note that
this is equivalent to the convergence of P(F,F
) to zero for every 'set
n
quence {t }c e with It I -700. Finally, observe that replacing F by F ,
n
n
n
the distribution associated with f (see Chapter III), yields a lower
n
bound for the finite sample breakdown point of the MHO estimator.
For
a discussion of the breakdown point as a finite concept see Donoho and
Huber (1982).
Since the contaminating distribution could be in the model family,
it seems obvious intuitively that the breakdown point of the M-ID esti1
mator cannot exceed 2'
That this is indeed the case is shown, under a
condition on the model, in the following theorem.
Theorem 4.2
Suppose
a sequence
{t }c
n
e
exists for which It
n
I
-700
and
57
(4.2.4)
lim P(Fs ,F ) =
t
n
n
a
n~
for every bounded sequence {s }c
n
the MHD functional for
1
e.
Then s* (T , F) ~ 2" for each F if T is
e.
It is shown that T breaks down under s-mixtures at any distribu-
Proof
1
tion F of s > 2'
Suppose s>
i
and T does not break down.
p((l-s)F+ s F , F
t
n=1,2, ••• and
It n I
-+
00,
t
n
)
~ £
Since
~
n
there must be a sequence {s } c
n
e
and M <00 for
which I s I::;; M infinitely often and
n
(4.2.5)
eventually. By considering a subsequence assume without loss of genera1ity that Is
£
~
p(F
t
n
I :;
n
, F
s
!.<
!.<
Then p((I-s)F+ SFt ' F ) ::;; (1-£) 2 +
s
n
n 1.
.
!.<
) -+ (I-s) 2 as n-+ oo by (4.2.4). Since
s > 2 lmplles
M for each n.
n
!.<2
(1-s)2 <i:: , this contradicts (4.2.5).
1
Hence Tbreaks down i f s>z.
The special status of models with p*= 0, indicated by (4.2.2), and
the condition (4.2.4) raise the question of when pep ,Q ) -+
n n
.
quences of probabi Ii ty measures {p } and· {Q }.
n
n
is given in the following lemma.
in any sense.
Lemma 4.1
Let P and
n
An equivalent condition
Note that P and Q need not converge
n
n
~
be probability measures on the measurable
Then p(Pn'~)-+O
there is a sequence of sets {A }, A
.
. n
n
as n -+ 00.
for two se-
The proof is deferred to the Appendix.
spaces (~n,An)' n=1,2, ...
-+ 1
a
E
as n-+ oo
if and only if
An , for which Pn (An ) -+ 0 and Qn (An )
58
The lower bound (4.2.2) guarantees that the breakdown point of the
MHD estimator is positive, and in many instances (4.2.2) is nearlY}.
Complications arise, however, in the context of discrete probability
models, as illustrated in the following examples.
Let P e = p( 0,; e) denote the Poisson density with mean parameter
Example 4.1
e, and let Pe = p(o;e) denote the corresponding probability measure and distribution function.
That t:* ~
2"1
Consider the estimation of
can be verified as follows.
e by
the MHD estimator.
Let {t } be a sequence of posn
itive real numbers tending to infinity, and let A be the open interval
n
(t :'"
n
say s
t~/4,oo).
n
~
M <00.
Let {sn} be a bounded sequence of positive real numbers,
Then
~
P(A;s )
n
n
sup peA ;s)
Isl~M
n
L
~
sup p(x;s)
An Is I~M
=
peA ;M)
n
3 4
eventually, and p(x;s) is increasing in s
n
for large n, since M<t - t /
n
for x> s.
other haJ\d
Since peA ;M)-+O i t follows that peA ;s )-+0
n
n n
P (A ; t ) -+ 1
n
n
as n -+ 00, by Chebyshev's inequality.
holds, by Lemma 4.1, and t:*(T,F)
1
~2
"'2·
"'2·
Let
the estimation of
sponds to both
Pe
and P
n=
e = 00
Hence (4.2.4)
as t-+ oo
for each
Hence p*(F) = 0 and t:*(T,F) is bounded below by
P /(l+p ) using Theorem 4.1.
Example 4.2
On the
for each F.
For a lower bound on t:*, note that p(F,Pt)-+O
fixed distribution F.
as n-+ oo .
log
and
1
At the Poisson model one obtains £*="2
e be
e.
as in the previous example and consider
Breakdown occurs when
e = O.
Letting 0
o
and 1
0
In i = 00,
which corre-
denote the point mass
59
distribution and density concentrated at zero,
II p~t
2_
1
0
1I +
°
as t+ 0, where
II· II
By
2
as Tl + _00, and p* (F)
A!
Theorem 4.1 e::*~ Y /C1+Y ) where Y= p(F) - f2(0).
with 8= ell.
0
I :;:;
2
Hence p(F ,P (. ; ell )) + f~(O)
2
p(F ,0 )
is the L norm with respect to count1
ing measure.
Ip(F, Pt ) -
=
f~(o) .
At the model y = l-e
-8/2
This can be made arbitrarily close to zero by choosing 8 near
zero.
Different lower bounds for e::* are obtained in the two examples because
an estimated 8 = ell =
° is considered to be a breakdown of the estimator in
Example 3.4 but not in Example 3.3.
This, of course, does not imply that
£* is different in the two examples.
In fact, calculation shows that the
derivative of p(F,P ) at zero is infinite if f(l) > 0, so an estimate of
t
8=
° does not occur when l's are observed in the sample.
the asymptotic breakdown point is
21
at the model.
In particular,
Even if the underlying
distribution has positive mass at 1; however, it is quite possible that no
•
l's will occur in a finite sample.
down points for 8 and 11
4.3
=
In this case the finite sample break-
log 8 will be different.
Probability breakdown point
Donoho and Huber (1982) implicitly assume that there is a natural
scale in which the breakdown point of a parameter is most appropriately
defined.
For instance, they contend that the breakdown point of a s.cale
parameter is best defined on the log scale, in which case a scale estimated
to be zero is considered to be an instance of breakdown.
They essentially
consider an estimate to have broken down if it has been driven to one of
the boundaries of the parameter space.
An alternative approach is to consider the probability structure directly rather than the parameter values.
This approach is limited to sit-
60
uations where a probability model is available; however , it can be justified on the grounds that the probability structure is more basic than the
parameterization.
Along these lines one can define the breakdown point at a given dis~he
tribution to be
smallest fraction of contamination that can lead to
an estimated distribution that is singular with respect to the original
distribution.
Formally, let
Define the probability breakdown point £ by
P
£ (T,F)
p
= inf{£:
p(£;T,F)
= al.
This breakdown point is invariant to reparameterizations and is a reflection of the stability of the probability model obtained in the estimation
process rather than the parameter-value.
Observe that £ yields the same
p
breakdown point for a scale parameter as is obtained by Donoho and Huber
(1982) after a log transformation.
When the scale is estimated to be
zero or infinity, the corresponding probability model is singular with
respect to proper continuous distributions.
See Beran (1977b, p. 48).
For the MaD estimator the probability breakdown point has the following lower hound.
'"
Let p = p(F) be as in Theorem 4. 1 Then
Theorem 4.3
£
P
(T,F)
"'2
~ P
v
"'2
/(l+p )
if T is the MHD functional.
Proof
The proof is similar to that of Theorem 4.1.
Note that p(£;T,F)=a
61
if and only if a sequence {G } exists for which p(F,F(o,T((l-E:)F+E: G)))
n
n
-+
O.
A consequence of Theorem 4.3 is that the MHD estimator always has an
asymptotic probability breakdown point of at least
21
at the model.
In
Example 4.2, for instance, an estimated 8= 0 is not considered to be an
instance of breakdown because the corresponding point mass at zero is a
•
proper count distribution.
4.4
Discussion
Breakdown analysis provides a useful quantification of the stability
of the MHD estimator.
For related work see Tamura and Boos (1985). They
derive a lower bound of a certain affine invariant MHD estimator for multivariate location and scatter.
This result is interesting in light of the
known upper bound of Ij(d+l), where d is the dimension, for the breakdown
point of an affine invariant M-estimator.
•
For the estimator of Tamura
and Boos it appears that a better lower bound of
i can be obtained for
the asymptotic breakdown point by using the methods of this chapter.
Regarding the usefulness of the probability breakdown point as a
concept, a key question is whether it has the same upper bound as the
ordinary breakdown point for an affine invariant M-estimator of multivariate location and scatter.
APPENDIX
v
Proof of Lemma 4.1
(4.A.l)
For probability measures P and Q on (n,8),
J IdP-dQI
2- 2 p(P,Q) ~
=
2 sup Ip(A) - Q(A)I;
AE8
see, e.g., Billingsley (1968, p. 224) for the equality.
Suppose pep n ,Qn)
-+
0
as n
-+
00.
Set s n
=
suplp
.
n (A) - Qn (A)
I
where the
62
supremum is over AE A,
n
that
Then s -+ 1
n
as n -+ 00,
Select {A I } , AI E A , such
n
n
n
o (A') - P (A') > s - n -1
'n n
n n
n
n=l, 2, " ,
Conversely, suppose a sequence of sets {A } exists for which P (A )-+0
n
n n
1
and Q (A ) -+ 1. Let ]J = -2 (P + Q ) and let p and q be the densities of
n n
n
n
n
n
n
Pn and
~
with respect to ]In'
•
Then
By the Cauchy-Schwarz inequality the right side is at most
n\
. {p {A)Q (A )}~ + {I- P (A )}~{l- Q (A
n nn n
n· n
n n
which converges to zero as n -+ 00
,
•
v
REFERENCES
Andrews, D. F. et al. (1972). Robust Estimates of Location: Su~­
vey and Advances. Princeton University Press, Princton, N.J.
Apostol, T. M. (1957).
Reading, Mass.
Mathematical Analysis.
Beran, R. J. (1977a).
tistics 5, 431-444.
Addison-Wesley,
Robust location estimates.
Annals of Sta-
Beran, R. J. (1977b). Minimum Hellinger distance estimates for
parametric models. Annals of Statistics 5, 445-463.
Beran, R. J. (1982). Robust estimation in models for independent
non-identically distributed data. Annals of Statistics 10,
415-428.
Billingsley, P. (1968).
Wiley, New York.
Convergence of Probability Measures.
Boos, D. D. and Serfling, R. J. (1980). A note on differentials
and the CLT and LIL for statistical functions, with applications to M-estimates. Annals of Statistics 8, 618-624.
Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika 40, 318-335.
Carroll, R. J. (1978a). On almost sure expansions for M-estimates. Annals of Statistics 6, 314-318.
CarrOll, R. J. (1978b) • On the asymptotic distribution of multivariate M-estimates. Journal of Multivariate Analysis 8, 361371.
Collings, B. J. and Margolin, B. M. (1985). Testing goodness-offit for the Poisson assumption when observations are not identically distributed. Journal of American Statistical Association 80,411-418.
Donoho,D. L. and Huber, P. J. (1982). The notion of breakdown
point. In A Festschrift for Erich L. Lehmann. Eds. P. Bickel,
K. Doksum, and J. L. Hodges, Jr. Wadsworth, Belmont, Calif.
Franklin, P.
New York.
Hampel, F.
mation.
(1940).
A Treatise on Advanced Calculus.
Wiley,
(1968). Contributions to the theory of robust estiPh.D. Thesis, University of California, Berkeley.
Hampel, F. (1971). A general qualitative definition of robustness. Annals of Mathematical Statistics 42, 1887-1896.
Hampel, F. (1974). The influence curve and its role in robust
estimation. Journal of American Statistical Association 62,
1179-1186.
Holland, P. W. and Welsch, R. E. (1977).
interatively reweighted least-squares.
tistics A6, 813-827.
Robust regression using
Communications in Sta-
Holm, S. (1976). Discussion of paper by Peter Bickel.
navian Journal of Statistics 3, 158-161.
Scandi-
Huber, P. J. (1964). Robust estimation of a location parameter.
Annals of Mathematical Statistics 35, 73-101.
Huber, P. J. (1967). The behavior of maximum likelihood estimates
under nonstandard conditions. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, Vol. 1.
University of California Press, Berkeley.
Huber, P. J. (1972). Robust statistics: a review.
Mathematical Statistics 43, 1041-1067.
Annals of
Huber, P. J. (1973). Robust regression: asymptotics, conjectures
and Monte Carlo. Annals of Statistics 1, 799-821.
Huber, P. J.' (1981)
Robust Statistics.
Johnson, N. L. and Kotz, S.
Wiley, New York.
(1969).
Wiley, New York.
Discrete Distributions.
Krasker, W. S. (1980). Estimation in linear regression models
with disparate data points. Econometrica 48, 1333-1346.
Krasker, W. S. and Welsch, R. E. (1982). Efficient bounded-influence regression estimation. Journal of American Statistical
Association 77, 595-604.
Kullback, S.(1959).
New York.
Information Theory and Statistics.
Wiley,
Matusita, K. (1954). On the estimation by the minimum distance
methods. Annals of Institute of Statistical Mathematics 5,
59-65.
Millar, P. W. (1981). Robust estimation via minimum distance
methods. z. Wahr. v. Gab. 55, 73-89.
Parr, W. C. (1981). Minimum distance estimation: a bibliography.
Communications in Statistics Theory Methods A10, 1205-1224.
Parr, W. C. and Schucany, W. R. (1980). Minimum distance and robust estimation. Journal of American Statistical Association
75, 616-624.
Pitman, E. J. (1979).
Hall, London.
Some Basic Theory of Statistics.
Chapman-
•
Rao, C. R. (1963).
Criteria of estimation in large samples.
Sankya A 25, 189-206.
Rao, C. R. (1973). Linear Statistical Inference and its Applications. Wiley, New York.
Ruppert, D. (1985). On the bounded influence regression estimator of Krasker and Welsch. J. Amer. Statist. Assoc. 80, 205-208.
Scheffe, H. (1947). A useful convergence theorem for probability
distributions. Annals of Mathematical Statistics 18, 434-458 •
•
Serfling, R. J. (1980). Approximation Theorems of Mathematical
Statistics. Wiley, New York.
Stather, C. R. (1981). Robust Statistical Inference using HelliJ1ger distance methods. Ph.D. dissertation, La Trobe University, Australia.
Stigler, S. M. (1973). The asymptotic distribution of the
trimmed mean. Annals of Statistics 1, 472-477.
Tamura, R. and Boos, D. (1985). Minimum Hellinger distanceestimation for multivariate location and covariance. To appear in
Journal of American Statistical Association.
Tukey, J. (1960). A survey on sampling from 'contaminated distributions. In Contributions to Probability and Statistics
I. Olkin, ed. Stanford University Press, Stanford.
Woodruff, R. C. et ale (1984). Chemical mutagenisis testing in
Drosophila: I. Comparison of positive and negative control data
for sex-linked recessive lethal mutations and reciprocal translocations in three laboratories. Environmental Mutagenicity 6,
189-202.
Zimmering, S. et al. (1985). Chemical mutagenisis testing in
Drosophila: II. Results of 20 coded compounds tested for the
National Toxicology Program~ Environmental Mutagenicity 7,
87-100.