UNIVARIATE DATA MODELING WITH THE
ANDERSON-DARLING STATISTIC
A unified approach to data modeling is presented which relies on
the Anderson-Darling distance
tion function
F
n
d(Fn,F e )
between the empirical distribu-
and the hypothesized model
are estilnated by minimizing
d(Fn,F )
e
Fe •
First, parameters
while goodness-of-fit is
assessed with the minilnized distance
d(Fn,F e)
regions are constructed by inverting
d(Fn,F )
e
.
Then, confidence
in the ma.nner suggested
by Easterling (1976).
Prilnary attention is focused on the null
distribution of
e)
d(;Fn,F
dures.
KEY WORDS
Anderson-Darling statistic
Minilnum distance estimators
Goodness-af-fit
Confidence regions
and the efficiency of the confidence proce-
1
INTRODUCTION
1.
Recent·papers by Parr and Schucany (1978), Parr and DeWet (1979),
Hillar (1979), and Boos (1980) have shown that the minimization of a
weighted Cramer-von Mises distance between the hypothesized model
distribution function
Fe
results in estimators
6 which are consistent, asymptotically normal,
and the empirical distribution function
F
n
and robust and/or efficient if the weight functi.on is chosen appropriately.
In this paper we focus on the Anderson-Darling (A-D) distance
00
dF (e)
=
n
f
(1.1)
-00
because the weight function
we
= [Fe (l-F e)] -1
allows a nice balance
between robustness and efficiency in a variety of models, and
nd
F
(6)
n
has a manageable null distribution.
approach to univariate modeling:
e
We are suggesting a general
first
d (e)
F
is minimized to obtain
n
and
nd
Fn
(6)
is used to check the model validity; then, assuming
the model is true, . {e: nd
F
n
where
da.
(e) .:: d
.1
a
forms a confidence region for
e,
is the critical value of the A-D statistic when no
parameters are-estimated.
All such computations can be performed
easily on the same computer run using the well-knmvn computing
formula
where X ) 2 ..• < X ) are the ordered sample values.
Cl
Cn
Sections 2'-4 focus on the three different uses of dF , Section 2
n
describes briefly the minimum A-D distance estimates 6
for Weibullized
2
models (detailed general treatments are available in the above references).
inverting
dF (e)
in one parameter models.
Section 5 gives a numerical
n
example, and Sec tion 6 is a brief SU1II:lllary.
2.
ESTIMATION
X1 , ••• ,Xn withWeibu11izeddistribution
Consider a random sample
function
°
Fe (x) = F «x/cr) c), x
~ 0, cr
any suitable distribution function on
but
F(x) =1- exp(-x)
°
distance estimates
f
°, c
:>
c"
0.
Here
with density
[0,(0)
is the most important.
& and
approximate solutions of
>
FO may be
fO(x),
The minimumA-D
obtained by minimizing (1.1) are also
l:t/J(X.,cr,c) = (0,0)
T
1
, where
00
~
°
[I(x < cry1lc) - F (y)]Q1 (y)dy
°
t/J(x,cr,c) =
[l(x
.~
cry 1/c ) - F (y)]Q2{y)dy
O
and
Using
integration by parts one can see that
0 and
c
have influence
curves which are beunded as long as
Moreover, the asymptotic covariance matrix of
is given by
t/J(X ,cr,c)
1
n
and
-1 -1-1
lJ.
ClJ.
where
C is the covariance matrix of
(0,2)
3
co
00
:1
:22]'
v
y ql (y)fO(y)dy
=
t:.
O
y Q2(y)f (y)dy
o
i
00
~~
f
00
y Q2(y)f O(y)dy
:2
o
Jy in Y Q2(y)f O(y)dy
o
is the matrix of expected values of the derivative of
FO(x)
=1
- exp(-x)
c2
.1922"
.023 1
(J
(J
.023
For
we find by numerical integration that
2
c
.4042"
t:. =
$.
1
1
(J
(J
C =
1
.2522'
c
(J
.039
.039
1
1
.0492"
c
(J
and thus
1.145
(J
2
"""2
.231
c
t:.- 1Ct:.- 1
=
.231
(J
.717
:2]
We note that the asymptotic relative efficiencies of
A
(J
and
A
c
compared to the maximum likelihood estimates (m.1.e. 's) are .969 and
.848 respectively.
However, since the m.1.e. 's are seriously affected
by outliers, one might be willing to trade in some efficiency for the
robustness achieved by the minimum A-D estimates.
the estimate of
(J
when
.1918/(.4041)2 (J2 Jc 2
1.145 (J2/ c 2
c
occurs for estimating
is known has asymptotic variance
= 1.174
obtained when
c
We also note that
(J2/ c 2 which is larger than the
c
when
is not known.
(J
is known.
The same phenomenon
Of course, such results
are impossible when using fully efficient estimation schemes.
4
The minimum A-D estimators do well ;i.n other models as well.
en,S)
normal and logistic location-scale models the efficiencies of
were calculated in Boos (1980) to be
(.966, .849)
Fo);
(1.0, .923)
and
res.pec tively.
GOODNESS...OF-FIT.
3.
Consider the composite goodness-of-fit hypothesis
HO:
distribution function of the data = Fe'
e unknown but
a member of some specified parametric family.
d
= d (6)
min
Fn
Fe
The minimized distance
is a natural statistic for testing
null distribution is much smaller than that of
although its
dp (6)
when
is
6
n
specified.
A similar result holds if
is replaced by other
estimators, and Stephens (1974,1976,1977,1979) has published tables of
the null distribution of
nd
Fn
(6 )
M
where
eM
is the m.l. e. estimate.
Unfortunately, each parametric family requires a different table.
In
contrast, Boos (1980) considered location-scale models
Fe(x) = F ((x-ll)/cr)
O
and conjectured that the null limiting distribution of
ndm~n
.
reasonable approximated by the distribution of
for a range of symmetric distributions
standard normal random variables.
F '
O
=
'could be
I~=3zZ /i(i+l)
where the
Z.
are i.i.d.
~
The Monte Carlo results of this
section support that conjecture and suggest that the approximation is
also valid for more than just symmetric location-scale models.
general use of
Ak2 = I~=k+l
z;
/i(1+l)
for the case of
k
The
estimated
parameters can be motivated by analogy with the chi-square goodness-of-fit
statistic, where typically
the degrees of freedom are reduced by the
number of estimated parameters.
ndF (6)
n
Here, the limiting distribution of
when no parameters are estimated is
Al
=
I~=l 2i2 /i(i+l)
5
and estimation of parameters results in the approximate loss of unequal
2
2
degrees of freedom corresponding to -Zl /2 , Z2 /6 , etc.
Table 1 contains Monte Carlo estimates of the upper percentiles of
nd .
m~n
for normal and logistic location-scale models and for the two
.
parameter Weibull.
Yi = in Xi
Since smooth transformations of the data such as
do not affect the distribution of
nd . , the results apply
m~n
as well to the lognormal, log-logistic (see Tadikamalla and Johnson (1979))
or Burr III with
k
= 1,
and extreme value distributions respectively.
TheIMSL minimization routine
ZXMIN was used to find
Monte Carlo samples were generated for each situation.
d.
m~n
and 1000
If certain con-
vergence criteria were not met, then those samples were not included in
the final estimates although their
d.
m~n
values tended to.be in the
middle of the distribution and would have had very little effect on the
percentile estimates.
Simple order statistic estimators were used for
starting parameter estimates and the random numbers were generated by
the McGill Super-Duper random number generator and analyzed by
Dickey's (1978) Monte Carlo package.
generally in the range + .01 to + .04.
The errors in the estimates are
The percentage points of
were computed from Pearson curve approximations and should be accurate
to the two decimals listed (see Solomon and Stephens (1978) for details).
insert Table 1 here --The percentiles of
tend to be only a little smaller than the
estimated percentiles in Table land the usual fast convergence of A-D
statistics to asymptotic values appears to hold even with estimated
parameters.
Thus, unless exact percentiles are required, we expect that
the critical values of
will be adequate for most applications.
6
---
~nsert
Table 2 gives results
that the percentiles of
~or
Al2
Table 2 here
the one parameter exponential
\'00
2
.
~ 'i~2 Zi li(~+l)
smaller than the estimated percentiles,
E nd.
mJ',n
by
shows
are from .03 to .11
Since
E
Al2
.... 5. and
.5254/.5 ... 1.051 to obtain respectively
.65, .91, 1.10, 1.30, and 1.57 for
a'" .25
correction could be applied to Table 1 where
.3436, .3371, and .3459
+
min
~nd
.5254, a possible correction factor would be to multiply the
+
percentiles of
E nd
~--
to .01.
E
A similar
Ai . . 1/3
and
for the normal, logistic, and
Weibull.
A further use of
different models.
nd
min
would be to compare values of
ndmin
for
This comparison would be meaningful but not
necessarily the best method for choosing between models.
asymptotic distribution of
d.
The non-null
and some examples of asymptotic power
m~n
were given in Boos (1980).
4.
If the data
Fe'
where
then
d
a
known.
used for
C(6)
CONFIDENCE REGIONS
Xl"" 'Xn
=
{e: nd
Fn
belongs to some specified parametric family
(e)
2
d } forms a confidence region .for
a
e
is the critical value of the A-D statistic with parameters
(Stephens (1974) notes that the asymptotic critical values can be
n...: 5.)
Littell and Rao (1978) and Salvia (1979, 1980) have
described methods for obtaining analogous regions from the
Kolmogorov-Smirnov statistic
Dn
=
supx IFn (x) - Fe (x)
I.
Easterling (1976)
originally proposed the use of such regions obtained from goodness-of-fit
statistics for model fitting:
the bigger the region, the more
7
i.e., such regions should cover the true parameter value
(1 - a) x 100
percent of the time in repeated sampling and should be as smaZZ as
possible.
This section is intended to shed a little more light on the
sampling characteristics of these regions.
In particular, it appears
that the high efficiency of the minimum distance estimates carries
over to confidence procedures for some parameters but not necessarily
for others.
Consider a one parameter family
Xl, ••• ,Xn
is available.
endpoints of the
from
Let
6
L,a
F (x)
6
and
for which a random sample
be the left and right
6
R,a
(1 - a) x 100 percent confidence interval constructed
d (B) , and let
F
e
be the minimum
n
derivation of the asymptotic length of
A-D estimate.
A heuristic
~
n (6 · '", - 6
)
R '"
L,a
is as follows.
By Taylor expansion
d
-a=
n
E:
Sirtce
A
and
6
<Ii
L ,a -
6 < 6
R,a
(if
2 ,n
)(6
A
L ,a
da ~ndF
-6)
(6»,
2
then
n
F
d
d
F
<'
1
(a)
' , (6-E:
n
2,n
(2.1)
.]".]
).
8
where
T (U)
l
is the limiting distribut;i.Qn 0:1;
nd;F (6) •
One interesting
n
feature is that (2.1) is a random variable instead of a constant.
In
contrast,the typical confidence interval for the scale parameter of the
exponential distribution Fe (x) = 1 - exp(-x/e)
multiplied by
n~which
converges to
has length when
2z /2' where
a
za
is the upper
(1- a) x 100 percentile of the standard normal (see Bain (1978), p. 129).
For comparison purposes we might use an approximate bound on the expectation of (2.1),
.[d~- ET1(U)]"
E(2.l) 2. 2 ~(L"
--:Fe
In Table 3 we compare
2za /2
~
ET1(U) =.525 ..
Fe
"(e) = .40411 and
•.
(2.2)
(e)
with (2.2) for the exponential, where
--- insert Table 3 here --These results are consistent with Table 7 of Easterling (1976) and help
e
verify that the asymptotic efficiency .85 of
·confidence interval construction as welL
carries over to
We now show that such
efficiency need not carryover to confidence intervals in every situation.
Let
c
F (x) = 1 - exp(-x ) , Weibull with scale equal to one.
c
In
Table 4 we compare (2.2) with the asymptotically efficient method based
on the m.Le. of
k
n 2 /c
c
converging to
which has asymptotic length when multiplied by
2z a / 2 (.608)
k
2.
Here
~dF "(c) = .25l9/c 2 and
c
ETl(U)= .8062.
--- insert Table 4 here --Now the intervals derived from
d (6)
Fn
are just not competitive with
the intervals derived from the m. L e. although the asymptotic relative
9
efficiency of
c"
to the m.l.e. is ,11.
The approximation (2.2) might
not be as accurate for this situation, but a 1000 sample Monte Carlo
estimate of the expected length
fOT
= .05,
C4
n
= 20,
yielded 4.96 which
is reasonably close to the tabled value 5.17 of (2.2).
--- insert Tables 5 and 6 here --Tables 5 and 6 give analogous results for the normal location (scale
known) and normal scale (location known) respectively where the minimum
A...D estimates have asymptotic relative efficiencies of .97 and .85.
infer the following basic principle:
We
inversion of a Cramer-von Mises
goodness-of-fit statistic will yield a competitive confidence interval
(With regard to length) only for the parameter which corresponds to the
first component of the statistic involved.
Moreover, we expect the
general principle to hold for confidence regions as well.
in the two parameter Weibull
Fe(x)
=1
c
- exp(-(x/cr) ),
confidence region to be relatively short in the
longer than necessary in the
c
direction.
cr
For example,
we expect the
direction but quite
The procedure is still
useful since exact confidence regions are obtained for a variety of
models and the total area of such regions could easily be competitive with
say a Bonferroni rectangular region obtained from individual confidence
intervals derived from them.I. e. 's.
Lastly, the goodness-of-fit approach
is adaptable to censored samples by use of a recent result of Michael
Schucany (1980).
and
10
5 ~.
NUMERICAL EXAMPLE
Salvia (1979) used the data in Table 7, originally found in Visscher
and Goldman (1978), and constructed 80 percent confidence regions for the
two parameter Weibull and exponential distributions.
The conclusion
--- insert Table 7 here --was that the data was more "consonant" with aWeibull than with an
exponential.
In Table 8
--- insert Table 8 here ---
we have computed the m.l.e.'s, minimum A-D estimates, and our goodness-ofRecall from Tables 1 and 2 that
fit statistic for a number of models.
the
a
= .05
critical value for
nd.
mJ.n
is approximately 1.05 and .63
for one and two parameter models respectively.
The data speaks strongly
against a one or two parameter exponential, but either the Weibull, normal,
or logistic fit the data fairly well.
In Figure 1
--- insert Figure 1 here --we have drawn the 75 and 90 percent confidence regions for
the Weibull model.
In addition, the dotted lines are individual 95
percent confidence intervals for
using Bain (1978), Ch. 4.
c
and
a computed from them.1.e. 's
The dotted region thus forms a 90 percent
Bonferroni rectangular region for
efficiency of the A-D region in the
(c,a)
c
and illustrates the low
direction.
that the 90 percent A-D region is shorter in the
.e
(c,a) under
However, we note
c
direction than is the
80 percent Kolmogorov-Smirnov region calculated by SalVia (1979).
11
--- insert Table 9 here --presents those altered calculations.
As expected, the minimum distance
estimates are more stable (robust!) than the m.l.e. 'so
The normal model
is no longer acceptable at the .05 level, although the Weibull and
logistic remain acceptable.
6.
SUMMARY AND CONCLUSIONS
A comprehensive approach to univariate data modeling has been
suggested which includes estimation of parameters, testing goodness-of-fit;
and construction of confidence regions, all based on the Anderson-Darling
goodness-of-fit statistic.
Results for the normal (lognormal), logistic
(log-logistic), and Weibull (extreme value) indicate that the approach
will be useful for
.e
a variety
of possible model distributions.
12
Car~o e$t£m~tes of the
peraenta~e points da~ suah
that Pend i· < d ) ~ 1 - a.
TABLE 1-Monte
m·n -
.25
.10
a
.05
.025
.01
.69
.63
.79
.76
.98
.90
.72
.71
.82
.84
.75
.76
.87
.87
Normal
n
n
~
~
20
50
.42
.41
.55
.53
Logistic
n
n
~
~
20
50
.40
.42
.53
.54
.62
.63
Weibu11
n
n
~
~
20
50
.41
.42
.54
.56
.63
.66
~ 2
~ L3Zi /i(i+1)
A22
.41
.54
.63
*
.73
.86
2
*Percentage
points of AZ calculated
from Pearson curves, see Solomon and
Stephens (1978).
TABLE 2-Monte carZo estimates of the
peraentage points da~ suah
that P Cnd.
< d ) ~ 1 - a•
. mJ:.n·a
.25
.10
.05
.025
.01
1.35
1.33
1.56
1.61
Exponential
n = 20
n = 50
.64
.65
.91
.91
1.13
1.11
A 2 = r,·2
2Z1 /i(i+l).*
1
;62
.e
*Percentage
.86
1.05
1.24
2
1.50
points of A calculated
2
from Pearson curves, see Solomon and
Stephens (1978).
13
TABLE 3-Asymptotia ~en~thG I)f acmfidenae
interva.~~ fo~ the exp~nentia~
saaZe pa~ameteZ'.
.01
.05
.10
.25
2z~/2
(2.2)
Ra,Uo
5.15
3.92
3.29
2.30
5.74
4.41
3.73
2.67
.90
.89
.88
.86
TABLE 4-Asyrrrptotia 'lengths of aonfidenae
inte'l'VaZs for the Weibun shape
paramete!' (saal,e known).
~
.01
.05
.10
.25
TABLE
2zq (2 (.608) ~
4.02
3.06
2.57
1. 79
Ratio
6.96
5.17
4.23
2.65
'.:58
.59
.61
.68
l,engths of aonfidenae
inte'l'Va'ls for a no~al, 'loaation
parameter (saal,e known).
5~Asymptotia
2z
.01
.05
.10
.25
(2.2) /e
0./2
(2.2)
Ratio
5.15
3.92
3.29
2.30
5.28
4.07
3.45
2.49
.97
.96
.95
.92
TABLE 6-Asymptotia l,engths of aonfidenae
interval,s for a no~al, saaZe
parameter (l,oaation known).
(2.2)
.e
.01
.05
.10
.25
3.64
2.77
2.33
1. 63
6.68
4.95
4.02
2.46
Ratio
.55
.56
.58
.66
14
1
TABLE 7-Casino earnings.
416
594
119.2.
1269
1453
1555
2065
2070
2438
2497
2595
2845
2967
2999
3130
3162
3251
3283
3414'
3467
3516
3729
3963
4006
4338
5395
5520
5885
7059
lnata first appeared in Visscher and Goldman (1978).
TABLE 8-AnaZysis of TabZe 7.
11ode1
m.1.e.
Exponential 1 - exp(-x/a)
&= X= 3106
Exponential 1 - exP(-(x-ll)!a)
Weibu11
1 - exp(-(x/cr)c)
(~,&) = (416,2690)
Normal
Logis tic
~ «X-ll) / a)
(1+exp (- (X-ll) / cr) ) -1
Min. A-D Est. ndmn
i
3872
(401,3280)
(e,a) = (2.13,3501)
(X,S) = (3106,1551)
(v,&) = (3033,846)
2.75
(2.16,3515)
1.95
.34
(3045,1509)
:36
(3034,861)
.27
TABLE 9-AnaZysis of TabZe 7 with 5885 changed to 9885.
Model
Normal
Logistic
.e
A
-
=X =
(v,&) =
- exp (- (X-ll) / a)
c
(c ,&) =
1 - exp(-(x/cr) )
(X,S) =
~ «x-ll) / a)
.
-1
(l+exp(-(x-ll)/a»
(D,&) =
Exponential 1
Exponential 1
Weibu11
- exp(-x/a)
m.1.e.
a
Min. A-D Est. ndmin
2.48
3244
(416,2828)
3941
(400,3344)
(1. 79,3653)
(1. 98,3575)
1.72
.51
(3244,1937)
(3066~.1632)
.72
(3047,952)
(3043,900)
.45
15
5000
J
,
I
4500
I
I
-----,--I
4000
3500
3000
I
I
--
-
-~I
1
I
I
1.2
1.4
I
1.6
1.8
2.0
2.2
c
FIGURE 1.
I
I
1
2500
I
~-
Confidence regions for (c,o).
2.4
2.6
2.8
3.0
3.2
3.4
16
REFERENCES
Bain, L. J. (1978).
ModeZs.
StatistiaaZ AnaZysis of R eZiabiZity and Life-Testing
New York:
Boos, D. D. (1980).
goodness-of-fit.
Dickey, D. A. (1978).
Marcel Dekker, Inc.
Minimum distance estimators for location and
To appear in the J.hner. Statist. Assoa.
A program for generating and analyzing large
Monte Carlo studies.
Easterling, R. G. (1976).
Unpublished.
Goodness of fit and parameter estimation.
Teahnometrias, 18, 1-9.
Littell, R. C. and Rao, P. V. (1978).
Confidence regions for location
and scale parameters based on the Kolmogorov-Smirnov goodness of
fit statistic. Teahnometrias, 20, 23-27.
Michael, J. R. and Schucany, W. R. (1979).
goodness of fit for censored samples.
Millar, P. W. (1979).
A new approach to testing
Teahnometrias, 21, 435-441.
Robust estimation via minimum distance methods.
Preprint.
Parr, W. C. and Schucany, W. R. (1978).
estimation.
Preprint.
Parr, W. C. and DeWet, T. (1979).
statistic estimation.
Salvia, A. A. (1979).
Salvia, A. A. (1980).
consonance sets.
On minimum weighted Cramer von Mises
Preprint.
Consonance sets for2-parameterWeibu11 and
exponential distributions.
IEEE Trans. rleliabiZity , R-28, 300-302.
Some fundamental properties of Ko1mogorov-Smirnov
Technometrics, 22, 109-111.
Solomon, H. and Stephens, M. A. (1978).
.e
functions using Pearson curves.
153-160.
Minimum distance and robust
Approximations to density
J. Amer. Statist. Assoc., 73,
17
Stephens, M. A. (1974).
comparisons.
J.
EDF statistics for goodness of fit and some
Amer.
Stephens, M. A. (1976).
Statist.~
Assoa., 69, 730-737.
Asymptotic results for goodness-of-fit
Ann. Statist., 4, 357-369.
statistics with unknown parameters.
Stephens, M. A. (1977).
distribution.
Goodness of fit for the extreme value
Biometrika, 64, 583-588.
Stephens, M. A. (1979).
Tests of fit for the logistic distribution
based on the empirical
distr~bution
function.
Biometrika, 66,
591-595.
Tadikama11a, P. R. and Johnson, N. L. (1979).
Systems of frequency
curves generated by transformations of logistic variables.
North
North Carolina Institute of Statistics Mimeo Series #1226.
Visscher, W. M. and Goldman, A.
s.
(1978).
Optimization of earnings
in stochastic industries, with applications to casinos.
Statist. Assoc., 73, 499-503 •
."e
J. Amer.
© Copyright 2026 Paperzz