•
•
A NONPARAMETRIC MULTIVARIATE TEST OF HOMOGENEITY
BASED ON A U-STATISTIC OF DEGREE (2, 2)
by
Nelson F. de Oliveira
Department of Biostatistics, University of
North Carolina at Chapel Hill, NC.
•
•
Institute of Statistics Mimeo Series No. 2105T
September 1992
"
A NONPARAMETRIC MULTIVARIATE TEST OF HOMOGENEITY
BASED ON A U-STATISTIC OF DEGREE (2, 2)
by
Nelson F. de Oliveira
A dissertation submitted to the faculty of the University of North Carolina at
Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of
Philosophy in the Department of Biostatistics.
"
•
Chapel Hill
1992
Approved by:
,r--
D
,
•
ABSTRACT
•
NELSON
F.
DE
OLIVEIRA.
A Nonparametric Multivariate Test for
Homogeneity Based on a U-statistic of degree (2, 2).
(Under the direction of
Dana Quade and Pranab K. Sen.)
Nonparametric multivariate tests of homogeneity between two populations
have been proposed in the literature. Some are extensions to the multivariate case
of well-known univariate tests, such as the tests proposed by Bickel (1969) and
Fiedman and Rafsky (1979) as generalizations of the runs test and the
Kolmogorov-Smirnov test. Others are based on the Euclidean distances between
"
observations and its nearest neighbors, such as the tests proposed by Weiss (1960),
Schilling (1986), Henze (1988) and Barakat (1989).
In this research we propose a new test based on a U-statistic of degree (2,
2). Consistency and asymptotic distribution of the test statistic are studied. The
test is shown to be consistent against the general alternative of non-homogeneity
in some specific cases. The power of the test is investigated through Monte Carlo
simulation and results are compared with
Hotelling's T 2 and Schilling's and
Barakat's tests. Finally the test is used to analyze a subset of Fisher's Iris data by
testing the hypothesis of homogeneity of the distributions of two sepal
"
measurements of two species of Iris.
..
11
ACKNOWLEDGMENTS
•
First, I would like to thank my advisors Professor Dana Quade and
Professor Pranab K. Sen.
Dr. Quade suggested the topic and helped me
constantly and patiently throughout this work.
Dr. Sen gave me many
suggestions during the development of this project. To them I would like to
express my deepest gratitude.
I would also to thank the other members of my committee C. M.
Suchindran, S. Bangdiwalla and I. Salama for their encouragement.
My thanks also go to Conselho Nacional de Desenvolvimento Cientffico e
Tecnol6gico,
Brazil,
for
financial
assistance,
and
to
my
colleagues
at
Departamento de Estadstica, UFBA, Brazil, for their support.
•
Finally, I would like to thank Vera for taking care of our children during
part of the time I was working in this project. To my children Genaro, Catarina
•
and Lineu, I dedicate this work.
•
III
TABLE OF CONTENTS
•
..
Page
List of Tables
vii
List of Figures
viii
Chapter 1: Introduction and Literature Review
1.1 Introduction
1
1.2 U-statistics
2
1.2.1 Basic definitions
•
1
2
1.2.2 The H-decomposition and the S-decomposition
.4
1.2.3 Large sample results
5
1.2.4 Two-sample U-statistics
6
1.3 Lehmann's general two-sample test
9
•
1.4 Multivariate nonparametric two-sample tests
11
1.5 A new test based on aU-statistic
16
Chapter 2: Consistency of V mn
18
2.1 Introduction
18
2.2 Expected value of </> under the null hypothesis
18
2.3 Proof of consistency when X and Yare distributed over at most four
•
points assuming distances between pairs of points to be all different ........ 19
IV
2.3.1 Expected value of ¢> when the closest and the second closest pairs
do not have any point in common
20
2.3.2 Expected value of ¢> when the closest and the second closest pairs
25
have one point in common
2.3.2.1 Expected value of
7],
26
7],
assuming d 12 < d 13 < d 14 < d 24
< d 23 < d 34
2.3.2.3 Expected value of
•
assuming d 12 < d 13 < d 14 < d 23
< d 24 < d 34
2.3.2.2 Expected value of
•
28
7],
assuming d 12 < d 13 < d 14 < d34
< d23 < d24
30
2.4 Proof of Consistency assuming distances between pair of points are all
32
equal.
2.5 Proof of Consistency for discrete distributions over n points assuming all
39
distances equal.
2.6 Proof of Consistency in the univariate Normal case
43
2.7 Proof of Consistency in a continuous bivariate case
.45
Chapter 3: Asymptotic Distribution of V mn
•
•
55
3.1 Introduction
55
3.2 Permutational Distribution of V mn
55
3.3 Asymptotic distribution of V mn
56
Chapter 4: Simulation study of the power
74
•
4.1 Introduction
74
4.2 Procedure
74
v
4.2.1 Computation of 6 to achieve a specified power for Hotelling's T 2
test
•
•
76
4.3 Conclusions
85
Chapter 5: An Example
86
5.1 Introduction
86
5.2 Description of the data
86
5.3 Analysis of the data
90
5.3.1 Randomization tests
90
5.3.2 Tests based on the asymptotic distribution
91
5.3.3 Conclusions
94
Chapter 6: Summary and suggestions for future research
•
98
6.1 Summary
98
6.2 Suggestions for future research
99
•
Appendices
Appendix 1: A computer program to calculate E(¢» when the closest
and the second closest pairs do not have any point in common
100
Appendix 2: A computer program to calculate E(¢» when the closest
and the second closest pairs have one point in common
101
•
Appendix 3: A computer program for finding the asymptotic
distributions of the test statistics V mn and W
VI
103
Appendix 4 A computer program for finding the permutational
107
distribution of V mn
"
Appendix 5 A computer program for finding the permutational
111
distribution of W
References
115
"
•
..
Vll
LIST OF TABLES
•
Table 2.1: Joint distribution of (Do, D 1, D 2), when X and Y have a discrete
distribution over four points
21
Table 2.2: Distribution of 7] = 7](X 1, X 2 ; V), assuming d 12 < d 34 <
d 13 < d 14 < d 23 < d 24
22
Table 2.3: Distribution of 7] = 7](X 1, X 2 ; V), assuming d 12
d 14 < d 23 < d 24 < d 34 ·.. · ·
·
< d 13 <
Table 2.4: Distribution of 7] = 7](X 1, X 2 ; V), assuming d 12
d 14 < d 24 < d 23 < d 34 ·.. ·
·.. ·.. ·.. ·
<
Table 2.5: Distribution of 7] = 7](X1, X 2 ; V), assuming d 12
d 14 < d 34 < d 23 < d 24 ·
·
·.. ·.. ·
< d 13 <
Table 2.6: Distribution of 7]
d 14 = d 23 = d 24
d 13
·
27
<
29
31
= 7](X1, X 2 ; V), assuming d 12 = d 13 =
= d 34 · ·
33
Table 2.7: Joint distribution of (Do, D 1, D 2), when X and Y have a discrete
distribution over n points
35
•
Table 2.8: Distribution of 7] = 7](X 1, X 2 ; V), assuming d 12 = d 13 =
dIn = d 23 =
= d 2n =
= dn - 1 1 =
= d n -1 n
41
•
Table 2.9: Numerical evaluation of the functions in (2.8.3) for values of
r close to 1
=
53
Table 3.1: Distribution of the maximum relative difference between exact
and bootstrap values of 'l/Jn(x, y) when X and Y are Uniform (0, 1),
for 200 pairs of bootstrap samples
76
Table 4.1: Relationship between h and p, d, m, n, N
= m +n
Table 4.2: Values of h* to achieve Hotelling's T 2 power of 0.70 and 0.90
78
79
Table 4.3: Estimated power of Vmn when Hotelling's T 2 power is 0.70 and 0.90.82
•
Table 5.1: A subset of Fisher's Iris data
87
Table 5.2: Descriptive statistics of Fisher's Iris data
89
Table 5.3: Analysis based on permutation tests
91
Table 5.4: Analysis based on asymptotic distribution
92
Vlll
LIST OF FIGURES
•
Figure 2.1: Picture illustrative of the computation of
P{IIX 2 - XIII < min[llX I - YII , IIX 2 - YII]} for X and Y
uniformly distributed over circumferences
48
Figure 2.2: Picture illustrative of the computation of
P{IIY 2 - YIII < min[IIY I - XII, IIY 2 - XII]} for X and Y
uniformly distributed over circumferences
52
Figure 2.3: E(7]) as a funcion of r
54
Figure 5.1: Confidence regions for fl - /12 using Fisher's Iris data
based on Hotelling's T and V mn with an "anti-conservative"
critical value
96
Figure 5.2: Confidence regions for fl - /12 using Fisher's Iris data
based on Hotelling's T and V mn with a "conservative"
critical value
97
,.
..
•
IX
"
CHAPTER 1
INTRODUCTION AND LITERATURE REVIEW
1.1 Introduction
The two-sample problem is a very important problem in statistics and has
long been studied (Randles and Wolfe, 1979, Ch. 9; Gibbons, 1971, Ch. 7).
Its
general formulation is as follows. Let X and Y be random variables with values in
R d , d 2: 1, the d-dimensional Euclidean space, with distribution functions F and
G respectively.
Let Xl' X 2 ,
".,
X m and Y I , Y 2 ,
".,
Y n be independent random
samples from X and Y. The hypothesis of homogeneity of the two populations, Ho:
F( x) = G( x) V x, may be tested against the completely general alternative of
"
non-homogeneity, Ha : F(x) ::j:. G(x) for some x, the location alternative that X
and Y differ only by a shift parameter, HI: F(x) = G(x + b), V x and b ::j:. 0 or
•
the scale alternative that X and Y differ only by a scale parameter H 2 : F( x)
G(bx), V x and b
i-
1.
If assumptions can be made about the form of the
distribution then well known parametric tests can be used to test these
hypotheses.
For example, if normality can be assumed for X and Y, in the
univariate case, Student's t-test and the F-test can be used to test Ho against HI
and H 2 respectively. Nonparametric tests are used when no particular form of the
distribution functions can be assumed. To test Ho against Ha , in the univariate
case, the runs test [Wald and Wolfowitz (1940)], Cramer-von Mises-Lehmann's
test, [Cramer (1928), von Mises (1931), Lehmann(1951)], the Kolmogorov-Smirnov
two-sample test [Smirnov (1939)] are well known. The normal scores test [Fisher
and Yates (1938), Terry (1952), Hoeffding (1950)], the Wilcoxon test [Wilcoxon
(1945)], the van der Waerden test [van der Waerden (1952, 1953), and the median
test [Mood (1954)] are among the tests used to test Ho against location shift
alternatives and the Capon test [Capon (1961)], the Klotz test [Klotz (1962)], the
quartile test [Hajek and Sidak (1967)] and the Siegel-Tukey test [Siegel and Tukey
(1960)] are among the tests used to test Ho against scale alternatives.
•
.
In the multivariate case, nonparametric (and parametric) tests have been
proposed, but to a less extent than in the univariate case. Puri and Sen (1971,
Ch. 5) discuss a class of multivariate tests conditionally distribution free with
asymptotically normal null distribution as well as a class of multivariate tests
based on V-statistics.
It is the purpose of this work to present a multivariate
nonparametric test, based on a V-statistic, to test the hypothesis of homogeneity
of X and Y, against the general alternative of non-homogeneity.
In this Chapter a brief review of V-statistics theory is presented in Section
1.2. In Section 1.3 Lehmann's two-sample test, which is a univariate test based
on a V-statistic, is reviewed.
..
In Section 1.4 a review of nonparametric
multivariate two-sample tests proposed in the literature to test the hypothesis of
homogeneity is presented.
proposed.
In Section 1.5 a new test for this hypothesis is
The new test is based on aU-statistic and is consistent against all
alternatives for a certain class of multivariate distributions.
1.2 V-Statistics
1.2.1 Basic Definitions
Let Xl' X2 ,
•••,
Xn be a random sample from a population X, possibly
multivariate, with distribution function F within a class of distribution functions
CJ.
(}(F) is a junctional of F if V F E CJ, a real number (}(F) is assigned.
A
functional (}(F) is regular or estimable over CJ if there exists a positive integer n
2
•
.
and a function ¢J(x 1 ,
X2' ..., Xn)
such that E F [¢J(X 1 , X 2 ,
... ,
X n)] = 9(F) V F E g.
The function ¢J( .) is an unbiased estimator of 9(F). If r is the smallest sample
•
size for which there exists an unbiased estimator of 9(F) then r is the degree of
B(F) and ¢J(X 1 , X 2 ,
•
••• ,
X r ) is the kernel of B(F). Without loss of generality it can
always be assumed that the kernel is a symmetric function of Xl' X 2 ,
otherwise a symmetric kernel can be created as ( r !) where the summation is over all permutations
1
•••,
X r since
L: ¢J(XCi 1 , XCi 2 , ..., XCi ),
0'1' 0'2' ••. ,
r
a r of the integers {1, 2,
..., r}. Now, following Hoeffding (1948), a U-statistic is defined by,
(1.2.1.1)
where
L:
extends over all subsets 1 ~
(n.r)
The notation
L:
0'1
< 0'2 < ... < a r
~ n of {1, 2, ..., n}.
will be used repeatedly in this Chapter. Un is symmetric in its
(n.r)
n arguments and is an unbiased estimator of 9(F). As an example of aU-statistic
..
•
let g be the class of all distributions with a finite first moment and B(F) =
EF (X 1 ). Then the degree of B(F) is 1, the kernel is ¢J(X 1 )
-
=
Xl and U1
=
U(X 1 )
n
L:Xi = X.
n- 1
i
=1
The variance of a U-statistic can be expressed
III
terms of certain
conditional expectations. Define
(1.2.1.2)
for c = 0, 1, ..., r. Then
¢Jo
=
B(F), (0
=
0, and [Hoeffding (1948)],
3
(1.2.1.3)
1.2.2 The H-decomposition and the S-decomposition
•
In this Section two representations of a V-statistic are presented. First a
representation of a V-statistic as a sum of uncorrelated V-statistics of degree 1, 2,
..., r is given. This representation is very useful in studying the properties of a Vstatistic and is due to Hoeffding (1961).
To describe the decomposition, define
(1.2.2.1)
for c
=
2, 3, ..., r, where cPe(x ll
X2' .•• ,
L
xc) is given in (1.2.1.2) and
extends
..
(e,i)
over all subsets 1 ::; a 1 < a2 < ... < ai ::; c of {I, 2, ..., c}.
Then Hoeffding (1961) shows that
(
~
J)
L cP
(")
-1
n,J
( ")
J
(X a
,
1
Xa
, ...,
2
Xa
.).
J
(1.2.2.2)
For a proof that the components H~) are uncorrelated with varIances of
decreasing order in n, see Hoeffding (1961).
,.
The second decomposition, due to Sen (1960), is as follows. For each ~ 1, 2,..., n calculate V in as
(1.2.2.3)
4
where the summation extends over all subsets
{1, 2, ..., i - 1, i
+
{0:2' 0: 3 , .• •,O:r}
from the integers
1, ..., n}. Then Un is given by
•
n- 1
n
L:U in .
(1.2.2.4)
i=1
•
Sen (1960) proved that Uin -
¢>1(X i) converges in probability to O.
1.2.3 Large Sample Results
The following results are proved in Hoeffding (1948):
Theorem1. If E F[¢>2(X ll X 2 ,
lim n Var(U n) = r 2
n--+oo
•.. ,
X r )] <
00,
then
(1.2.3.1)
(1'
•
(1.2.3.2)
•
If (1
0 then the asymptotic distribution is degenerate.
The following result was proved by Sen (1960):
Theorem 3. Let ZI = n ~ 1
n
L: (U in
i = 1
- U n)2, where Uin is given in (1.2.2.3). ZI
is a consistent estimator of (1' It follows that
(1.2.3.3)
where )((0 , 1) represents a Normal distribution with mean 0 and variance 1.
5
1.2.4. Two-Sample U-statistics
The definitions and results presented so far for one-sample U-statistics can
be extended to two or more samples.
situation, let Xl' X2 ,
••• ,
To extend the results to a two-sample
Xm and Y 1 , Y 2 ,
...,
Y n be independent random samples
from two distributions with distribution functions F and G respectively within a
class g x g. The parameter ()
=
()(F, G) is estimable of degree (r, s) if rand s are
the smallest sample sizes for which EF ,G[¢>(X1 , X2 ,
•.• ,
Xr
;
Y1 , Y2 ,
•••,
Va)] =
()(F, G). ¢>(.;.) is the kernel of ()(F, G) and the two-sample U-statistic is defined
by
( r;!)-1 (~)-lL L¢>(Xo:,1 Xo:'2 ..., Xo: r ; Y{J'1 Y{J''''
2
( m,r ) (
",6)
'Y{J)
a
(1.2.4.1)
Now similarly as in Section 1.2.1 define,
•
Cc,d (F, G)= EF,G[tP~,d (Xl' X2 ,
COV[( ¢>(X 1 ,
for c -
0, 1, ..., rand d
=
...,
...,
Xr
;
Y1 , Y 2 ,
Xc, Xc + l' ..., Xr
;
... ,
Y1 ,
Y a ))
...,
0, 1, ..., s. It follows that ¢>o,o
and
6
=
Yd'Yd + 1l
...,
Y a)
()(F, G), (0,0
=
,
0,
..
(1.2.4.2)
•
To generalize the H-decomposition for the two-sample V-statistic, define
.
II
dG(vj)] .
J
II
dG(vj),
=d+l
where M x denotes the distribution function of a single point mass at x, and
H(c,d)
m, n
_
-
(m)-l (n)-l
'"
'" A,(c, d) (X
d
L..J L..J
C
(m, c) (n, d)
'I'
X0'
0"
1
, ••• ,
2
X.
y {3' y {3
0"
c
1
2
,""
Y)
{3 .
d
Then, the H-decomposition is given by [see Hoeffding (1961)],
(1.2.4.3)
.
The S-decomposition can also be generalized to the two-sample case in the
following way. [Sen (1960)].
Define
where C1 extends over all subsets {a2 < a3 < ... < ar} from the integers {I, 2,
..., j - 1, j + 1, ..., m}, and
7
where C2 extends over all subsets {,82 < ,83 < ... < ,8s} from the integers {I, 2,
..., J -
1, j
+
1, ..., n}.
Then U mn can be written as,
U mn -
1
m
m
I:
U (l)
j
-
j=1
The large sample results for the two-sample U-statistic equivalent to the
ones presented for one sample in Section 1.2.3 are as follows.
Theorem 4. Let N =
m + n and
assume that lim
N-+oo
1- A, where 0 < A < 1. If E F ,c[4>2(X1 ,
Theorem 5. Let N =
m + n and
lim
W[U
A]
A
(I-A)
1-
+
A, where 0 < A < 1. ZI
m/ N
N-+oo
..•,
1
•.. ,
~1
m
f: U
j=1
N(O, 1).
8
j(l)
N-+oo
00,
and Z2
n/ N
then
n/ N
N-+oo
= A and lim
00,
(1,0 or (0,1
N-+oo
=
<
A and lim
Va)] <
n, and assume that lim m / N
0< max((I,O , (0,1) < 00 , then
[U mn - B(F, G)]
~r2Z
S2Z 2
__
1 + __
m
n
Xr j Y1 ,
1 N(O, 1), provided
mn r2t"
s2t"
~+
1,0,1
Theorem 6. Let N = m
... ,Y s)]
Y1,
assume that lim
1- A, where 0 < A< 1. If EF ,c[4>2(X1,
N-+oo
... ,X r ;
m/ N =
>
then
o.
lim n/N
= A and N-+oo
=
1
•
-
-
n
I:
-=-1
n
. 1U j(2)·
J=
If
1.3 Lehmann's General Two-sample Test
In this Section, Lehmann's two-sample test [Lehmann (1951)], to test the
•
..
hypothesis of homogeneity between two univariate distributions, is reviewed.
Let Xl' X 2 ,
••. ,
X m and Y I' Y 2'
.••,
Y n be independent random samples of
populations X and Y with continuous distribution functions F and G respectively.
To test Ho: F(x) = G(x) V x, against the alternative H a : F(x) =j:. G(x) for some
x, Lehmann (1951) proposed the parameter
as a measure of the discrepancy between the two distributions. Lehmann showed
that if a pair of X's, say Xl and X 2 , and a pair of V's, say Y I and Y 2, are chosen
at random, then
..
P{max(X I , X 2 )
Let ,\
=
< min(Y I , Y 2 ) or max(Y I , Y 2 ) < min(X I , X 2 )} =
i + 2~.
P{max(X I , X 2 ) < min(Y I , Y 2 ) or max(Y I , Y 2 ) < min(X I ,
X 2 )}. An obvious estimator of ,\ is the two-sample U-statistic,
(1.3.1)
with kernel 4>( . ; . ) of degree (2, 2) defined as
1 if max(X I , X2 )
<
min(Y I , Y2 ) or
max(Y I , Y 2 ) < min(X I , X 2 ),
- 0 otherwise.
9
(1.3.2)
Equivalently the kernel can be defined as
</>(2 X's ; 2 V's) - 1 if the ordering of the observations is XXYY or
YYXX,
- 0 otherwise.
Under Ho,
~
(1.3.3)
i.
=A
= 0, E(U rnn)
The null variance of U rnn found by
Sundrum (1954) is given by
Var(U rnn )
-
4 [(m + n) (m + n - 1) - 2]
45 m n (m - 1)(n - 1)
°
Since (0, 1 and (1, are both equal to zero under Ho, Theorem 5 of Section
1.2.4 does not apply. The asymptotic null distribution of U rnn is given by Wegner
(1956):
•
W mn
-
m n (3
Urnn - 1)
6 (m
+
+
(m +
n)
n)
is distributed as nw
where nw 2 is the Cramer-vonMises goodness-of-fit statistic. As n
> 0, P{W rnn
00
:::;
X }
--+
P{ k2;
1
7r 2
2
,
--+ 00
and
r:::
--+
c
2
k 2 (Zk -1) :::; x}, where Zk are independent
1
standardized normal random variables.
normally distributed.
Under Ha
,
U rnn is asymptotically
This test is unconditionally distribution free consistent
against all alternatives.
The kernel defined
III
(1.3.2) of degree (2, 2) can be written as a linear
combination of kernels of degree (2, 1) and (1, 2). For that, consider the four
possible triplets in the two pairs of X's and V's: XXV, XYX, YYX, YXY. Define
the kernel of degree (2, 1),
10
7](2 X's ; 1 Y)
-
1 if the ordering in the triplet is XYX
-
0 otherwise.
Then,
4>(2 X's ; 2 Y's) -
1 - ~ [7](2 X's ; 1 Y)
+ 7](2 Y's ; 1 X)].
1.4 Multivariate nonparametric two-sample tests
Generalizations to the multivariate case of univariate tests of homogeneity
between two distributions have been proposed in the literature.
Bickel (1969)
presents a generalization of the Smirnov test, based on Fisher's permutation
principle.
The test is shown to be conditionally distribution free and consistent
against all alternatives.
However the asymptotic null distribution of the test
statistic is not given.
Friedman and Rafsky (1979) propose tests that are
generalizations of the Wald and Wolfovitz runs test and the Smirnov test. The
•
tests are based on using the minimal spanning tree of the sample points as a
generalization of the sorted list of the pooled sample of observations.
For the
multivariate runs test a test statistic R based on the number of subtrees is defined
and the mean and variance of R under Ho are derived. The variance of R under
Ho depends on the common distribution of the two populations.
distribution-free
test
is
then
constructed,
conditioning on
A conditional
the empirical
distribution of the pooled sample. The permutational conditional distribution of
the test statistic is shown to be asymptotically normal.
general alternatives is not proved.
Consistency against
The generalizations of the Smirnov test are
based on two different procedures for ranking the multivariate observations and
•
then applying the standard univariate Smirnov test to the resulting sequence.
Whaley (1983) shows that the multidimensional runs statistic of Friedman and
Rafsky is equivalent to three other independently derived test statistics.
11
Other tests proposed in the literature which are not attempts to generalize
univariate tests are described below. Weiss (1958) proposes a test based on the
Euclidean distance between observations. Let, as usual, Xl' X 2 ,
Y2 ,
••. ,
••• ,
X m and Y 1 ,
Y n be independent random samples from two d-variate continuous
distributions.
Let 2R j be the distance between Xj and its nearest neighbor (the
point closest to Xj) among Xl' ..., Xj _ l' Xj + l' ..., X m and let Sj be the number of
points Y 1 , Y 2 ,
...,
Y n contained in the open sphere {x:
Ix -
Xjl < R j }. The test
statistic is then Qm(O), the proportion of the values Sl' 52' ..., Sm which are equal
to zero.
Under Ho, as m-+oo, Qm(O) converges in probability to Q(O) which is
equal to 2d o:j(1 + 2d o:) where
+
0:
r;::.
=
The test rejects Ho if Qm(O) > 2d o:j(1
2d o:). The distribution of Qm(O) depends on the common distribution of X and
Y even under Ho and the test is not shown to be consistent against general
alternatives.
Schilling (1986) proposes a test based on the proportion of all k-nearest
neighbors in which a point and its neighbor are members of the same sample.
Again let Xl' X 2 ,
••• ,
X m and Yl' Y2'
••• ,
Y n be independent random samples from
two d-variate continuous distributions, and take
Let N
=
=
m+n, 0 1
=
{1, 2, ..., m}, O2
II . II
to be the Euclidean norm.
{m+1, m+2, ..., m+n} and
define
Let NN j(r) represent the rth nearest neighbor to the sample point Zj, that
is, Zj satisfying
j'
~
n, j'
i:-
II Zj'
-
Zj
II
~
II Zj
-
Zj
II
for exactly r - 1 values of j', 1 ~
i, j, and let
Ij(r) = 1 if NN j(r) is in the same sample as Zj,
12
o otherwise.
(1.4.1)
..
The test statistic is
•
(1.4.2)
n =
in such a way that lim m = >'1 and lim N
N-oo N
N-oo
T k , n has a limiting normal distribution under Ho. It is also proved that
It is proved that as m, n -+
1 -
>'1'
00
the test is distribution free and consistent against all alternatives.
However, it
should be noted that for finite k the limiting distribution of the number of nearest
neighbors is Poisson with parameter 2k + 1 [Jammalamadaka and Janson (1984)],
which is not in concordance with the asymptotic results of Schilling. Thus for a
fixed k and a small shift under the alternative the probability of accepting the
•
null hypothesis may not go to zero so that the test may not be consistent in this
situation. Weighted versions of the test are also considered using different weight
functions. Power of the different versions of the test is calculated through Monte
Carlo simulation.
For this same situation Friedman and Steppel (1974) propose a test based
on the number of points Ci from, say, the first sample, Xl' X2 ,
••• ,
Xm , that are
among the k nearest neighbors of each point Zi in the combined sample. Separate
frequency distributions can be calculated from C i for i E 0 1 and i E 02' Under
Ho' for large N Friedman and Steppe! suggest contrasting {C i , i E
0d
with {C i ,
i E 02} by comparing the frequency distribution of all C1 , C2, ..., C N'
•
Henze (1988) considers Schilling's test with an arbitrary norm on Rd. The
randomized version of Schilling's test is proved to be consistent against all
alternatives, asymptotic normal and conditionally distribution free, under Ho·
13
Barakat (1989) proposes a test related to Schilling's test, using all nearest
neighbors, which is equivalent to a certain weighted average of Schilling's
proportions. Barakat's test statistic is equivalent to
•
N-I
(1.4.3)
N:LkTk,N,
k=l
where Ij(r) and Tk,N are given in (1.4.1) and (1.4.2) respectively.
The small sample properties of the test statistic are studied as well as
asymptotic properties. Under Ho, E(W)
Var(W) is a function of
PI
=
=
N{m (m - 1)
+
n (n - 1)}/2, and
P[NNI(r) = Z2 , NN 2 (s) = Zl] and
P2
=
P[NNI(r) = NN 2 (s)], quantities studied by Schilling (1986). One would expect
larger values of Wunder H a than under Ho because of a lack of complete mixing
of the two samples under the alternative.
Barakat, Quade and Salama (1990)
show that W is equivalent to a sum of N Wilcoxon rank sums. W can also be
expressed as a linear combination of two U-statistics of degree (2, 1) and (1, 2)
respectively. For Xl' X2 and Y I in R d define the kernel function of degree (2, 1)
I{D(X I, X2) < D(X I , YIn + I{D(X I , X2) < D(X 2, YIn,
(1.4.4)
where D is Euclidean distance and I{ . } is the indicator function.
two U-statistics
Then, as shown in Barakat, Quade and Salama (1990),
14
Consider the
,.
m (m -
W-
2
1)
(m
+ n Ul ) +
n (n -
2
1)
(n
+ m U2 )·
..
When m = n, W is equivalent to a single U-statistic of degree (2, 2).
Define the kernel of degree (2, 2)
¢(Xl , X2 ; Yl ,Y2)
=
ry(X l , X2 ; Yl )
+ ry(X l , X2 ; Y2)
(1.4.5)
and
Umn
(n)-I~
= (m)-1
2
2
.LJ.
~
LJ ¢(X i , X j
;
(1.4.6)
Y k' V,),
'<J k<1
Then, for m
= n, Barakat, Quade and Salama (1990) show that,
W
m 2 (m -
1) (1
+
(U mm )j4) .
•
The relation between U mm and Wand the fact, proved in Chapter 3, that
U mm is not
asymptotically normally distributed,
imply that W is not
asymptotically normal, under Ho.
Barakat's test is not consistent against all alternatives, as the following
counterexample shows. Consider the case where m
=n
and let Si E R d , i
= 1,2,
3, 4 be regions in d-dimensional Euclidean space such that P(X E SI) = P(X E S2)
=
1and P(Y E S3)
= P(Y E S4)
=
1·
Assume that the smallest distance
between two X points located in different regions is greater than the largest
distance between an X and an Y, and conversely the smallest distance between
•
two Y points located in different regions is greater than the largest distance
between an Y and an X. Then it is easy to see that the expected value of the
kernel function (1.4.2) is E H (¢) = E H (¢) =
0
a
15
4.
Since in this case W is
equivalent to a V-statistic, it can be concluded that the test
IS
not consistent
against this alternative.
Jupp (1987) proposes a test based on the concept of orientation of a
simplex and the idea of separation of two points by a hyperplane.
The test
statistic is a V-statistic of degree (p, 2), where p is the dimension of the data
vectors. The test is not distribution free under Ho. Consistency is not discussed
and the asymptotic distribution is not found.
1.5 A New Test Based on a V-statistic
In this section a new test is proposed for the hypothesis of homogeneity
between two multivariate distributions, which is based on a V-statistic and is
consistent against all alternatives for a certain class of distributions. For the same
situation as Barakat's test, define the kernel function of degree (2, 1) as
7J(X 1 , X 2 ; Y1 )
=
I{D(X I , X 2 ) < min[D(X I , YI)' D(X 2 , YI)]}
! I{D(X
+ ! I{D(X
+ ! I{D(X
+
1,
X2 )
-
D(X I , Y 1 )
< D(X 2 , Y I )}
I,
X2 )
-
D(X 2 , Yd
< D(X I , Y1 )}
I,
X2)
-
D(X 1 , Y I )
-
D(X 2 , Y 1 )}
(1.5.1)
Define the kernel function ofdegree (2, 2)
(1.5.2)
The test statistic will be the V-statistic
16
•
(1.5.3)
•
Similarly as in Barakat's test V mn is expected to achieve larger values
•
under Ha than under Ho, hence large values of V mn will lead to the rejection of Ho.
Dnder Ho, as proved in Section 2.2, E( 1])
=
•
•
17
! and hence E( 4»
= E(V mn)
= ~.
.
CHAPTER 2
CONSISTENCY OF V mn
2.1 Introduction
To study the consistency of the test against the completely general
alternative of non-homogeneity, first the case where X and Yare discrete random
vectors distributed over four points in d-dimensional space is considered in
Sections 2.3 to 2.6.
In Section 2.7 consistency in the univariate Normal case is
studied and in Section 2.8 a continuous bivariate case is considered. Since V mn'
defined in (1.5.3), is a V-statistic with a bounded kernel, moments of all orders
exist, so that to prove consistency it will suffice to prove that the expected value
•
of V mn is greater under any alternative than it is under the null hypothesis, which
•
is equivalent to proving that E H (¢»
a
> E H0 (¢» where ¢> is defined in (1.5.2).
Initially, it is shown that in general E H/¢» =
t.
2.2 Expected value of ¢> under the null hypothesis.
Let X and Y be discrete or continuous random vectors
d-dimensional
Let Xl' X2 , Y I , Y 2 , be independent observations from X and Y
space.
respectively.
•
III
Consider 7J(X I , X 2
;
Y I ) defined in (1.5.1).
Denote the Euclidean
distances among the three pairs by Do = D(XI , X2 ), D I = D(XI , Y I ) and D2 =
D(X2 , Y I ). Define the events
•
EI
-
event that the three distances are different.
E2
-
event that two distances are equal and the other is larger
E3
-
event that two distances are equal and the other is smaller.
E4
-
event that the three distances are all equal.
Now,
E[7J(X I , X2 j Y I)
= E[7J(X I, X2 ; YII EI)] P(E I)
+ E[7J(X I, X2 j YII E2 )] P(E 2 )
+ E[7J(X I, X2 ; YII
E3 )] P(E3 )
Under Ho, if E I occurs then E[7J(X I , X2
7J(X I, X2
;
=
YII EI) = !; if E2 occurs then
YII E2 ) = 0 with probability!, and 7J(X I, X2
probability ~, so that E[7J(X I , X2
E3 )
;
;
;
YII E3 )
=
YII E2 ) = ~ with
YII E2 ) = !; if E3 occurs then 7J(X I , X2
1 with probability!, and 7J(X I , X2
that E[7J(X I , X2
;
;
YII E3 )
=
j
YII
0 with probability ~, so
!; if E4 occurs then 7J(X I , X2
;
YII E4 )
-
!' hence
•
E[7J(X I , X2 ; YII EI ) = !. Thus,
E[7J(X I , X2 ; Y I) - ! P(E I )
+!
P(E2 )
+!
P(E3 )
+!
P(E4 )
1
- 3'
Then, by symmetry, E( </>) = 4 (!) -
i.
•
2.3
Proof of consistency when X and Y are distributed over at most four
points assuming distances between pairs of points to be all different.
Let X and Y be random vectors distributed over four points
19
III
d-
dimensional space, with probabilities PI' P2' P3' P4 and ql' q2' ~, q4 respectively.
Let d ij be the Euclidean distance between points i and j in d-dimensional space, i,
j = 1, 2, 3, 4. Assume d 12
.
=I d 13 =I d 14 =I d 23 =I d 24 =I d34 · Let Xl' X 2, ...,
Xm and Yl' Y2' ..., Y n be two independent random samples from X and Y
respectively. Consider a triplet of two X's, say Xl and X 2 and one Y, say Y l , and
the distances Do = D(X l , X 2), D l = D(X l , Y l ) and D 2 = D(X 2, Yl)' The joint
distribution of (Do, D l , D 2 ) is shown in Table 2.1.
There are 6!
=
720 possible orderings of the distances between the pairs of
points, but we shall see that these orderings can be apportioned among four
classes within each of which the expectation of the kernel function has only one
value. Thus this expectation needs to be calculated only four times.
2.3.1 Expected value of 4> when the closest and the second closest pairs do not
•
ha.ve any point in common.
First the case where the closest pair of points and the second closest pair
•
do not have any point in common is considered. Without loss of generality the
points can be numbered so that the closest pair is (1, 2) and the second closest is
(3, 4). The third closest must join 1 or 2 with 3 or 4. Without loss of generality,
let the third closest pair be (1, 3).
Then there are 3! = 6 possible matrices of
ranks of distances between pair of points.
The program in Appendix 1 shows
that the kernel function has the same value for all matrices. Thus without loss of
generality the ordering d 12 < d 34 < d 13 < d 14 < d 23 < d 24 will be assumed.
•
•
In the case under consideration the distribution of 1/ = 1/(Xl , X 2 ; Y1) is given in
Table 2.2.
The expected value of 1/ - T/( Xl' X 2 ; Y1 ) is then given by
20
Table 2.1.
Joint distribution of
(Do. D il D 2 ),
when X and Y have a discrete
distribution over four points.
(Dol D il D 2 )
Probabilities
(Do, D il D 2 )
Probabilities
0
PI2 qi
+ P22 q2
d I4 d 12 d 24
PI P4 q2
d I4 d 24 d I2
PI P4 q2
d I4 d13 d34
PI P4 q3
d I4 d34 d I3
PI P4 q3
+ P42 q3
0
0
+ P5 <lJ + p~ q4
+
0
d I2 d I2
0
d I3 d I3
0
d I4 d I4
0
d 23 d 23
0
d 24 d 24
+ PI2 q3
P42 qi + PI2 q4
P22 q3 + P32 q2
P22 q4 + P42 q2
0
d 34 d34
P32 q4
P22 qi
PI2 q2
P32 qi
d23 d23 0
+ P2 P3 <lJ
P2 P3 q2 + P2 P3 q3
d 23 d I2 d13
P2 P3 qi
d 23 d I3 d I2
P2 P3 qi
d 23 0
d 23
P2 P3 q2
PI P2 qi
+ PI P2 q2
d 23 d34 d 24
P2 P3 q4
d I2 d I2 0
PI P2 qi
+ PI
d 23 d 24 d34
P2 P3 q4
d I2 d I3 d 23
PI P2
d 24 0
P2 P4 q2
d 12 d 23 d I3
PI P2 q3
d 24 d 24 0
+ P2 P4 q4
P2 P4 q2 + P2 P4 q4
d I2 d 14 d 24
PI P2 q4
d 24 d 12 d I4
P2 P4 qi
d 12 d 24 d I4
PI P2 q4
d 24 d I4 d I2
P2 P4 qi
d I3 0
PI P3 qi
+ PI P3 <lJ
d 24 d 23 d34
P2 P4
d I3 d I3 0
PI P3 qi
+ PI
d 24 d34 d 23
P2 P4 q3
d I3 d I2 d 23
PI P3 q2
d34 0
d I3 d 23 d I2
PI P3 q2
d 34 d34 0
+ P3 P4 q4
P3 P4 q3 + P3 P4 q4
d I3 d I4 d34
PI P3 q4
d34 d 13 d I4
P3 P4 qi
d I3 d 34 d I4
PI P3 q4
d 34 d 14 d 13
P3 P4 qi
d I4
PI P4 qi
+ PI
P4 q4
d34 d 23 d 24
P3 P4 q2
PI P4 qi
+ PI P4 q4
d34 d 24 d 23
P3 P4 q2
d I2 0
d 14 0
d I2
d I3
d I4 d I4 0
P2 q2
<lJ
P3
<lJ
21
d 24
d34
P3 P4
..
•
<lJ
<lJ
..
Table 2.2. Distribution of TJ
< d23 < d24 "
Probabilities
TJ
(Do. DI , D 2 )
1
0
dI2 dI2
P~
q2 +
2
P2 ql
0
dI3 dI3
2
PI q3 +
2
P3 ql
0
dI4 dI4
2
PI q4 +
2
P4 ql
0
d23 d23
P~~
+
2
P3 q2
0
d24 d24
2
P2 q4 +
2
P4 q2
0
d34 d34
2
P3 q4 +
2
P4 q3
•
•
dI2 dI3 d23
PI P2 ~
d12 d23 dI3
PI P2 q3
dI2 dI4 d24
PI P2 q4
dI2 d24 dI4
PI P2 q4
d34 dI3 dI4
P3 P4 ql
d34 dI4 dI3
P3 P4 ql
d34 d23 d24
P3 P4 q2
d34 d24 d23
P3 P4 q2
3
1
0
0
Otherwise
0
0
2 ~ + P4
2 q4
PI2 ql + 2
P2 q2 + P3
[1 - P(TJ = 1) - P(TJ =
"
•
22
!)]
E( 1])
=
1(p~ qI + p~ q2 + P5 <b + p~ q4)
+ p~ q2 + p~ <b + p~ q4
+ p~ q4 + P5 qI + P5 q2 + P5 q4 + p~ qI
+ 2 (PI P2 <b + PI P2 q4 + P3 P4 qI + P3 P4 q2)'
+ p~ qI + p~ <b
+ p~ q2 + p~ <b
Under the null hypothesis of homogeneity, Pi -
E HO (1])
=
~,I -
1, 2, 3, 4. Thus,
1(p~ + p~ + p3 + p~) + p~ P2 + p~ P3 + p~ P4 + p~ PI
+ p~ P3 + p~ P4 + P5 PI + P5 P2 + P5 P4 + p~ PI + p~ P2
+ p~ P3 + 2 (PI P2 P3 + PI P2 P4 + P3 P4 PI + P3 P4 P2)
= 1(PI + P2 + P3 + P4)3 = 1
as was proved, in general, in Section 2.2.
Under the completely general alternative of non-homogeneity, Pi
at least one i, i = 1, 2, 3, 4. Let ~ =
63
+
64
=
Pi
+
6i , i = 1, 2, 3, 4, with 61
¥-
~
for
+
62
+
=
O. Then the average expected value of 1], fj = fj(X I , X 2 ; YI , Y2 )
~ {E[1](X I , X2 ; VI)]
+ E[1](Y I , Y2 ; Xl)]}, is
•
~{
! [p~ (PI + 6
+ p~ (P2 + 62) + P5 (P3 + 63) + p~ (P4 + 64)]
+ p~ (PI + 61 ) + p~ (P3 + 63) + p~ (P4 + 64) + P5 (PI + 61 )
+ P5 (P2 + 62) + P5 (P4 + 64) + p~ (PI + 61 ) + p~ (P2 + 62)
+ p~ (P3 + 63) + 2 [PI P2 (P3 + 63) + PI P2 (P4 + 64)
+ P3 P4 (PI + 61 ) + P3 P4 (P2 + 62)]
+ [(PI + 6d 2 PI + (P2 + 62)2 P2 + (P3 + 63? P3
+ (P4 + 64)2 P4] + (PI + 6I? P2 + (PI + 61 )2 P3 + (PI + 61 )2 P4
+ (P2 + 62? PI + (P2 + 62)2 P3 + (P2 + 62)2 P4
+ (P3 + 63)2 PI + (P3 + 63)2 P2 + (P3 + 63)2 P4
+ (P4 + 64)2 PI + (P4 + 64)2 P2 + (P4 + 64)2 P3
+ 2 [(PI + 6d (P2 + 62) P3 + (PI + 61 ) (P2 + 62) P4
fj =
1)
1
23
+
(P3
+
h3 ) (P4
+
h4 ) PI
= ~ {A + B + C + D
A = 2
+ (P3 + h3 ) (P4 +
+ E}, say, where
h4 ) P2] }.
[1 (p? + p~ + p~ + p~) + p~ P2 + p~ P3 + p~ P4 + p~ PI
+ p~ P3 + p~ P4 + p~ PI + P5 P2 + P5 P4 + p~ ql + p~ P2
+ p~ P3 + 2 PI P2 P3 + 2 PI P2 P4 + 2 PI P3 P4 + 2 P2 P3 P4],
B
C
= p~ hI + p~ h2 + P5 h3 + p~ h4 + p~ h2 + p~ h3 + p~ h4 + p~ hI
+ p~ h3 + p~ h4 + P5 hI + p~ h2 + P5 h4 + p~ hI + p~ h2 + p~ h3 ,
= h~ (~Pl + P2 +
P3
+ h~ (PI + P2 + ~ P3
D = 2 PI P2 hI
•
•
+ 2 PI P3 hI
+ 2 PI P4 hI
+ 2 P2 P3 hI
+ 2 P2 P4 81
+ 2 P3 P4 81
+ P4) + h~ (PI + 1P2 + P3 + P4)
+ P4) + h~ (PI + P2 + P3 + i P4),
+ 2 PI P2 h2 + 2 PI P2 h3 + 2 PI P2 h4
+ 2 PI P3 h2 + 2 PI P3 h3 + 2 PI P3 h4
+ 2 PI P4 h2 + 2 PI P4 h3 + 2 PI P4 h4
+ 2 P2 P3 82 + 2 P2 P3 83 + 2 P2 P3 84
+ 2 P2 P4 82 + 2 P2 P4 h3 + 2 P2 P4 h4 ,
+ 2 P3 P4 h2 + 2 P3 P4 h3 + 2 P3 P4 h4 ,
Now,
24
Therefore,
r;
!+ ! [8~ (! PI + P2 + P3 + P4) + 8~ (PI + ! P2 + P3 +
+ 85 (PI + P2 + ! P3 + P4) + 6~ (PI + P2 + P3 + ! P4)
+ 2 (P3 81 82 + P4 81 82 + PI 83 84 + P2 83 84 )],
=
>
P4)
!' since the expression in brackets is greater than
Thus, under Ha , by symmetry,
E(¢»
- 2 r;(X I ,
X 2 j Y b Y 2)
+
2 r;(Y I ,
Y 2 j Xl' X 2 ),
> 2(l) + 2(!) =~.
2.3.2
Expected value of <p when the closest and second closest pans have
•
one point in common.
..
Now the case where the closest pair of points and the second closest do
have one point in common is considered.
Without loss of generality the points
can be numbered so that the closest pair is (1, 2) and the second closest is (1, 3).
Then there are 4!
= 24
possible matrices of ranks of distances between pairs of
points. The program in Appendix 2 shows that for all the 24 matrices, the kernel
function assumes only 3 different values exemplified by the ones for which the
ordering of distances between pairs are d l2 < d l3 < d l4 < d 23 < d 24 < d34
;
d l2 < d l3 < d l4 < d 24 < d 23 < d 34 and d l2 < d l3 < d l4 < d 34 < d 23 <
d 24 respectively. The expected value of the kernel function will be calculated for
each one of these 3 cases.
25
2.3.2.1 Expected value of Tl, assuming d 12 < d t3 < d 14 < d 23 < d 24 < d34
The distribution of." = .,,(X t , X 2
;
•
Yt), in this case, is shown in Table 2.3.
Thus, the expected value of." is given by
E(.,,)
= ~ (p~ ql + p~ q2 + P5 ~ + p~ q4) + P~ q2 + p~ ~ + p~ q4
+ p~ ql + P~ ~ + P~ q4 + P5 ql + P5 q2 + P5 q4 + p~ ql
+ p~ q2 + p~ ~ + 2 (PI P2 ~ + PI P2 q4 + PI P3 q4 + P2 P3 q4)·
Under the alternative, TJ is given, as before, by ~ {A
+
B
+
C
+
D
+
E },
where A, B, C and D are as in Section 2.3.1, and
•
As before, A
.
Therefore,
fj
= ~ + ~ [<5~ (~Pl +
3"2 ' and B - D -
+ P3 + P4) + <5~ (PI + ~ P2 + P3 + P4)
+ <55 (PI + P2 + ~ P3 + P4) + <5~ (PI + P2 + P3 + ! P4)
+ 2 (P3 <5 1 <5 2 + P4 <5 1 82 + P4 <5 1 83 + P4 82 83)].
>
P2
!' since the expression in brackets is greater than
•
"
o.
Thus, similarly to previous case, E( <p) >
26
i.
Table 2.3. Distribution of TJ = TJ(X 1 , X 2
j
Y 1) assuming d 12 < d 13 < d 14 < d 23
< d 24 < d 34 •
TJ
(Do, D 1 , D2 )
1
0
d 12 d 12
2
+
2
PI q2
P2 ql
0
d 13 d 13
2
+
PI ~
0
d 14 d 14
PI2 q4 + P42 ql
0
d 23 d 23
P22 q3 +
P32 q2
0
d 24 d 24
P22 q4 +
P42 q2
0
d 34 d 34
2
+
2
P3 q4
P4 q3
Probabilities
2
P3 ql
d 12 d 13 d 23
PI P2 q3
d 12 d 23 d 13
PI P2 ~
d 12 d 14 d 24
PI P2 q4
d 12 d 24 d 14
PI P2 q4
d 13 d 14 d 34
PI P3 q4
d 13 d 34 d 14
PI P3 q4
d 23 d 24 d 34
P2 P3 q4
d 23 d 34 d 24
P2 P3 q4
3"
1
0
2 q2 + P3
2 q3 + P4
2 q4
PI2 ql + P2
0
Otherwise
0
0
[1 - P(TJ = 1) - P(TJ =
..
i)]
..
.
27
< d 13 < d 14 < d 24 < d 23 < d 34•
2.3.2.2 Expected value of TJ assuming d 12
The distribution of TJ = TJ(X I , X 2 ; YI), in this case, is shown in Table 2.4.
Thus, the expected value of TJ is given by
E(TJ)
=
! (Pt ql + p~ q2 + P§ <I3 + p~ q4)
+
p~ ql
+
p~ <I3
+
p~ q4
+
+
P§ ql
+
+
Pt q2
P§ q2
Under the alternative, as before, fj is equal to ~ {A
+
Pt <I3
P§ q4
+
+
Pt q4
p~ ql
+ B + C + D + E },
where A, B, C and D are as in Section 2.3.1, and
o.
As before, A
••
...
Therefore,
fj
! + ~ [8t (! PI + P2 + P3 + P4) + 8~ (PI + ! P2 + P3 + P4)
+ 8§ (PI + P2 + ! P3 + P4) + 8~ (PI + P2 + P3 + i P4)
=
+
2 (P3 81 82
The expreSSIOn
In
•
= 0.01,
,
.
. equ al to expressIOn
IS
P3
=
+
P4 -
+
P4 81 82
P4 81 83
+
P3 82 84)],
brackets can be negative. For example, if PI
0.49,
=
81
0.733
X
84
= -
10- 5 •
distances the test is not consistent.
28
0.005,
82
=
83
=
=
P2
0.005, the
Thus for this group of orderings of
Table 2.4. Distribution of 1]
d24 < d23 < d34 ·
1]
(Do, DI , D 2 )
1
0
d12 dI2
2
PI q2 +
2
P2 qi
0
d13 dI3
2
PI q3 +
2
P3 qi
0
dI4 dI4
2
PI q4 +
2
P4 qi
0
d23 d23
P~~
P~
0
d24 d24
2
P2 q4 +
2
P4 q2
0
d34 d34
2
P3 q4 +
2
P4 q3
Probabilities
+
q2
d I2 dI3 d23
PI P2 q3
d12 d23 dI3
PI P2 q3
dI2 d14 d24
PI P2 q4
d12 d24 d I4
PI P2 q4
d I3 d14 d34
PI P3 q4
d I3 d34 dI4
PI P3 q4
d24 d23 d34
P2 P4 ~
d24 d34 d23
P2 P4 q3
3"
1
0
2 q2 + P3
2 ~ + P4
2 q4
PI2 qi + P2
0
Otherwise
0
0
•
[1 - P(1]
..
=
1) - P(1]
=
i).
•
29
2.3.2.3 Expected value of fJ, assuming d 12 < d 13 < d 14 < d34 < d 23 < d:w
The distribution of 1] = 1]( Xl' X 2, Yd in this case, is shown in Table 2.5.
Thus, the expected value of 4> is given by,
E( 1])
= ! (pi ql + p~ q2 + P5 <lJ + p~ q4) + pi q2 + pi q3 + pi q4
+ P22 ql + P22 q3 + P22 q4 + P32 ql + P32 q2 + P32 q4 + P42 ql
+ p~ q2 + p~ q3 + 2 (PI P2 q3 + PI P2 q4 + PI P3 q4 + P3 P4 q2)·
Under the alternative, as previously, Tj is equal to
~{ A
+
B
+
C
+
D
+
E },
where A, B, C and D are as in Section 2.2.1, and
D -
.
o.
Therefore,
ij
! + ~ [<5i (! PI + P2 + P3 + P4) + <5HPI + ! P2 + P3 + P4)
+ <55 (PI + P2 + ! P3 + P4) + <5HPI + P2 + P3 + ! P4)
=
+
2 P3 <5 1 <5 2
+ 2 P4 <5 1 <5 2 + 2 P4 <5 1 <5 3 + 2 P2 <5 3 <5 4].
Similarly to the previous case, the expression in brackets can be negative.
example if PI = P2 = P3 = 0.01, P4 = 0.97, <5 1 = <5 4 =
= 0.005, the expression is equal to -
0.733 x 10 -5.
ordering of distances, the test is not consistent.
30
For
- 0.005, <5 2 = <5 3
Thus for this group of
Table 2.5. Distribution of TJ -
TJ(X 1 , X 2 , Y 1 ), assuming d l2 < d l3 < d r4 < d 34
< d23 < d24 •
Probabilities
TJ
(Do. D 1 , D 2 )
1
0
d 12 d l2
2
PI q2
0
d l3 d 13
2
PI q3
0
d l4 d l4
2
PI q4
a
a
a
d 23 d 23
2
P2 q3
d 24 d 24
2
P2 q4
d 34 d 34
2
P3 q4
+ p~ ql
+ p~ ql
+ p~ ql
+ P32 q2
+ P42 q2
+ P42 q3
d l2 d l3 d 23
PI P2 q3
d l2 d 23 d l3
PI P2 q3
d l2 d 14 d 24
PI P2 q4
d 12 d 24 d l4
PI P2 q4
d l3 d 14 d34
PI P3 q4
d l3 d 34 d l4
PI P3 q4
d 34 d 23 d 24
P3 P4 q2
d 34 d 24 d 23
P3 P4 q2
3
1
0
a
Otherwise
a a
2
PI ql
[1 -
+ P~ q2 + P§ ~ + P~ q4
P(q = 1) - P(q = i)]
..
31
2.4 Proof of Consistency assuming distances between pair of points are all equal.
=
The distribution of TJ
TJ( Xl' X 2,
YI ) in this case, is shown in Table 2.6.
Thus, the expected value of TJ is given by
E( TJ)
= ! (p~ qi + p~ q2 +
+
p~ qi
+
p~ q3
+
+
pa q2
+
pa
~ +
+
PI P4 q2
+
PI P4 ~
P§ q3
+
p~ q4
+
+ p~ q2 + p~ ~ + p~ q4
pa q4)
P§ qi
+
P§ q2
+
P§ q4
+
p~ qi
! [2 (PI P2 ~ + PI P2 q4 + PI P3 q2 + PI P3 q4
+
P2 P3 qi
+
P2 P3 q2
+
P2 P4 qi
+
P2 P4 q3
+ P3 P4 qi + P3 P4 ~ )].
Under the alternative, as previously,
+
r; can be written as ~ { A + B + C + D
E}, where A, B, C and D are as in Section 2.2.1, and
E
=
+
! [2 (P3 + P4) hI h2 + 2 (P2 + P4) hI h3 + 2 (P2 + P3) hI h4
2 (PI
+
P4) h 2 h3
+ 2 (PI +
P3) h2 h4
+
2 (PI
+
P2) h3 h4 ]·
Therefore,
r;
•
= ! + ~ {h~ (! PI +
+
+
P2
+
P3
+
P4)
! P3 + P4) + ha (PI
! P2 + P3 + P4)
+ P2 + P3 + ! P4)
+ h~ (PI +
+
h1 (PI
+
! [2 (P3 + P4) hI h2 + 2 (P2 + P4) hI h3 + 2 (P2 + P3) hI h4
+
2 (PI + P4) h2 h3 +2 (PI
>
!' since the expression in curly brackets is greater than
P2
+
32
P3) h2 h4
+
2 (PI
+
P2) h3 h4 ]}·
Table 2.6. Distribution of 1] -
=
1](X I , X 2 ,
VI),
assuming d I2
-
d 13
d23 = d24 "
1]
(Do. D I , D 2 )
1
0
dI2 dI2
2
PI q2
+
p~
0
dI3 d13
2
PI q3
+
P5 qI
0
dI4 dI4
2
PI q4
+
P~
0
d23 d23
P~~
+
P5 q2
0
d24 d24
2
P2 q4
+
2
P4 q2
0
d34 d34
2
P3 q4
0
0
2
PI qI
+ P~ ~
+ P~ q2 +
1
3
0
Probabilities
d12 dI3 d23
PI P2 ~
d12 d23 dI3
PI P2 ~
d I2 d I4 d24
PI P2 q4
dI2 d24 dI4
PI P2 q4
dI3 d12 d23
PI P3 q2
dI3 d23 dI2
PI P3 q2
d I3 d I4 d34
PI P3 q4
dI3 d34 dI4
PI P3 q4
dI4 d12 d24
PI P4 q2
d14 d24 dI2
PI P4 q2
dI4 dI3 d34
PI P4 ~
dI4 d34 dI3
PI P4 ~
d23 d12 dI3
P2 P3 qI
d23 dI3 d12
P2 P3 qI
d23 d24 d34
P2 P3 q4
d23 d34 d24
P2 P3 q4
33
qI
qI
P5
~
+
P~
q4
,
Table 2.6. (continued)
(Do, D1 , D2 )
0
Probabilities
d 24 d 12 d 14
P2 P4 ql
d 24 d 14 d 12
P2 P4 ql
d 24 d 23 d34
P2 P4 ~
d 24 d 34 d 23
P2 P4 q3
d34 d13 d14
P3 P4 ql
d34 d 14 d 13
P3 P4 ql
d34 d 23 d 24
P3 P4 q2
d34 d 24 d 23
P3 P4 q2
Otherwise
[1 - P(ry
•
•
34
1) - P(ry - !)]
Table 2.7.
Joint Distribution of (Do, D t , D2 ) when X and Y have a discrete
distribution over n points.
(Do, D t , D2)
0
0
Probabilities
0
Pt2 qt
+ P22 q2
0
d 12 d t2
0
d t3 d 13
+ ... + p~ qn
Pt2 q2 + P22 qt
Pt2 ~ + P32 qt
d 2n d 2n
P22 qn
o
o
o
0
+ Pn2 q2
-.
0
dn -
t n
dn -
P22 qn
t n
+
2 q2
Pn
d t2 0
d 12
Pt P2 qt
+
d t3 0
d t3
Pt P3 qt
+ Pt
Pt P2 q2
P3 q3
•
Pn-t Pn qn-t
35
+ Pn-l Pn qn
Table 2.7. Continued.
(Do. DI • D2)
Probabilities
d I3 d I3 0
+ PI P3 ~
PI Pn qI + PI Pn qn
P2 P3 q2 + P2 P3 ~
P2 P4 q2 + P2 P4 q4
PI P3 qI
dIn dIn 0
d 23 d 23 0
d 24 d 24 0
dn- I n
•
dn_ I n
0
Pn-I Pn qn-I
d 12 d I3 d 23
PI P2
~
d I2 d 23 d I3
PI P2
~
d I2 d I4 d 24
PI P2 q4
d 12 d 24 d I4
PI P2 q4
d 12 dIn d 2n
PI P2 qn
d I2 d 2n dIn
PI P2 qn
d I3 d I2 d 23
PI P3 q2
d 13 d 23 d I2
PI P3 q2
d I3 d I4 d 24
PI P3 q4
d I3 d 24 d I4
PI P3 q4
d I3 dIn d 2n
PI P3 qn
d I3 d 2n dIn
PI P3 qn
dIn d I2 d 2n
PI Pn q2
dIn d 2n d I2
PI Pn q2
"
36
+ Pn-I Pn qn
Table 2.7. Continued.
(Do. DI , D2 )
Probabilities
dIn d I3 d 2n
PI Pn <lJ
dIn d 2n d I3
PI Pn q3
dIn d l n-I
d 2 n-I
PI Pn qn-I
dIn d 2 n-I
d l n-I
PI Pn qn-I
d 23 d 12 d 13
PI P3 qi
d 23 d I3 d I2
PI P3 qi
d 23 d 24 d 34
P2 P3 q4
d 23 d 34 d 24
P2 P3 q4
d 23 d 2n d 3n
P2 P3 qn
d 23 d 3n d 2n
P2 P3 qn
d 24 d 12 d I4
P2 P4 qi
d 24 d I4 d I2
P2 P4 qi
d 24 d 23 d 34
P2 P4 <lJ
d 24 d 34 d 23
P2 P4 q3
d 24 d 2n d 4n
d 24 d 4n d 2n
dn- I n
d l n-I
dIn
dn- I n
dIn
d l n-I
Pn-I Pn qi
dn - I n
d 2 n-I
d 2n
Pn-I Pn q2
dn- I n
d 2n
d 2 n-I
Pn-I Pn q2
Pn-I Pn qi
37
•
Table 2.7. Continued.
(Do, D1 , D2 )
Probabilities
Pn-l Pn qn- 2
Pn-l Pn qn-2
•
38
! [(P3 + P4) (b
+ (PI +
which is
P4)
+ b2)2 + (P2 + P4) (b l + b3)2 + (P2 + P3) (b l + b4)2
(b 2 + b3)2 + (PI + P3) (b 2 + b4)2 + (PI + P2) (b 3 + b4)2],
l
> O.
2.5. Pr<X>f of consistency for discrete distributions over n points asSumIng all
distances equal.
The distribution of "l in this case is given in Table 2.8.
Under Hal the
average expected value of "l, fj is given by
+
2
t= t
i
I j '" i
PiPjb i
+
tt
b~Pj
=
i
Ij '" i
39
nt t
3 = I j =i + I
+~
l
i
t
k '" i '"
PiPjPk
j
•
! {A + B + C + D + E} where
2
1
(n )3
2
3 .LPi
,= 1
3'
n
L
i=1
C -
.L h~
,=
n
(
1
3 Pi
1
2
n-l
n
+ -3 .L 1· L.
1=
+
3='+ 1
t
j::j:i
p~
n
L
j=1
hj
-
O.
Pj)
n
L PiPkhj +
k .,...,..
... ,· ... ,·
2
-3
n-l
n
n
L ·L· k ....,...,..
L,· ... ,·PjPk hi
. 13=,+1
,=
Therefore,
77
=
>
i
+ !{itl h~(ipi +
j~iPj) + ~ ~~: j];:+1 k::j:~::j:lkhihj},
i, since the expression between curly brackets is greater than
•
> O.
Hence, as previously, E(~) >
i.
40
Table 2.8. Distribution of
=
d 23
= ... =
d 2n
1]
=
=... =
1](
Xl' X2 , YI), assuming d 12
d n -1 1
= ... =
d n - 1 n·
Probabilities
1
o
P~ q2
+ p~ ql
P~ ~
+ P~ ql
+ P5 q2
P~ q4 + P~ q2
P~ ~
0
1
3
dn - 1 n
dn _ 1 n
Pn2 - 1 qn
d 12 d 13 d 23
PI P2
~
d 12 d 23 d 13
PI P2
~
d 12 d 14 d 24
PI P2 q4
d 12 d 24 d 14
PI P2 q4
d 12 dIn d 2n
PI P2 qn
d 12 d 2n dIn
PI P2 qn
d 13 d 12 d 23
PI P3 q2
d 13 d 23 d 12
PI P3 q2
d 13 d 14 d 34
PI P3 q4
d 13 d 34 d 14
PI P3 q4
41
+ Pn2 qn - 1
Table 2.8. Continued.
Probabilities
1
3"
d1 n-l
dIn d n - 1 n
PI Pn-l qn
d 1 n-l
dn- 1 n
dIn
PI Pn-l qn
d 23 d 12 d 13
P2 P3 ql
d 23 d 13 d 12
P2 P3 ql
d 23 d 24 d 34
P2 P3 q4
d 23 d 34 d 24
P2 P3 q4
d 24 d 25 d 45
d 24 d 45 d 25
d 24 d 2n d 4n
d 24 d 4n d 2n
Pn -2 Pn-l qn
Pn-2 Pn-l qn
•
000
o
[1- P(77 = 1) - P(77 = !)]
Otherwise
42
Thus the consistency of the proposed test depends on the orderings and on the
pattern of ties of distances between points.
2.7 Proof of consistency in the univariate Normal case.
observations from X and Y respectively. Then
P{ I Xl - X2 I < min[ I Xl - Y I,
P{(X(2) - X(l)) < (X(1) - Y)}
I X2 -
+ P{(X(2)
Y I]}
- X(l)) < (Y - X(2))}'
where X(l) and X(2) are the order statistics of Xl and X2. 50,
E[7J(X l , X2 ; Y)]
pry < 2 X(l) - X(2)}
+ pry >
2 X(2) - X(1)}
P{(Y- X) < - ~(X(2) - X(l))}
+ P{(Y = P {(Y - X) <
= 1 -
P{ -
X(1)
where X
+
2
and 5x
.
3
X) > 2 (X(2) - X(l))}
-.h 5x} + P{(Y -
3
.J2
5x < Y -
X(2)
-
X) >
.h 5x}
3
-2
9 2
X < .J2 5 x } - P {(Y - X) > 2 5 x },
=
X(2) - X(l))
.J2
But Y, X and 5x are mutually
5~
independent, so "" X~ and Y - X "" )( (6, o-} + 0"~/2 ), where 6
O"~
= Jiy
- JiX'
Thus
.1
Y- X
'JO"}
+
0"~/2
r 1)
"" J~~r( U,
,an d
(Y
h
2
t
2 - X)2
2 /2 "" X2 2, were
X 2 represen s
O"y
+ O"X
1,6
1,6
a non-central chi-squared distribution with noncentrality paramater 62 • Hence
43
.
..
E[T/(X I , X 2 ; Y))
X
= P { (Y -S2
)2}
~2
>
X
= P { F(1,1,<52 )
>
9a~ 2}'
+ ax
2
2 ay
where F(1,1,<52 ) represents a non-central F distribution with degrees of freedom 1,
1 and noncentrality parameter <5 2 • Under Ho, J-tx = J-ty, ax = O'y, <5 =
E[T/(X I , X 2 ; Y)) -
P{F(1,1,0) > 3}
°
and
1
- 3·
Now,
If J-tx
=
J-ty and
= J-ty and ~~ -
~~ -
00
then E[T/(X I , X 2 ; Y)) - P{F(1,1,0) > O} = 1. If J-tx
0, then E[T/(X I , X 2
E[T/(X I , X 2 ; Y)) can be less than
= 2 P{F(1,1,<52 )
>
92
2 (~~)
1.
;
Y)) - P{F(1,1,0) > 9} = 0.2048, so that
However,
}
+1
+ 2 P{F(1,1,<5 2 ) >
92
2 (~~)
}
+1
>
j,
unless J-tx = J-ty and O'x = ay. To prove this result, first observe that F(1,1,<5 2 ) is
an increasing function of <5 2 .
Thus, the minimum of E[¢>(X I , X 2
obtained by minimizing
44
;
Y I , Y 2 )), is
g(x)
P{F(I,I,O) > 2 ;
+ I} +
P{F(I,I,O) >
x
_ 2 -
~7r {tan - l( ~2 x3 + 1 ) +
2
tan -
~
2 (~)
+1
}
I
Now,
d g(x) __ ~ {
- 6x
2
2
dx 7r ~2 x + 1 (10 + 2 x )
The equation
cannot be
d:( x) =
°
x
Uy,
~2 (~y
6
+ 1 (10 x3
}
+ 2 x)
has two real roots, x = 1 and x = -
negative~and d~(:)
I=
x
that is, for u x =
+
=
I
J
~3 7r
.
1. Since x
> 0, g(x) is minimized for x
=
1,
which proves the result.
2.7 Proof of consistency for X and Y uniformly distributed over circumferences.
Let X be uniformly distributed over the unit circumference (circumference
of radius 1) and Y be distributed over the circumference of radius r (r > 1). Let
Xl' X 2 , Y I , Y 2 , be two observations from X and Y respectively. Then
P{D(X I , X 2 ) < min[D(X I , V), D(X2 , V)]}
-
P{E I }, say.
Without loss of generality rotate the axes such that the two X's fall
symmetrically with respect to the horizontal axis as shown in Figure 2.1. Let 8
be the angle formed by the line segment from the origin to Xl with the horizontal
axis. 8 is uniformly distributed over (0, 7r/2). Let Xo
=
cos(8), and
Yo
=
sin(8)
be the coordinates of Xl' Let a (uniformly distributed over (0, 27r)) be the angle
45
formed by the line segment from the ongm to Y with the horizontal axis.
Consider now three circumferences: the circumference centered at (cos(8 1 ),
sin(( 1 )) where 81 is the angle such that 2 sin(( 1 )
=
r -
1; the circumference
centered at (cos( ( 2 ), sin(( 2 )) where 82 is such that this circumference intersects
the circumference of radius r at the point (r, 0), and the circumference centered
at (cos( ( 3 ), sin(( 3 )) where 83 is such that this circumference intersects the
circumference of radius r at the point ( - r, 0). Finally consider the circumference
centered at (cos( 8), sin( 8)) and let (3 be the angle between the line segment from
the origin to Xl and the line segment from the origin to the intersection of this
circumference with the circumference of radius r, as shown in Figure 2.1. Note
that D(X 1 , X 2 )
P{E1 }
=
=
2 sin(8). Then
P{E 1 10 < 8 < 8d P{O < 8 < 81 }
+ P{E l 181 <
+ P{E l 182 <
+ P{E l 183 <
.
8 < 82 } P{8 1 < 8 < 82 }
8 < 83 } P{82 < 8 < 83 }
8 < 1r/2} P{83 < 8 < 1r/2}.
(2.8.1 )
By observing the geometric characteristics of Figure 2.1, it can be seen that 4
sin 2 ( ( 2 )
=
1
+ r2
2 r cos(( 2 )
-
and 4 sin 2 ( ( 3 )
= 1 + r2
-
2 r cos(1r - ( 3 ),
These relations plus the relation 2 sin(( 1 ) = r - 1 and the fact that 0 < 82 <
83
<
~,imply that for r ~
..J'3,
. -1
-1)
81 - szn
-2-
(r
82 - cos -f+~3 4(4 -r')}
83 - cos
-t{ - + ~3 (4 - r') }
r
4
(2.8.2)
.
46
Also 4 sin 2 ( 8) - 1
+ r2
-
2 r cos(j3), which implies
(2.8.3)
Now, P{E I
10 < () <
()l}
=
1, since in this region D(X l , X 2 ) is always less than
min[D(X l , Y), D(X2 , Y)], and P{E 1 I ()3 < 8 < 1r/2} = 0, since in this region
D(X l , X 2 ) is always greater than max[D(X l , Y), D(X 2 , Y)]. When 81 < () <
it can be seen that
When 82 < () < 83 it can also be seen that
- 1
Thus
47
()2
..
Figure 2.1. Picture illustrative of the computation of
P{ :X - Xl: < min[ :X - Y:, :X - Yl: ] I for X and Y
2
l
2
Uniformly distributed over circumferences
(2.8.4 )
Consider now
P{D(Yl' Y 2 ) < min[D(Y 1 , X), D(Y 2 , X)]}
-
P{E 2 }, say.
Similarly as before, without loss of generality, rotate the axes such that the
two Y's fall symmetrically with respect to the horizontal axis as show in Figure
2.2. Let 8* be the angle formed by the line segment from the origin to Y 1 with
the horizontal axis; 8* is uniformly distributed over (0, 7r/2). Let X o = r cos(8*),
and Yo
=
r sin( 8*) be the coordinates of Yl' Let Q* (uniformly distributed over
(0, 27r)) be the angle formed by the line segment from the origin to X with the
horizontal axis. Consider now the circumference centered at (r cos( 8i'), r sin( 8i'))
where 8i' is the angle such that 2 r sin(8i')
=
r -1; the circumference centered at
(r cos( 8;), r sin( 8;)) where 8; is such that this circumference intersects the unit
circumference at the point (1, 0), and the circumference centered at (r cos(8j), r
sin( 8j)) where 8j is such that this circumference intesects the unit circumference
at the point ( -1, 0). Finally consider the circumference centered at (r cos(8*), r
sin(8*)) and let (3* be the angle between the line segment from the origin to Y 1
and the line segment from the origin to the intersection of this circumference with
the unit circumference (see Figure 2.2). Note that D(Y 1 , Y 2 ) = 2 r sin(8*).
8i',
82, 8j and (3* are similarly expressed as a function of r using the geometric
properties in
Figure 2.2:
cos( 82) and 4 r 2 sin 2 ( 8j)
2 r sin( 8n
= 1 + r2
-
=
2 r cos( 7r - 8j). For r
49
= 1 + r2
~ ~,
r - 1; 4 r 2 sin2 ( 8;)
-
2 r
()* =
1
2-
LJ *
u
sin -
1
(r2- 1)
l'
cos -
l{ 1 + ~ 124 rr 2 -
3}
(2.8.5)
Also, 4 r 2 sin 2 ( (}*) -
{3* -
{3* ( ()*,
1
+ r2
2 r cos({3*), which implies
-
r)
(2.8.6)
Then
'* (}j
(}*2
(}*2
3 2
-
11"2
_...i...
11"2
f
0*
2 {3*
()*
d(}* - ~
11"2
1
f
0*
d(}*
(2.8.7)
{3*d(}*
(2.8.8)
3 {3*
0*
2
Therefore
+ '* (}j
-
(}*2
(}*2
3 -; 2
11"
_...i...2
11"
J(); {3*d(}*
(}l*
- ~2
J
11"
0*
3
(}2*
The integrals in (2.8.4) and (2.8.7) were evaluated numerically, usmg IMSL
subroutine DQDAG, and results are presented in Table 2.10.
Note that as a
function of r E[77(X 1 , X2 j Y)] increases and E[77(Y1 , Y 2 j X)] decreases but E[77(X1 ,
X2
•
j
Y)]
+
E[77(Y 1 , Y 2
j
X)] increases (see Figure 2.3), so that to minimize this
function it suffices to consider values of r close to 1. Numerical millimizatioll was
done using IMSL subroutine DUVMGS and the minimum found at r
function value =
l
=1
with
By interchanging the roles of X and Y the same results are
50
obtained for r < 1 so that the test is consistent to test Ho: r = 1 vs H a : r
=f 1.
..
•
51
•
Figure 2.2. Picture illustrative of the computation of
P[ lY 2- Y1II
< m i n[ lY 1- XlI, lY 2- Xl ] I for X and Y
Uniformly distributed over circumferences.
52
Table 2.9. Numerical evaluation of the functions in 2.8.3 for values of r close to 1.
E[ry(X 1 ,X 2 ;Y)] +
r
E[ry(Y 1 ,Y 2 ;X)]
1.00
0.33333316
0.33333333
0.66666649
1.05
0.34099165
0.32724485
0.66823650
1.10
0.34955042
0.32250085
0.67205127
1.15
0.35879903
0.31871992
0.67751895
1.20
0.36863258
0.31567433
0.68430691
1.25
0.37898579
0.31320554
0.69219133
1.30
0.38981582
0.31119706
0.70101288
1.35
0.40109478
0.30956058
0.71065536
1.40
0.41280597
0.30822772
0.72103369
1.45
0.42494168
0.30714468
0.73208636
1.50
0.43750209
0.30626860
0.74377069
.
•
53
0.8
0.7
0.6
E(~) 0.5
0.4
--~.:-.~.~.~.~.~
----
~.. ~'1..............................................
(y, J'r2 ; X)l
0.3
0.2
L..J..._..I...---.L..._..I...---.L..._..J...-----l-_..J...-----l-_..L----L....
1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50
r
Figure 2.3.
E(~)
as a function of r
•
54
CHAPTER 3
ASYMPTOTIC DISTRIBUTION OF V mn
3.1 Introduction
In this chapter the asymptotic distribution, under the null hypothesis, of
the test statistic proposed in (1.5.3) is studied. Since the distribution of the test
statistic depends on the underlying distribution of X and Y, the permutation
principle, for small samples, and bootstrap methods, for large samples, are used to
develop a conditionally asymptotically distribution free test. The same methods
are used to establish the null asymptotic distribution of the test statistic proposed
...
in Barakat, Quade and Salama (1990) and given in (1.4.6) .
3.2 Permuta.tional Distribution of V mn
Let Xl' X 2 ,
...,
X m and Y I' Y 2,
...,
Y n be random samples from X and Y
E R d with distribution functions F( . ) and G( . ) respectively. Let N
=
m
+
n
denote the size of the combined sample.
1,2, ..., m
z -
•
m
+ 1, m + 2, ..., N.
Under Ho, Z = (Zl' Z2' ..., ZN) is composed of N i.i.d random vectors.
(3.2.1)
Thus
given Z all possible permutations of the coordinates are equally likely, each with
probability
k!.
Hence all possible partitionings of the N observations into 2
groups, with m and n observations respectively are conditionally equally likely,
each having probability
(iX)
=
which is independent of Z and F
-1
permutation distribution of V mn is induced by the probability measure
the test based on V mn is conditionally distribution free.
computed for all
(iX) possible partitionings of the N
G.
The
(iX)
and
-1
So, given Z, V mn is
observations. Let M
=
(iX)
and let the ordered values of V mn be represented by V(l)' V(2)' ..., V(M)' Let V o
be the observed value of the test statistic and M 0 be the number of V(i)'s greater
~o.
than or equal to Yo' Then the p-value is given by
Let 0 < a < 1 be the
level of significance. The test rejects if V o ~ V([M(l _ 0') + 1]) where [M(l - a)
represents the greatest integer less than or equal to M(l - a)
+ 1.
+ 1]
The theory of
permutation tests is discussed in Puri and Sen (1971).
3.3 Asymptotic distribution of V mn
.
Consider the kernel defined in (1.5.2):
4>(X1 , X 2 ; Y1 , Y2 )
= 7J(X1 , X2 ; Y 1 ) + 7J(X1 , X 2 ; Y 2 )
+ 7J(Y 1 , Y 2 ; Xl) + 7J(Y 1 , Y 2 ; X 2 ),
Under Ho' E[7J(X 1 , X 2 ; Y 1 )] =
1and E[4>(X
1,
X2
;
Y 1 , Y 2)]
where
~. Consider the
test statistic defined in (1.5.3)
Vmn
-
)-1 (n2 )-1 .f!--.
LJ
(m
2
1<3
~
LJ4> (Xi' X j ; Y k , YI ) .
k<l
•
Define the conditional expectations,
56
¢lO(X) = E[¢(x, X2 j YI! Y2)],
¢Ol(Y) = E[¢(XI! X2 ; Y, Y2)],
•
¢n(X, y) = E[¢(x, X2 ; Y, Y2)],
¢02(Yl, Y2)
¢20(X l ,
X2)
=
=
E[¢(X l , X2 ; Yl' Y2)],
E[¢(x l ,
X2 j
YI! Y2)),
¢12(X j Yl' Y2) = E[¢(x, X2 j Yl' Y2)),
¢2t(X t , X2 ; y)
=
E[¢(x t , X2 ; Y, Y2)],
¢12(X; Yl' Y2)
=
E[¢(x, X2 ; Yl' Y2)),
¢22(X l , X2 j Yt, Y2)
=
E[¢(xt, X2 ; Yt, Y2)]
=
¢(X l , X2 j Yl' Y2)
(3.3.1)
Define also,
1/JlO(X) = ¢lO(X) - E(¢),
1/JOl(X) = ¢Ol(X) - E(¢),
.
= ¢n(X, Y) - ¢lO(X) - ¢Ol(Y) + E(¢),
1/J02(Y l , Y2) = ¢02(Y l , Y2) - ¢01(Y l ) - ¢01(Y 2) +
1/Jn(X, Y)
E(¢),
1/J20(X l , X2) = ¢20(X l , X2) - ¢lO(Xl ) - ¢1O(X 2)
+ E(¢),
1/J 12 (X
+ ¢lO(X)
j
Yl , Y2) = ¢12(X, Yl , Y2) - 2 ¢n(X, Y)
+ 2 ¢Ol(Y)
- ¢02(Y l ' Y2) - E(¢),
1/J 2l (XI! X2 j Y) = ¢2l (Xl' X2, Y) - 2 ¢n(X, Y)
+ ¢01 (Y)
+ 2 ¢lO(Y) - ¢20(Xl , X2) - E(¢),
1/J22(X t , X2 ; Yl , Y2 ) = ¢22(X l , X2 ; Yl , Y2) - 2 ¢12(X j Yl , Y2)
- 2 ¢2l (Xl' X2, Y) + ¢02(Yt, Y2) + ¢20(X l , X2)
+ 4 ¢n(X, Y) - 2 ¢Ol(Y) - 2 ¢lO(X) + E(¢).
(3.3.2)
•
Then V mn can be written in terms of its H-decomposition (1.2.4.3) as
57
+
2
m (m - 1)
+
2
n (n - 1)
+
4
m (m - 1) n
4
+ n m (m
+ m (m -
L
4
1) n (n - 1)1$i<i$m
+ 2 H~~) + 2 H~~) + H~~), where
H(20)
mn
H(02)
mn
(3.3.3)
2
- m (m - 1)
2
n (n - 1)
•
H(21)
mn
2
m (m - 1) n
58
•
(3.3.4)
Now,
E(H~~) = 0, V i,j since E(4)ij) = E(4)). Also,
Var(H~~)) -
E[H~~)F = 0 (m- 2 n- 2 ),
Var(H~~)) -
0 (m- I
n- 2 ),
V ar(H~~)) - 0 (m- 2 n- I ), and
Var(H~~)) _ 0 (m- I
n- I
(3.3.5)
).
Furthermore, we have the following result.
..
Result 1: Under Ha, H~~)
=
H~~)
=
0 a.s.
Proof:
E[4>(z, X2 ; YI , Y2 )]
=
E[TJ(z, X2 ; YI )
+ E[TJ(z, X2 ; Y2 )]
+ E[TJ(Y I , Y2 ; z) + E[TJ(Y I , Y2 ; X2 )]
=
P{D (z, X2 ) < D (z, YI )}
x P{D(z, X2 ) < D(X 2 , YI ) I D (z, X2 ) < D (z, YI )}
59
clearly, P{D (z, X2 ) < D (z, Y I )} =
!.
Thus
"
Now denote
Al = [D (Y I , Y2 ) < D (Y I , z)]
A2 = [D (Y I , Y2 ) < D (Y 2 , z)]
A3
=
[D (z, YI ) < D (z, Y2 )],
Then
r
=
P{A 3 } P{A I I A3 } P{A 2 1 A3 n AI}
+
p{Af} P{A 2 1 Af} P{A I I Af n A2 }·
Under Ho, clearly P{A 3 } = p{Af} =
1- A(z), and P{A 2 1 A3 n Ad
E[77(Y I , Y2 ; z)] =
=
! [1 -
!' P{A I
I
A3 }
-
P{A 2 1 Af} -
P{A I I Af n A2 } = 1. Thus
A(z)]
+ ! [1
Therefore
60
- A(z)] = 1 - A(z).
= i a. s.,
(10 = Var[4>lO(X)] = 0, and H~~) =
The above result implies 4>lO(X)
=
1/ll0(X)
•
..
4>lO(X) - E(4))
=
0 a. s.,
By symmetry, 4>01 (Y)
=
i
a. s., 1/l0l (Y)
=
0 a. s., (01
=
0 a. s.
Var[4>lO(X)]
= 0 and H~~) = 0 a. s. Q.E.D.
Now,
4>ll(x, y)
=
E[4>(x, X2 j y, Y2 )]
= P{[D(x, X2 )
+ P{[D(x, X2 )
+ P{[D(y, Y2 )
< min[D(x, V), D(X 2l V)]}
< min[D(x, Y2 ), D(X 2 , Y2 )]}
< min[D(y, x), D(Y 2 , x)]}
+ P{[D(y, Y2 ) <
min[D(y, X2 ), D(Y 2 , X2 )]}.
Denote
..
B(z)
•
= P{[D(z, X2) <
min[D(X l , X2 ), D(z, Xd]}
C(z) = P{[D(X l , X2 ) < min[D(z, Xl), D(z, X2 )]}
D(z) = P{[D(z, Xl) < min[D(X l , X2), D(z, X2)]}
Q(x, y)
=
P{[D(x, y) < min[D(x, X), D(y, X)]}.
Then
B(z)
+ C(z) + D(z) =
1 and B(z)
=
P{[D(x, X2 ) < min[D(x, V), D(X 2 , V)]}
•
D(z)
+
=
=
1 - Q(x, y)
+
1 - C(x)
2
61
+
2
. Also
P{[D(y, Y2 ) < min[D(y, x),
D(Y 2 , x)]} = 1 - Q(x, V). Hence
4>ll(x, y)
1 - C(z)
1 - C(y)
2
_ 2 - C~x) _ C~y) _ Q(x, y). So
<P11(X, x)
=
1 -
C(X) < 1,
=
smce Q(x, x)
1.
But, under Ha,
E[<Pll(X, Y)] - ~. This implies that <Pll(x, y) varies with x and y which in turn
implies that
"
(11 = Var[<Pll(X, Y)] > 0, and
tP11 (X, Y)
•
<P11 (X, Y) - <PlO(X) - <POt (Y)
+
E( <p),
- <P11 (X, Y) - E( <p) a. s.
(3.3.6)
In the same way, it is proved that
tP20(X t , X2) = <P20(X t , X2) - <PlO(X t ) - <P1O(X2 )
+
E(<p),
- <P20(X t , X2) - E( <p) a. s.
(3.3.7)
and
tP02(Y t , Y2) = <P02(Y t , Y2) - <POt(Y t ) - <POt(Y 2)
+ E(<p),
= <P02(Y t , Y2) - E(<p) a. s.
(3.3.8)
Similar results can be obtained for the test statistic proposed by Barakat,
Quade and Salama (1990). In this case the kernel is given by (1.4.5):
62
+ TJ(Y l , Y2 ; Xl) + TJ(Y l , Y2 ; X2 ), with
TJ(X l , X2 ; Yl ) = I{D(X l , X2 ) < D(X l , Yl ) + D(X l , X2 ) <
Under Ho, E[</>(X l , X2 ; Yl , Y2 )] = 4.
D(Xl , Yl)}·
For this kernel a result similar to Result 1 holds.
...
Result 2: Under Ho, H~~)
H~~)
=
=
0 a.s.
Proof:
E[</>(z, X2 ; Yl , Y2 )] = P{D (z, X2 ) < D (z, Yl )}
+ P{D (z, X 2 ) <
+ P{D
D (X2 , Y l )}
(z, X2 ) < D (z, Y2 )}
+ P{D (z, X2 ) < D (X2 , Y2 )}
+ P{D (Y l , Y2 ) < D (Y l , z)}
+ P{D (Yl , Y2 ) < D (Y 2 , z)}
+ P{D (Y l , Y2 ) < D (Y l , X2 )}
+ P{D (Y l , Y2 ) < D (Y 2 , X2 )}.
J
Denote
A(z)
B(z)
=
=
P{ D(z, X) < D(z, V)}
P{ D(z, X) < D(X, V)}
C(z) = P{ D(z, Y l ) < D(Y l , Y2 )}
,,\
= P{ D(Y l , Y2 )
< D(Y l , X)}
Then
</>lO(Z)
=
E[</>(z, X 2 ; Y l , Y 2 )]
=
2 [A(z)
+ B(z) +
(1 - C(z))
+ ,,\].
Under Ho, clearly A(z) = ~, B(z) = C(z) V z, and"\ =~. Hence E[</>(z, X2
;
Y l , Y2 )] = 4 V z, which implies </>lO(X) = 4 a. s., tPlO(X) = </> 10 (X) - E(</»
= 0 a. s.,
(10
= Var[</>lO(X)] = 0, and H~~) = 0 a. s.
By symmetry, </>Ol(Y)
=
4 a. s., tPOl(Y)
63
=
0 a. s., (01 -
0, and
H~~)
=
°
a. s. Q.E.D.
Now,
<Pn(x, y) = E[<p(x, X2 ; y, Y2 )]
•
= P{D(x, X 2 ) < D(x, y)}
+
+
P{D(x, X 2 ) < D(X 2 , y)}
P{D(x, X 2 ) < D(x, Y 2 )} + P{D(x, X 2 ) < D(X 2 , Y 2 )}
+ P{D(y,
Y 2 ) < D(y, x)}
+ P{D(y,
Y 2 ) < D(y, X 2 )}
+
P{D(y, Y 2 ) < D(Y 2 , x)}
+ P{D(y,
Y 2 ) < D(Y 2 , X 2 )}.
But under Ho
P{D(x, X 2 ) < D(X 2 , y)}
=
1 - P{D(y, Y 2 ) < D(Y 2 , x)},
P{D(x, X 2 ) < D(x, Y 2 )}
=
P{D(y, Y 2 ) < D(y, X 2 )}
=
~.
Thus
<Pn(x, y) = 2
+ B(x) + B(y) + Q(x, y) + Q(y, x),
where
B(z) = P{D(z, X) < D(X, Y)} and
..
Q(x, y) = P{D(x, X) < D(x, y)}. Therefore, since Q(x, x) = 0,
<Pn(x, x) = 2
+
2 B(x)
< 4. But, under Ho, E[<Pn(X, Y) = 4. Thus <Pn(x, y)
varies with the values x and y which implies that
(n
=
Var{<Pn(X, Y)] > 0, and
1Pn(X, Y)
=
<Pn(X, Y) - <PlO(X) - <POl(Y)
= <Pn(X,Y)-E(<p) a.s.
In the same way it is proved that
64
+
E(<p),
E(<p) a. s.
•
and
(02
=
Var[<P02(Yl' Y 2)]
> 0,
'l/J02(Y 1 , Y 2) = <P02(X 1 , X 2)
-
<P02(X 1 , X 2)
and
<P1O(X 2)
<P1O(X 1 )
+ E(<p),
E( <p) a. s.
Therefore, in view of the above results, for both test statistics,
'ljmn [V mn
-
+ 'Jffi
-rTf m H(20)
+ 'rTf
-rm n H(02)
E(V mn)] _ 4 'Jmn H(ll)
mn
mn
mn
+
Op ( 'Jmn
-{ffin )
+
4 'Jmn H(ll)
mn
mn )
0 p ('J'iTim
+ -'Jffi
rTf m
Hence the limiting distribution of 4 'ljmn H~~)
+
mn )
0 p ('Jmn
H(20)
+ 'In
.rm n
+
~ m H~~)
mn
H(02)
mn
+
~ n
H~~) provides the desired asymptotic distribution.
Denote
H(l)
= 4 'Jmn H(ll)
. H(2)
mn
mn'
mn
= 'Jffi
-rTf
m H(20)
mn
and H(3)
mn
= 'In
.rm n
H~~). Now, let F m and G n denote the empirical distribution functions of X and Y
respectively. Then
..
= 4 'ljmn
f f'l/Jll(X,
y) dF m(x) d[Gn(y) - G(y)
65
+
G(y)]
= 4 'Jmn
f f'l/Jn(x, y) dF m(x) d[Gn(y) - G(y)], since
f f'l/Jn(x, y) dF m(x) dG(y)
.
-
f f [¢In(x, y) - 9]dFm(x) dG(y)]
-
f f ¢In(x, y) dFm(x) dG(y) - 9
=
f: ~ E[¢J(x;, X
;=
2 ;
Y I , Y 2 )]
-
9
I
Thus
H~~
= 4 'Jmn
JJtPll(X, y) d[Fm(x)
- F(x)
+ F(x)]
d[Gn(y) - G(y)]
= 4 'Jmn f f'l/Jll(x, y) d[Fm(x) - F(x)] d[Gn(y) - G(y)]
= 4 f f'l/Jll(x, y) d{-.sm [Fm(x) - F(x)]} d{vr [Gn(y) - G(y)]}.
Now define
•
Then
66
Similarly,
= ~ f f tP20(Xl' X2)
d{-.sm [Fm(Xl) - F(Xl)]} x
d{-.sm[F m (X2) - F(X2)]
•
•
67
..
Bootstrap methods will be used to determine the asymptotic distribution.
Let H N denote the empirical distribution function of the combined sample
kk=1
2: I {Zk
N
ZI' Z2' ..., Zn· HN(z) =
Zk = X k for k
=
1, ..., m and Zk
::s; z}, where I{·} is the indicator function,
= Y k- m
for k = m
+ 1, ..., N,
with Zk ::s; z
meaning that every coordinate of Zk is less than or equal to the corresponding
coordinate of z.
Draw B
paIrs of bootstrap samples (simple random samples with
replacement) from the combined sample. Let S;i
- {Yi i, Yi i ,
...
=
{Xii, Xii' ..., X~i} and S=i
'Y~i} denote the i th pair of bootstrap samples, i
=
1, 2, ... , B.
Let F~, i( x) and G~, i(Y) denote the empirical distribution functions for the
i th pair of bootstrap samples.
1
m
G~,i(Y)
1
n
Lm
k=1
L
n
k=1
I {X ki < x},
I {Y ki
< y},
Then, the bootstrap estimators of W~)( x) and W~2)(y) for the i th pair of
bootstrap samples, denoted by W~~) (x) and W:~~)(y), are given respectively by
68
•
Now let
tPii, tPfo and tP~ be
the estimators of
tPll' tP20
and
tP02 respectively,
based on the combined sample Zl' Z2' ... ,ZN. From Result 1, it follows that
•
</>~
- (), where
•
•
Since
smce
tPii
F~, i( x)
is a bounded continuous function of F~jx) and G~,i(Y) and
and G:, i(Y) converge in probability to F( x) and G(y) respectively,
it follows that as N
and
-+
00, tPii ~ tPll-
In the same way as N
-+
00, t/J~ ~ t/J02
tPfo ~ tP20Ties among Xi's and Yi's due to sampling with replacement may occur.
Let tx,j' denote the number of repetitions of distinct Xj/s , j =1, 2, ..., k where k
is the number of distinct Xji's in S;i.
69
Let N represent the original combined
sample. Denote the set of distinct X:/s by S~:. Also let ty,i denote the number
of repetitions of distinct Yj/s , j =1, 2, ..., 1 where 1 is the number of distinct
Y:'s in S~i' Denote the set of distinct Yj/s by S~1. If tx,i
and S;i = S~:' Similarly if t y ,; = 1 V j, then 1 =
n
=
1 V j, then k
=
m
and S~i = S~1. With this
notation dW~,l}(X) and dW~~~)(y) are given by
t x,J·N -
m
ifxrt
- 0,
t y,J. N -
N
n
•
- 0,
if y
rt
N
Denoting by H~n"i' H:n ,. and H:~, i the bootstrap estimators of H~~, H~~ and
H~~, for the i th sample, it follows that
•
"'" (t
LJ
x
i N N
-
' -{ffi
d.C
y e Syi
70
m) [<P11 (x, y) N
8]
L.:
+ -.Jr;2n
L.: [~~(X, y)
-
8]},
d*C
d*C
x E Sxi
Y E Syi
- 8]
- 8]
and
H*** .
mn,
"
I
= . rm
\J1f
{
""
L..J
d*
Y1ESyi
""
L..J
d*
Y2 ES yi
t( -.In.N-n)(t -.In.N-n)
Yl,J
N
_Y.:..2'-,J~~_
N
- 8]
- 8]
..
In the case of no ties, these expressions reduce to
71
-
'Jffi
(0 -
-
~,
1'f X
E S·C
xi
- 0,
if x rI. N
= 0,
if y rI. N
L:
xES .c
.
XI
L:
X
+
1) =
L:
y E s·yi
L:
[<p{';(x, y) - OJ
[<p{';(X, y) - OJ
S·
.C
E •-i yES.
yl
-J~~ L:.C L:S.C [<p{';(X, y)
X
.
E Sxi
yE
- OJ},
yi
..
72
.
and
H***
mn,'. =
Now denoting by Vi the bootstrap estimator of the asymptotic value of the
test statistic for the i th sample {S;i' S~i}' by the above results, as m and
~mn
(Vi -
i)
H~n , i
+ H;:n , i + H;::, i
,
i
n-+oo
1,2, ... , B.
To determine the p-value, let Vo be the observed value of the test statistic
in the original combined sample and V(l)' V(2)' ..., V(B)' be the ordered bootstrap
values of the test statistic. Let B o be the number of V(i)'S greater than or equal
to
~mn (Vo
-
i ), then the p-value is given by ~o.
the test rejects if ~mn (Vo -~)
represents the greatest integer
For a level of significance a,
2: T(mn,[B(l-a)+l]) where [B(l- a) + 1]
< B(l - a) + 1. A computer program that
produces the estimated bootstrap asymptotic distribution is presented in
.
Appendix 3.
73
..
CHAPTER 4
•
SIMULATION STUDY OF THE POWER
4.1 Introduction
In this chapter a Monte Carlo simulation is performed to study the power
of the test based on V mn' The asymptotic distribution developed in Chapter 3 is
used to obtain critical values and the procedure used is similar to that given in
Whaley and Quade (1985) for estimating the power of the multidimensional runs
test. Results are compared with Hotelling's T 2 test and Schilling's and Barakat's
tests [Barakat, (1989)].
4.2 Procedure
As mentioned before the procedure to compute the power is similar to that
used by Whaley and Quade (1985). The power is computed for shift alternatives.
Let X and Y be two d-dimensional populations with respective densities f(x) and
g(x
+ «5)
differing only by a location shift «5. Then the power is the probability of
rejecting the null hypothesis when applying the test of homogeneity based on Vmn
to these populations. As described in Whaley (1983) the power depends on the
following factors:
1) The type I error. Let a = 0.05.
2) The sample sizes. Let m = n
= 5, 10 and 25.
3) The number of dimensions. Let d = 2, 5, 10 and 20.
4) The common distribution of the two populations under Ho. In order to be able
to compare the power of this test with the power of Hotelling's T 2 test, the
multivariate normal distribution with all variances equal to 1 and all correlations
equal to p (p
the form 2:
=
=
0 or 0.36), is considered. Thus the d x d covariance matrix is of
(((iij))
with
(iii
= 1 and
(iij
=
p, i
.
i= j = 1, 2, ..., d.
5) The size and direction of the shift. When p = 0 the power depends only on
the size of the shift, but for p = 0.36 the power depends also on the direction of
the shift. When p = 0 the same amount of shift in the same direction is applied
to every dimension. Let S
=
(8*, S*, ..., 8*)', represent the d-dimensional shift
vector. Term this shift as a "same direction shift" (SDS). \x/hen p = 0.36, if d
is even 8 = (8*, ...,8*, - 8*, ..., - 8*)' where a positive shift is applied to the
first ~ coordinates and a negative shift is applied to the other ~ coordinates. If d
is odd then 8 = (S*, ..., 8*, - S*, ..., - S*, 0)' where a positive shift is applied to
the first d 2" 1 coordinates, a negative shift is applied to the next d 2" 1 coordinates
and the last coordinate is not changed. Term this shift as an "opposite direction
shift" (ODS). The value of 8 is chosen such that the power of Hotelling's test is
equal to 0.70 and 0.90.
The way by which this can be done is described
III
Appendix 5.4.1 and Appendix 6.2.2 of Whaley (1983) and is summarized
III
Section 4.2.1.
To compute the power, 200 sets of N d-dimensional multivariate normal
random variates are generated for each combination of sample size, dimension,
correlation and shift, using the IMSL subroutines CHFAC and RNMVN. Due to
limitation of computing time, for N = 20, 100 sets are generated and for N
= 50,
50 sets are generated.
In each case the appropriate shift is added to the last n
members of the set.
The critical values are estimated using the asymptotic
distribution developed in Chapter 3.
75
.
4.2.1 Computation of 6 to achieve a specified power for Hotelling's T 2 test.
Let X and Y be two d-variate populations such that X '" N(ltI' E) and
•
Y '" N(lt2' E), where E = (( (Tjj))' Hotelling's T 2 test statistic is commonly used
The Hotelling's T 2 statistic is given by
= 1t2'
to test the hypothesis that Itl
[Hotelling (1931)],
T 2 = m m+n n (-Xl
-X 2 )' S
-
-1 (Xl -
- )
X2 ,
(4.2.1.1)
where m = number of observations in sample 1, n = number of observations in
sample 2, Xi is the mean vector of measurements in sample i, i
= 1,
2, S
= (m
Sl + n S2)/(m + n - 2) is the pooled unbiased estimate of the common covariance
matrix of the populations.
population i, i = 1, 2.
Sj
=
unbiased estimate of the covariance matrix of
Assuming multivariate normality of the data, the
distribution of T 2 is given by
•
m+n-d-1 T2
F
(m+n-2)
'" d,m+n-d-I,Li
where ~
=
(4.2.1.2)
mNn (It 1 - 1t2)' E -I(ltl - 1t2), m and n are the sample sizes and Fa, b, c
represents the noncentral F distribution with a and b degrees of freedom and
noncentrality parameter c. Using the notation h = Itl - 1t2
~
mNn h*2
= (h*, h*, ..., h*)'
d
L
(Tij \
where
(Tjj 1
are the elements of E -1.
ij
.
For the particular covariance structure being considered here
•
E- I _1
_{I_
- 1-p
p
1'1}
1+(d-1)p"" "" .
76
If p = 0, then r; -1
d
= I and L
<j 1 = d, which results ~
=
N d b*2
and
ij
C*
o
_
-
~N~
..
mnd"
If p f:. 0, in the" same direction shift" case
In the "opposite direction shift" case, if d is even then
*2 {
b' r; - 1 b = b
d
P
1 _ P [I - 1 + (d _ 1) p]
d2
- P
- 2 (1 - p) [ 1 + (d - 1) p]
}
d(d - 2)
+ 2 (1 _ p)
- P
[ 1 + (d - 1) p ]
d b*2
= 1 _ P , and
•
b* = ~N~(l-p)mnd
If d is odd, then
b' r;- l b = b*2{d-1 [1 _
P
]
1-p
1+ (d-1) p
and
b* -
N~(l-p)
mn(d-1)
77
+
(d-1) (d-3)
2 (l-p)
The relationship between h* and p, d and the sample sizes m, nand N
is given in Table 4.1.
•
=m +
n
Table 4.2 gives the values of h* needed to achieve
Hotelling's power of 0.70 and 0.90 for each combination of sample SIzes,
dimension, correlation and shift. Table 4.3 shows the estimated power.
Table 4.1. Relationship between hand p, d, m, nand N
p
d
=m +
n.
8*
1
0.00
2, 5, 10, 20
[N hj(m n d)]2
0.36
2, 5, 10,20
{N h[l+(d -1) p]j(m n d)}2
2, 10, 20
[N h(l- p)j(m n d)]2
1
(SDS)
1
0.36
(ODS)
1
0.36
5
{N h(l - p)j[m n (d -1)]}2
(ODS)
78
Table 4.2. Values of h* to achieve Hotelling's power of 0.70 and 0.90
m=n
d
p
h~.70
5
2
0.00
1.56
2.01
0.36
1.81
2.35
1.24
1.61
0.00
1.69
2.21
0.36
2.64
3.46
1.51
1.98
0.00
0.96
1.23
0.36
1.12
1.44
0.77
0.99
0.00
0.87
1.10
0.36
1.22
1.54
0.70
0.88
(SDS)
0.36
(ODS)
5
(SDS)
0.36
(ODS)
10
2
(SDS)
0.36
(ODS)
5
(SDS)
0.36
(ODS)
79
Table 4.2. (Continued)
m=n
d
p
80.70
80.90
10
10
0.00
0.78
0.98
0.36
1.60
2.01
0.62
0.78
0.00
0.57
0.73
0.36
0.67
0.86
0.46
0.59
0.00
0.44
0.55
0.36
0.68
0.85
0.39
0.49
0.00
0.37
0.45
0.36
0.75
0.93
0.39
0.36
(SDS)
0.36
(ODS)
25
2
(SDS)
0.36
(ODS)
•
5
(SDS)
0.36
(ODS)
10
(SDS)
0.36
(ODS)
80
Table 4.2 (continued)
m=n
d
p
8~.70
8~.90
25
20
0.00
0.33
0.41
0.36
0.92
1.13
0.26
0.32
(SDS)
0.36
(ODS)
81
•
Table 4.3 Estimated power of Vmn when Hotelling's T 2 power is 0.70 and 0.90.
•
m=n
5
d
2
Power of T 2
p
0.70
0.90
0.00
0.895
0.985
0.36
0.890
0.980
0.820
0.940
0.00
0.995
1.000
0.36
1.000
1.000
0.990
1.000
0.00
0.760
0.920
0.36
0.800
0.940
0.660
0.919
0.00
0.930
0.990
0.36
0.940
1.000
0.800
0.940
(SDS)
0.36
(ODS)
5
(SDS)
0.36
(ODS)
•
10
2
(SDS)
0.36
(ODS)
5
(SDS)
0.36
(ODS)
82
Table 4.3 (continued)
m=n
d
Power of T 2
p
0.70
0.90
•
10
10
0.00
0.970
1.000
0.36
1.000
1.000
0.890
0.980
0.00
0.640
0.860
0.36
0.660
0.820
0.620
0.920
0.00
0.800
0.920
0.36
0.840
0.980
0.620
0.860
0.00
0.860
0.980
0.36
0.960
1.000
0.500
0.780
(SDS)
0.36
(ODS)
25
2
(SDS)
0.36
(ODS)
5
(SDS)
0.36
(ODS)
10
(SDS)
0.36
(ODS)
83
Table 4.3 (continued)
•
m=n
25
d
20
Power of T 2
p
0.70
0.90
0.00
0.880
0.980
0.36
1.000
1.000
0.680
0.820
(SDS)
0.36
(ODS)
•
..
•
84
4.3 Conclusions
The simulation presented in this chapter was designed to study the power
of the test based on the test statistic V mn so that the results could be compared
•
2
with Hotelling's T and also with the results of Schilling's and Barakat's tests.
Comparing with results presented in Tables 4.4.1 to 4.4.3 of Barakat (1989) the
results presented in Table 4.3 show that the power of the test based on V mn is in
all cases greater than the power of Schilling's
Tk,N
cases greater than the power of Barakat's test.
(k = 1, 2, 3), and in several
In most cases the power was
greater than Hotelling's. For the same sample sizes the power increases with the
dimension and for the same dimension the relative power (relative to Hotelling's
test) decreases with the sample sizes. For samples of size 5 the power of V mn is
greater than Hotelling'sj for samples of size 10 the power of V mn is greater than
Hotelling's except when d = 2 and p = 0.36 (ODS).
For samples of size 25 the
power of V mn is greater than Hotelling's for d = 5, 10 and 20 [except when
p
= 0.36
(ODS)].
Since the power of V mn is greater than Hotelling's in some
cases, note that as pointed out by Whaley (1983), Hotelling's T 2 is a uniformly
•
most powerful test in the class of tests with power depending only on the
noncentrality parameter or among tests invariant with respect to nonsingular
linear transformations [Anderson, (1958)]. The test based on V mn depends on the
distances between observations which are not invariant with respect to linear
transformations and its power does not depend only on the noncentrality
parameter since in the simulations, within each size-dimension combination the
non-centrality parameter was held constant but the power varied considerably.
Thus V mn does not belong to the class for which Hotelling's T 2 is uniformly most
powerfuland its power should be compared to other nonparametric tests such as,
for example, the multivariate version of Kolmogorov-Smirnov's test.
85
.
•
CHAPTER 5
•
AN EXAMPLE
5.1 Introduction
In this chapter a subset of Fisher's Iris data [Fisher, (1936)] is analyzed by
applying the test statistic V mn (1.5.3) and Barakat's test statistic, W, defined in
(1.4.3), to test the hypothesis of homogeneity of two bivariate distributions, using
the asymptotic distribution developed in Chapter 3. Randomization tests for both
test statistics are also performed and the results are compared.
5.2 Description of the data
•
Fisher's Iris data consist of measurements on sepal and petal widths and
,
lengths of three species of iris ( Iris setosa, Iris virginica, and Iris versicolor), and
have been used be several authors to illustrate different multivariate statistical
procedures. Anderson (1958, pp 108 - 110) used the data to illustrate the use of
Hotelling's T 2 statistic to test the hypothesis of equal means (assuming equal
covariance matrices) between two bivariate normal populations.
Manlia et al
(1980) used the data in several examples of applications of concepts of
discriminant analysis and cluster analysis.
•
Whaley (1983) used a subset of the
data in connection with applications of a multidimensional runs test.
Barakat
(1989) analyzed the same subset of the data using multivariate tests based on
nearest neighbors. The same subset consisting of 50 measurements on sepal widths
and lengths of the species Iris virginica and Iris versicolor is used here to
Table 5.1 A subset of Fisher's Iris data.
Iris virginiea
(m = 50)
Iris versicolor
(n = 50)
•
Sepal
Width
(em)
Sepal
Width
(em)
•
3.3
2.7
3.0
2.9
3.0
3.0
2.5
2.9
2.5
3.6
3.2
2.7
3.0
2.5
2.6
2.8
3.2
3.0
3.8
2.6
2.2
3.2
2.8
2.8
2.7
3.3
3.2
2.8
3.0
2.8
3.0
2.8
3.8
2.8
2.8
3.0
3.4
3.1
3.0
3.1
3.1
Sepal
Length
(em)
6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2
6.5
6.4
6.8
5.7
6.1
5.8
6.4
6.5
7.7
7.7
6.0
6.9
5.6
7.7
6.3
6.7
7.2
6.2
6.1
6.4
7.2
7.4
7.9
6.4
6.3
7.7
6.3
6.4
6.0
6.9
6.7
3.2
3.2
3.1
2.3
2.8
2.8
3.3
2.4
2.9
2.7
2.0
3.0
2.2
2.9
3.0
2.9
3.1
3.0
2.7
2.2
2.5
3.2
2.8
2.5
2.8
2.9
3.0
2.8
3.0
2.9
2.6
2.4
2.4
2.7
2.7
3.4
3.1
2.3
3.0
2.5
2.6
87
Sepal
Length
(em)
7.0
6.4
6.9
5.5
6.5
5.7
6.3
4.9
6.6
5.2
5.0
5.9
6.0
6.1
5.4
5.6
6.7
5.6
5.8
6.2
5.6
5.9
6.1
6.3
6.1
6.4
6.6
6.8
6.7
6.0
5.7
5.5
5.5
5.8
6.0
6.0
6.7
6.3
5.6
5.5
5.5
•
,
•
•
Table 5.1 (continued)
•
..
Iris virginica
(m = 50)
Iris versicolor
(n = 50)
Sepal
Width
Sepal
Width
(em)
3.1
2.7
3.2
3.3
3.0
2.5
3.0
3.4
3.0
Sepal
Length
(em)
(em)
6.9
5.8
6.8
6.7
6.7
6.3
6.5
6.2
5.9
3.0
2.6
2.3
2.7
3.0
2.9
2.9
2.5
2.8
It
•
88
Sepal
Length
(em)
6.1
5.8
5.0
5.6
5.7
5.7
6.2
5.1
5.7
test the same hypothesis applying both V mn and Barakat's test statistic, W, and
using the asymptotic distribution developed in Chapter 3 for both tests. A list of
the data is given in Table 5.1. Descriptive statistics of the data are presented in
Table 5.2.
Table 5.2 Descriptive statistics of Fisher's Iris data.
Iris virginica
(m
=
Iris versicolor
(n
50)
=
Sepal
50)
Sepal
Width
(em)
Sepal
Length
(em)
Width
(em)
Sepal
Length
(em)
Mean
2.97
6.59
2.77
5.94
St. dev.
0.32
0.64
0.31
0.53
Minimum
2.20
4.90
2.00
4.90
Median
3.00
6.50
2.80
5.90
Maximum
3.80
7.90
3.40
7.00
Whaley (1983) applied Hotelling's T 2 test to test the null hypothesis of
homogeneity of the distributions of the sepal measurements of Iris virginica and
Iris versicolor, after testing the normality of the sepal width and sepal length in
each species. He found evidence to support the assumption of bivariate normality
of the joint distribution. The p-value of Hotelling's T 2 test was 0.000001, leading
•
to the rejection of the null hypothesis. Then he applied his multidimensional runs
..
test also obtaining a highly significant p-value. To have a better comparison
between his test and Hotelling's he produced confidence regions for the difference
89
between the mean vectors of the two populations. Barakat (1989) tested the same
hypothesis using the same data, applying Schilling's test statistics T k , N (k = 1, 2,
3) and the test statistic W, which is equivalent to the U-statistic U rnn defined in
(1.4.6). The critical values were found both by assuming asymptotic normality of
his test statistic and by simulation. For this analysis, the mean and variance of
Tk,N
=1
(k
= 1,
2, 3) and W were estimated through simulation. The p-values for k
and 2 were
<
0.000001 and for k
=
2, it was 0.0008, all highly significant
leading to the rejection of the null hypothesis.
5.3 Analysis of the data
5.3.1 Randomization Tests
Edgington (1987) describes two basic methods of getting significance by the
use of randomization tests: systematic data permutation and random data
permutation.
In the systematic data permutation method, all possible data
permutations are used to determinine the significance as described in Section 3.2
.
of Chapter 3. Random data permutation uses a simple random sample, with
replacement, of all possible permutatioons and is useful when the sample sizes are
moderate or large so that the total number of permutations is very big.
Edgington (1969, 1987) justifies the use of random data permutation and gives
confidence intervals for random data permutation p-values based on r random
If p is the p-value obtained using all possible permutations
data permutations.
then the 100(1 -
0:)% confidence interval for random data permutation p-values
based on r random permutations is given by ([rp - za/2 ~rp(l- p)
- za/2 ~rp(l - p)
.
+
+ l]/r, [rp
l]/r). In the example in this Chapter the total sample size
is N = 100, with m = 50 and n = 50. The application of the systematic data
permutation method would require
(;X)
= (1 00) data permutations which is not
50
feasible. Thus random data permutation is used in the following way: both test
90
statistics are computed for the observed sample and then 9,999 random samples
with replacement are drawn from the population of all possible permutations.
The test statistics are computed for each random permutation of the data and
compared with the observed values. Let P be the number of values of the test
statistics greater than or equal to the observed values. The estimated p-values are
then equal to 10 ~oo· The null hypothesis being tested is that the two populations
,
have the same distribution.
Results of this analysis are presented in Table 5.3.
The observed values of the test statistics in the original sample were V mn
=
1.69332 and W = 4.71412. For all 9,999 random permutations the values of both
test statistics were greater than the respective observed values, so that both pvalues are equal to 10100'
,
Computer programs necessary for this analysis are
presented in Appendices 4 and 5.
Table 5.3 Analysis based on permutation tests
Test
statistic
# of random
permutations
.
p-value
V mn
10,000
0.0001
W
10,000
0.0001
5.3.2 Tests based on the asymptotic distribution
To perform the tests using the asymptotic distribution of the test statistics,
200 pairs of bootstrap samples are drawn from the observed sample and the
asymptotic bootstrap distribution is determined for both test statistics, as
described in Chapter 3.
samples are -Jmn [V mn -
The normalized values of test statistics in the original
E (V mn)]
=
91
50 (1.69332 -1.33333)
=
17.9995 and
-Jmn
[W - E (W)] = 50 (4.71412 - 4) = 35.706 respectively. For V mn' this
value is greater than the maximum value among the 200 values of the bootstrap
asymptotic distribution so that the p-value is : : : : O. Similarly the observed value
of W is greater than the maximum value for the bootstrap asymptotic distribution
of W, so that the p-value is also
summarized in Table 5.4.
: : : : O.
The results of this analysis are
The computer program for this analysis is listed in
Appendix 3.
Table 5.4 Analysis based on the asymptotic distribution
Test
statistic
p-value
V mn
:::: 0
W
::::::: 0
Since the results are so highly significant in this example, it is likely that
different tests will produce significant results. So it may be useful to get an upper
•
bound on the p-value by using Chebyshev's inequality. For this, the asymptotic
variance can be estimated from the bootstrap distribution.
Based on 200
bootstrap pairs of samples of sizes equal to 50, the observed value of V mn is
1.69332245 and its estimated variance is V[-Jmn (V mn
V(V mn)
-
0.0011 and u(V mn)
=
0.0333.
-
4/3)]
=
2.7500, so that
Then, using the Chebyshev's
inequality,
P{ IV mn
-
4/31 ~ k u(V mn)}
P{V mn ~ 1.69332245}
::;
+ P{V mn
::;
l2' which gives
1
0.97334422} ::; 116.8666
0.0086.
Thus, 0.0086 is a conservative upper bound for the p-value in this example, when
92
using the test statistic V mn'
Using the test statistic W, similarly, the upper
bound is given by
P{W > 4.71412231 } + P{W
< 3.28587769} < 149)7306 - 0.0067.
It is also useful to construct a confidence region for the mean difference between
Iris virginica and Iris versicolor populations using the test statistic V mn and
compare it with the confidence region given by Hotelling's T 2 • Let us consider
testing the following hypotheses, at the a confidence level:
where
J.l1
= (~~~)
is the mean vector of measurements in the Iris v1rgm1ca
population with J.lu = mean value of sepal width and
length,
J.l2
=
(~~~)
= mean value of sepal
is the corresponding mean vector of measurements in the Iris
versicolor population and bo = (
difference.
J.l12
~~~)
is the corresponding hypothesized mean
The set of values of bo for which the test accepts Ho constitutes the
100(1 - a)% confidence region.
The 100(1 - a)% confidence region for
J.l1 -
J.l2'
using Hotelling's T 2 , defined
in (4.2.1.1) and (4.2.1.2), is given by the values of bo satisfying the inequality
m
m n (-
+
n
X1-
-x
C)' 5-1 (- C) < (m + n - 2)d F
X 1 -x 2 - v O
2- v O
(m+n-d-l) 1-a,d,m+n-d-1'
where m = 50, n = 50,
s-_ [0.101
0.089
Xl
=
0.089]
0.335
(~:~~), x2
2.77) and
( 5.94
'
.
93
.
For a = 0.05, F 0.95,2,97
= 3.090,
12.925
and since
- 3.448
-3.448 ]
3.902 '
the inequality reduces to,
' con fid
ThIS
1 ence regIOn IS an eII'Ipse cen t ere d a t -Xl
bounded by sepal width differences of (0.204
of (0.652
±
-
-X 2 -- (0.204)
0.652 an d
0.16) and sepal length differences
± 0.29).
To construct a confidence region based on the test statistic V mnl we tested
H 0:
J1.l
-
J1.2 = 80 , for several values of 80 ,
Since the testing procedure is very
computer-time consuming due to the bootstrapping techniques used for finding
the critical value and the construction of the confidence region requires performing
hundreds of tests, we used two critical values: an "anti-conservative" critical value
•
(~:~~~), and a "conservative" critical value
= (~). The values obtained were C1 = 2.95
C1
obtained by testing H 0: J1.1 - J1.2 =
C2
obtained by testing H 0: J1.1 - J.L2
and c2 = 3.55. Then tests were performed for values of 801 and 802 ranging from
- 0.5 to +0.5 by units of 0.02 using both critical values.
The two confidence
regions also with an elliptic shape are shown in Figures 5.1 and 5.2. The internal
ellipse is Hotelling's T 2 confidence region.
5.3.3 Conclusions
•
The p-values obtained using the permutation tests depend directly on the
number of permutations used and the number used here, 10,000, was chosen based
on limitations of computer time and such that the p-value obtained is a precise
estimator of the p-value based on all possible permutations. The p-values
94
obtained using the bootstrap asymptotic distribution also depend on the number
of bootstrap samples drawn from the original sample. We chose 200 mainly based
on limitations of computer time. The assumption of normality of the asymptotic
distribution of W in Barakat's analysis is not correct, but since there is so much
evidence in the data against the null hypothesis, all tests produce the same result.
The area of both confidence regions obtained by using the test statistic Vmn is
larger than the area of Hotelling's T 2 confidence region.
The elliptical shape of
the confidence region based on Vmn indicates that, similar to Hotelling's T 2 , the
power is about the same for both the SDS and ODS shifts. As indicated before,
it would also be useful to compare Vmn's confidence regions with confidence
regions based on other nonparametric tests as for example the bivariate
Kolmogorov-Smirnov test, which will be considered in a future research.
.
95
..
1.3
1.1
0.9
0
0
0
0
0
0
0
0
0
802
0.7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
•
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
a
0
0.3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
•
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
a
0
0
0
0
0
0
0
0
0
0
0.1
-0.1
-0.4
-0.2
0,0
0.2
0.4
0.6
o,e
801
Figure 5. 1. Confidence regions for JL 1
Fisher's Iris data based on Hotelling's
with a "conservative" critical value.
96
JL2. using
r2. and Vmn
1.0
1,3
0.9
0
0
0
0
0
0
0
0
0.7
<502
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
0
0
0
.
Q
0.3
0.1
-0.1
-0.4
0.4
0.0
0.6
o.e
1,0
Figure 5.2. Confidence regions for J.l- 1 J.l-2 using
2
Fisher's Iris data based on Hotelling's T and Vmn
with an "anti - conservative" critica I va lue.
97
4
•
CHAPTER 6
SUMMARY AND SUGGESTIONS FOR FUTURE RESEARCH
6.1 Summary
A nonparametric test for testing the hypothesis of homogeneity between
two multivariate distributions has been proposed. The test statistic, Vmm is a Ustatistic with kernel of degree (2, 2). In Chapter 2, the consistency of the test was
studied first for a particular class of discrete distributions and then in two cases of
continuous distributions. The permutational and asymptotic distribution of the
test statistic was developed in Chapter 3.
Bootstrap methods were used to
estimate the asymptotic distribution and an assessment of the precision of the
approximation was made for a particular case. In Chapter 4, the power of the test
was studied via Monte Carlo simulation.
For different sample sizes and
dimensions, multivariate normal random variables were generated and the power
was estimated for location shift alternatives.
The results were compared with
Hotelling's T 2 theoretical power and with simulated results of Schilling's and
Barakat's tests, and in some cases the power of V mn compares favorably with
Hotelling's. In Chapter 5, the test was applied to a subset of Fisher's Iris data to
test the hypothesis of homogeneity of the two populations of sepal width and
•
length. Both the permutational and the bootstrap asymptotic distributions of the
test statistic were used in the analyisis and confidence regions for III - 112 were
provided.
The results are similar to results from previous analyses of the same
data set, done by Whaley (1983) and Barakat (1989).
6.2 Suggestions for future research
The main interest in this research was in the study of asymptotic
properties of the proposed test. Further research on these properties as well as on
practical aspects of the application of the test to real life data sets could improve
its usefulness. Some suggestions for future research on this topic are as follows:
1. A more detailed study on the consistency of the test in order to characterize a
class of continuous distributions for which the test is consistent, particularly
investigating the class of multivariate normal distributions;
2. Further work on the analytical expression of the asymptotic distribution of the
proposed test;
3. Comparing the performance of the proposed test in the univariate case with
other nonparametric tests such as Kolmogorov-Smirnov's test;
4. Optimizing the computer programs used to find critical points, SInce the
bootstrap methods used are very computationally intensive;
5. Investigating the asymptotic non-null distribution and hence the power of the
proposed test theoretically;
6. Performing further Monte Carlo
simulations for different sample sizes and
other distributions and comparing the power with other nonparametric tests;
7. Constructing accurate confidence regions based on actual critical points and
cmparing with other tests;
8. Investigating the generalization of the proposed test to more than two
populations.
•
99
•
APPENDIX 1
•
A COMPUTER PROGRAM TO CALCULATE E(PHI) WHEN THE CLOSEST AND
THE SECOND CLOSEST PAIRS DO NOT HAVE ANY POINT IN COMMON
..
**********************************************************;
*WITHOUT LOSS OF GENERALITY IT IS ASSUMED THAT THE CLOSEST
PAIR IS THE PAIR (1,2), THE SECOND CLOSEST IS THE PAIR
(3,4). THE THIRD CLOSEST MUST JOIN 1 OR 2 WITH 3 OR 4.
WITHOUT LOSS OF GENERALITY LABEL THIS POINT (1,3). THEN
ALL THE 3! = 6 MATRICES OF POSSIBLE RANKS OF DISTANCES
BETWEEN POINTS GIVE THE SAME VALUE FOR THE KERNEL
FUNCTION. SO WITHOUT LOSS OF GENERALITY THE ORDERING
012 < 023 < 014 < 034 < 023 < 024 CAN BE ASSUMED;
**********************************************************;
OPTIONS LS=76;
PROC IML;
PSI=J(10,10,0);
H={l 5 6 7}//{5 2 8 9}//{6 8 3 10}//{7 9 10 4};
NAME={"ll" "22" "33" "44" "12" "13" "14" "23" "24" "34"};
DO 00=1 TO 6;
IF 00=1 THEN O={O 1 3 5}//{1 0 4 6}//{3 4 0 2}//{5 6 2 O};
IF 00=2 THEN O={O 1 3 6}//{1 0 5 4}//{3 5 0 2}//{6 4 2 O};
IF 00=3 THEN O={O 1 3 4}//{1 0 5 6}//{3 5 0 2}//{4 6 2 O};
IF 00=4 THEN O={O 1 3 5}//{1 0 6 4}//{3 6 0 2}//{5 4 2 O};
IF 00=5 THEN O={O 1 3 4}//{1 0 6 5}//{3 6 0 2}//{4 5 2 O};
IF 00=6 THEN O={O 1 3 6}//{1 0 4 5}//{3 4 0 2}//{6 5 2 O};
DO X1=1 TO 4;
DO X2=1 TO 4; ROW=H(IX1,X21);
DO Y1=1 TO 4;
DO Y2=1 TO 4; COL=H(\Y1,Y2\); S=O;
W=O ( I Xl, X21 ); Z=MIN (0 ( I Xl, Y1:) I: 0 ( I X2, Y11 ) ) ;
S=S+(W=Z)+3#(W<Z);
Z=MIN(0(IX1,Y2\) II 0(IX2,Y21»;
S=S+(W=Z)+3#(W<Z);
W=O <l Y1, Y2 : ); Z=MIN (0 ( \ Xl, Y11) I I 0 ( I Xl, Y2 : ) ) ;
S=S+(W=Z)+3#(W<Z);
Z=MIN (0 ( : X2, Y11) I I 0 ( I X2, Y21 ) ) ;
S=S+(W=Z)+3#(W<Z);
PSI(IROW,COLI)=S;
ENO;ENO;ENO;ENO;
PRINT 0;
PRINT PSI {ROWNAME=NAME COLNAME=NAME};
END;
QUIT;
APPENOIX 2
A COMPUTER PROGRAM TO CALCULATE E(PHI) WHEN THE CLOSEST ANO
THE SECONO CLOSEST PAIRS HAVE ONE POINT IN COMMON
****************************************************** ****i
*WITHOUT LOSS OG GENERALITY IT IS ASSUMEO THAT THE CLOSEST
PAIR IS THE PAIR (1,2), THE SECONO CLOSEST IS THE PAIR
(1,3). THEN THE 24 POSSIBLE MATRICES OF RANKS OF
OISTANCES ARE GENERATEO. ONLY THREE OF THOSE MATRICES,
(THOSE FOR EXAMPLE FOR WHICH THE OROERING OF THE OISTANCES
ARE 012 < 013 < 014 < 023 < 024 < 034, ANO
012 < 013 < 014 < 024 < 023 < 034, ANO
012 < 023 < 014 < 034 < 023 < 024.)
SHOW OIFFERENT VALUES FOR THE KERNEL FUNCTION.i
****************************************************** ****i
OPTIONS LS=76i
PROC IMLi
PSI=J(10,10,0)i /*MATRIX OF KERNEL FUNCION VALUES*/i
H={l 5 6 7}//{5 2 8 9}//{6 8 3 10}//{7 9 10 4}i
NAME={"ll" "22" "33" "44" "12" "13" "14" "23" "24" "34"}i
00 00=1 TO 24i
IF 00=1 THEN O={O 1 2 3}//{1 0 4 5}//{2 4 0 6}//{3 5 6 O}i
IF 00=2 THEN O={O 1 2 3}//{1 0 4 6}//{2 4 0 5}//{3 6 5 O}i
IF 00=3 THEN O={O 1 2 3}//{1 0 5 4}//{2 5 0 6}//{3 4 6 O}i
IF 00=4 THEN O={O 1 2 3}//{1 0 5 6}//{2 5 0 4}//{3 6 4 O}i
IF 00=5 THEN O={O 1 2 3}//{1 0 6 4}//{2 6 0 5}//{3 4 5 O}i
IF 00=6 THEN O={O 1 2 3}//{1 0 6 5}//{2 6 0 4}//{3 5 4 O}i
IF 00=7 THEN O={O 1 2 4}//{1 0 3 5}//{2 3 0 6}//{4 5 6 O}i
IF 00=8 THEN O={O 1 2 4}//{1 0 3 6}//{2 3 0 5}//{4 6 5 O}i
IF 00=9 THEN O={O 1 2 4}//{1 0 5 3}//{2 5 0 6}//{4 3 6 O}i
IF 00=10 THEN O={O 1 2 4}//{1 0 5 6}//{2 5 0 3}//{4 6 3 O}i
IF 00=11 THEN O={O 1 2 4}//{1 0 6 3}//{2 6 0 5}//{4 3 5 O}i
IF 00=12 THEN O={O 1 2 4}//{1 0 6 5}//{2 6 0 3}//{4 5 3 O}i
IF 00=13 THEN O={O 1 2 5}//{1 0 3 4}//{2 3 0 6}//{5 4 6 O}i
IF 00=14 THEN O={O 1 2 5}//{1 0 3 6}//{2 3 0 4}//{5 6 4 O}i
IF 00=15 THEN O={O 1 2 5}//{1 0 4 3}//{2 4 0 6}//{5 3 6 O}i
IF 00=16 THEN O={O 1 2 5}//{1 0 4 6}//{2 4 0 3}//{5 6 3 O}i
IF 00=17 THEN O={O 1 2 5}//{1 0 6 3}//{2 5 0 4}//{5 3 6 O}i
IF 00=18 THEN O={O 1 2 5}//{1 0 6 4}//{2 5 0 3}//{5 6 3 O}i
IF 00=19 THEN O={O 1 2 6}//{1 0 3 4}//{2 3 0 5}//{6 4 5 O}i
IF 00=20 THEN O={O 1 2 6}//{1 0 3 5}//{2 3 0 4}//{6 5 4 O}i
IF 00=21 THEN O={O 1 2 6}//{1 0 4 3}//{2 4 0 5}//{6 3 5 O}i
IF 00=22 THEN O={O 1 2 6}//{1 0 4 5}//{2 4 0 3}//{6 5 3 O}i
IF 00=23 THEN O={O 1 2 6}//{1 0 5 3}//{2 5 0 4}//{6 3 4 O}i
IF 00=24 THEN O={O 1 2 6}//{1 0 5 4}//{2 5 0 3}//{6 4 3 O}i
DO
DO
DO
DO
•
Xl=l TO 4i
X2=1 TO 4i ROW=H(IXl,X2l)i
Yl=l TO 4i
Y2=1 TO 4i COL=H(/Yl,Y21)i S=Oi
W=D(lXl,X2l)i Z=MIN(D(IX1,Yll) II D(lX2,Yll»i
S=S+(W=Z)+3#(W<Z)i
Z=MIN (D ( I Xl, Y21) 1 I D( l X2, Y21 ) ) i
S=S+(W=Z)+3#(W<Z)i
W=D(/Yl,Y2/)i Z=MIN(D(IX1,Yl/) / I D(lXl,Y2l»i
S=S+(W=Z)+3#(W<Z)i
Z=MIN(D(lX2,Y11) I I D(/X2,Y2l»i
S=S+(W=Z)+3#(W<Z)i
PSI(IROW,COLl)=Si
ENDiENDiENDiENDi
PRINT Di
PRINT PSIi
ENDi
QUITi
..
102
APPENDIX 3
A COMPUTER PROGRAM FOR FINDING THE ASYMPTOTIC DISTRIBUTION OF
THE TEST STATISTICS V(MN) AND W
***********************************************************
*
*
-----------------------------------------------------
*
*
*
**
*
*
*
*
*
*
*
*
*
*
MAIN PROGRAM
----------------------------------------------------INTEGER D,B
PARAMETER(N=100,M=SO,D=2,B=200,THETA=4./3.)
PARAMETER(ALPHA=O.OS)
REAL ZS(N,D),XS(M,D),YS(N-M,D),R(N),DZ(N,N),TMN(B)
REAL V,BV
INTEGER INDEXX(M),INDEXY(N-M)
INTEGER INDEXXS(M),INDEXYS(N-M)
EXTERNAL RNUN,RNSET,SVIGN,SVRGN
RNUN RNSET,SVIGN,SVRGN ARE SUBROUTINES FROM THE
IMSL LIBRARY
REN=REAL(N)
INPUT DATA
READ(S,88) «XS(I,J),J=l,D),I=l,M)
88 FORMAT(12FS.1)
READ(S,88) «YS(I,J),J=l,D),I=l,N-M)
DO 10 I=l,N
DO 10 J=l,D
IF (I .LE. M)
THEN
ZS(I,J)=XS(I,J)
INDEXXS(I)=I
ELSE
II=I-M
ZS(I,J)=YS(II,J)
INDEXYS(II)=I
ENDIF
10 CONTINUE
----------------------------------------------------"JITTER" THE DATA TO AVOID TIED DISTANCES
..
*
*
..
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
ISEED=1234567
CALL RNSET(ISEED)
NR=N
CALL RNUN(NR,R)
DO 11 I=l,N
R(I)=R(I)j1000000.
ZS(I,l)=ZS(I,l)+R(I)
11 CONTINUE
COMPUTE EUCLIDEAN DISTANCES
CALL DIST(N,D,ZS,DZ)
COMPUTE THE OBSERVED VALUE OF THE TEST STATISTIC
CALL VMN(M,N,DZ,V)
WRITE(6,2080)V
2080 FORMAT(' OBSV=',F14.8)
BOOTSTRAP ESTIMATION OF ASYMPTOTIC DISTRIBUTION
GENERATE RANDOM PAIRS OF SAMPLES WITH REPLACEMENT
Xl, X2, ... ,XM AND Y1, ... ,YN FROM Zl, ... ,ZN
DO 1000 III=l,B
NR=M
ISEED=123457-2*III
CALL RNSET(ISEED)
CALL RNUN (NR, R)
*
*
.
*
*
*
*
*
**
CREATE SET S
*
XI
DO 50 I=l,M
R(I)=REN*R(I)+l
INDEXX(I)=INT(R(I»
50 CONTINUE
CALL SVIGN(M,INDEXX,INDEXXS)
----------------------------------------------------104
*
*
*
*
*
*
*
*
CREATES SET S
*
YI
----------------------------------------------------NR=N-M
ISEED=123475+2*III
CALL RNSET{ISEED)
CALL RNUN{NR,R)
DO 70 I=l,N-M
R{I)=REN*R{I)+l
INDEXY{I)=INT(R(I»
70 CONTINUE
CALL SVIGN{N-M,INDEXY,INDEXYS)
COMPUTE BOOTSTRAP DISTRIBUTION
*
*
CALL VMN{M,N,INDEXXS,INDEXYS,DZ,BV)
1000 TMN{III)=BV
WRITE{6,2010) (TMN{I),I=l,B)
2010 FORMAT{' TMN=',5{F12.8,lX»
STOP
END
* *********************************************************
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
SUBROUTINES
•
SUBROUTINE VMN
THIS SUBROUTINE COMPUTES THE VALUE OF THE
TEST STATISTIC V{MN). TO FIND ASYMPTOTIC
DISTRIBUTION FOR BARAKAT'S TEST REPLACE THIS
SUBROUTINE BY SUBROUTINE W GIVEN IN APPENDIX 5.
SUBROUTINE VMN{M,N,INDEXXS,INDEXYS,DZ,V)
INTEGER M,N
REAL INDEXXS{M)MINDEXYS{N-M),DZ{N,N),V
PHI=O.
DO 10 II=l,M-1
DO 10 JJ=II+1,M
I=INDEXXS{II)
J=INDEXXS{JJ)
DO 10 KK=l,N-M-1
DO 10 LL=KK+1,N-M
K=INDEXYS{KK)
L=INDEXYS{LL)
IF (DZ{I,J) .LT. AMIN1{DZ{I,K),DZ{J,K»PHI=PHI+1.
105
•
IF (DZ(I,J) .LT. AMIN1(DZ(I,L),DZ(J,L»PHI=PHI+1.
IF (DZ(K,L) .LT. AMIN1(DZ(K,I),DZ(L,I»PHI=PHI+1.
IF (DZ(K,L) .LT. AMIN1(DZ(K,J),DZ(L,J»PHI=PHI+1.
10 CONTINUE
V=4.*PHI/(M*(M-1)*(N-M)*(N-M-1»
RETURN
END
.
*
*
*
*
*
*
SUBROUTINE DIST(N,D,ZS,DZ)
THIS SUBROUTINE COMPUTES EUCLIDEAN DISTANCES AMONG
OBSERVATIONS
INTEGER N,D
REAL ZS(N,D),DZ(N,N)
DO 20 I=l,N
DO 20 J=l,N
DZ(I,J)=O.
20 CONTINUE
DO 30 I=1,N-1
DO 30 J=I+1,N
DO 30 JJ=l,D
DZ(I,J)=DZ(I,J)+(ZS(I,JJ)-ZS(J,JJ»**2
30 CONTINUE
DO 40 J=1,N-1
DO 40 I=J+1,N
DZ(I,J)=DZ(J,I)
40 CONTINUE
RETURN
END
•
106
APPENDIX 4
A COMPUTER PROGRAM FOR FINDING THE PERMUTATIONAL
DISTRIBUTION OF V(MN)
***********************************************************
*
THIS PROGRAM COMPUTES THE PERMUTATIONAL DISTRIBUTION
*
OF THE TEST STATISTIC V(MN) USING RANDOM DATA
*
PERMUTATION (EDGINGTON (1980»
***********************************************************
*
*
*
MAIN PROGRAM
*
*
*
*
*
*
*
*
*
INTEGER D,B
PARAMETER(N=100,M=SO,D=2,B=10000,THETA=4./3.)
REAL ZS(N,D),XS(M,D),YS(N-M,D),R(N),DZ(N,N),PTMN(B)
REAL V,PV
.
INTEGER INDEXX(M),INDEXY(N-M),IPER(N)
EXTERNAL RNPER,RNSET
RNPER AND RNSET ARE SUBROUTINES FROM THE IMSL LIBRARY
•
INPUT DATA
READ(S,88) «XS(I,J),J=1,D),I=1,M)
88 FORMAT(12FS.1)
READ(S,88) «YS(I,J) ,J=1,D) ,I=1,N-M)
DO 10 I=1,N
DO 10 J=1,D
IF (I .LE. M) THEN
ZS(I,J)=XS(I,J)
ELSE
INDEXXS(I)=I
II=I-M
ZS(I,J)=YS(II,J)
INDEXYS(II)=I
ENDIF
10 CONTINUE
*
**
----------------------------------------------------"JITTER" THE DATA TO AVOID TIED DISTANCES
*
-----------------------------------------------------
•
*
*
*
*
ISEED=1234567
CALL RNSET(ISEED)
NR=N
CALL RNUN(NR,R)
DO 11 I=l,N
R(I)=R(I)/1000000.
ZS(I,l)=ZS(I,l)+R(I)
11 CONTINUE
COMPUTE EUCLIDEAN DISTANCES
*
*
*
*
CALL DIST(N,D,ZS,DZ)
-----------------------------------------------------
*
COMPUTE THE OBSERVED VALUE OF THE TEST STATISTIC
*
-----------------------------------------------------
*
CALL VMN(M,N,INDEXXS,INDEXYS,DZ,V)
WRITE(6,2080)V
2080 FORMAT(' OBSV=',F14.8)
*
*
*
PERMUTATIONAL DISTRIBUTION (RANDOM DATA PERMUTATION)
*
•
•
*
*
*
*
*
*
*
*
*
*
*
•
*
*
*
*
*
GENERATE RANDOM PAIRS OF SAMPLES WITHOUT REPLACEMENT
Xl, X2, •.• ,XM AND Y1, •.. ,YN FROM Zl, .•. ,ZN
MO=O
DO 1000 III=1,B-1
ISEED=123457+2*III
CALL RNSET(ISEED)
CALL RNPER(N,IPER)
CREATE X Y PERMUTATIONAL SAMPLES
DO 50 I=l,M
50 INDEXXS(I)=IPER(I)
DO 51 I=l,N-M
51 INDEXYS(I)=IPER(M+I)
-----------------------------------------------------
COMPUTE TEST STATISTIC
-----------------------------------------------------
CALL VMN(M,N,DZ,INDEXXS,INDEXYS,PV)
PTMN(III)=PV
108
IF (PV .GE. V) MO=MO+1
1000 CONTINUE
PVALUE=MO/B
WRITE(6,4000)PVALUE
4000 FORMAT(' PVALUE=',F14.8)
STOP
END
* *********************************************************
*
*
*
SUBROUTINES
*
*
*
SUBROUTINE VMN
*
*
*
*
*
*
THIS SUBROUTINE COMPUTES THE VALUE OF THE
TEST STATISTIC V(MN).
SUBROUTINE VMN(M,N,INDEXXS,INDEXYS,DZ,V)
INTEGER M,N
REAL INDEXXS(M),INDEXYS(N-M),DZ(N,N),V
PHI=O.
DO 10 II=1,M-1
DO 10 JJ=II+1,M
I=INDEXXS(II)
J=INDEXXS(JJ)
DO 10 KK=1,N-M-1
DO 10 LL=KK+1,N-M
K=INDEXYS(KK)
L=INDEXYS(LL)
IF (DZ(I,J) .LT. AMIN1(DZ(I,K),DZ(J,K»PHI=PHI+1.
IF (DZ(I,J) .LT. AMIN1(DZ(I,L),DZ(J,L»PHI=PHI+1.
IF (DZ(K,L) .LT. AMIN1(DZ(K,I),DZ(L,I»PHI=PHI+1.
IF (DZ(K,L) .LT. AMIN1(DZ(K,J),DZ(L,J»PHI=PHI+1.
10 CONTINUE
V=4.*PHI/(M*(M-1)*(N-M)*(N-M-1»
RETURN
END
,
•
*
*
*
*
*
*
SUBROUTINE DIST(N,D,ZS,DZ)
THIS SUBROUTINE COMPUTES EUCLIDEAN DISTANCES AMONG
OBSERVATIONS
INTEGER N,D
REAL ZS(N,D),DZ(N,N)
DO 20 I=l,N
DO 20 J=l,N
DZ(I,J)=O.
20 CONTINUE
DO 30 I=1,N-1
109
•
..
•
DO 30 J=I+l,N
DO 30 JJ=l,D
DZ(I,J)=DZ(I,J)+(ZS(I,JJ)-ZS(J,JJ»**2
30 CONTINUE
DO 40 J=l,N-l
DO 40 I=J+l,N
DZ(I,J)=DZ(J,I)
40 CONTINUE
RETURN
END
•
•
•
110
APPENDIX S
A COMPUTER PROGRAM FOR FINDING THE PERMUTATIONAL
DISTRIBUTION OF W
***********************************************************
THIS PROGRAM COMPUTES PERMUTATIONAL DISTRIBUTION OF
*
*
*
BARAKAT'S TEST STATISTIC W USING RANDOM
DATA PERMUTATION. (EDGINGTON (1980»
***********************************************************
*
*
*
*
*
MAIN PROGRAM
INTEGER D,B
PARAMETER(N=100,M=SO,D=2,B=10000,THETA=4./3.)
REAL ZS(N,D),XS(M,D),YS(N-M,D),R(N),DZ(N,N),PTMN(B)
REAL V,PV,PVALUE
INTEGER INDEXXS(M),INDEXYS(N-M),IPER(N),MO
EXTERNAL RNPER,RNSET
**
*
RNPER AND RNSET ARE SUBROUTINES FROM THE IMSL LIBRARY
*
INPUT DATA
*
-----------------------------------------------------
*
-----------------------------------------------------
*
*
READ(S,88) «XS(I,J) ,J=l,D) ,I=l,M)
88 FORMAT(12FS.1)
READ(S,88) «YS(I,J),J=l,D),I=l,N-M)
DO 10 I=l,N
DO 10 J=l,D
IF (I .LE. M)
THEN
ZS{I,J)=XS{I,J)
ELSE
INDEXXS{I)=I
II=I-M
ZS(I,J)=YS{II,J)
INDEXYS(II)=I
ENDIF
10 CONTINUE
*
*
-----------------------------------------------------
*
-----------------------------------------------------
"JITTER" THE DATA TO AVOID TIED DISTANCES
•
.
*
ISEED=1234567
CALL RNSET(ISEED)
NR=N
CALL RNUN(NR,R)
DO 11 I=l,N
R(I)=R(I)/1000000.
ZS(I,l)=ZS(I,l)+R(I)
11 CONTINUE
*
*
*
*
*
*
*
*
*
*
COMPUTE EUCLIDEAN DISTANCES
CALL DIST(N,D,ZS,DZ)
-----------------------------------------------------
COMPUTE THE OBSERVED VALUE OF THE TEST STATISTIC
----------------------------------------------------CALL W(M,N,INDEXXS,INDEXYS,DZ,V)
WRITE(6,2080)V
2080 FORMAT(' OBSV=',F14.8)
*
*
*
PERMUTATIONAL DISTRIBUTION (RANDOM DATA PERMUTATION)
*
*
•
*
*
*
*
*
*
*
*
*
*
•
•
*
*
*
*
*
GENERATE RANDOM PAIRS OF SAMPLES WITHOUT REPLACEMENT
Xl, X2, ... ,XM AND Y1, .•. ,YN FROM Zl, ..• ,ZN
MO=O
DO 1000 III=1,B-1
ISEED=123457+2*III
CALL RNSET(ISEED)
CALL RNPER(N,IPER)
CREATE X Y PERMUTATIONAL SAMPLES
DO 50 I=l,M
50 INDEXXS(I)=IPER(I)
DO 51 I=l,N-M
51 INDEXYS (I) =IPER(M+I)
---------------------------------------------------------------------------------------------------------
COMPUTE TEST STATISTIC
CALL W(M,N,DZ,INDEXXS,INDEXYS,PV)
PTMN(III)=PV
112
IF (PV .GE. V) MO=MO+1
1000 CONTINUE
PVALUE=REAL(MO)/B
WRITE(6,4000)PVALUE
4000 FORMAT(' PVALUE=',F14.8)
STOP
END
**
-----------------------------------------------------
*
-----------------------------------------------------
*
*
*
*
*
**
*
*
*
*
SUBROUTINE W
THIS SUBROUTINE COMPUTES THE OBSERVED VALUE
OF THE TEST STATISTIC W
SUBROUTINE W(M,N,INDEXXS,INDEXYS,DZ,BV)
INTEGER M,N,INDEXXS(M),INDEXYS(N-M)
REAL DZ(N,N) ,BV
PHI=O.
DO 10 II=1,M-1
DO 10 JJ=II+1,M
I=INDEXXS(II)
J=INDEXXS(JJ)
DO 10 KK=1,N-M-1
DO 10 LL=KK+1,N-M
K=INDEXYS(KK)
L=INDEXYS(LL)
IF (DZ(I,J) .LT. DZ(I,K) )PHI=PHI+l.
IF (DZ(I,J) .LT. DZ(J,K»PHI=PHI+1.
IF (DZ(I,J) .LT. DZ(I,L»PHI=PHI+1.
IF (DZ(I,J) .LT. DZ(J,L»PHI=PHI+1.
IF (DZ(K,L) .LT. DZ(K,I»PHI=PHI+1.
IF (DZ(K,L) .LT. DZ(L,I) )PHI=PHI+1.
IF (DZ(K,L) .LT. DZ(K,J»PHI=PHI+1.
IF (DZ(K,L) .LT. DZ(L,J»PHI=PHI+1.
10 CONTINUE
BV=4.*PHI/(M*(M-1)*(N-M)*(N-M-1»
RETURN
END
..
----------------------------------------------------SUBROUTINE DIST(N,D,ZS,DZ)
-----------------------------------------------------
THIS SUBROUTINE COMPUTES EUCLIDEAN DISTANCES AMONG
OBSERVATIONS
INTEGER N,D
REAL ZS(N,D),DZ(N,N)
DO 20 I=l,N
DO 20 J=l,N
DZ(I,J)=O.
20 CONTINUE
DO 30 I=1,N-1
•
113
...
•
DO 30 J=I+l,N
DO 30 JJ=l,D
DZ(I,J)=DZ(I,J)+(ZS(I,JJ)-ZS(J,JJ»**2
30 CONTINUE
DO 40 J=l,N-l
DO 40 I=J+l,N
DZ(I,J)=DZ(J,I)
40 CONTINUE
RETURN
END
•
•
..
114
REFERENCES
Anderson, T.W. (1958). An Introduction to Multivariate Analysis, John Wiley &
Sons, New York.
Barakat, A.S. (1989). A Nonparametric Multivariate Test for Homogeneity Based
on All Nearest Neighbors, Institute of Statistics Mimeo Series No. 1866T,
University of North Carolina.
Barakat, A.S., Quade, D. and Salama, LA. (1990). A Multivariate Test of
Homogeneity based on All Nearest Neighbors, Unpublished manuscript.
Bickel, P.J. (1969). A Distribution Free Version of the Smirnov Two Sample Test
in the p-Variate Case, Annals of Mathematical Statistics, 40, 1-23.
Capon, J. (1961). Asymptotic Efficiency of Certain Locally Most Powerful Rank
Tests. Annals of Statistics, 32, 88-100.
Cramer, H. (1928). On the Composition of Elementary Errors, Skandinavisk
Aktuarietidskrijt, 11, 13-74 and 141-180
Edgington, E. S. (1969). Statistical Inference: The Distribution-free Approach,
McGraw-Hill, New York.
,
Edgington, E. S. (1987). Randomization Tests, Marcel Dekker, Inc., New York.
Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems,
Annals of Eugenics, 7, 179-188.
•
Fisher, R.A., and Yates, F (1938). Statistical Tables for Biological, Agricultural
and Medical Research, Oliver & Boyd, Edinburgh-London.
Friedman, J.H., and Rafsky, L. (1979). Multivariate Generalization of the WaldWolfowitz and Smirnov Two-Sample Tests, Annals of Statistics, 7, 697-717.
Friedman, J.H., and Steppel, S. (1974).
A Nonparametric Procedure for
Comparing Multivariate Point Sets, Unpublished manuscript.
Gibbons, J.D. (1985). Nonparametric Statistical Inference, 2nd edition, Marcel
Dekker, New York.
Hajek, J., and Sidak, Z. (1967).
New York.
The Theory of Rank Tests.
Academic Press,
Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of
Nearest Neighbor Coincidences, Annals of Statistics, 16, 772-783.
Hoeffding, W. (1948).
A Class of Statistics with Asymptotically Normal
Distribution, Annals of Mathematical Statistics, 19, 296-325.
Hoeffding, W. (1950). "Optimum" Nonparametric tests. Proceedings of the 2nd.
Berkeley Symposium, 83-92.
Hoeffding, W. (1961). The Strong Law of Large Numbers for U-Statistics,
Institute of Statistics Mimeo Series No. 302, University of North Carolina.
Jammalamadaka, S.R., and Janson, S. (1984). On the Limiting distribution of the
Number of 'Near Matches', Statistics & Probability Letters, 2, 353-355.
Hotelling, H. (1931).
The Generalization of Student's Ratio, Annals of
Mathematical Statistics, 2, 359-378.
Jupp, P.E. (1987). A Nonparametric Correlation Coefficient and a Two-Sample
Test for Random Vectors or Directions, Biometrika, 74, 887-890.
Klotz, J. (1962). Nonparametric Tests for Scale. Annals of Statistics, 33, 498-512
Lee, A.J. (1990). U-Statistics Theory and Practice, Marcel Dekker, New York.
Lehmann, E.L. (1951). Consistency and Unbiasedness of Certain Nonparametric
Tests, Annals of Mathematical Statistics, 22, 165-179.
Mardia, K., Kent, J., and Bibby, J. (1980).
Press, Inc., New York.
Multivariate Analyisis, Academic
von Mises, R. (1931). Wahrscheinlichkeitsrechnung, Leipzig-Wien.
Mood, A.M. (1954). On the Asymptotic Efficiency of Certain Nonparametric
Tests, Annals of Mathematical Statistics, 25, 514-522.
Puri, M. L., and Sen, P. K. (1971).
Analysis, Wiley, New York.
Nonparametric Methods in Multivariate
Randles, R.H., and Wolfe, D.A. (1979).
Introduction to the Theory of
Nonparametric Statistics, Wiley, New York.
Schilling, M.F. (1986).
Multivariate Two-Sample Tests Based on Nearest
Neighbors, Journal of the American Statistical Association, 81, 799-806.
Sen, P.K. (1960). On Some Convergence Properties of U-Statistics, Calcutta
Statistical Association Bulletin, 10, 1-18.
Siegel, S. and Tukey, J.W. (1960). A Nonparametric Sum of Ranks Procedure
Journal of the American
for Relative Spread in Unpaired Samples,
Statistical Association, 55, 429-444.
Smirnov, N.V. (1939). On the Estimation of the Discrepancy Between Empirical
Curves of Distribution for Two Independent Samples, Bulletin of
Mathematics Moscow University, 2, 3-16.
Sundrum, R. M. (1954).
On Lehmann's Two-sample Test,
Mathematical Statistics, 25, 139-145.
116
Annals of
If
•
...
Terry, M. E (1952). Some Rank Order Tests Which are Most Powerful Against
Specific Parametric Alternatives. Annals of Mathematical Statistics, 23,
617-624.
van der Waerden (1952/1953). Order Tests for the Two-sample Problem and
Their Power, I, II, III, Indagationes Mathematicae 14, 453-458; 15, 303-310
and 311-316.
Wald, A., and Wolfowitz. J. (1940). On a Test Whether Two Samples Are From
the Same Population, Annals of Mathematical Statistics, 11, 147-162.
Weiss, L. (1958). A Test of Fit for Multivariate Distributions, Annals of
Mathematical Statistics, 29, 595-599.
Whaley, F.S. (1983). Some Properties of the Two-sample Multidimensional Runs
Statistic, Unpublished PhD Dissertation, Department of Biostatistics,
University of North Carolina at Chapel Hill.
Whaley, F.S. and Quade, D. (1985). Optimizing the Power of the Two-sample
Multidimensional Runs Statistic: Guidelines Based on Computer Simulation,
Communications in Statistics: Simulation and Computation, 14, 1-11.
Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods, Biometrics, 1,
80-83.
,
•
"
117
© Copyright 2026 Paperzz