Sen, P.K.; (1977)The Extended Two Sample Problem: Nonparametric Case."

BIOMATHEMATICS TRAINING PROGRAM
THE EXTENDED TWO-SAMPLE PROBLEM:
NONPARAMETRIC CASE
by
Pranab Kumar Sen
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1140
August 1977
THE EXTENDED TWO-SAMPLE PROBLEM:
NONPARAMETRIC CASE*
Pranab Kumar Sen
University of North Carolina, Chapel Hill
The classical two-sample problem is extended here to the case where
the distribution functions of the observable random variables are specified functions of unknown distribution functions and the null hypothesis
to be tested or the parameter to be estimated relates to these unknown
distributions.
Various properties of the proposed rank tests and derived
estimates are studied.
AMS 1970 Classification Nos:
62G02, 62GlO
Key Words & Phrases: Asymptotic efficiency, asymptotic normality,
distribution-freeness, equality of basic distributions, rank based estimates, rank order tests.
* Work partially supported by the Air Force Office of Scientific Research,
U.S.A.F., Grant No. AFOSR 74-2736.
-2-
1.
INTRODUCTION
In the classical two-sample problem l given two independent samples
from two (continuous) distributions
test for the identity of
relates
G to
F and
F and
G,
we intend either to
G or to estimate some parameter which
F in a meaningful way [e.g.,
G(x)
= F(x-B)
the difference of locations of the two distributions].
where
e
is
In an extended two-
sample problem, we conceive of two independent samples from two continuous
distributions
F*
and
F* (x)
G*
respectively, where
= Ql (F (x))
Q = {Ql (u), 0::; u ::; l} and Q
2
l
functions (with Q (0) = Q (0)
2
l
tinuous) distributions F and
HO:
,
G* (x)
=Q2 (G (x))
{Q2 (u), 0::; u ::; l}
=
= 0,
Ql(l)
,
(1.1)
are known, non-decreasing
= Q2(1) = 1)
G are not specified.
and the basic (conWe intend to test for
F =G (without necessarily assuming that
Ql
=Q2)
.
(1. 2)
Ql =Q2 then F =G => F* =G*, so that the classical theory holds.]
Also, relating G to F in a meaningful way, we desire to estimate the
[If
allied parameters based on the observations from the distributions
G*.
F*
and
As illustration, we consider the following:
Example 1.
Consider a system with
k(~
1)
electric cells connected in ser-
ies (i.e., the low potential pole of the i-th cell is connected to the high
potential cell of the
(i+l) -th
each cell has a life distribution
system is
F*(x)
= l_[l_F(x)]k.
one, for
F.
i
=
l, ... ,k-l)
and assume that
Then, the life distribution of the
Suppose that there is a second system with
l(~
1)
cells in series and the life distribution of each cell is
that the system has a life distribution
G* (x):: 1- [l.,.G(x) ]l.
samples relating to the two systems, (1.1) holds with
G,
so
Thus, for
Ql(u) :: l~(l_u)k
Q2(u) :: l-(l-u)l.
Example 2.
Suppose that in Example 1, the cells are connected in parallel
(i.e., all the high potential poles are connected together by a conducting
wire and all the low potential poles together in the sameway).
F*(x) :: [F(x)]k
and
G*(x) :: [G(x)]l.
Also, if
Then,
C be the individual
capacity of the cells and the system becomes inoperative when its capacity
is less that
C*
F*(x) :: \~
then
(where, for some
r: l:$;r:$;k, (k-r)C
(~)[F(x)]j[l_F(X)]k-j.
In both these examples,
C*:$; (k-r+l)C),
A similar case holds for the
LJ=r J
second system.
<
k
and
l
need not be equal.
Similar examples arise in problems of reliability theory and competing
risks.
When
Ql:: Q2' F:: G => F* :: G*,
theory holds.
and hence, the classical nonparametric
Q and Q are specified,
l
2
the classical theory can readily be extended and suitable rank tests can be
constructed.
We intend to show that so long as
Towards this, note that under
G*(x) :: QO(F*(x))
H in (1.2),
O
where
(1. 3)
and
Q is a specified, non-decreasing function with Qo(O) = 0 and
O
QO(l) = 1. This representation (characterizing the distribution-free
nature of rank based tests) is exploited in Section 2 in the proposal of
suitable rank tests.
The asymptotic distribution theory is dealt with in
Section 3 and this is incorporated in Section 4 in the study of optimal rank
-4-
statistics for some local alternatives,
Section 5 deals with the allied
estimation problem and some general remarks are made in the concluding
section.
2.
Let
THE PROPOSED RANK TESTS
Xl"",X
be independent and identically distributed random
m
variables (i.i.d.r.v.) with a continuous distribution function (df)
and let
and
G*
Yl""'Y
G*(x),
be i.i.d.r.v. with a continuous df
n
satisfy (1.1).
Let
N = m+n
and
F*(x),
F*
where
Rnl""'~m'" "~N
be respec-
tively the ranks of
Xl" .. ,Xm, Yl""'Yn (in the combined sample); by
virtue of the assumed continuity of F* and G* , ties among the observations may be neglected, in probability, so that
a (random) permutation of
(1, ... ,N),
Let
~
= (RNl ,.·
aN(l), ... ,aN(N)
"~N)
is
be a set of
(real valued) scores and we consider the usual two-sample rank order
statistic
(2.1)
and with suitable
{aN(i)},
Let
!N = (r ,· .. ,r )
l
N
be the set of all possible
=
we intend to use
be any permutation of
N!
lr J...J
zl<'''<Zn
=
permutations.
n
dF*(z r )
1=1
J... J
O<ul<···<uN<l
=
TN
1
m
ni=l
as a test statistic.
(1, ... ,N)
and let
RN
Then, we have
-fr
J=l
dG*(zr
.lJ
m+J
n
du
n
r i j=l
dQO(u
r m+j
)
(2.2)
-5-
where
~=t'
r =r
. ,
for
m+l
l:s; i :s; n
(2.3)
r
=
r. ,
1
for
l:s; i :s; m,
for
l:s; r :s; N .
[Thus, there is an one-to-one correspondence between !N
1 :s; r :s; N.]
Since
distribution of
based on
~
that for
Qo(u)
Qo(u), 0
~u ~ 1
under
R_
"'N'
(and hence, on
F u,
HO'
TN
is specified under
does not depend on
in (2.l)is
and the
H in (1.2), the
O
F, so that a test
distribution~free,
Note
the probability function in (2,2) need not relate
to a (discrete) uniform distribution over the
n!
realizations
{En ERN}'
Nevertheless, the distribution can be enumerated by using (2,2) for a given
Qo for finite
where
a
= llk(>
m and
0)
(as
n.
a
For instance, in Example 1, Qo(u) = l-(l-u)
l
k a
a
l-G*(x) = [l-G(x)] = ([l-F(x)]) = (l-F*(x)) ),
and hence, (2.2) reduces to
(2.4)
This familiar expression arises in the so called Lehmann [5] alternative
case for the classical two-sample problem (though, there it relates to the
alternative hypothesis, whereas, here, it relates to the null case).
lar expressions hold for the second example.
We may note that for every
r: l:s; r :s; N,
Simi-
-6-
I,\r
m-l
n
P{RNl =r\H O] = r~s=l ( S-lJ) ( r-s ]
I
OO
[F*(x)]
5-1
[l-F*(x)]
n-s
•
_00
[G*(X))]r-S[l-G*(X)]n-r+sdF*(X)]F=G
(2.5)
= s~l[~=i](r~s]J:uS-l(l-U)m-S[Qo(U)]r-S(l-QO(U)ln-r+SdU
50 that
(2.6)
Similarly,
(2.7)
where for every
r
=
s-r
I
I
i=l j =1
II
1~r
~
5 ~ N,
(m-l) !
nl
(i-l)!(j-l)!(m-i-j)! (r-i)!(s-r-j)!(n-s+i+j)!·
ui-l(v_u)j-i-l(l_V)m-i-j[Q (u)]r-i[Q (V)-Q (u)]s-r- j •
o
O<u<v<l
• [l-QO(V)]
As
m or
n
n-s+i+j
dudv
0
0
(2.8)
increases, the computation of the exact null distribution
of
TN
(pr its mean and variance) becomes prohibitively laborious.
For
this reason, in the next section, we take recourse to the asymptotic case
and provide simple approximations under appropriate regularity conditions.
Let
u(l)
=1
3.
ASYMPTOTIC DISTRIBUTION OF
or
0
according a.s
-lIm1=
. lU(x-x.)
.
l'
is >
-lIn.
G* =n·
F* (x) = m
m
t
n
TN
J=
or
lU(x-Y.)
J
<
0
and define
( .. oo<x<oo)
(3.1)
(3.2)
Thus,
F*
m'
lIN
and
G*
n
sample empirical df's.
are respectively the first, second and combined
Let then
(3.3)
[Note that
H*
may depend on
N through
AN;
for notational simplicity,
this dependence is understood.]
We conceive of a score function
{~(u),
=
and
0 < u < l}
~2
such that
~(u)
~l
(u) -
~2 (u),
~
=
0 < u < 1 where both
~l
are absolutely continuous and non-decreasing with
l
1
I~.(t) I{t(l-t)}-~dt <
Jo
00
for
j = 1,2 .
(3.4)
J
[This is slightly more restrictive than the square integrability condition of the
some
~j' but, is less restrictive than f~l~j(u) Irdu
r>2, j =1,2.]
defined by
Then, we assume that the scores
{aN(i)}
<
00
are
for
~(i)
where
size
= <jl(i/(N+l))
or
E<jle!JNi) ,
fQr
i=11 .. "N(;::: 1),
(3.5)
U < .. , < U
are the ordered random variables of a sample of
NI
NN
N from the rectangular (0,1) df, In particular, when <jl(u) == u
(or the inverse of the standard normal df),
mal scores).
aNti)
=
i/(N+l)
(or the nor-
Let us then define
~ = Joo<jl(H*(X))dF*(X)
a~ = 2
JJ
,
(3.6)
F* (x) [l-F* (y) ]<jl' (H* (x) )<jl' (H* (y) )dG* (x)dG* (y)
(3.7)
G*(x) [l-G*(y)]<p' (H*(x))<jl' (H*(y))dF*(x)dF*(y).
(3.8)
_co<x<y<oo
If
a~ = 2
_co<x<y<oo
Note that under
H in (1.2), we have
O
V = Vo =
If
a~:: a~o = 2
u(l-v)<jl'
I>CANU. Cl-ANllloCulldu
(~u+(l-~)~(u))<jl' (ANv+(l-AN)QO(v))dQo(u)dQO(v)
O<u<v<l
a; = a;o = 2
fJ
(3.9)
(3.10)
QO(u) [1- QO(V)] <jl' (ANU+ (l-AN)QO (u)) <jl' (ANV+ (l-AN)QO (v)) dudv.
O<u<v<l
[The dependence of
(3.11)
~, ~01
222
2
aI' alO' a 2 and a 20 on N through AN is
understood.] Finally, we assume that there exist a A (0 < A ~~) and
O
O
an NO(~ 2), such that
(3.12)
Then we have the following.
Theorem I,
that as
Under the
N-+oo
a~8umptions
, for every real
made above,
xC...
00
< x < (0),
(3.13)
...00
where
(3.14)
and further
(3.15)
The proof of (3.13) follows directly from Theorem 2.3 of Hajek [2]
after noting that our
TN
is a special case of his statistic (where
=.•. =cm =1, cm+ 1 = ... =cN = 0) , so that his expressions in (2.9) and
l
(2.10) simplify considerably and also our (3.4) insures his square inte-
c
grability condition.
The second ascertain in (3.15) also follows from
Hajek's Theorem 2.3.
In fact, we have strengthened his square integrabi-
lity condition to (3.4) with the objective of using Theorem 1 of Hoeffding
[4] which insures the first ascertion in (3.15).
In view of these, the
details are omitted.
Note that Theorem 1 covers both the null and non-null cases.
null case,
Q
O
In the
is specified, so that by (3.9)-(3.11), all the quantities
~O' aio and cr~o are also specified, and hence if max(crio'cr;o)
> 0,
then
(3.16)
where
(3.17)
",10-
ZN
Thus, a large sample test can be based on
percentile point of the standard normal df.
using the appropriate
As illustration, we con",
sider the case in Example 1 and Wilcoxon statistic Ci. e, ,
¢(u)
==
u) •
Then, we have
11
0
2
10
=
2a
= 1 -~A _ (hA )/(a+l)
o = J(1 {;\NU+(l~AN) [l_(l.-u)a]du
.
N - N
o
2
If
u(l-v) (l-u)
a-I
(l~v)
a-I
,
2
dudv = 2a [(a+l) (2a+l) (2a+2)]
(3.18)
-1
(3.19)
O<u<v<l
If
0~0=2a2
a
a
-1
[l-(l-u) ] (I-v) dudv = 2a[(a+l) (a+2) (2a+2)]
(3.20)
O<u<v<l
a = ilk.
where
Hence, (3.17) holds whenever
4.
0 < a <
00
LOCALLY OPTIMAL RANK TESTS
Here, we shall consider some local alternative hypotheses (relating
G to
tions.
F),
and in this context, study the optimal choice of score func-
First, we consider a sequence
{~}
of Pitman-type translation
alternatives, where
~:
G(x)
=
e
G(N) (x)
real (and fixed) .
(4.1)
Also, we assume that
(4.2)
¢(u), Ql (u), Q2(u)
Further, we assume that
derivatives
u(O
< u < 1)
¢'(u), ql (u)
and
F(x)
and
q2(u)
have continuous first order
respectively for almost all
possesses an absolutely continuous probability
-lh
density function
of
~
in
(~.7)
f(x)
for almost all
when
K holds,
N
x.
Let
}-l(n)
eel
Then, by some standard
be the value
steps~
it fol-
lows that
-co
(4.3)
lim
~
2
N-+co °l(N) = °10
where
2
°1 (N)
2
and
k
Q (F (x-N- 28))
2
02(N)
and
being replaced by
-2
A.
2
-2
(4.4)
N-+co °2(N) = °20
are defined by (3.7) and (3.8) with
G*(x) -
o~o are defined by (3.10).(3.11) with AN
and
010
lim
and
k
{~}, N2(TN-~0)/oNO has asymptoti-
Hence, under
cally a normal distribution with unit variance and mean
(4.5)
-2
where both
010
and
optimal choice of
~
-2
A.
Thus, an
should maximize (4.5) (for a fixed
8)
and, in
also depend on
general, this depends on
F, Ql' Q
2
In particular when
==
q (u)
~
and
020
Q (u)
l
and
Q (u) - Q(u)
2
not necessarily equal to
1
F, QO'
A in a rather involved way.
(~>
for all
ql (u):= q2 (u)
0 <u
<
1) ,
= q (u))
but
(4.5) reduces to
A(1-A)e[JOO~(x)q2(F(X))$'(Q(F(X)))dX];'{J:$2(U)dU-[J:$(U)dU]2}~
(4.6)
-co
If we define
ljJ(u) - - [f(x)q' (F(x))/q(F(x))
+
ft (x)/f(x) ]Q(F(x))=u
[(d/dx)log q(F(x))f(x)]Q(F(X))=U' 0
<
u
<
1 ,
(4.7)
and note that J~l.jJ(UldU =: 0,
we obtain then by partial integrp.tion on
the numerator of (4.6) that (4\6) is equal to
(4.8)
where
~ = f~~(U)dU,
~ is
Thus, here an optimal choice of
~ (u)
==
l.jJ (u) •
0 <u < 1 .
(4.9)
This is a direct extension of the parallel result for the calssical twosample problem where
q(u)
==
1
so that
l.jJ(u) = -fl(F
~l
(u))jf(F
~l
(u)),
O<u<1.
Let us next consider a sequence
KN:
(with
e > -K,
G(x) = G(N) (x)
K
~
1
==
for every
{KN}
of scale alternatives, where
F(x (l+N -~e)), e
2
N ~ K ).
real (and fixed)
(4.10)
In this case, in (4,3), the inte-
gral has to be replaced by
(4.11)
and a similar change is needed in (4.6).
l.jJ(u)
Similarly, in (4.7) and (4.9),
will have to change to
l.jJ*(u) - - 1
- - 1
[x{(djdx)log q(F(x))f(x)}]Q(F(x))=u
{F*-l(u)}l.jJ(u), 0 <u < 1 where
In either case, it may be observed that for
tribution of
TN
(4.12)
F*(x)=Q(F(x))
Q == Q2' though the null disl
agrees with that in the classical two-sample problem,
the asymptotic power function and the optimal score function are
generally different and depend on
5.
In the classical
q~(u)
as well as
ESTIMATION BASED ON
two~sample
f~(x).
TN
problem 1 the problem of estimation
based on linear rank statistics has been treated by Hodges and Lehmann
[3] and Sen [7].
In view of the results in Section 3, we shall extend
it to the extended two-sample problem as follows.
Suppose now that
(1.1) holds with
G(x) - F(x+8)
where
and our problem is to estimate
8
8
(real) is unknown ,
(based on
(5.1)
Xl""'X
and
m
Yl, ... ,Y ).
n
Let us denote then
H~ (x) = ANF* (x) + (l-~) G* (x)
= ANQI(F(x)) + (I-A N)Q2(F(x+8))
and since
Ql,Q2
are non-decreasing, it follows that
non-decreasing in
Thus,
by
Also,
,Yn )
then the rank of
~i(a))
is
~
in
under
Xi
He(x)
is also
p{y i + 8 s; x} = G* (x-8) = Q (F (x)), V i ~ 1.
2
,X ,Y +8, ... ,Y +8)
m l
n
T (X ,
N l
T(Xl, ... ,Xm,Yl,
aN(N),
8.
(5.2)
has the same distribution as of Tn =
8=0.
among
~(l)
On the otherhand, if
Xl, ... ,Xm, Yl+a, ... ,Yn+a
a(- 00 < a < 00)
TNCa) = TN(XI, ... ,X , Yl+a, ... ,Yn+a)
m
denote the right hand side of (2.6) by
for every
is also
~NO'
~
ls;is;m.
in
a.
s; ... s;
(denoted
Hence,
Thus, if we
then by alignment, we
may consider the following estimator.
Let
(5,3)
(5.4)
The proposed estimator is
(5.5)
As in the case of the classical two-sample problem,
invariante, robust and consistent estimator of
e,
A
eN
is a translation-
Further by virtue of
Theorem 1 and the asymptotic simplifications made in Section 4, the proof
of Theorem 4 of Hodges and Lehmann [3] can be directly adapted and this
yields that under the assumptions made in Sections 3 and 4, as
where
B(F,Ql,Q2,~,A)
is defined by (4.3).
if we use the Wilcoxon scores (i.e.,
~(u)
N+
00
For instance, in Example 1,
=u),
we obtain from (3.19),
(3.20), (4.3) and (5.6) that
(5.6)
where
)'1.}
2
[a
{a
I } 1ft foo 2
k+i-2
Yki = ~a+l)2 A(2a+l) + (1-X) (a+2) I\ki"oof (x) Cl-F(x))
dx
}!{f
OO
_ \_
1
{k
i
- ty9.(k+i)2 A(2k+l) + (l"'A) (k+21)
<
2
k+i,,2
oof (x) (1-F(x))
dx
..
}2lJ'
-15 . .
In
thi~
context, we may refer to Brown [1] for some related estimation
pro~
Similar expressions can be derived for the second example,
blems based on the maximum likelihood procedure t
arises whether
Yk£.
is a minimum for
k
=£. = l?
A natural question
I'n general, it need
not be.
6,
SOME GENERAL REMARKS
By virtue of (1.3), it is intuitively appealing to consider a
Kolmogorov~Smirnov
type test statistic:
D
N
= sup
N~ IG* (x) ~ Q (F* (x)) I
x
nOm
(6.1)
(or the one-sided case) where the empirical df's are defined by (3.1).
With the
defined in (2.3), we have
(6.2)
As such by using (2.2) and the
{~,
H )'
O
1~r
~
N},
For large
one~to~one
correspondence between
~
and
the small sample distribution of
m, n
D can be obtained (under
N
this becomes quite complicated. In fact, if we con-
sider a stochastic process
W
N
= {WN(u),
0
~
u
~
l}
by letting
(6.3)
then under
H in (1.2),
O
(6.4)
...16..
H in 0,21,
O
It can be shown by standard steps that under
W '£ W::; {W(u) I 0:;; u :;; I} ,
N
where
W is Gaussian with
EW(u)
= 0,
as
N-+oo
0:;; u :;; 1
•
and
EW(u)W(v)
for
0:;; u :;; v :;; I,
(6,6)
In general (for
covariance structure of a Brownian
Q eu) f u) ,
O
bridge~
C6. 6) differs from the
and hence, the asymptotic
distribution theory of the classical two-sample
tistic does not hold in our case.
bilities for general
spect of using
ON
For
W~
Kolmogorov~Smirnov sta-
the boundary crossing proba-
Q are not precisely known, and hence, the proO
as a large sample test statistic does not appear to
be very bright. A similar criticism applies to the Cramer-von Mises type
test based on
It may also be intuitively appealing to use
more generally,
!¢CQOCF*Cx))dG*CX))
m
n
!QOCF*Cx))dG*Cx)
m
n
as a test statistic.
Cor
The develop-
ment of the asymptotic distribution theory of such a statistic poses no
serious problem and can be made by the usual expansion of
around
F*
and
the functions
G*
F*
m
and
G*
n
respectively, and then using the Gaussian nature of
IN [F;-F*]
and
In
[G~-G];
the techniques are very simi-
lar to the ones employed in Section 3.6 of Puri and Sen [6] and hence,
the details are omitted.
REfERENCES
Communiaations in
[1]
Brown, G,B., "Estimation from smallest values, \I
Statistias L (19781, to appear.
[2]
Hajek, J., "Asymptotic normality of simple linear rank statistics,"
AnnaZs of MathematiaaZ Statistias39 (1968), 324~46.
[3]
Hodges, J.L. Jr., and Lehmann, E,L., "Estimates of location based
on rank tests,~' Annals of Mathematiaal Statistics 34 (1963),
598~611.
[4]
Hoeffding, W., "On the centering of a simple linear rank statistic,"
Annals of Statistias !. (1973), 54-66.
[5]
Lehmann, E.L., The power of rank tests,
Statistias 24 (1953), 23~43.
[6]
Puri, M.L., and Sen, P.K., Nonparametria Methods in MUltivariate
Analysis. John Wiley: New York, 1971.
[7]
Sen, P.K., "On the estimation of relative potency in dilution
(direct) essays by distribution-free methods," Biometrias
~ (1953), 532-52.
Annals of Mathematiaal