Sen, P.K. and Chatterjee, S.K.On the Kolmogorov-Smirnov-type test of symmetry."

I
I
"e
Work supported by the National Institute of Health, Public Health Service,
Grant GM-12868.
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
e·
I
I
ON KOLMOGOROV-SMIRNOV-TYPE TESTS FOR SYMMETRY
by
Shoutir Kishore Chatterjee and Pranab Kumar Sen
University of Calcutta, and University of North Carolina
Institute of Statistics Mimeo Series No. 645
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
··e
I
I
ON KOLMOGOROV-SMIRNOV-TYPE TESTS FOR SYMMETRY*
BY SHOUTIR KISHORE CHATTERJEE AND PRANAB KUMAR SEN
University of Calcutta, and University of North Carolina
O.
~.
For testing the hypothesis of symmetry (about a specified point), a
simple Kolmogorov-Smirnov-type test is proposed.
The exact and asymptotic (null
hypothesis) distributions of some allied statistics are obtained, and the Bahadurefficiency of the test is studied.
1.~.
Let {X.} be a sequence of independent real valued random
1
variables with continuous distribution functions (df) {F. (x)}, all defined on
1
(-00,00) and not necessarily identical.
Based on a sample (Xl, ••• ,X ), we want to
n
test the null hypothesis (H ) that all the df F , .•. ,F are symmetric around their
o
n
1
respective (specified) medians.
Without any loss of generality, we may take all
these medians to be equal to zero, and thus frame H as
o
(1.1)
H:
o
F. (x)+F. (-x)=l, V
1
1
x~O,
Let c(u) be equal to 0 or 1 according as u < or
~
and i=l, ... ,n.
0, and let
(1. 2)
Thus, F~ is the empirical
(a.e.).
df and it estimates unbiasedly the average
df F(n)
In testing the null hypothesis (1.3), we are interested in the following
alternative hypotheses:
(1. 3)
(1. 4)
* Work supported by the National Institute of Health, Public Health Service,
Grant GM-12868.
I
2
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
II
When Fl= ... =Fn=F, (1.3) [(1.4)] relates to one [two-] sided skewness.
In many
practical problems, though it may be unwise to impose the restriction that
Fl= ..• =Fn=F, it may not be unreasonable to assume that Fl, ••. ,F
pattern of skewness, when (1.3) does not hold.
F~([x-lJ,l/o,), and F~E:t , i=l, ... ,n, where
1.
1.
1.
1.
0
1-0
1.
sign, Fl, ••• ,F
n
have a common
For example, let F,(x) =
1.
= {F: F(x)+F(-x) = 1 a.e.}, and
the lJ, and o. are the location and scale parameters.
1.
n
If the lJ. have all the same
1.
are either all positively or all negatively skewed, no matter
whatever be their forms and the 0,(>0).
1.
We also note that whereas in the classical
one-sample goodness of fit problem (where the Kolmogorov-Smirnov-type tests apply)
we require to assume the form of the true df, the same is not needed here.
Also,
unlike the one-sample location problem, we are not necessarily confining ourselves
only to translation alternatives.
Fl, ••• ,F
n
In fact, even if all the medians of the df
are equal to 0, the alternatives in (1.3) and (1.4) may hold.
addition, the homogeneity of Fl, ••• ,F
n
In
is totally inessential.
For testing the null hypothesis, we consider the following Kolmogorov-Smirnovtype statistics whose appeals are evident from (1.3) and (1.4):
(1.5)
D = max[D+,D-] = sUPIF*(x)+F*(-x-)-ll.
n
n n
x>O n
n
(1. 6)
Note that F* is a step-function, and hence, to avoid some complications in the
n
distribution theory, we have taken F*(-x-) for F*(-x), x>O.
n
nThe small sample null distributions of D+, D- and D are deduced in Section 2,
n
n
n
and tabulated too, for n<16.
of these statistics.
Section 3 deals with asymptotic null distributions
The last section is devoted to the study of the Bahadur-
efficiency of the test based on D with respect to the sign test.
n
I
3
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
••
I
2.
Since
F* is a step-function assuming the values i/n, i=l, .•. ,n, the process
n
{n[F*(x)+F*(-x-)-l]:
x>O}
can only assume the integral values between -n to n.
n
n
Thus, the permissible values of nD+, nD- and nD are the integers O,l, ..• ,n, but
n
n
n
not all of these are admissible.
1n0
(2.1)
We denote by F = (Fl, .•. ,F ), and
-n
n
= {F: F.e::~, i=l,oo.,n},
-n
1
0
Then, we have the following theorem.
Theorem 2.1.
(2.2)
For every F e::~o and k=l, ••• ,n,
-n
n--
n -n ]-ok()2
n-n ,
p{nD+>k} = p{nD - >k} = 2[ L. s 0(.)2
nn1= 1
S
where s = [~(n-k)] is the largest integer contained in
as n-k = odd or even, and
1,
(2.3)
~(n-k),
ok
o or
1 according
k=O,l,
p{nD >k}
n2\ u (-l)j P{nD+>(2J'+1)k}, k>l,
L.j=O
n-
where u = [n/2k]-1, k=l, ••• ,n.
Proof.
Let Yl~"'~Yn be the ordered values of Ixll , ••• ,Ixnl, arranged in
descending order of magnitude.
Let t n,1. = F( n ) (-Y.),
so that
1· for l<i<n,
--
O<t l<t 2<" .<t
<F( )(0) = ~ (as F e::~o==>F( )E:~).
- n, - n, - n,n- n
-n n
n
0
Since, Fl , ••. ,Fn are symmetric
and continuous, ties among Ixll, .•• ,lxnl, and hence,among t n, l, ••• ,t n,n can be
neglected in probability.
(2.4)
O<t
Thus,
n,
l<t
n,
2< ••. <t
<~,
n,n
in probability.
O<t<l, where G*(t)
= n-l\.nlc(t-F(
and
Define then V (t) = n~[G*(t)-t],
n
n
L. 1 =
n )(x.)),
1
n
let
(2.5)
V*(t) = V (t-)+V (l-t),
n
n
n
O~t~~.
I
4
I.
I
I
I
k<
2
For t<t
- n, l' n V*(t)=0.
n
1
At t=t
n,
1+' n~V*(t) is either +1 or -1, depending upon
n
whether the random variable X. associated with Y has negative or positive sign.
n
1.
k<2
The process n V*(t) continues to have the same value until t=t 2+' where it
n
n,
makes another jump of +1 or -1, depending on whether the X. associated with Y
1.
n-l
is negative or not.
And thus the process continues.
Hence, on
1=(0,~),
k<
I
I
I
I
Ie
I
I
I
I
I
I
I
Ie
I
n 2V*(t) makes n jumps (at t l, .•• ,t
) and each jump is either +1 or -1.
n
n,
n,n
Let p .. = P{Y
1.J
'+l=lx.I}, i,j=l, ••• ,n, (thus \ ,nIP' ,=1, i=l, .•• ,n).
n-1.
J
lJ=
1.J
Since,
for ~n
F E~o,
the df of X,1. is symmetric about 0, l~i~n,
n
(2.6)
P{Y ~l.'+1. corresponds to a positive X.}
J
as the distribution of sign Xi is independent of Ixil when FiE~o' i=l, ••• ,n.
Thus, the jumps (+1 or -1) at t
. are both equally likely with probability
n,l.
~.
Moreover, for F E3.°,
n the vector .(Sign XI, ••• ,Sign Xn ) is distributed independently
~n
Sign XI, ••• ,Sign X are also mutually stochastically inn
1
dependent.
Hence, the jumps of n~V*(t) at t
l, ••• ,t
are mutually independent.
n,
n,n
nD-(= SUPrn~*(t)]) and nD (= SUPln~v*(t)l)
Finally, the values of nD+ (= SUP~V*(t»
n
tEl
n
n
tEl
n
n
tEl
n
'
n
are independent of the particular realization of ~n
t =(t n, l, ••• ,t n,n )El.
Hence, we
conclude that (i) the distribution of nD+ (or nD-) (under H ) is the same as that
n
n
0
of the maximum positive (or negative) displacement in n steps of a symmetric
random walk starting from the origin, and (ii) the distribution of nD
that of the corresponding maximum absolute displacement.
+
are both non-negative, and hence, p{nD >O} = p{nD >O} = 1.
nn-
n
k<
n 2V*(t) is either +1 or -1.
n
(2.2), we consider
k~l,
Hence, nD >1, with probability one.
n-
and for (2.3), k>2.
agrees with
Thus, the distribution
problem is reduced to that of a symmetric random walk problem.
nD
n
Note that nD+ and
n
Also, at t n, l'
So, to prove
I
5
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
,I
We now use theorem 1 (Section 8) of Takacs (1967, p. 24), and obtain that
for
k~l,
n
k
L -;-
(2.7)
j=k J
P{N.
\
j-k} ,
J
where
P{N.
(2.8)
J
= j-k}
-
~
2-j(~), j-k=2r, r=0,1,2, .•. ,
0,
j-k = 2r+l;
j~k~l.
Using an alternative standard expression given in Uspensky (1937, p. 149), (2.7)
can be written as
2-(n-l), s (n)_o (n)2- n
(2.9)
L.t=O t
k s
'
where s and ok are defined after (2.2).
Thus, the proof of (2.2) is completed.
Writing now Q+(a,n) (or Q(a,n»
for the
probability that a particle starting a symmetric random walk at the origin with
the absorbing barrier at a (or barriers at ±a), a>O, will be absorbed at the
barrier in course of time n, we have
< k'} = l-2Q(k'+1,n);
n-
(2.10)
P{nD
(2.11)
p{nD+ < k'}
n-
l-Q+(k'+l,n).
Also, from Uspensky (1937, p. 156), we obtain that
(2.12)
Q(k'+l,n) = Q+(k'+1,n)-Q+(3k'+3,n)+Q+(Sk'+S,n)
- •••• +(-1)uQ+«2u+l)k' ,n); u = (n/2k')-1].
Then, (2.3) readily follows from (2.10), (2.11), (2.12) and (2.2), by letting
k' = k-l. Q. E •D •
The probabilities in (2.2) and (2.3) are quite simple to be computed, and
are presented below for n<16.
- -e -- -----_ .. ---- --e- TAB.
Table for the values of P{nD+>k}
n-
~
2
3
4
5
6
I 7
8
I 9
10
11
12
13
14
15
16
I
=
P{nD->k} for k<n<16
n-
0
1
2
3
4
5
6
7
8
9
10*
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.500
.625
.625
.688
.688
.727
.727
.752
.752
.772
.772
.789
.789
.803
.803
.250
.250
.375
.375
.453
.453
.508
.508
.549
.549
.581
.581
.607
.607
.629
.125
.125
.219
.219
.289
.289
.344
.344
.388
.388
.423
.423
.436
.436
.063
.063
.125
.125
.235
.235
.282
.282
.317
.317
.329
.329
.340
.031
.031
.072
.072
.111
.111
.147
.147
.181
.181
.211
.211
.016
.016
.039
.039
.065
.065
.092
.092
.119
.119
.143
.008
.008
.022
.022
.039
.039
.057
.057
.077
.077
.004
.004
.012
.012
.023
.023
.035
.035
.049
.002
.002
.006
.006
.013
.013
.019
.019
.0010
.0010
.0034
.0034
.0074
.0074
.0106
* Values are correct to 4 decimal places for
k~10,
12
11
.0005
.0005
.0018
.0018
.0032
.0032
.0002
.0002
.0009
.0009
.0023
13
.0001
.0001
.0004
.0004
14
15
.00005
.00005
.00026
.00003
.00003
and three decimal places for k<9.
TABLE 2
Table for the values of p{nDn>k} for 1<k<n<16
---
I~
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
.500
.250
.500
.750
.250
.750
.438
.875
.438
.875
.598
.938
.598
.938
.684
.970 I .684
.970
.764
.984
.764
.984
.820
.992
.820
.992
.825
.825
.994
4
5
6
7
8
9
.125
.125
.250
.250
.470
.470
.563
.563
.633
.633
.656
.656
.678
.063
.063
.143
.143
.221
.221
.294
.294
.362
.362
.423
.423
.031
.031
.078
.078
.131
.131
.185
.185
.237
.237
.286
.015
.015
.043
.043
.077
.077
.115
.115
.154
.154
.008
.008
.023
.023
.045
.045
.070
.070
.098
.004
.004
.013
.013
.026
.026
.037
.037
i
I
I
10*
.0020
.0020
.0068
.0068
.0148
.0148
.0212
11
12
13
14
15
.0010
.0010
.0036
.0036
.0064
.0064
.0005
.0005
.0018
.0018
.0046
.0003
.0003
.0008
.0008
.0001
.0001
.0003
.0001
* Correct to 4 decimal places for k>10 and up to three decimal places for k<9.
--
0\
I
7
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
II
3.
We consider first the null case.
k
+
Here, we
k -
provide asymptotic expressions for (a) p{n2DdY}, p{n2DdY}
(b) P{D~}, P{D~} and P{D~}, where y(O<y<oo) is fixed.
<I>(y) = (2IT) -k2
(3.1)
fY [exp(-~t2)]dt,
For this, let
_oo<y<oo.
_00
Then, we have the following theorem.
Theorem 3.1.
fixed y(O<y<oo), under H (i.e., ~ F E~o),
For every
-----"-----
-n
n
lim
k>v} = 2<I>(-y);
n-+oopin ~It-"
(3.2)
lim
k
\'
00
n-+ooP{n~~} = 4[L.k=1 (-1)
(3.3)
Proof.
with
0
Let r
n
<I>(-(2k-l)y)].
be the number of successes in n independent Bernoullian trials
probability~.
k
+
Then, by (2.2),
P{n 2D >v} =
(3.4)
k-l
It-"
. kP{n~ >v}
n--
= 2P{r <s }-okP{r =s }
n- n
n n
= 2P{n-~(2r -n) < n-~(2s -n)}-okP{r -s },
n
n
n n
where s
n
=
[n/2-~
~_
-k
y], so that n 2 (2s -n)
n
+
-y, as
Also, by the DeMoivre-
Laplace theorem, the right hand side of (3.4) tends to <I>(-y) as n-+oo.
(3.2) follows from (3.4).
Remark.
A similar proof applies to (3.3).
Hence,
Q.E.D.
By standard arguments [such as in Feller (1965, p. 230)], one could
have approximated the random walk of section 2 by a Brownian movement process,
and then used the well-known results on the maximum (or absolute maximum) displacement of such a process [viz., Parthasarathy (1967, pp. 224-230), particularly,
corollaries 5.1 and 5.2] to give alternative derivation for the proofs of
(3.2) and (3.3).
For simplicity of presentation, we do not consider this approach.
Let us now define, for every E:
O<E<~,
I
I
'e
I
I
I
I
I
I
I
8
It is then easy to verify that pee) is strictly
lim
pee)
=~.
~
in E:
O<E<~,
with p(O) = 1 and
Hence for any 1..>1
p(AE)/p(E) < 1, for all
Theorem 3.2.
O<E~A
.
Under H , for every E: O<E<l,
o
(3.7)
lim [n-1 log P{+}
Dn>E ] = log p(E/2);
(3.8)
n~
(3.9)
By (2.2) and (3.4), P{D+>E} = P{D->8} < 2P{r <s*}, where s* = [~n(l-E)]
Proof.
n-
Since, r
n
n-
-
n-n
n
is a sum of independent and bounded valued random variables, (3.7)
follows from the theorem 1 of Hoeffding (1963), and (3.8) follows from lemma 1
of Abrahamson (1967), attributed to Bahadur and Rao (1960). Also, noting that for
every E>O and
n~l,
(3.10)
(3.9) follows readily from (3.7) and (3.8).
Let us now consider the non-null case.
Q.E.D.
To simplify the expressions, we
assume the homogeneity of the cdf's, viz., Fl= ••• =Fn=F, for all n>l.
cdf F(x), we define for
.I
.1
E~-1
(3.6)
Ie
I
I
I
I
I
I
,I
pee)
(3.5)
I
x~O
and 8>0,
(3.11)
inf -tE
t
-t
= t>O{e
[F(-x)e +{F(x)-F(-x)}+{l-F(x)}e ]},
(3.12)
inf -t8
-t
t }
[F(-x)e +{F(x)-F(-x)}+{l-F(x)}e] ;
= t>O{e
For a
I
I.
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
e
I
I
9
(3.13)
Note that when F(X)E ~o' Pl(X,E) = P2(X,E).
Theorem 3.3.
For every continuous F and every E: O<E<l,
Outline of the proof.
p{n >E} > P{IF*(x)+F*(-x-)-ll ~ E} for any x>O.
(3.15)
Since,
u<-x,
By definition in (1.6), for every nand E>O,
n-
F~(X)+F~(-x-)-l =
-x~u~x
n
-
n
n
-I\' n
Li=lg(x i ), where g(u) = 1, 0 or -1 according as
or u>x, and as g(Xl), ••• ,g(X ) are all independent and bounded valued
n
random variables, using the well-known results of Bahadur and Rao (1960) (see
also lemma 1 of Abrahamson (1967)), we obtain by some simple steps that
(3.16)
lim[n-llog P{/F*(x)+F*(-x-)-ll>E}] = log p*(X,E), x>O.
n
n
-
n~
Thus, by (3.14), (3.15) and (3.16), we have
(3.17)
lim inf
1
Sup
[n- log p{n >E}] > 0 log p*(X,E)
n
n- x>
log p*(F,E).
Hence, the proof of (3.14) will follow, if we can show that
(3.18)
Since the proof of (3.18) follows by the same technique as in theorem 1 of
Abrahamson (1967) [namely, as in her (3.12)-(3.15)], we omit the details and
terminate the proof here.
4.
Hence the theorem.
Following
Abrahamsom (1967), we briefly sketch the Bahadur (1960) efficiency of two
sequences of statistics, when, in particular, we are interested in the hypothesis
I
I
'e
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
.I
10
of symmetry.
Here also we assume that Fl= ... =Fn=F for all n>l.
We define ~
as in Section 1, and let ~l be the class of all continuous (univariate) df's,
not symmetric about zero.
(4.1)
Thus, if we let
o(F) =
~~~IF(X)+F(-X)-ll,
then o(F) = 0, VFE~ , while o(F»O, for any
o
Consider now two sequences {T(l)} and
of non-negative real valued
n
statistics, satisfying the following three conditions:
(1)
there exists a non-degenerate and continuous df
~.(x),
].
such that for
all F E'1 and real r (O<r<oo) ,
o
lim P F{T(i)<r}
~ ( )
n-?OO
n
= i r ,
(4.2)
(2)
~.(z»O
].
there exists a non-negative function
~i
on [0,00] such that (i)
for all ZE(O,oo), and (ii) whenever {un } is a sequence of real numbers
for which n-1 u 2 + ZE(O,oo), we have
n
- 1~(2/n)log PF{T(i»u } = ~.(z),
(4.3)
n'TVV
n
- n
].
uniformly in FE~, and for FEd ,
l
(4.4)
(3)
a.s., as n-?OO, for i=1,2.
Then, we define the exact asymptotic efficiency of T(l) with respect to T(2)
n
as equal to
(4.5)
and with the metric o(F), defined by (4.1), the limit
(4.6)
lim
(1)
e(2)(F)
= o(F)+O e 1 ,2(F)
1,2
n
I
I
.e
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
.e
I
11
is defined the exact asymptotic limiting efficiency, both defined after Bahadur
(1960), as further interpreted in Abrahamson (1967).
Let now T(l) = n~D.
n
n
Under H : FE::!-, the distribution of T(1) is inde0
0
n
pendent of F, and by (3.3), we have
\'
(4.7)
00
\{Jl(r) = l-4L,k=1(-1)
k-l
J
ep(-(2k-l)r), O<r<oo,V'FE:;:fo'
Further, using theorem 3.2, (3.5) and some standard computations we obtain that
for {u } for which u 2 /n+zE:(0,1),
n
n
lim
(1)
Look
- n+oo (2/n)10g P{Tn ->u}
n = k-_lz /k(2k-l), 'V FE::Yo '
(4.8)
lim SUP/F*(x)_F(X) I = 0, a.s., and
Finally, by the Glivenko-Cantelli theorem, n+oo
x
n
hence, by (1.6) and (4.1),
(4.9)
n
-~
T
(1) = D +o(F) , a.s., as n+oo.
n
n
So for D , all the three conditions are satisfied.
n
Let us now consider the sign statistic S , defined by
n
rn
(4.10)
where c(u) is defined after (1.1).
(4.11)
= L,1=
\.nlc(x.),
1
If we then let T(2) = IS
n
n
I,
we have
\{J2(r) = ep(r)-ep(-r), O<r<oo, ~FE:Z: •
-
0
Also, using lemma 1 of Abrahamsom (1967) and some standard computations, we
have, parallel to (4.8),
(4.12)
Finally, it is well-known that as n+oo,
I
I
.I
I
I
I
I
I
12
(4.13)
n
-~
T
(2)
n
n
-1
(2r -n)+o (F)
n
0
2F (0) -1, a. s ..
Hence, the conditions are also satisfied for the sign statistic.
Thus, the
asymptotic efficiencies of D with respect to S , as defined by (4.5) and
n
n
(4.6), are equal to
(4.14)
(4.15)
Now, note that by (4.1) and (4.13), O(F)~Oo(F),VF£~oU~l' Hence, from (4.14)
and (4.15), we arrive at the following:
eel) ~ e(2)(F) > 1, for all F.
(4.16)
Ie
Thus, the proposed test is at least as efficient (asymptotically) as the sign-
I
I
I
I
I
I
I
e
I
I
we are interested only in shift alternatives, then o(F)
test for all F.
In particular, i f F(x) (E:S:) is symmetric and unimodal, and
o
= 0o (F),
so that in
(4.16), the equality signs hold; the conclusion is not necessarily true when
F(x) is not strictly unimodal [viz., the uniform df].
On the other hand, for
certain specific type of asymmetry (of F), 0 (F) may be exactly or nearly equal
o
to zero, but o(F) can still be positive, making (4.14) or (4.15) either
00
or
indefinitely large.
For other tests for symmetry, the Bahadur efficiency of D may be computed
n
in a similar way; for brevity the details are omitted.
REFERENCES
[1]
ABRAHAMSON, I. G. (1967). Exact Bahadur efficiencies for the KolmogorovSmirnov and Kuiper one- and two-sample statistics. Ann. Math.
Statist. 38, 1475-1490.
[2]
BAHADUR, R. R. (1960). Stochastic comparison of tests.
Statist. 31, 276-295.
Ann. Math.
I
I
e
I
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I
I
I
-
13
[3]
BAHADUR, R. R., AND RANGA RAO, R. (1960). On deviations of the sample
mean. Ann. Math. Statist. 31, 1015-1027.
[4]
FELLER, W. (1967). An introduction to probability theory and its
applications, vol. 2. John Wiley: New York.
[5]
HOEFFDING, W. (1963). Probability inequalities for sums of bounded random
variables. Jour. Amer. Statist. Assoc. 58, 13-30.
[6]
PARTHASARATHY, K. R. (1967). Probability measures on metric spaces.
Academic Press: New York.
[7]
TAKACS, L. (1967). Combinatorial methods in the theory of stochastic
processes. John Wiley: New York.
[8]
USPENSKY, J. V. (1937). Introduction to mathematical probability.
McGraw-Hill: New York.