Davis, C.E. and D. Quade; (1968)A distribution-free test of compaing the associations between two pairs of variables in a multivariate sample."

I
I
.e
I
I
I
I
I
I
This investigation was supported in part by a Public Health Service
Grant (number GM-38) from the Institute of General Medical Sciences,
Public Health Service.
Ie
I
I
I
I
I
I
I.
I
I
A Distribution-Free Test for Comparing the
Associations Between TWo Pairs of Variables in a
Multivariate Sample
by
C. E. Davis and Dana Quade
Department of Biostatistics
University of North Carolina School of Public Health
Institute of Statistics Mimeo Series No. 575
March 1968
Errata
Page 7
; the number of observations discordant with the ith observation is the
sum of the cells which are simultaneously above and to the right of the
(j,k) cell or below and to the left.
Page 12
where the summation (c) extends over 1 < c
Let
2
sk
= __1__ ~ (V(i)
n-l i=l
k
k = 1,2.
l
< c < ... < cm ~ n •
2
Sen shows that zk
Page 15
Since U converges in probability to 8 and U converges in probability to
1
2
l
As the asymptotic joint distribution of /n(U -8 ) and /n(U -8 ) is bivariate
l l
2 2
normal
k = 1,2,;
I
I
I·
I
I
I
I
I
I
A Distribution-Free Test for Comparing the Associations Between
TWo Pairs of Variables in a Multivariate Sample
by
c. E. Davis and Dana Quade
Department of Biostatistics
University of North Carolina School of Public Health
Sunnnary
Consider a random sample of observations on
~
=
(X, Y, W, z).
Let N(l) N(l) and N(l) be the number of pairs of observations which are
C ' D'
T
concordant, discordant, and tied, respectively, for variables X and Y; and
N(2)
C
'
N(2)
D'
and N(2) be the corresponding numbers for Wand Z.
T
of the association between X and Y is t
and similarly, t , for Wand Z.
2
l
=
A measure
(N(l) - N(l»/(N(l) + N(l) + N(l»
CDC
D
T
In this paper it is shown how, using t
l
and t , the association between X and Y may be compared with that between
2
Wand Z, if n is sufficiently large.
1.
I
I
I
I
I
Introduction
Let A. = (X., Y., W., Z.), 1 _< i <_ n, be a random sample of
-~
~
~
~
observations on a random variable
ordinal.
~
~,
where each component of
~
is at least
Based on this sample it is desired to test a hypothesis concern-
ing, or place a confidence interval on, the difference between the correlation of variables X and Y, and that of variables Wand Z, when no
assumption is made concerning the form of the underlying distribution.
is the purpose of this paper to indicate how this can be carried out for
large samples.
The measure of association used is the unconditional index of
I.
I
I
order association or Kendall's tau-a, hereafter referred to as the
It
I
I
,.
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
-2-
unconditional index, which we now define.
OS. '
A pair of bivariate observations
Y ) and (X ' Y ) is said to be:
l
2
2
concordant i f
~ < X2 ' Y1 < Y2 or ~ >
discordant i f
~ < X2 ' Yl > Y2 or ~ > X2 , Y1 < Y2
tied
~ =
if
~, Y1 > Y2
or Y1 = Y2 or both.
~
Let PC' PD' and PT be the probabilities that a randomly chosen pair of
observations is concordant, discordant, or tied, respectively.
The uncondi-
tiona1 index between variables X and Y is then
T(X, Y) =
T
PC-PD
= PC+PD+PT = Pc - PD
2
Note that T(X,Y) = T(Y,X) = -T(-X,Y); that T (X,Y)
if X and Yare independent.
~
1; and that T(X,Y) = 0
Thus T is a standardized measure much like a
correlation coefficient.
If T
1
and T 2 are the unconditional indices for (X,Y) and (W,Z),
respectively, we shall show how to test hypotheses of the form H: T
1
- T
2
The procedure is outlined in Section 2; Section 3 consists of an
example; a discussion of the method is in Section 4; and theoretical considerations are presented in the Appendix.
2.
For 1
~
A Test for the Difference in Associations
i
~
n, let Q be the number of observations which are
1i
concordant with the ith observation, less the number which are discordant,
for variables (X,Y); and let Q be the corresponding difference for (W,Z).
2i
The estimates of T
1
and T
2
are then
1
~
Q
n(n-1) i=l 1i
o•
I
I
-3-
,.
and
I
respectively.
I
n
t
then
tk
Ie
I
I
n(n-l) i=l 2i
I~1
N
-
Tk
(~)
For k = 1,2, define
[n £
4
n 2 (n_l)3
£
Q2. _ (
Q .) 2]
i=l k~
i=l k~
Then the random variables
k = 1,2
are asymptotically standard normal random variables.
Hence, for large n,
we can construct confidence intervals for T or test hypotheses such as
k
H: T = T* ' where T* is some constant.
k
k
k
I
I
I
I
I
I
I.
L: Q
1
We note that if NTk,k = 1,2 is the number of tied observations,
I
I
I
I
2
=
s
[n i=l£ Q~~
2 = _......:4_ _
n 2 (n_l)3
_ (
~
Q.) 2 ]
i=l ~
Then for large n,
(t -t ) - (T -T )
z =
l
l
2
2
s
is also a standard normal random variable.
H: T
l
This enables us to test
- T = 0, or to construct a confidence interval for (T -T ) .
1 2
2
An estimate of the correlation between t
where
l
and t
2
is
I
-4-
I
I·
I
I
I
I
I
I
Note that s
222
= s1 + s2 - 2s l2
That these results are valid can be seen by noting that t
..
are U-stat~st~cs
an d appea l'~ng to t h e resu 1 ts
end, identify the general random variable
1
~l (A.
-~
,A.)
-J
I
I
I
I.
I
I
and t
To t h'~s
X used there with!, T
with 8 1 and 8 respectively, and note that m = 2.
2
l
and
T
2
Define
i f (X.,Y.) is concordant with (X.,Y.)
~
~
J
J
i f (Xi' Y ) is tied with (X.,Y.)
i
J J
0
-1
i f (Xi,Y ) is discordant with (X.,Y.)
i
1
i f (Wi,Zi) is concordant with (W.,Z.)
J J
0
i f (Wi,Zi) is tied with (W., Z.)
J J
J
J
and
~2(A.
-~
=
,A.)
-J
Ie
I
I
I
0 f t h e appen d'~x. 1
l
-1
~k(A.,A.)
-~
-J
, k
= 1,2
i f (W. , Z.) is discordant with (Wj,Zj)
~
~
is then a symmetric, unbiased estimator of T •
k
the U-statistic for estimating T
Uk
=
=
(2
1
is
k
n n- 1
Hence
)
L:
i<j
n
L:
n ( n- 1) i=l
~k(A.,A.)
-~ -J
Qki = t k
k = 1,2 •
The components of Uk are
n
L:
k = 1,2
j=l
j:fi
1Thos e readers who are not mathematically inclined may skip the remainder
of this section and the Appendix.
2
I
I
I·
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
-5-
Moreover, V(i)
=
V~i) - V~i) and a little algebra then verifies the formulae
222
for sk' s , and s12.
Hence the results outlined above are seen to be special
cases of the results in the Appendix.
3.
Example
As an example we use a portion of the data collected in a dental
survey of North Carolina, described in detail by Fulton et al. [lJ.
The
observations consisted of noting whether a tooth was healthy (i.e. noncarious), carious, or missing.
for white males aged 15-19.
We shall use the data for the second molar,
As past experience has indicated that for this
age group most missing teeth are missing due to previous caries, we shall
consider a missing tooth to be a carious tooth.
Denoting a non-carious tooth by 0 and a carious tooth by X, there
are 16 possible outcomes for an individual (see Table 1); e.g. (outcome #9)
the second molar on the upper left quarter of the mouth may be carious while
the remaining 3 second molars are healthy.
Table 1
Possible Outcomes for an Individual
1-
2.
3.
4.
*
o
0
*~
*
o
X
X0
XX
5.
6.
7.
8.
o0
~
oX
~
XO
~
a
XX
9.
10.
11.
12.
o0
Wo
*Wo
*
o
X
X0
XX
13.
m
o
0
14.
$
oX
15.
X0
Wa
16.
iii
XX
The frequency distribution for the second molar is given in Table 2.
are two questions of interest:
There
I
I
••
I
I
I
I
I
-6-
(i)
Is there an association in caries rates between the second
molars on the left and those on the right of the mouth?
the
maxillary and mandibular halves?
(ii)
If the answer to both parts of (i) is yes, is there a difference
in the degree of association?
Table 2
Observation
Frequency
Ql i
Q2i
Qi
fQ
fQ
li
2i
fQ.
~
1
24
146
119
27
3504
2856
648
2
4
80
75
5
320
300
20
3
5
72
75
-3
360
375
-15
4
26
89
-44
133
2314
-1144
3458
I
5
3
80
50
30
240
150
90
6
4
-66
62
-128
-264
248
-512
Ie
I
I
I
I
I
I
7
5
89
62
27
445
310
135
8
13
17
12
5
221
156
65
9
6
72
50
22
432
300
132
10
5
89
62
27
445
310
135
11
0
74
62
12
0
0
0
12
7
9
12
-3
63
84
-21
13
5
89
-69
158
445
-345
790
14
9
17
-13
30
153
-117
270
15
7
9
-13
22
63
-91
154
16
69
83
56
27
5727
3864
1863
14,468
7256
7212
I.
I
I
Total
192
Let T and T be the population unconditional indices for the
1
2
left-right and maxillary-mandibular comparisons, respectively.
A computa-
tiona1 aid, which also indicates an important application of the method, is
to put the data into two contingency tables as in Table 3.
I
I
-7-
Ie
Table 3
Number of Carious Teeth
I
I
I
I
I
Left Side
(a)
o
I
Right
Side
24
11
0
35
1
7
41
14
62
2
4
22
69
95
35
74
83
192
Maxillary
(b)
o
1
2
Total
I
o
24
9
5
38
1
9
14
16
39
2
26
20
69
115
Total
59
43
90
192
Mandibular
Ie
I.
I
I
o
Total
I
I
I
I
I
I
I
Total
2
1
From these Q and Q can easily be determined.
1i
2i
Suppose the ith observa-
tion is in the jth row and the kth column, i.e. the (j,k) cell.
The number
of observations which are concordant with the ith observation is the sum of
all cells which are simultaneously above and to the left of the (j,k) cell
or below and to the right; the number of observations discordant with the
ith observation is the sum of the cells which are simultaneously below and
to the right of the (j,k) cell or above and to the left.
For example, con-
sider an observation in the second row and second column of the first
contingency table.
There are 41 such observations.
tions concordant with each of these is 69 + 24
cordant is 0 + 4
93 - 4
= 89. Q2i
= 4.
= 93,
The number of observawhile the number dis-
Hence Q for each of these 41 observations is
1i
can be computed in a similar manner after noting in which
cell of the second contingency table the ith observation lies.
I
I
I·
I
I
I
I
I
I
-8-
If the data cannot be put into contingency tables, no such easy
method of counting the number of concordant and discordant pairs is available.
From the computations indicated in Table 2 we have
t
t
2
14,468
= 0.39452
36,672
=
7,256
= 0.19786
36,672
and
s2 = 0.038225 •
Thus
0.39452
0.032266
= 12.23
and
0.19786
z2
= 0.038225
=
5.17
Hence we reject both hypotheses H: T = 0, k = 1,2; and conclude that there
k
is an association in caries rates for teeth on the left and right side of
the mouth and similarly for the maxillary and mandibular halves.
Further
computation yields
t
t
1 -
2 = 0.19666
and
s = 0.037570
Thus for the test of H: T
1
= T2 ,
z =
I.
I
I
=
sl = 0.032266
Ie
I
I
I
I
I
I
1
we have
0.19666
0.03575
= 5.23
and we conclude that there is a difference between T and T •
1
2
The estimate
I
I
I·
I
I
I
I
I
I
-9-
of the correlation between t
r
l
and t
2
is
0.000545
= ""'(0-.-0-3-2";;"26::";6;"':)~(;;;'0";':;.0;"'3"""8-22-5"""") = 0.44201
4.
Discussion
As was suggested in the example of Section 3, this method is
applicable when a set of multivariate observations is arranged in two or
more contingency tables, as is
often the case with sample survey data.
Since we can estimate the correlation between t
and t
l
2
we have an indica-
tion of the degree of dependency of the two tests of association.
Even if only one contingency table is being analyzed the uncondi2
tional index will be preferable to the usual X test, if an estimate of the
degree of association is desired.
Various coefficients of contingency,
Ie
which are functions of X2 , have been used for this in the past.
I
I
I
I
I
I
should be kept in mind that the test of no order association and the test of
I.
I
I
However,
2
these functions of X are difficult, if not impossible, to interpret.
2
independence based on X are not tests of the same hypothesis.
It
If the
variables are independent, the population unconditional index will be zero,
but the converse is not true since variables which have no order association
may still have some other kind of association.
As the small sample distribution of (t
l
- t ) is not known, a
2
question to which we have no answer is how large n should be in order that
the asymptotic results be applicable.
Further research will be required
before this question can be answered.
The computations involved in carrying out the tests for association
given here can be tedious for even moderately large n, however the methods
are readily adaptable to an electronic computer.
I
I
I·
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
-10-
Since the only restriction on the components of
~
is that all are
at least ordinal, we may take W ~ X, in which case the procedure given here
allows us to compare the association of X and Y with that of X and Z.
ACKNOWLEDGMENT
The authors wish to express their appreciation to Dr. H. A. Tyroler
for making the data of the example available to them.
I
I
.e
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
-11-
5.
[lJ
References
Fulton, J. T., Hughes, J. T., and Mercer, C. V. (1964) The Life
Cycle of Human Teeth, Department of Epidemiology, University of
North Carolina Publication.
[2J
Sen, P. K. (1960) "On Some Convergence Properties of U-Statistics",
Calcutta Statistical Association Bulletin, Vol. 10, pp. 1-18.
I
I
••
I
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
-12-
Appendix:
Let
Xl'
The Asymptotic Distribution of the Difference
Between Two U-Statistics
X , ••. , X be independent and identically distributed
2
n
random variables with cumulative distribution function F(X).
estimable parameters 8 1 and 8 each of degree m (
2
and
O2
~
Consider two
(Xl' ••• , X ) be their symmetric unbiased estimators.
m
for estimating 8
1
and 8
2
... ,
n) and let 0 (Xl'
1
The U-statistics
are then
U = (n)-l ~ 0 (X , X ,
c
c
1
m
(c) 1
1
2
... ,
U = (n)-l ~
2
m
(c)
... ,
Xc )
m
and
O2
(X
c
1
where the summation (c) extends over 1
, X
c
~
,
2
c, < c
2
Xc )
m
< ... < c < n
m-
.
Define the components of the U-statistics as
... ,
y(i) =
k
1 < i < n,
k = 1,2 ,
where the summation (ci) extends over 1 ::: c < c < ... < c 1 < n
1
2
mthat c. = i must not occur for any j.
J
,
except
Note that for k = 1,2
1 ~ y(i)
Uk = -n
i=l k
2
1
Let sk = - - ~ (y(i)
n-1 i=l k
Y )2
k
k = 1,2.
Sen [2J shows that zk =
I"n (Yk-8k)
mS
k
converges in distribution to the standard normal distribution.
Since 8
8 = 8
1
8
2
1
and 8
2
are each estimable of degree m, it follows that
is estimable of degree less than or equal to m.
degree of 8 is exactly m.
Xm)
Assume that the
A symmetric unbiased estimator of 8 is
I
I
I-
-13-
and the U-statistic for estimating
I
I
I
I
I
I
e
is
... ,
xc )
m
Define
V(i) = (n- 1 5l L:: 0
(Xi' Xc
m-l (ci)
l
= v(i)
1
,
... ,
Xc _ )
m l
1 < i < n
v(i)
2
and
Ie
I
I
I
I
I
I
I.
I
I
It then follows immediately from Sen's result that
is asymptotically a standard normal random variable.
We now prove the following result:
is convergent, then s
probability to P12"
- __1__ ~ (V(i)
12 - n-l i=l 1
If the integral
U )(v(i) - U ) converges in
1
2
2
I
I
I-
I
I
I
I
I
I
-14-
Proof:
Note that s12 can be written as:
s
n
(')
(V ~
n-l i=l 1
1
12
= --- ~
I.
I
I
- 9 ) - n~l (U - 9 )(U - 9 ) •
l
l
2
2
2
Define
0 (i)
k
= f···S0k(X"X c , ... ,Xc _ )dF(Xc ) ••• dF(Xc _ );
l
l
ml
ml
~
where c,
J
=i
1 < i
~ n,
k
= 1,2
must not occur and consider
n
+
Ie
I
I
I
I
I
I
(')
9l)(V2~
~
i=1
Now by the Cauchy-Schwartz inequality
Sen shows that
-1-1 ~
i=l
Hence it follows that
n-
(v(i)_0(i))2 converges in probability to zero; k
k
k
converges in probability to zero.
and
A similar argument proves that both
= 1,2.
I
I
II
I
I
I
I
I
Ie
I
I
I
I
I
I
I.
I
I
-15-
converge in probability to zero.
By definition, 0~i) and 0~j), i#j; k
= 1,2;
are independent and identically distributed, thus by Khintchine's weak law
of large numbers it follows that
converges in probability to E[(0(i)-e )(0(i)_e )}
112
2
= P12.
Thus
we see that
converges in probability to P .
12
Since U converges in probability to e
l
l
and U converges in probability e , n~1(Ul-el)(U2-e2) converges in probability
2
2
to zero.
Hence s12 converges in probability to P •
12
As the asymptotic joint distribution of fn(Ul-el) and fn(U2-62) in
bivariate normal with variance matrix
follows that if n is sufficiently large a valid estimate of the correlation
between U and U is
l
2