ANALYSIS OF CATEGORICAL DATA
OBTAINED BY STRATIFIED
RANDOM SAMPLING I.
by
E. Sobel, M. Francis and P. Imrey*
Department of Biostatistics
University of North Carolina at Chapel Hill
Instit~te
of .Statistics Mimeo Series No. 1065
March 1976
II'
-1-
..
ANALYSIS OF CATEGORICAL DATA
OBTAINED BY STRATIFIED
RANDOM SAMPLING I.
E. Sobel, M. Francis and P. Imrey*
*E. Sobel is Research Fellow, Department of Epidemiology, and
N. Francis is Assistant Professor, Department of Biostatistics,
University of North Carolina, Chapel Hill, N. C. P. Imrey is Assistant Professor of Biostatistics, University of Illinois, Urbana, Ill.
-2-
ABSTRACT
A non-iterative method of analysis by linear models is
extended to categorical data obtained by stratified random sampling.
It is shown that, asymptotically, proportional allocation reduces
the variance of estimators over that obtained by simple random
sampling.
The difference between the asymptotic covariance matrices
of estimated parameter vectors obtained by simple random sampling and
stratified random sampling with proportional allocation is shown
to be positive definite under fairly non-restrictive conditions.
-3-
..
1.
INTRODUCTION
Grizzle, Starmer and KJch (GSK, (4)) propose a weighted leastsquares methodology for the analysis of a wide range of linear and
non-linear models for categorical data, including testing of
hypotheses within the underlying parameter space.
line~r
Their approach is
a synthesis of the work of Hald ell], Neyman [10] and Bhapkar [1,2].
The broad applicability of the method has been demonstrated in a continuing series of papers, e.g., Koch and Reinfurt [9J, Forthofer and
Koch [3J, Koch, Johnson and Tolley [8J, and Koch, Imrey and Reiq.furt
[7J.
The research described above assumes throughout that the categorical data were obtained by independent simple random sampling within
each of an arbitrary number of very large populations.
Koch, Freeman
and Freeman [6] discuss alternative sample designs for which this
assumption may not be-necessary due to the use of pseudo-replication
or like methods for estimating the covariance matrix of· the sample cell
counts.
Johnson and Koch [5J, in the context of an example, show how
proportions or mean scores from stratified random sampling designs may
be analyzed within the original GSK context, through consideration of
each stratum as a population and use of a "mediator matrix" to transform
within-stratum cell counts or mean scores into the usual stratified
estimates and their variances and covariances.
Linear models are then
applied to the statistics so derived.
In this-paper we extend the suggestion of Johnson and Koch [5] to
the full ra2$e of functional models described in GSK [4J.
We then pro-
vide a multivariate extension of the well-known theoretic result that
-4-
.
stratified sampling with proportional allo~ation provides an unbiased
estimator of a population mean with smaller variance than the estimator obtained from simple random sainpling.
In particular we then show
that this result carries over to stratified and non-stratified estimators
of the parameters in the functional models,mentioned above.
2.
NOTATION and the PROBLEM
As far as possible, we follow the notation used in GSK.
that there are
s
populations of interest.
Within the i
th
We
a~me
population
we are interested in a multinomial random (response) variable with
response categories COl' ••• , Cj
~
.r .
•
Notice that we allow the number
~
of categories to vary between populations.
Tf
=
ij
.
proport~on
0
Let
f t h e ~. th populat~on
·
"belonging" to C
ij
~
)!.
The
unkno~vn
TI ••
1J
=
~
~
(;!!l' ... , J!s)
are hypothesized to satisf.y a possibly non-linear
model of the form
-5-
,l (If) == !
~
u. x v v 'X 1 .
u,.tX 1
(1)
where
i)
(f 1 (Jr.), ••• , f
==
ii)
u < I:(r. -
-
iii)
~
1),
each f (1T) has continuous second partials with
m""
respect to the
iv)
( 1T) ) ,
u-
the f
C·)
m-
1T
ij
,
are independent of one another,
.
i.e., the rank of
== (
H .(1T)
""" ....
uxI:r
af~(~:)'
c1T ij
i
.
is equal to u,
v)
X is a known u x v design matrix of
......
rank
vi)
,...,B is
v
~
u,
a v x 1 vector of unknown parameters.·
Assuming simple random sampling from each population, GSK
treated (1) as a null hypothesis and presented an asymptotic x2-test.
e.
Then, assuming (1) is valid, they gave a best asymptotically normal
r·
-6-
(BAN) estilnate of...§,.
Finally they gave an asymptotic x2-test for the
null hypothesis
a=0
c
, (2)
....
d~""
where
vii) ......C is a knovffi d x v
A.
NOTATIO~
- Simple Random Sampling
Within the i th population, i
D
i
= 1,
~
= number
j
Pij
E
n. /n
iJ
E
of C
ij
i
x r
i
(p.)
-~
responses,
1, .-- ,
~iJin
the sample
i·
(I-7T
:t.e!
••• , s, let
_ = total sample size
. ij
r
matrix of rank d < v •
H
) - 7T
7T
il
--
i2
•
i2
..-
- 7T
7T
i l ir
i
1
= V(7T i ) = -~-
ni •
- 7T
7T
ir i il
- 7T
ir
7T
i
7T
ir
(1 - 7T
i
ir )
i
r
-7-
Also let
=
p'"
1~
rr
(p", ••
,...1
~
, p ")
-8
i
V(p)
~~
= block
di.agonal matrix of the V(7T;)
= usual
sample estimate of ~N
V(7T)
"'- '" 1.
H = H(p)
""
"-",
S
,v
B.
= HV(p)H'"
""'''''' "".
t"'W
NOTATION - Stratifie.d Random Sampling
Within the i th population, i
n
f.
D
i
n (
i h)
ni(h)j
=
= 1,
••• , s, let
total sample size
"
= number
of strata
= sample
size in stratum h
=
number of Cij responses in the sample
from stratum h, j
= 1,
••• , r
i
-8-
N .
=
Nih
= population
i
w .. =
total population size
(W
-i
J
•••
size of stratum h
J
il
W )
iD
i
Qi(h)j = proportion in stratum h
"belonging" to C
ij
a
i(h)j
=n
i(h)j
In
1(h)
...
&:
1
&:.--
~i(l)'
...
...
-9-
~<..ei)
Di r i
=:
block diago:l.al matrix of the !~i(h» .
x Dir i
Also let
• ••
!<S)
ID . r
1.
V(a)
i
x ED. r .
..
1. 1.
=:
"" '"
a ....)
block diagonal matrix of the Vea.)
"" "-'1.
usual sample estimate of V(a)
"v
Wo#@
-1
Ir
where
0
matrix.
-8
..
1
is the Kronecker product and
•
0"""
•
•
~~0 r
='"H5 tV(a)H'"
"'''' '\ost
Sst(a) = H t(a)V(a)H (a)'"
......."'"
""8 -v ' " ' " .... st 'V
where H
"'st
I
s
is the r
Finally let
-
-
"U
I
r 1 '"x r
0""'
i x tDir i
A~st
.
• • • , ""'s
a.#)
, ""-'S
and ~st(~) are defined in (3) in Section 4 •
i
""x
x r
i
identity
--10-
We remark here that even though v(n) and
Veal
'"
'" """
thus only positive semi-definite. both Sen) and S
are singular and
""'\",
'" 5 t
...., ""
(a) are non-singular
'"
and thus positive definite as a consequence of assumption iv)
equation (1).
following
We shall use the positive definiteness in the proof of
the theorem in Section 6.
3.
UNDER SIMPLE RANDOM SAMPLING
GSK pointed out that if (1) is valid, then
~[F(n)
,.....
"""
= XSJ = (F(p) - Xb) ~ S-1 (F(p) - Xb)
"-"'"
"'" '"
1"wI"V
~
4\""
"""
,....,,..,.
=
v then (1) must trivialiy
2
is asymptotically a X (u-v) where
This provides a test of (1).
be true.
Note that if u
If (1) is true the b is a BAN estimation of
S. GSK
'"
suggested
that to test (2) one use
2
which, i f (2) is true, is asymptotically a X (d).
Note that if (1) is
true and F(n) is linear, Sen) is the covariance matrix of F(p) and that
""" ~
""""" """
,....",
0_
if F(n) is non-linear
"" ,..,
var
~
-11-
as the. l1
and
i
-+ co
so that ni(h) In
is constant.
i
Similarly,
(X~S-lX)-l is asymptotically equal to the covariance matrix of b.
" " """'"
""
"V
4.
UNDER STRATIFIED RANDOM SAi.'1PLING
In order to test (1), obtain a BAN estimator of ,....
B and test (2)
under stratified random sampling we basically treat this situation
s
fromfue point of view of simple random sampling from EDt populations.
1
Without loss of generality, we assume
ai(h)j > 0, for each h, j, i'.
Since
Dr
'lT rj
:::
L
WihQi(h)j
h=l
we have
Qa =
.",
1T
'V
and thus if (1) is true
T(a) ~ F(Qa) ::: X
'"" """
"'" "'" """
"'""
a
""
.
-12-
Note that this·diffels from Johnson and Koch [5], who calculate
functions within each stratum before pooling using stratum weights.
Such
an approach is inappropriate if ,...,
F(.)
is non-linear, while it produces
......
results identical to ours in the linear case.
It is clear that, with T(.)
""'"
""
= F(Q.),
""'"
(1) and
(l~)
are equivalent.
"-N
Next we show that the conditions corresponding to iii) and iv)
of section 2 are valid for
. ..
~
t
(a»
u '"
~
.
We have
atm_
(a)
af (~)
afm(Ga)
af m""
(~) a~'j
m .....
"-",
1
= .".aa....;1~·(-h-)-j "" ~a~;;';'i-j--=a-a1-· (~h"")-j "" Wih a~
~a·(h)
.. 1
J.
1J
..
and
a [Wihdfm"""",
(Go) Ian: . ]
1J
-13-
Thus each
t
(a) has continuous second partials.
Let
m
~st(~) =
u x EDiri
and
atm'V
(a)
H
,,",",st
(a) ==
:- H
'"'-st ""
Then
(aflC!I.) YWII
e
)
~st (It)
a'lJ'
J.
I'\J
...
=:
Cfl(~Y WID
1
(a:i
V)-
w
sl
W...
•
~
C:~:V)
W~
sD
.
s
0
a~1
",1
•••
0
0
Cfu(19) WII
==
a.31
•• 0
@
(-afl<E)Y
0)
Cfu 0!)J
.
d~1
(.:;~~»)
...
• •• ",5
w"
0
e:~:)r
0
0
H('lJ'),9.
"V ~
a~l
w~
~s
0
e~~)Y
-
W~
-sD
s
Since Q has only one non-zero entry in each column and is Er. x ED.r.
'"
J..
J..
J..
rk Q = [r .•
"""
J..
Thus
rk
!st(;;)
=
(T(a) -" Xb
= u ..
Consequently
SS[T(a) =
........
'\I "'-
xsJ
'" '"
,.. ' "
"""'S t
)S-l(T(~) - Xb )
"\0
St
"" ~
"'" no S t
is asymptotically X2 (u-v) where
b
"'" s t
Thus i f (1"') is valid, ~st is a BAN estimator of...@.
The test suggested
by GSK for (2) then uses
which is asymptotically X2 (d) if (2) is true.
5.
COMPARISON OF band b t with PROPORTIONAL ALLOCATION
""
'" s
We now assume that
ni (h)
= Wihni' ~
h
= 1,
••• , Di' i
= 1,
••• , s.
-15-
Then the usual estimate of n .. is
~J
"
=
7T ••
Put
IT:
-~
= (n iI' ...
,
~J
nir. )
I
""
n
and
1 ""
x Er i
~
=
( ";.
n , " . , "'''')
n
•
"" l
",s
Then
-,...
= "'7f"" = Qa
...Est
and
.yar(fT) = var (
.- '"'"
~
gs t )
=
(cov("
'"
n ij' n i" j ",
Also, in this case, since
~st(~)
and
5
-st
- H V(p ) H'"
.... ,...;vat ""
where
V(p )
"-",st
= usual
estimate of var(p )
""""'...,st •
»
-16-
f'
.
B and A < B, where A and Bare
'"""'"
"""'"
""""". ~
~
"""
square matrices, means that B-A is positive semi-definite and positive
..... ""
.
As a notational convenience, A
definite, respectively.
~
~
and> have equivalent meanings.
Lemma
Proof: .
First note that both
"
~ ~)
and
y~)
are composed of (square) block
diagonal matrices:
o
and
...
=
var(n)
~...,
o
•
•
•
"-
Thus if we show that
<
we are done.
But
:l(~i)
J
i
=1
J
...
J
S
-17-
n. c:r(iT.) -- Va1.·C1T.))
.l'
'" "" 1
'""'-'
....1
~
W1h (01(h)1 -
W11 )(01(h)2
:
••• ~ W1h (a 1(h)1 -
W12 )
W11 }(Q1(h)r
•
1
•
•
Now let
~
~i
=
(XiI' ... ,
1, ... , r
where
h
i
is a randomly chosen integer, 1 < h < D , with
-
P(h
=
ho )
-
i
= Wih~
o
Then
'11" ••
1J
var(X .. ) = rw'h(a'(h)' 1J
h
1.
J
1
'11" . .
1J
)2
and
Jo\;
Thus the covariance matrix of..e:i. is
fore positive semi-definite.
D
i
• (.Y0f) - xe.rC!i»' which is there-
-
W
1r1
)
-18-
Theor.em
Asymptotically, with proportional al1oc.ation,
for all vectors "'c
~
O.
This is equivalent to
"-
Proof:
Since
var c ~b = c ~(~
X S-1X)-1 C
"'f"'\.,oo
the equivalence is clear.
""t"'Vrv""",""""
Now from the lemma and the fact that H(TI) has
'" ......
Li.n.ear1y independent rows it follows that
(a) = H(TI)[V(TI)-var(~)]H~(TI) >
8(TI) - S
"V
"""
"-'
st '"""
I""\" "'V
-"''''V
~
-
........
0"'\.1
O.
--
Since S t(a) > 0, its inverse possesses a (symmetric) square root P (a) = P1'
.......,g""V
""'
"'" 1 ""
""\I
say for convenience.
Then
there exists an orthogonal
> ..9 and by the Principal Axis Theorem
£1~<:~)!1
matrix!2~)
Thus, with ...........
P(a) = P = P P
...... "-Z,,,l'
whei:e D is diagonal of full rank.
'"
=D
PS(TI)P
""''''''" ""'""'""
PS
"''Vs t
But D-I :: p[S(rr) - S
~...""
.........
"'V
"'"
-v
st
(a)]P~
'V
""
(a)P
"V
'"
E
,..,
=
!2?2 :: I.
-
.-
> 0 and therefore every diagonal element
-
'"'"
of ,....,
D is greater than or equal to 1.
,-v
=,J Z' say, such that
Let
-19-
Then
Put
·where
~E I=={~"'=
.---
(0 1 , ... , o)!050 k <l, k=l, ... ,
u}
E(o) == diagonal matrix with
0 ~s the diagonal v~ctor
.....,
~ ~
Thus
.....-.
for some .-v01T EO L.\ .
We are then finished once we prove that x"'U(o)x, x f. 0,
.......,- ""'" ' " ~
"""""
""'J
""
is a non-decreasing function of ea::h ok for ,§. E
But
I.
'V
-x"'U(o)(PX)'"
rv"'"
1"\.1
"""tV
-20-
•
where ,21<.
is the k
th
This completes the proof.
column of PX.
""',..,
Corollary
Let X'= (Xl', ••• ,X') where X. is defined in the proof of the lemma.
"'-.....
'"s
'" 1.
S
Let X be a matrix whose IT D. rows are the realizations of X.
~
~
·11.
1.=
+
Then var c'b t <·var c'b for all c
0 if and only if
..... "" s
..... ""
~
""
rank [l , XH'] = u + 1,
,... ,... '"
t
where
is a column vector of l's.
~
Proof:
Since Sen) - S
"'- "'""
'" s t
(a) is the covariance matrix of RXunder the
"""
.......""
probability mass function
s
per= ~l(hl)'··· ,,.e;(hs ») = ~ P~i =.eiChi »
the condition given is equivalent to ,...,.Sen)
S (a)
O.
,.". - -st
,.., > ,...
then follows directly from the proof of the theorem.
The corollary
-21-
.
6.
Under the model
F(n)
""'-
'"'-
=
DISCUSSION
Xa.
where F(') satisfies the conditions
'"""""""
"""'"
given in Section 2, the estimator of
t'tw
S under
the least-squares algorithm
~
behaves asymptotically as a linear function of the estimators of the
cell probabilities (see [lOJ).
Consequently if~(~) is such that its
components, f k (..:),are derived from non-overlapping groups of the s
populations, then the results in Section 5 are rather obvious.
We have
shown that this extends to rather arbitrary linear models.
The approach taken above ignores the finite population correction
factor.
However, as Johnson and Koch [5] remark, multiplication of V(p)
-",
by (l-f), where f is the sampling fraction, results i.n valid large-sample
...
statistics based on the work of Wald [11] when the population is considered
finite.
If the fpc were considered here the basic results would clearly
remain valid when all covariance matrices are multiplied by the appropriate
(l-f ).
i
This is easily seen by consideration of sequences of problems
with stratum sizes Nih and sample sizes ni(h)' with both ~ih and ni(h)
increasing such that their ratio approaches f
i
= ni./N io
0
The resulting
hypergeometric distributions obey the same central limit theory .as their
multinomial
counterparts~
The weighted least-squares procedure is vulnerable to small sample
problems due to observed zero cell counts.
In the unstratified case if
this happens S will not be of full rank because the corresponding TI
ij
will be estimated by O. GSK [4] suggest using an "estimate" l/r n o
i 1
However, in the stratified case the invertibility of S
depends upon
"'st
'the covariance matrix of the stratified estimates' of the TI , not upon
ij
-2L-
.
the covariance matrix from the individual strata, i.e., the covariance
matrix of the estimates of the ai(h)j for fixed i, h.
Thus, problems
will be encountered when all strata within a given population have the
same
c~ll
empty.
Following GSK, each corresponding ai(h)j can be estiSuch an occurrence presumably would arise no more
often than having a zero cell count using simple random sampling with
identical total sample size.
Finally, the formulation and results in Sections 4 and 5 allow for
a general study of optimal allocation as a function of costs associated
with sampling from each stratum, the model F(rr)
'""'"'\.,.
covariance matrix of the estimators.
presented in a sequel.
a_
=.......XS
~
and the variance-
Results along these lines will be
-23-
REFERENCES
"
_
--
[1]
Bhapkar, V.P., "Notes on Analysis of Categorical Data," Series
No. 477, Institute of Statistics, University of North Carolina,
1966 (mimeographed).
[2]
Bhapkar, V.P., "A Note on the Equivalence of Two Test Criteria
for Hypotheses in Categorical Data," Journal of the American
Statistical Association, 61 (~furch 1966) 228-35.
[3]
Forthofer, Ronald N. and Koch, Gary G., "An Analysis for Compounded Functions of Categorical Data," Biometrics, 29 (March·
1973), 143-57.
. [4]
Grizzle, James, E., Starmer, C. Frank and Koch, Gary G., "Analysis of Categorical Data by Linear Hodels," Biometrics, 25
(September 1969), 489-503.
[5]
Johnson, William D. and Koch, Gary G., "Analysis of Qualitative
Data: Linear Models," Health Sciences Research, (tolinter 1970),
358-69.
[6]
Koch, Gary G., Freeman, Daniel H. Jr., and Freeman, Jean L.,
"Strategies in the Multivariate Analysis of Data from Complex
Surveys," International Statistic~l Review, 43 (No.1, 1975),
59-78.
(7]
Koch, Gary G., Imrey, Peter B., and Reinfurt, Donald W., "Linear
Model Analysis of Categorical Data with Incomplete Response
Vectors," Biometrics, 28 . (September 1972), 663-92.
[8]
Koch, Cary G., Johnson, William D., and Tolley, H. Dennis, "A
Linear Models Approach to the Analysis of Survival and Extent of
Disease in Multidimensional Contingency Tables," Journal of the
American Statistical Association, 67 (December 1972), 783-96.
[9]
Koch, Gary G. and Reinfurt, Donald H., "The Analysis of Categorical Data from Mixed Hodels," Biometrics, 27 (~Iarch 1971), 157:"73.
[10]
Neyman, Jenzy, "Contribution to the Theory of the X2 Test," in
J. Neyman, Ed., Proceedings of the Berkeley Symposium on Hathematieal Statistics and Probability, University of.California Press,
Berkeley and Los Angeles, 1949, 239-73.
[11]
Wald, Abraham, "Tests of Statistical Hypotheses Concerning Several
Parameters where the Number of Observations Is Large," Transactions of the American Hathematics Society, 54 (November 1943),
426-482.
l
© Copyright 2026 Paperzz