e·
•
LOGISTIC REGRESSION ANALYSIS FOR COMPLEX SAMPLE DATA
by
Lloyd E. Chambless
and
Kerrie E. Boyle
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series Number 1440
June 1983
•
•
LOGISTIC REGRESSION ANALYSIS FOR COMPLEX SAMPLE OATA
Lloyd E. Chambless
Kerrie E. Boyle
••
Lipid Research Clinics Program
Oepartment of Biostatistics
University of North Carolina
Chapel Hill, NC 27514
Key Words: Asymptotic normality, Estimation for subdomains,
Horvitz-Thompson estimators, Maximum likelihood estimation,
Modeling for finite populations, Stratified sampling,
Taylor series expansion, Variance estimation
•
This research was supported by NHLBI Contract No.
M01~HV-1-223-L
.
•
SUMMARY
The estimation of parameters' in 'linear and logistic regression
based on a stratified random sample from a finite population is investigated.
The notion of an infinite superpopulation facilitates
the development of asymptotic normal distribution theory.
Procedures
for parameter estimation and statistical inference on a subdomainof
the finite population are also developed.
We illustrate these modeling
techniques, showing the effects on inference of ignoring the sampling
design in estimation of model coefficients.
Finally, we extend our
methods to logistic regression on data gathered in a two-stage stratified sample design .
•
•
2
1.
INTRODUCTION
In the situation where
complex than
simpl~ r~ndom ~~~pling~ ~e
sampling drsign
when the
arise from a sampling scheme more
d~ta
in
our
an~lysis
d~ta:
of the
~cco~nt
would like to
for the
•
While this is often done
qnly
estimates
of population
means or totals,
.. ' 'entails
"
".'
, . .
. '.,
the situation is ptherwi~~ for m9r~ comp.l~x ~palys~s. H9wever, inter-
est in
~n~lysis
acco~rting
~h~ ?~~pling
for
design ip
~naly~is
has
increas~d
in recent
years;
see Fuller (1975),' S~rndal
(1978),. O'Brien (1981),
.
".
.' .
'.;
.
~n~ D~Mou~hel an~ D~ncan
Binder (1981)?
consid~r
199isti~ regr~ssipn
primarily
(1981).
s~udy ~e
In this
for stratifitd random sample
data, though we also include an extension to a two-stage stratified
•
.
e.
•
•
~
-
_.
-'".
.".
"
.. '
sampling
, . situation.
.
.;"
T~at
it is
q~sirabl~
to account for the
samplin~
mating populqtior
seems generally
accepted..
.. ' means
'
-"
, .
"
setting, we
"'~'.
wo~l~
".-"
design in esti-
In a linear model
be "model ipg" Y= X§~ where Y is the variable of
interest, and X is
id~~!ically
variables are included in X.
and an indicator variable?
•
one. ·The controversy begins when more
Suppose X contains only an intercept term
sa~
for: smoking yer-sus non-smoking.
If the
sampling of the U.S. population were ·done by race, with equal numbers
•
' '..
',"
. ' "
•
j
U;,
."
,
.
, .
.
.'
"
of whites and blacks,
little
~rgum~nt
woulq
be raised
in accounting for
.,
.','
..
,
-,
'
the sampling scheme? via
tion means of Y for
to adjust for age
wei~hted ay~ra~~s~
e~~r le~~l
diff~r:ences,
in our budding model.
when estimating the popula-
of the smoking variable.
by
~dding
Suppose we want
age qS an independent variable
At this pointthere are those who would argue for
a return to the ordinary least squqres
(O~?)
fit of the homoscedastic
model to "test" the di fference between smokers and non7smokers.
Cl early
•
3
•
if the mean of Y differs between the. races, this method will lead to
"adjusted means" by smoking class that are "biased" toward the mean
for blacks, who are oversampled.
Note that such a "bias" .may or may
not be present in the estimate of the di fference in expected Y between
smokers and non-smokers.
DuMouchel and Duncan (1981) present several models for consideration.
One is the OLS model just discussed.
Another, the "mixture.
ITXJdel", is essentially OLS within each sampling stratum, then a weighted
average thereof; i.e., a S· is estimated for each stratum, and the
-J
parameter of interest is the ii = E w. S., where the weights wj are proJ-J
portional to the inverse sampling proportion. But ii is problematic
when there are large numbers of small strata, since stratum-specific
•
variable. estimates become unstable. 'In addition, this "mean of Rarameters"
parameter may not be the desired one, i.e., there are instances where
an "overall population" parameter is the object of interest.
Another
model, of the type considered in this paper, defines S as giving the
best linear predictor
xe
for Y in the sense of minimizing the expected
squared error of prediction.
Or in logistic regression, our main focus,
§ will be defined to give the best linear predictor X§ of logit {Pr(Y=l)l
in the sense of maximi zi ng the expected "1 i kel i hood" for thefi ni te
population.
Of course if the relationship of interest varies too much
by sampling strata, an "overall" § may be inapprop.riate.
How much
variation is too much for an overall measure to be useful?
is it to show stratum-specific results?
The answers to such questions
are not fixed, but depend on the particular problem at hand.
•
broader context,
w~ether
How feasible.
In a
to include some of the stratum-defining variables
4
in the model is really just part of a more general model building procedure.
•
In addition to concern about the validity of estimation which
ignores the sampling design, one must also be concerned with variance
estimates.
Even if assumirig away the sampling design has little effect
on the estimates of the parameters of interest, there may.be a sub·
stantial effect on the estimated variances of the estimators (O'Brien,
1981) .
The analytic model of interest in this paper is logistic regression;
however, the basic techniques are applicable to other transformations
of binary data.
After introducing some notation, we define the parameter
of interest and develop large sample distribution theory, including
hypothesis testirig procedures .. In applications we have applied an
iteratively reweighted least squares algorithm to maximum likelihood
estimation of the model's coefficients, arid briefly discuss this.
We
•
also investigate logistic regression modeling on a subdomain of the
population.
We then utilize our modeling techniques in an example.
Parameter estimates and statistical i·nferences obtained from a logistic
regression analysis that accounts for the stratified random sampling
scheme are compared to those from an analogous model that assumes a
simple random sample.
Finally, we present an extension of our techniques
for two-stage stratified sampling.
•
5
•
2.
NOTATION
·We consider a population
•
Q
a disjoint union of K strata.
=
K
U Q.,
i=l
1
We also consider subsets 8.c R. c Q. for
1 -
1 -
1
each i, where
'K
L n.,
n i = size of 8 ,
i
n =
Ni = size of R ,
i
K
N = LN. ,
i =1 1
i =1
K
8 =
1
U 8.,
i =1
1
K
R =
U R ..
i=l
1
If V·is a variable of interest, we will use Vij to designate the
•
observation on V· in .R.1 or 8 1"
caps:
!'
h
We use "til ded" capi ta 1 scri pt characters
to designate the whole vector of observations on V in R:
V ,ooV KN ).
1N 1
K
i
~'=
(V
ll
...
For the vector of observations on V in 8 we used "tilded"
= (V".
00
V1"1 ... V ) .
KnK
We will interpret R as the finite population of interest, considered
as a random sample of the,infinite superpopulation Q.
a stratified random (non-replacement) sample from R.
8 will be taken as
Our interest is in
statistical inference from 8 to R.
3.
DEFINING THE PARAMETER OF INTEREST
Suppose we had variables V and X on Q, and some hypothesized
"relation" F(V) = G(X;B) with G depending on parameter B.
problem is to estimate
•
~,
an,rxl vector,
fro~
sample
~ata.
The usual
Examples of
such relations are linear regression models V= XB, or, more generally,
nonlinear regression models V= G(X;B).
Another example is the case
6
where F is a density Qr probability function we want to estimate.
This
~
paper deals primarily with this ·latter example, though we wiil briefly
menti on the "regress ion" exampl e for moti vati ona1 purposes.
3.1
Linear Reqression
Consider the model E(V)
= X§
(or V = X§ + d on R.
Using R we
. would estimate S, using the least squares criterion, by
B = (x'x)-l X'y
(1)
where Xij is an r dimensional row vector (X ijt ) of "independent" variables. Actually our main interest in this study does not lie with S
and the "superpopLJlation"
Q.
Rather, we are concerned with B and the
•
finite population R, and so take (1) as the definition.of a parameter
on the finite population R.
B= (X'WX,)-l
Not actually having R, we estimate
~
by
X'WV where V and X are defined on the sample s similarly
.to Y and X and
W=
o,
for wij = f:1 l , where
f.1 is the sampling proportion in the i th stratum.
.
Then X'WV and X'WX are the (unbiased) Horvitz-Thompson estimators
(Cochran, 1977) of X'Y and x'x, respectively.
3.2 Nonlinear Regression
For a least squares fit of a nonlinear regression model E(V)
the normal equations on Rare
= G(X.S),
1-
•
7
•
Z'G = Z'y
-
-
where by Z we mean the Nxr matrix with entries
We can then "estimate"
and
Z'W~,
z'q
and
Z'~
by Horvitz-Thompson estimators Z'WG
respectively, with Was before and Z,
random sample relatives of Z, G, and Y.
A
-
-
~,
and
~
the stratified
To estimate B, we then equate
-
Z'WG and Z'WY and solve for B, generally by iterative methods.
3.3
Density.
Suppose now that we hypothesize E{F(Y)} = XB.
•
If F is invertible,
we could rewrite this, heuristically, E(Y) = F-l(XB) and apply the methods
of the previous. paragraph.
However, if Y is a
zero~one
random variable,
so that E(Y) = pr(Y=l) = F-l(XB), instead of applying nonlinear least
squares, we take a maximum likelihood approach.
The likelihood of our sample R is:
Y..
K N·1
1
l-Y ..
. .B) 1J {l-F- (X .. S)}
1J
L(B) = n n F-l(X 1J1J .i=l j=l
Y..
l-Y ..
K N.1
P
.
.
(B)
1J {l-P . . (B))
1J
= n n
1J
1J
i =1 j=l
where
( 2)
The resulting likelihood equations are
0= U =
-
•
a log
aB
L'= Z'(Y_P),
- -
where by Z we mean the Nxr matrix with entries
(3)
8
x,
are on the finite population and we assume there is a solutipn
Y, P
~o
••
Thi~
s~JT)r1e
R~
t ' ; t..
. ',,',_,"
'.~-
re~ll!
Not
W~ ~tniz~
s :.R,
A
,_
. •- ',..
.'
~'
"
,
t~~~ ~s p~fi~iD9 th~ p~r~~~t~r
we
population
!
<.
haying R
. ' . : ' , ,-
qf
i~t~rest
,.<'
,
pn
~~t o~lY th~ srrat1fje~
s tg
c0!J1P!!r~ ~he
Z:~IY
~.
" .....
= I fo l
•
th~
finite
random
1!MqryHz:-T/J0!J1pson" !?sti'!J~tqrs
A
z:p where
Z'Y and
.....
I' ,'l
-J!',-
K
A
Z'y =
i:, 1 1
A
and simi1ar.1y for
.
'
.',
,
Then the equations
Z'P.o
...... •
,
• ,'. ' I
_ . . . , . -.. -"
o =U =
can be written
A
L·=
For the
gistic
W=
K
II
i=l
re~t
.
a ]~§
(4)
ZlWY': Z'WR
are solved, iten:q;iY§l~~ for Bo
.
'.
A
Note thil~ thes~ "likelihood'! equiltiqns
•
L = 0, where
Yo.
1-Y .. f.-1
l
iI F:l(X .. S)lJ {l-F-l(Xo.S)}
lJ 1
0":1'
'lJlJJ:,
•. .
..
[
n.
J
of
regres~ign~
th~ pap~r \'i~
will res,trict
o~r
attention to lo-
wher§
.
F-1( X0oS)
= exp(X. oS)/{l + exp(X ..S)}.
1~-'
.1J'lJ-
In this case the likelihood
, " " " '> '.' equations
" . ,,:' '.':' are
-','
x' (y-p)
............
or using the
= 0,
(3! )
str~tifie~ r~rq?m sa~ple~
X!W(Y=I')
.. , ..... ,,-,'
=
00.
~
•
9
•
Before proceeding to derive the large sample distribution of the estimators of ~, we point out that (4') reduces to the "right" estimates
in the simple case where X is a zero-one variable.
the i
th
stratum consider the 2x2 table
y
1
1
X
o
where a. + b . + c. + d. =
1
1
1
0
I:; :; I
The model 'i s
1
exp(a+
pr(Y~ll
X) ~
ex)
1 + exp(a +
••
ex)
and the maximum 1ikel ihood estimates for a and
K
L f:
exp(a)
1
e are
ci
~ ..:.,i~"..,l_l_ _
L f: 1
i ~l
1
and
'Y, f:
( i~ll
1
a 1'}
di
I
(.1~11f:
1
d
1'}
exp(6) ~ - - - - - - ; - - - - -
~.
(KL f: 1
t. f-.l b' Y
( i~l 1
1
i=l
J
If we .then estimate pr(Y~lIX=O) ~ Po by
e''J.
PO~l+/i.
we can show that
•
l'
c. }
1
Specifically, for
10
K
n·
I' f.,-1
I
i =1 j=l ,
A
P '"
0
K
I
X"'0
n.
I'
•
V..
'J
f.-1
1
i"'l j"'l,
X"'O
.
This is just the estimate we would get fronl a weighted mean approach
in which the sampling prJ!:C'rtiohs on the X'" 0 subpopulation are assumed
to be the same as those on the full population.
4.
LARGE-SAMPLE THEORV
As the next step in our investigation of the parameter of interest
B, we will find the asymptotic distribution of
B- B.
To obtain our
asymptotic results we use an approach similar to Fuller (1975):
sequence
{R
Take a
m} of finite populations, with strictly increasing sizeN m.
Each Rm is considered as a random sample from the infinite superpopulation Q. Let Rmi '" RmO Qi' Write Nmi for the size of Rmi . For each m
let a stratified random non-replacement sample 8 m of size nm be selected
•
from 11m, where 8mi i s the sample from the i th stratum. Write nmi for the
size of 8 mi so thatnm"'Linmio Further, write nm;lNmi"'fmi and let
lim f mi '" f i , for 0 < f i < 1.
Finally, write gi '" l-f i o
Also, let lim Nm;lNm"''IT i , where 0
<
'IT i
< 1.
Theorem 1.
Assuming the existence on each Qi of the second moment of
the random variable X'(Y-P} and the first moment equal to zero, we have
A
-~
nm
(u
u)
->-
N(O,L}
where
K
L
'"
I
'IT.
i =1 '
f~l
9i
t.
,
, L.,
1
•
11
•
nm1.
nm
m1
f = \"t,. f.l1.
= I 1i m f . N = 1i m I. N = 1i III N •
~lm
~m
1 1 1 ·1~ m1 m
N •
and
L
i is the variance of X'·(V-P} on Qi.
Note that we will generally drop the subscript m from the
Proof:
A
~m'
nmi , Nmi , f mi , etc.
_'- {~. -1
_L(A
n"U-U}=n"
n.
I1
j=l
Lf.
-
. 1
1=
1
X~.(V .. -P .. } -
1J
-~
= n
. i=l
1
1J
.
.
K
,
( -1
)
L f. - 1
1J
K
N.
1
I I
i=l j=l
X:.(V .. -P .. }}
1J 1J 1J
.
n
..
K
,i
•
(
)
-~ ,
LX .• V.. - P 'J' - n
L
j=l
1J
lJ
N.
,1L X:.(V
.. .. -P . .}.
._
+1
1J
1J
i=l ,1- n
. 1J
i
1
We have
•
X ~ .(V .. - P.. ) -> (f: 1 - l) N( 0, L . ) ,
n.-~ (-1
f. - 1 )
1
lJ
1
lJ
,lJ
l'
1
and
N.
-~
n.
,1L
J=n.+ 1
1.
1
->
• ( V.. -P .. } -_ (l-f. ) ~ f.-~( N.-n.) -~ Ni X..
• ( V.. -P .. )
X..
1J 1J
1J
1. 1 1 1 J-n.
._ I +1 1J lJ
1J
1
(1- f.) ~ f.-~ N(O,L.}
. 1
1
1
by the Lindberg-Levy Central Limit Theorem.
A. =
1
n: ~ { f: 1 In.1
l ' 1 j=l
->
X:. (V .. - P .. ) -
1J
lJ
1J
N.1
I
j=l
Thus,
X~ . (V .. - p .. )
lJ
1J
}
1J
1
1
1
9i
N( 0, { ( f: - 1) 2 + (1 - f.) f: } L.} = N( 0 • n: N. -f L:-) ;
1
11·1
1 1 i 1
So:
K
= I
i=l
•
n. ~
(_1) Ai ->
n
Theorem 2 (see Binder (1981)).
N(O,
K
I
i=l
11.
1
-1 gi
f.f,L i }·
1
Under the assumptions of Theorem 1•
plus an additional regularity condition as specified in Rao (1973), p.293,
12
on the third-order partial derivatives (with respect to (3) of p. ·(13),
.
-
i.e .. on the Xr\X
t
1J -
P(1-P)(1-2P), n!:l(B - B) is asymptotically N(0,I-ln- 1 ),
where
I = f
-1
K
1.
i';l
•
1T.I.,
1 1
and
Proof:
Bya Taylor series expansion, for any j=i, ... ,r,·
using the additional regularity condition and the fact that B is a consistent estimator of 13.
au
df3_
But since U(B) = 0, we have
K
=
•
Ni
'
I
x..x.· P.. (f3){l-P .. (S)},
1J 1J 1J 1J 1 =1 j=l
-.Y.
so,
.
au
K N.
PUM (N- l ----::C) - - PUM
\"
as
L
i =1
K
= - 1:
1T.I.
i =1 11
1
-
tf.Ni
1
,
X..X.. P.. (f3){l-P .. (s))
1J 1J
1J - l J -
= - fI ..
Thus,
.
au·
N (-1
-)!:l (B -13 )}
+N --n
n
af3
-{n-!:l
~(~)
+
= PUM {n-!:l
~(~)
- In!:l
= PUM
r
l (-fI)n!:l
(~- ~)}
(~-~)}
= O.
.
(a)
•
13
•
Also.
and
A
n.1
.}
.
-1 dU_
{ -1 K 1 L
PLIM n -dB = PLIM -n . L r
X~ .X .. P .. (l-P .. )
i=l 1 j=l 1J 1J 1J
1J
= -
K
Ni -N f.-1· n.-1
L PLIM {n-Ni -N
i=l
i
n
1
ni
' X.. P .. (l-P .. ) }
L X..
1 j=l
1J 1J
K
. =
.
-r.1 i=lL 1I·I.
1 1
= - I •
So.
•
From (a) and (b) we obtain:
PLIM [n~ (~-~) - [In-~ {Q(~) - ~(~)}]
= PLIM {n~ (B-B) - I-1n-~ U(B)}
,
-
Theorem 3.
(i)
PLIM {n~ (B-B) - I-1n-~ U(B)}.
Under the hypothesis of Theorems 1 and 2,
a consistent estimator for 1: is:
K N.
'\
l
9i
_1 --,,-
i =1
n
where
1
A
Eo =
1
1
ni -
•
1
- -n-
i
ni ,
I.
K
A
1:'
1
L
=
i =1
1
ni 9i
nf~
A
1:i'
1
(nLi x.. X•. {Y .. - P.. (B)}2
j=l
1J 1J 1J
1J-
[L
x..
{Y .. j=l
lJ 1J
I
A
P.. (B)}
1J.-
....
1J
1J
14
and
.'
a consistent estimator for [lu- l 'is ilE;-l, where
(ii)
n.
K N.
I
=
I
_1
i =1 n
A
I. =
1
K
I
Proof:
L
=1
A
A
P.·(B){l-P.,(B)}
1J -
1J -
_l.>LJ= - ' - - - - - : : - - : : - - - -
i=l n
K
i
N. . I11 X~.X,.
1J 1J
ni
A
A
X;jX ij Pij(§){l-Pij(~)}·
This follows from the weak law of large numbers and the fact
that B converges to B in probability.
Note that the proofs of Theorems 1 through 3 had nothing to do
with the particular form of P=F-\XB).
The conclusions of the theorems
hold whatever the form of these functions, as long as the related
regularity conditions on the functions hold and we replace X by the
Nxr matrix with entries
•
in U and change
dU
dB
accordingly.
5.
HYPOTHESIS TESTING
We wish to show that the usual type of test statistics for testing
hypotheses of the form Ho :
1)
Wald Statistic:
A~
= C are valid ,in our sampling framework.
W~ n(AB-C)'(AI-1U-1A')(AB-C)
-
-
UnderH o ' n~ (AS-C) ~ N(O,AI-l~I-1A'), so the distribution of Wis
•
15
•
asymptotically X2 with rank (AI- 1EI- 1A') = ~ank (A) degrees of freedom.
.... l'""1
Of course, we use the estimate AI- EI-' A' for the variance.
2)
Likelihood Ratio Statistic:
,......
= -2
LR
B* maximizes L(f3) under the restriction Af3 = C.
-
-
_
_
[-2{{(8*){(8))]=
Ho , PLIM
"'''
10g{L(B*)/L(B)}, where
One can show that under
-
_ - .e.(B)}], where .e. is the
PLIM [-2{.e.(B*)
A
log-likelihood and .e. its sample estimate.
By the usual MLE theory
the latter is X2 with rank (A)d. f.
-
3)
Now
.
Score Statistic:
n-~ (Q(;~) - ~(~))iS
S = n-
1 .... . . .
,...
U(B*)
'E(B*).... ....
....
1 ...........
U(B*)
........
N(O"E(~)).
asymptotically
But U(B)';' 0, so
, n-~U(B) is a'symptotically N(O,E(B)).
-
•
-
We can write AB = C equivalently, with a possible re-ordering of
the components of B, as ~
and
~l
='(~;
f(~l)')"
has dimension r- rank (A).
where f is an affine function,
~*
Then
,;
(~~, f(~l ) ') , , . where ~li s
A
the MLE for L considered as a function of r-rank (A) parameters through
the relation~
By definition
"
= (~;,f(~l)')"
Write ~
Thus, n-
6.
is (rank Ax1).
is asympto-
A
N(O,E(~*)),
l
....
Ql(~l) = O. SO n-~ Q(~*) = n-~ (0' ,Q2(~*)')'
A
tical1y
......... I
= (~~'~2)" where'~2
under H ' since
o
Q(~*) E(~*)-l Q(~*) ~
(~*-~) ...
0 in probability under Ho '
X:ank (A)'
FITTING A MODEL
. ON A SUBDOMAIN
.
'
Suppose we are interested in fitting a. model on a subpopulation
R' of our fi ni teO popul ati on R.
Designate with primes the parameters
,
associated with R', e.g., N is the size of the i th stratum ,of subi
population R'.
•
Let s' = SnR'.:· If the Nj are known, then we could
assume that S' is a stratified 'random sample of R', with sampling
16
I
I
I
I
proportion f ,= ni/Ni, and proceed as before. However, if the Ni are
i
not known, we must proceed otherwise.
.
, = ('I
)
Suppose we wanted to fit a
We can est1mate
N.1, A
as N.
1 n.1 n1· N1·.
linear model E(V) = X~ on R', Write
x
on
=
••
l'
R'
[
R - R' J
with similar notation on S.
We want the least squares estimate of
!l =Xl~' that is, we want to solve the equation (X~ Xl)~ = X~!l'
If we knew the sampling proportions fj for the subpopulation, we would
estimate xiXl and Xi!l by the Horvitz-Thompson estimators as we did on
the entire finite population.
,
Not knowing the f i , we immediately think
•
of estimating them by:
,
,
n.
n
n
f. = i = , i
-1 = f ,
=
i
1
n.
N~
Ni
1N
,1
ni i
and then employing'these in estimating X1X l and Xi!,l'
A,
Writing the
weight matrix Was
where Wl corresponds to the subpopulation of interest, the estimates
for
X~Xl and Xi!l are then X~W1Xl and XiWl~l' so that'~ = (X~W1Xl)-1(XiWl~1)'
To get the distribution of our estimator, however, we need to
return to the sampling framework of the entire' finite population R.
Define
•
17
•
with X* defined similarly.
Then
and
s.
on
.
If we fit the model Y* = X*B* on
O)(~l ~2J( X~]]-l
[(X;
[(X;
R,
.
we get ~*= (x*'wx*)-lX~'w~* =
O)(~l ~2)( ~l)]
which is the estimator derived above.
=
(X;W1Xl)-lX;!41~1 =~,
Now we consider the estimated
variance of B- B:
var(B
•
B) = var(B*
B*) = (x*'wx*)-l G(x*'WX*)-l
where
K
n. l-f.
I _1_ ~ x~' (V~2 - __
.!!.:.1 V~ J~ V~) X~,
G = ~ n-r n.-l f.
.n
1
1
1
1
1
1
i =1
1
1
i
J~ is an n.xn.
matrix of units, and V~1 = diag(Y~
- X~1 B*) is the
1
1
_1
. 1
diagonal matrix formed from Y~
- X~1_B* (see Fuller (1975)).
_1
,
If the stratum-specific subpopulation sampling proportions f i
were presumed to equal f i , then the corresponding estimated variance,
A
denoted V, would "underestimate" var B*.
For large
n~
we have approxi-
mately
K
var B* - V = ~ (l-f i ]( 1
i =1 f j ni
-Iii1 ]
X~' V~ J~ V~ X~
1
1
1
1
1
..
A
It is clear that the inaccuracy in using V as an estimate of var B*
worsens as the disparity between the subpopulation sample size and
the full sample size increases within each stratum.
•
To bring the reader to more familiar ground, we can take B as a
18
scalar parameter B, and X=l, and it can then be shown that (5) simplifies to the estimate in Cochran (1977) of the variance of the ratio
estimator of domain means.
Now consider
lo~istic
regression on the finite subpopulation R'.
With x* being 0 off R' ,we get
L(S;X*)
•
=
K
exp(X .. S)
II
.II , .{ 1 +exp( ~~J.~)
i=l JERi
}V 1j
{
1
·S
1 + .exp X.lJ-
l-V ..
lJ
K
II
i =1
V..
exp ( X··S )
1J {
,
1 +
i =1 JERi { 1 + expB~j~) } .
K
.II
II
where Ll(~) is the likelihood on R'.
1
eXP(Xij~)
}l-V lJ
..
We see that the device of equating
•
X toO Off R' gives the "right" likelihood, up to a constant, and the
"right" likelihood equation,
Ul
-
=
K
~
N:1
I
X~.(V .. -P .. )
lJ
i=l j~l lJ lJ
~l
The sample estimate of
=
K
N.
.
LX *• •(V .. - P .. ) •
~
i=l j=l
\"1
I
lJ
.
lJ
lJ
is the same, using X* (on R), or X (on R') with
the estimated weights fi l :
K
A
U
_1
=
,
n.1
I I f:
i=l j=l 1
1
=r
K
X:.(V .. -P .. )
lJ
lJ
lJi=l
I
A
Then, from the right-hand expression and Theorem 1, n-'li(~l - ~l)
where
K
l:l =
I
i=l
1T.
1
-
f.
1
1
g.
1
f i l:l 1.,
->-
N(O,l:l),
•
19
•
and
Eli = var {X*'(V-p)}
on Qi'
Moreover, a consistent estimator for El is:
N. g.
_1 _1
n f
K
A
E
1i
i
= }:
i =1
where Eli is the usual sample covariance of X~'(Vi-Pi) on the i th
stratum:
1
=-1
ni -
[
,
,
ni
{I; x..lJ (v lJ..
j~l
•
.
I x..
X.. (V .. -P .. )
j~l·
lJ lJ lJ
lJ
-P .. )}]
lJ
2
X:.(V,,-P .. )}
lJ
lJ
lJ
.
If we assumed the sUbpopulation sampling proportions to be known, with
,
A
f.=f.,
then the corresponding estimated variance E* would be like El ,
1 1 .
with every.n i in the formula replaced by ni. For n; large, we get
A
El - E* approximately as_
I;
X.. (V .. - P..• )}
X: . (V .. - P.. )} • {
lJ lJ
lJ
lJ lJ
lJ
j~l
..
,
This difference then becomes smaller as n.1
Writing
~*
~
-
n..
1
for the parameter to be estimated on the finite sub-
population, we have from Theorem 2
•
1
n~
A l l
~ N(O,I- E I - ),
l l
1
(B* - B*)
-1
-1
A_l A A_l
where I l El I l is consistently estimated by I l El I l
•
K
.L f:
i=l
1
1
n.
K
with
,
n
}:l X~~ X~ .P .. (l - P.. ) = n- l }: f: l }: i X.. x.. P.. (l-P .. ) .
lJ
i=l 1 j=l lJ lJ lJ
lJ
j=l lJ lJ lJ
I
20
A
Hp.nce, the total error in estimating var{B* - B*) by taking the subpopulation weights as fi l and known comes in the estimate of Ll .
•
The procedure discussed above to estimate B on a sUbpopulation
R'
made no use of the value of Y off the subpopu1ation.
off
y-p
R',
it may be of use to set Y =
is identically zero off
~
on
R - R'
Since P.. =~
lJ
so that the res i dua1
This would haveparticular applica-
R'.
tion for goodness-of-fit tests which involved the residuals.
One may a1 so want to fit a sepa rate model on each of R' and R - R' •
Defining an indicator variable I to identically zero on R' and unity
off, we could then take
P .. =
lJ
exp {X ..B + {X .. 1. .)B 2}
.
lJ- 1
lJ lJ 1 + exp {X ..Sl + {X .. 1. ,)S2}
lJ-
lJ lJ -
We assume here that X contains a column of ones, i.e., that· the model
has an intercept term, so that X-I includes I as a variable in the
model.
Then
~3=~1+~2
~1
•
corresponds to the subpopu1ation of interest, and
to its complement.
two factors, one a function of
The likelihood of
~1
R
and the other of
then breaks up into
~3'
and maximization
of the product is equivalent to maximization of each factor separately,
so that we get the saine estimates of B on
will often be impractical
~hen
Q
as before.
This approach
it comes to computations, either be-
cause of the increase in the number of variables or because of a lack
of data in
R-R'.
7. ALGORITHMS FOR ESTIMATION
In order to estimate the coefficients of the logistic regression
model, an iteratively reweighted least squares program was used. As
•
21
•
in Jennrich and Moore (1975), one can show that the "normal" equations
from maximum likelihood estimation of our model are the' same as those
for a least squares estimation for a related nonlinear model.
A
Write f(a)
.....
(Y - pea))
- pea))
for the we.ighted sums of squares
... 'WT(Y-.....
of residuals 'to be minimized. Taking. partial, derivation of f with respect
=
-_
to a we get the normal equations
oP
aa
0= (-.::.) '-Wl-(Y - Pl.
--
aP.
'
= Xij l }l- Pij), if we 'take WI' as a diagonal matrix with
i
t
-1(
)-1 we see that the normal equations
diagonal elements f -1
i Pj j 1 - Pij
'Since
aS
(4') result.
-
Of course the weights depend on a, so' that at each. stage
of the iteratively weighted least squares routine, the weights are re-
•
calculated using the 'most recent estimates of the a.
. PROC NLIN in SAS (1979) to get our estimates of
B,
We have adapted
and have written
our own software to utilize these estimates in calculating variances
and test statistics.
8.
ILLUSTRATION
Using data from the Lipid Research Clinics Program Prevalence
Study, parameter estimates from a weighted logistic regression model
are compared to those from an analogous model that ignores the stratified random sampling scheme. 'The Prevalence Study consisted· of two
screening visits.
The first, Visit 1', was designed as a complete
screen of ten well-defined target populations:
subjects as the finite population R.
•
consider these Visit 1
The detai1s'of the actual complex
sampling strategy are not presented; but, with some simplification, we
22
take the 'Visit 2 sample as a stratified random sample of the finite
,Visit 1 population (s), where the strata are defined by race and
lipid zone.
•
This latter variable, lipid zone, has three categories:
elevated levels of either serum cholesterol or triglyceride at Visit 1;
borderline eievated levels of cholesterol or triglyceride; and normal
serum cholesterol and triglyceride levels.
Thesampling proportions
for the first two zones were 100% and 15%'for'each race and for the
third zone, 25% for whites and 32% for blacks.
The logistic regression framework is used to estimate age-adjusted
race-specific Visit 1 prevalences of Type IIa dyslipoproteinemia
(essentially, serum low density lipoprotein cholesterol (LDL), for
males aged 6-19 years from one urban clinic.
Define the dependent
variable Y to be one for Type IIa subjects and zero, otherwise.
Vari-
able R is a race variable with value 'zero for whites and one for blacks,
whereas A is an age variable, equal to age minus 12.
The model of
•
interest is
A
Parameter estimates (B) when this model is considered in a finite
random sampling scheme as well as estimates obtained within a finite
stratified random sampling scheme are displayed in Table 1.
The
parameter estimates and the estimated standard errors for the simple
random sampling scheme have been obtained from PROC LOGIST in SAS
(1979).
However, the standard errors that appear in Table 1 have been
modified by a finite population correction factor (!f:fi) to improve
compa rab i 1ity.
•
23
•
Inspection of Table 1 reveals a, notable difference between the
two estimates of the intercept:
the estimate calculated within the
stratified random sampling scheme is much smaller than that calculated
in a simple random sampling framew.ork (-2;95 vs -1.93).
prevalences of phenotype IIa are shown in Table 2.
The estimated
There are substan-
tial differences between these two sets of probabil ities.
Since LDL '
is defined as "elevated" if above the age-specific 95th percentile fo'r
the white population, the "weighted" estimates must be the correct ones.
Ignoring the stratified random sampling design in this nonlinear regression model results in an overestimation
of
these two prevaiences.
Such
,an error would have serious implications, for example, 'in health planning
if this proportion were used to calculate an expected number of individ-
•
•
uals in a population requiring specific health resources.
The number
of blacks would be overestimated by 90% and the number of whites by
130%, resulting in overestimation of required health services.
TABLE 1
Coefficient Estimates for Logistic Regression Models
Ignoring or Accounting for SampJing Design
Ignoring Design
B
Std.Err.(B)
A
Va.riable
'Intercept
Age
Race
•
-1.93
-0.08
0.43
0.11
0.03
0.19
B,
Design
Std.Err.(B)
-2.95
-0.05
0;70
0.13
0.03
0.24
C~nsidering
A
24
TABLE 2
Estimated Age-Adjusted Race-Specific Prevalences of Type IIa
Dyslipoproteinemia Ignoring or Accounting for Sampling Design
Ignoring Design
Considering Design
P(Type IIaIWhite)*
0.13
0.05
P(Type IIaIBlack)*
0.18
0.10
•
*Calculated at Age= 12.
Suppose the primary hypothesis of interest is a black-white comparison in the prevalence of phenotype IIa.
The two approaches yield
roughly the same conclusion that blacks have an elevated prevalence of
phenotype IIa, but the magnitude of this excess differs.
Accounting
for the sampling scheme, we find that juvenile black males have 2.0
(= exp(O.70»
times the odds of being TypeIIa as whites, while if we
•
fit the model ignoring the sampling scheme we get,an odds ratio of 1.53
(= exp(O.43».
9.
EXTENSION OF LARGE SAMPLE THEORY TO A TWO-STAGE STRATIFIED SAMPLE
In this section the proposed methods ,are extended to a two-stage
stratified sampling design.
Firstly, a choice must be made on what
types of "asymptotics" to consider.
We have followed Krewski and Rao
(l98l) in focusing "on surveys with large numbers of strata with relatively few primary sampling units selected within each stratum," so
that the limiting processes consider the number of strata as increasing.
We begin by presenting the framework for the asymptotic theory .
•
Consider the sequence {R K} of finite populations such that RK
•
25
•
has K strata with Ni primary sampling units in the i th stratum,
i=l,." ,K. Suppose n primary sampling units (PSU's) are selected
i
th
from the i
st~atum, with single-draw sampling probabilities lI ij > 0
(j=l, .. .,N i ), where l:~lIij = 1 for all _i. - The }h primary sampling
unit of the i th stratum contains Mij subjects, of which a simple random
non-replacement sample of mij uni~s_ is drawn, Assume the primary
sampling units are selected with replacement and that independent
samples are selected within those P5U's selected more than once,
Note that in the above and subsequent notation for ~ara~eters for the
kth population RK, K has consistently been suppressed for simplicity,
When the finite population in the- jth p5U of the i th stratum of
•
the fjnite population RK is considered as a
r~ndom
infinite superpopu1ation, the
RK can be written as
L(S)
11~e1ihood_on
sample from an
=
K N. IlMiJ' { eXP(XiJ'k~) }V ijk {
1
}l-V ijk
Il III
i=l j=l k=l
1 + eXP{Xijk~)
1 + eXP{Xijk~)
=
K N. MoO
Il III Il 1J
i =1 j=l k=l
where
exp(Xijk~)
1 + exp{X, 'kS)
1J -
Of particular interest-are the likelihood equatiQns:
a log L(§)
ass
K
= L\"
N,
\"1
L
M,.
\"lJ
L
i=l j=l k=l
•
Let
XiJ'k
(V
iJ'k - PiJ'k
)
= O.
26
•
= (1T;j m;j n;}-l
M..
1J
A
=
L(S)
K
n.{m
..
V" k
1
IT .. IT 1J P"
1J
;=1 j=l k=l
1J k
IT
and
A
a log L(S)
Us =
ass
m••
}:l J w· ,X" k (v O'k - p. 'k)'
k=l
1J 1J S l J
1J
K
= }:
i=l
A
Then Us is an unbiased estimator of US'
The solution to the equations U= 0 defines the parameter of
A
interest
~.
and the solution to
~
A
= 0 provides our estimate
~o
Asymptotic
results are expressed in the following two theorems.
Theorem 4.
Under suitable regularity conditions. K-J,
K-.."'. where r is defined belowo
Proof:
U- U =
i~l
K
Iij
.x: ok (V 1J0Ok - Po1J'k)
1J 1J
Ii,
j=l k=l
w0
(IJ - U) -.. N(O.r)
as
•
K
- I
; =1
K
= }:
;=1
m..
K
•
}:l J Xijk(V
ijk - Pijk )
- }:
k=m 00+ 1
i=l
K
- L
1J
N.
. ,1
L
.
M..
,lJ
L
i=l j=n 1o+1 k=l
K
= }: ~i
i =1
'
1J
X··k(V.ok-P·· k )
1J
'lJ
K
}: Co.
i=l _1
•
27
•
where the definitions of A.,
_1 B.,
_1 C.
_1 are the obvious ones.
Consider A.
_1 first.
Let
Then
n.
A. =
_1
m..
r r
j=l k=l
J Z. ·k(w .. -1) =
-lJ
1J
n.
1
I
j=l
A.. ,
-lJ
and the expected value of A. gi ven the n. PSU is
_1
.
1
assuming E(~ijk)=O (regularity condition nUmbe~ one).
Clearly E(~i)=O.
Now examine the variance of A.:
_1
•
var(A.)
_1 = E(A.A~)
_1_1
n.
= E(
I1
j=l
A..A ~ . + 2'
-lJ-1J
Ij j'>j
I
A..A: . , ) .
-lJ-1J
We know E(A · ·A: ., Ijjlj') = 0', 'and
- 1 J -lJ
E(A .. A~ .IPSU)= (w .. _1)2 E{(
-lJ-1J
1J
~ij z.
'k)( ~ij Z~ 'k)}
-lJ
k=l -lJ
k=l
2
mij
•
= (w .. - 1 ) I
E(Z"kZ"k)'
1J
k=l
-lJ -lJ
= (w .. -l)
1J
2
m.· L ..
1J
1J
Averaging over all PSU's produces the desired result:
var(~i )
=
•
N. (M .. -
,1
l.
j=l.;
1J.
m.. n.)
1J 1J 1
71 ••
7I ij
m
ij n i
2
i:ij'
28
With suitable conditions on moments of the
K
,
_I,
K '2
A. -, N(O,l'l)'
L
i=l
K
= PLIM
,K-+oo
t.}:1=1
is assumed to be positive definite.
the B.
and C..
_1
_1
B. =
_1
As before,
E(~i)
•
_1
where
f1
~i'
var(~i)
Now we repeat this process for
M•.
n. ,
n.
}:l J Z. 'k = }:l J Ii .. :
}:1
j=l k=m ..+1 -lJ
j=l -lJ
1J
= 0, and
var(~i ) =
N.
}:1
j=l
11 •• n.
1J
1
E(B .. B~ .Ij)
-lJ-1J
N.
,1L
=
lI
j=l
(M
)
iJ·n i iJ' - miJ,
E
iJ..
.0
Again, with the right regularity conditions,
B.
-1
where,
-+
1
f
K
PUM K I var(~i)
2 = K-+oo
i=l
is assumed positive definite.
Further,
K
K-~ I ~i
i =1
where
f
N(o,f ),
2
-+
N(O~r3)'
1 K
=
PLIMI var(C.),
3 K-+oo K i=l
_1
N.
C.
_1
=
2:
1
j=n.+1
1
M••
1J
I
k=T
Z"k'
~,lJ
•
29
•
and
N.
var(e.)
=
_1
Thus K-~
(U - U)
r =
•
1. 1
. '1
J=
(N.
1
... N(O,r), for
3
1
= PUM -
Lr
t=l t
K
L
K"'''' K i=l
{var(A.) + var(B.) + var(c.)}
_1
_1
-1
1 K N.1 {(M .. -11 .. m.. n.)
. 1J
= PUM -K
1J 1J 1
K... '"
i=l j=l
n· 11 .. m,.
L L
1
2
+ n.1IoO(MoO - moO)
1 1J 1J
1J
1J 1J
+ 1I .. (N.-n.)M ..} E:.
1J 1
1 1J
1J
. K
= PUM
. K"''''
•
Theorem 5.
N.
r
k- L
i=l j=l
M1·J·(M~J' -
211iJ:miJ>ni + n1·m 1·J· 1IiJ·Ni) EiJ·•
Under suitable regularity conditions'
where 6 is defined below.
Proof:
As in Theorem 2,
Now
au
as -
K
N. M••
,
,1 1J
- L L L
i=l j=l k=l
X~1J'kX"k
P.1J'k)'
1J PoOk(l1J
Write
Then
NO' M..
E(}:l L1J D. 'k)
j=l k=l
1J
•
•
So,
N.
= }:1
j=l
M.. ') ..
1J 1J
30
1: 1
M• •v .. ) = 0
1J 1J
j=l
under suitable regularity conditions.
K
.
We assume that
N.
I:
11m
•
N.
·1 d~
1 K
PUM (K as + K.l.
K+oo
_
1=1
1: 1 M• •v ..
i=l j=l 1J 1J = '
.
K
LI
K+oo
is positive definite.
o=
Thus,
PUM {K-
~
au
1
-
~(~) + K-~ (as)(~
1
= PUM {K-~ ~(~) + (K-
1 aU
as)
~
~)}
(c)
(~- ~)}
K
~
= PLIM {K~~ ~(~) - /::, K~ (~ - ~)}.
A1so,
Now
•
A
1
au
1
K as
K
I:
K i=l
--=--
n.
~1
l
m
~ij
l
W•• 0.
j=l k=l
'k'
1J_1J
and
E(
n.1 m..
1: 1: 1J w· .0. 'k)
j=l k=l
1J-1J
N.
= 1: 1
j=l
So,
M• •v ...
1J 1J
A
PUM
K+00
1 a~
Ci< as) = - /::',
and
(d)
As in Theorem (3), from (c) and (d) we get
hence
•
31
•
For "a consistent estimator of 6, we take
: =1
"
~
K L
~.i
~ij
t.
L
i=l j=l k=l
wiJ·D 1"J·k'
A
where Dijk is Dijk evaluated at the MlE. "For r, take
r
n. M" .(M .. - 21T~ om:; n. + n. mo .1T~. No)
1 K ,,1
_-,-,1J"---,1.>l.J_-:1.>l.J---,:,1J"--::l_-:l,---,,lJ><.....;l",J,---,-l
L
Lo
K 1. j=l
1T ij ni
1J
i=l
A
A
=
th stratum.
for l:ij' the sample covariance of D.1,1ok on the jth PSU of the i
The hypothesis tests which follow from these theorems can then
be derived just as in Section 5.
•
•
32
ACKNOWLEDGEMENT
The authors are grateful to· Dawn Stewart for her programming
assistance.' We al.so wish· to
than~
•
Dave DeLong for helpful discussions
regarding PROC NUN and Paul Stewart. for h,is contribution to the
subdomai.n methodology.
The, skillful typing of this manuscript by
Ernestine Bl and is acknowledged .. '
•
•
33
•
REFERENCES
Barr, A.J., Goodnight, J.H., and Sall, J.P. (1979).
•
SAS Institute Inc.:
•
Binder, D.A. (19B1).
SAS Users Guide,
North Carolina.
"On the variance of asymptotically normal esti-
mators from complex surveys", Statistics Canada, Industrial and
Agricultural Survey Methods Division, Ottowa.
Cochran, W.G. (1977).
and Sons:
Sampling Techniques, 3rd edition, John Wiley
New York.
DuMouchel, W.H. and Duncan, G.J. (1981).
"Using sample survey weights
in multiple regression analysis of stratified samples", American
•
Statistical Association,
1981 Proceedings of the Section on Survey
Research Methods, 629-637.
Fuller, W.A. (1975).
"Regression analysis for sample survey", Sankhya 37,
117-132.
Jennrich, R.I. and Moore, R.H. (1975).
"Maximum likelihood estimation
by means of nonlinear least squares", American Statistical Association,
1975 Proceedings of the Statistical Computing Section, 57-65.
Krewski, D. and Rao, J.N.K .. (1981).
"Inference from stratified samples:
Properties of the linearization,'jackknife and balanced repeated replication methods", The Annals of Statistics
O'Brien, K.F. (1981).
2.,
1010-1019.
"L He table analysis for complex survey data",
Ph.D. Thesis, Department of Biostatistics, School of Public Health,
•
•
University of North Carolina, Chapel Hill, North Carolina.
34
Rao, C.R. (1973).
Linear Statistical Inference and Its Application,
2nd edition, John Wiley and Sons:
Scirndal, C.L (1978).
New York.
•
"Design-based and model-based inference in
survey sampling", Scand. J. Statist.
~,
25-52.
•
•
© Copyright 2026 Paperzz