Causal inference in outcome-dependent two

J. R. Statist. Soc. B (2009)
71, Part 5, pp. 947–969
Causal inference in outcome-dependent two-phase
sampling designs
Weiwei Wang,
Princeton University, USA
Daniel Scharfstein,
Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
Zhiqiang Tan
Rutgers University, Piscataway, USA
and Ellen J. MacKenzie
Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
[Received June 2008. Revised March 2009]
Summary. We consider estimation of the causal effect of a treatment on an outcome from
observational data collected in two phases. In the first phase, a simple random sample of individuals is drawn from a population. On these individuals, information is obtained on treatment,
outcome and a few low dimensional covariates. These individuals are then stratified according
to these factors. In the second phase, a random subsample of individuals is drawn from each
stratum, with known stratum-specific selection probabilities. On these individuals, a rich set of
covariates is collected. In this setting, we introduce five estimators: simple inverse weighted;
simple doubly robust; enriched inverse weighted; enriched doubly robust; locally efficient. We
evaluate the finite sample performance of these estimators in a simulation study. We also use
our methodology to estimate the causal effect of trauma care on in-hospital mortality by using
data from the National Study of Cost and Outcomes of Trauma.
Keywords: Doubly robust estimator; Outcome-dependent sampling; Two-phase sampling
1.
Introduction
In some observational studies, it may be relatively inexpensive to measure the outcome Y , the
treatment T and a low dimensional set of covariates S, while the remaining set of covariates W is
expensive to obtain. In such settings, an outcome-dependent two-phase sampling design, which
was first introduced by Neyman (1938) and discussed in Cochran (1963), can significantly reduce
the cost of the study. In the first phase, Y , T and S are collected on all subjects. In the second
phase, the subjects are divided into strata according to .Y , T , S/ and a random subsample,
which is also called a validation sample, is randomly drawn from each stratum. The expensive
covariates W are measured only on the validation sample.
Existing statistical methods for outcome-dependent two-phase sampling studies are primarily
focused on obtaining consistent and efficient estimates for the regression parameters in a
Address for correspondence: Weiwei Wang, Department of Operations Research and Financial Engineering,
215 Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA.
E-mail: [email protected]
© 2009 Royal Statistical Society
1369–7412/09/71947
948
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
population level model for the distribution of Y given T , S and W ; see, for example, Cosslett
(1981, 1983), White (1982), Fears and Brown (1986), Breslow and Cain (1988), Pepe and Fleming (1991), Carroll and Wand (1991), Schill et al. (1993), Scott and Wild (1991, 1997), Robins
et al. (1995), Breslow and Holubkov (1997a, b), Lawless et al. (1999), Breslow (2000), Breslow
et al. (2003), Chatterjee et al. (2003) and Weaver and Zhou (2005).
In contrast with these methods, we are interested in estimating the causal effect of a nonrandomized treatment on the outcome. Specifically, we want to obtain estimators for the marginal
mean of the ‘potential’ outcome that would have been observed if all subjects had received a
specified treatment. Our work is motivated, in part, by the National Study on the Costs and
Outcomes of Trauma (NSCOT) (MacKenzie et al., 2006), in which an outcome-dependent
sample of severely injured patients treated at trauma and non-trauma centres is prospectively
followed for survival and functional status. One of the primary aims is to estimate the causal
effect of trauma centre versus non-trauma-centre care on in-hospital mortality, accounting for
(a) outcome dependent, two-phase sampling and
(b) selection bias due to non-randomized assignment of type of care.
The paper is organized as follows. Section 2 introduces the notation and data structure.
Section 3 introduces five estimators:
(a)
(b)
(c)
(d)
(e)
simple inverse weighted (SIW);
simple doubly robust (SDR);
enriched inverse weighted (EIW);
enriched doubly robust (EDR);
locally efficient (LE).
Section 4 provides the results of a comprehensive simulation study. Section 5 presents an analysis
of the NSCOT data by using our estimators. The last section is devoted to a discussion.
2.
Notation and framework
For an individual, let X = .S , W / denote covariates and let Y be the observed outcome. On all
subjects, .S , Y , T/ is collected whereas W is collected only on the validation sample. Let V be
the validation indicator and T be the treatment indicator. On the basis of the study design, the
probability of being part of the validation sample depends on .S, Y , T/, but not W. We denote
q.S, Y , T/ = P.V = 1|S, Y , T/, which is known by study design. For clarity, we further denote
q1 .S, Y/ = q.S, Y , 1/ and q0 .S, Y/ = q.S, Y , 0/. Let Y1 and Y0 be the potential outcomes under
treatment 1 and 0 respectively. We make the stable unit treatment assumption (Rubin, 1986), so
that
Y = TY1 + .1 − T/Y0 :
The observed data for an individual are O = .S , Y , T , V , V · W / . We are interested in using
n independent and identically distributed copies of O to draw inference about μtÅ = E[Yt ]. To
identify μtÅ from the observed data, we assume that T is independent of .Y0 , Y1 / given X. In other
words, we assume that, conditional on X = x, Y0 and Y1 , treatment assignment is an independent
Bernoulli process with assignment probabilities πtÅ .x/ = P.T = t|X = x/, t = 0, 1. We refer to
this assumption as no unmeasured confounding. We further make the positivity assumption, i.e.
πtÅ .x/ > 0 for all x and t = 0, 1. For convenience, Table 1 summarizes our notation. When it is
important, we shall emphasize the dependence of functions on model parameters; otherwise,
we shall suppress such dependence.
Causal Inference in Sampling Designs
Table 1.
Notation
Symbol
Y
T
Y1
Y0
μ1Å
μ0Å
S
W
X
V
μ1Å .X/
μ0Å .X/
q.S, Y , T/
q1 .S, Y/
q0 .S, Y/
π1Å .X/
π0Å .X/
3.
949
Description
Observed outcome
Treatment (0–1)
Potential outcome under treatment 1
Potential outcome under treatment 0
E[Y1 ]
E[Y0 ]
Covariates observed for all subjects
Covariates observed only on validation sample
.S , W /
Validation indicator
E[Y1 |X]
E[Y0 |X]
P.V = 1|S, Y , T/
q.S, Y , 1/
q.S, Y , 0/
P.T = 1|X/
P.T = 0|X/
Estimation
3.1. Simple inverse-weighted estimation
Under the assumptions in Section 2, it can be shown that the observed data estimating function
VI{T =t }
UtSIW .O; μt , πt / =
.Y − μt /
qt .S, Y/ πt .X/
has mean 0 when evaluated at μtÅ and πtÅ .X/.√Without additional modelling assumptions, this
estimating function cannot be used to draw n-inference about μÅt . This is because we would
need a non-parametric estimator of π Å .X/ converging at a rate faster than n1=4 , which cant
not be achieved when X is high dimensional (Newey, 1994). This has been called the ‘curse of
dimensionality’ (Robins and Ritov, 1997).
To proceed, we can parametrically model πtÅ .X/ = πt .X; αÅ /, where αÅ denotes the true value
of the model parameters α. Without loss of generality, we shall assume that
logit{π1 .X; αÅ /} = l.X; αÅ /
where l.X, α/ is a specified function of X and α, which is differentiable in α. This model implies
that
exp{t l.X; αÅ /}
πt .X; αÅ / =
:
1 + exp{l.X; αÅ /}
Under this model, αÅ can be estimated as the solution α̂ to the score function of the logistic
regression T given X where each individual receives weight V=qT .S, Y/. Letting
@l.X; α/
Sα .T , X; α/ =
{T − π1 .X; α/},
@α
then α̂ is the solution to
V
En
Sα .T , X; α/ = 0,
qT .S, Y/
where En [·] is the empirical expectation operator.
950
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
In terms of estimating μÅt , we note that
UtSIW .O; μt , α/ =
VI{T =t }
qt .S, Y/πt .X; α/
.Y − μt /
has mean 0 when evaluated at μtÅ and αÅ . This suggests that we can estimate μtÅ as the solution
En [UtSIW .O; μt , α̂/] = 0. The resulting estimator is then
VI{T =t }
VI{T =t }
=
E
Y
E
μ̂SIW
:
n
n
t
qt .S, Y/ πt .X; α̂/
qt .S, Y/ πt .X; α̂/
The advantage of the SIW estimator is that it is simple to compute and only requires modelling
of πtÅ .X/. The disadvantage of this estimator is twofold. First, it does not take full advantage
of the data that are recorded on subjects who are not in the validation sample. It only uses
information on .S, Y , T/ through the construction of the sampling weights. Second, the SIW
estimator is inefficient, even if we ignore data on those who are not validated.
3.2. Regular and asymptotically linear estimators
Under correct specification of the logistic regression model for πtÅ .X/ that was discussed in
the previous subsection and additional smoothness conditions, μ̂SIW
can be shown to be a regular
t
and asymptotically linear (RAL) estimator of μtÅ . To review, an estimator μ̂t is said to be asymptotically linear if there is a random variable IFt .O/ such that
n
√
n.μ̂t − μtÅ / = n−1=2
IFt .Oi / + op .1/
i=1
with E[IFt .O/] = 0 and 0 < E[IFt .O/2 ] < ∞. The function IFt .O/ is referred to as an influence
function (Hampel, 1974). By the central limit theorem, it then follows that
√
D
n.μ̂ − μÅ / → N{0, E [IF .O/2 ]}:
t
t
θ0
t
An estimator is said to be regular if its asymptotic distribution does not change in a shrinking
neighbourhood of the true data-generating distribution (van der Vaart, 1998). An asymptotically
linear estimator of μtÅ will be regular if and only if its influence function is orthogonal to the
model’s nuisance tangent space and its inner product with the score for μÅt is equal to 1. In a
semiparametric model, like ours, the nuisance tangent space is defined as the mean-square
closure of the nuisance tangent spaces of all parametric submodels, where a parametric submodel
is a model that
(a) is indexed by a finite dimensional parameter vector and
(b) contains the true data-generating distribution,
and where the nuisance tangent space of the parametric submodel is all linear combinations of
the submodel’s nuisance score vector.
As discussed in Tsiatis (2006), elements of the orthogonal complement of the nuisance tangent space can be used to construct estimating functions, which can in turn be used to generate
RAL estimators of μtÅ . In general, the influence functions of the resulting estimators will be
normalized versions of the elements that are selected from this space.
The influence function for μ̂SIW
can be shown to be
t
SIW
@Ut .O; μt ,αÅ /
V
@Sα .T , X; αÅ / −1
SIW
SIW
Å
Å
Å
Å
IFt .O; μt , α / = Ut .O; μt , α / − E
E
@α
qT .S,Y/
@α
V
×
Sα .T ,X; αÅ /:
qT .S, Y/
Causal Inference in Sampling Designs
951
This implies that the asymptotic variance of μ̂SIW
is equal to E[IFtSIW .O; μÅt , αÅ /2 ]. A natural
t
SIW
SIW .O; μt , α/ is the
estimator for the asymptotic variance is En [IFt .O; μ̂SIW
, α̂/2 ], where IF
t
t
SIW
same as IFt .O; μt , α/ except that the expectations are replaced with empirical averages.
3.3. Representation of influence functions
Using the semiparametric theory of inference in coarsened data problems (Bickel et al., 1993;
Robins et al., 1994; Scharfstein et al., 1999; van der Laan and Robins, 2003; Tsiatis, 2006), we
derive the class of all influence functions for RAL estimators of μÅt under correct specification
of the logistic regression model for πtÅ .X/. In Appendix A, we show that the class of all influence
functions consists of elements of the form
1
IF {O; μÅ , αÅ , μÅ .X/, g , g , b} = φ {O; μÅ , αÅ , μÅ .X/} +
φ .O; g / + φ .O; αÅ , b/ .1/
t
t
t
1
0
t
1
t
τ
2,τ
τ =0
3
where
φ1 {O; μt , α, μt .X/} =
VI{T =t }
V
T − π1 .X; α/
.Yt − μt / + .−1/t
{μt .X/ − μt },
qT .S, YT / πt .X; α/
V
φ2,τ .O; gτ / = I{T =τ } 1 −
gτ .S, Yτ /,
qτ .S, Yτ /
V
T − π1 .X; α/
φ3 .O; α, b/ =
b.X/,
qT .S, YT / R.X; α/
qt .S, Yt / πt .X; α/
g1 and g0 are arbitrary functions of their arguments, R.X; α/ = π1 .X; α/ π0 .X; α/, b.X/ ∈ T ⊥
and
@l.X; αÅ /
T = k
: k is an arbitrary constant vector :
@α
The influence function for the SIW estimator is a member of this class with g1 .S, Y1 / = 0,
g0 .S, Y0 / = 0 and b.X/ equal to
@l.X; αÅ /
SIW
t+1
Å
Å
Å
Å
Å
Å
b
.X/ = .−1/ π1−t .X; α /{μt .X/ − μt } − R.X; α / E π1−t .X; α /.Yt − μt /
@α
−1
Å
Å
Å
@l.X; α /
@l.X; α / @l.X; α /
:
× E R.X, αÅ /
@α
@α
@α
Letting
Λgτ = {φ2,τ .O; gτ / : gτ is arbitrary},
Λb = {φ3 .O; αÅ , b/ : b.X/ ∈ T ⊥ },
Λg = Λg0 ⊕ Λg1 and Λ = Λg + Λb , the class of all influence functions can be written as a linear
variety of form
{IF {O; μÅ , αÅ , μÅ .X/, g , g , b} : g , g , b} = φ {O; μÅ , αÅ , μÅ .X/} + Λ:
t
t
t
1
0
1
0
1
t
t
Λg0 and Λg1 are orthogonal, whereas Λg and Λb are not.
3.4. Locally efficient estimation
The variance of an influence function in the above class depends on the choice of g1 , g0 and
b. The influence function with the smallest variance (i.e the most efficient) is the projection
Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λ⊥ ], where Π[·|·] is the projection operator (Fig. 1).
952
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
f1 + Λ
IFt
Λ
f1
P [f1 | L^]
Fig. 1.
Optimal influence function
The required projection does not have a closed form solution; however, we can use the alternating projection theorem (Bickel et al., 1993) to approximate computationally the solution.
We reproduce a version of the theorem here.
Theorem 1 (alternating projection theorem). Let SA and SB be two non-orthogonal spaces.
Let h denote a vector. Let Q[·], QA [·] and QB [·] be the projection operator to the spaces
⊥ and S ⊥ respectively. Let Q.1/ [h] = Q [Q [h]] and Q.j/ [h] = Q.1/ [Q.j−1/ [h]],
.SA + SB /⊥ , SA
B
A
B
j > 1. Then,
lim {Q.j/ [h]} = Q[h]:
j→∞
In our context SA = Λg , SB = Λb and h.O/ = φ1 {O; μÅt , αÅ , μtÅ .X/}. Fig. 2 illustrates the alternating projection theorem in our context. The key then is to derive closed form expressions for
the projection operators QA and QB . To do so, we need to know how to project a function of
the observed data h.O/ onto Λg and onto Λb . It can be shown that
1
E[h.O/I{T =τ } {1 − V=qτ .S, Yτ /}|S, Yτ ]
V
I{T =τ } 1 −
Π{h.O/|Λg } =
.2/
q
.S,
Y
/
E[I{T =τ } {1 − V=qτ .S, Yτ /}2 |S, Yτ ]
τ
τ
τ =0
and
Π{h.O/|Λb } =
where
Å V
T − π1 .X; αÅ /
−1
@l.X; α /
C
.X/
−
k
A.X/
,
h
h
qT .S, YT / R.X; αÅ /
@α
Ch .X/ = E h.O/
V
T − π1 .X; αÅ / X,
qT .S, YT / R.X; αÅ / A.X/ = π1 .X; αÅ /−1 E[q1 .S, Y1 /−1 |X] + π0 .X; αÅ /−1E[q0 .S, Y0 /−1 |X]
and
Å
Å −1
@l.X; αÅ / −1
−1 @l.X; α / @l.X; α /
E A.X/
A.X/ Ch .X/
:
@α
@α
@α
kh = E
.3/
Causal Inference in Sampling Designs
953
Λ⊥
b
f1
Λ⊥
g
Λ⊥
g
Fig. 2.
Λ⊥b
Alternating projection theorem
To find the projection, Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λ⊥ ], we use the following algorithm. Let
denote the result of the jth iteration.
Step 1: compute
.1/
ht .O/ = φ1 {O; μÅt , αÅ , μtÅ .X/} − Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λg ]:
.4/
.j/
ht .O/
Step 2: compute
.2/
ht .O/ = φ1 {O; μÅt , αÅ , μtÅ .X/} − Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λg ] − Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λb ]
.5/
+ Π.Π[φ {O; μÅ , αÅ , μÅ .X/}|Λ ]|Λ /:
t
1
g
t
b
Step 3: on iteration 2k + 1 (k 1), compute
.2k+1/
ht
.2k/
.O/ = ht
.2k/
= ht
.2k/
.O/ − Π{ht
.O/|Λg }
.2k−1/
.O/ + Π[Π{ht
.O/|Λb }|Λg ]:
.6/
.O/ = ht
.O/ − Π{ht
.O/|Λb }
.2k+1/
.2k/
= ht
.O/ + Π[Π{ht .O/|Λg }|Λb ]:
.7/
Step 4: on iteration 2k + 2 (k 1), compute
.2k+2/
ht
.2k+1/
.2k+1/
Equations (6) and (7) show that, for k 3, the variance reduction that is achieved on iteration
k is always smaller than on iteration k − 1. The algorithm stops when the variance reduction is
smaller than some specified tolerance.
Although the general projection formulae are given above, the following special cases are
used in the algorithm. When h.O/ = φ1 {O; μÅt , αÅ , μtÅ .X/} (as in equations (4) and (5)), then
Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λg ] takes the form in equation (2) where the ratio of the conditional
expectations is equal to
954
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
I{τ =t } .Yt − μtÅ / + .−1/t E[{πτ .X; αÅ /=πt .X; αÅ /}{τ − π1 .X; αÅ /}{μÅt .X/ − μtÅ }|S, Yτ ]
E[πτ .X; αÅ /|S, Yτ ]
Å
Å
Å
and Π[φ1 {O; μt , α , μt .X/}|Λb ] takes the form in equation (3) with
.−1/t+1
Yt − μtÅ Cφ1 .X/ =
E
X + .−1/t A.X/ π1−t .X; αÅ /{μÅt .X/ − μtÅ }:
πt .X; αÅ / qt .S, Yt / When h.O/ = φ .O; αÅ , b† / ∈ Λ (as in equation (6)), then Π{φ .O; αÅ , b† /|Λ } takes the form
−
b
3
3
in equation (2) with the ratio of the conditional expectations for τ equal to
.−1/τ E[π .X; αÅ /|S, Y ]−1 E[b† .X/|S, Y ]:
τ
τ
g
τ
†
†
When h.O/ = Σ1τ =0 φ2,τ .O; gτ / ∈ Λg (as in equations (5) and (7)), then Π{Σ1τ =0 φ2,τ .O; gτ /|Λb }
takes the form in equation (3) with
1
1
τ +1
†
X :
.−1/ E gτ .S, Yτ / 1 −
CΣ1 φ2,τ .X/ =
τ =0
q .S, Y / τ
τ =0
τ
.j/ .j/
IFt .O; μÅt , αÅ , g1 , g0 , b.j/ /
.j/
ht .O/
.j/
.j/
On each iteration,
can be expressed as
for g1 , g0 and
b.j/ determined via the above projection formulae. Thus, for j Å large, the solution to
Å
.j Å / .j Å /
E [IF {O; μ , α̂, μÅ .X/, g , g , b.j / }] = 0
n
t
t
t
1
0
will be an efficient estimator of μtÅ whose influence function approximates Π[φ1 {O; μÅt , αÅ ,
Å
Å
Å
μtÅ .X/}|Λ⊥ ]. This estimator, however, depends on μtÅ .X/ and .g1.j / , g0.j / , b.j / /, which as seen
in the preceding displays depends on conditional expectations of the form E[u.X, Yt /|X] and
E[v.X/|S, Yt ] and marginal expectations of the form E[u.X, Yt /] and E[v.X/] where u.·/ and v.·/
are functions of .X, Yt / and X respectively.
To proceed, we shall need to utilize a working model for μtÅ .X/. Regardless of whether this
working model is correctly specified and how many iterations are used to compute the projection,
the resulting estimators will be consistent, since it can be shown that E[IFt {O; μÅt , αÅ , f.X/,
g1 , g0 , b}] = 0, whatever the choice of f , g1 , g0 and b. However, if this working model is correctly
specified, its parameters can be estimated at n1=4+" -rates and the number of iterations in the
projection is large, the resulting estimator will be efficient, achieving the semiparametric variance
bound. As a result, our estimator is said to be ‘LE’. Next, we discuss the implementation of the
LE estimator for a binary outcome and a continuous outcome separately.
3.4.1. Binary outcomes
For a binary outcome, we can posit a working logistic regression model for μtÅ .X/ = μt .X; η Å /,
where
logit{μ .X; η Å /} = j.t, X; η Å /
t
and j.t, X; η/ is a specified function of t, X and η. To estimate η Å , we note that
μ .X; η Å / = E[Y |X, T = t] = E[Y |X, T = t],
t
t
where the first equality follows by the assumption of no unmeasured confounding and the second
by the stable unit treatment assumption. Thus, η Å can be estimated as the solution η̂ to the
weighted Yt given X score equations where each individual receives weight VI{T =t } =qT .S, YT /.
Let
@j.t, X; η/
Stη .Y , X; η/ =
{Y − μt .X; η/};
@η
then η̂ is the solution to
Causal Inference in Sampling Designs
En
VI{T =t }
qT .S, YT /
955
Stη .Y , X; η/ = 0:
The projections above require estimation of conditional and marginal expectations of the
form E[u.X, Yt /|X], E[v.X/|S, Yt ], E[u.X, Yt /] and E[v.X/]. We can estimate the conditional
expectation E[u.X, Yt /|X] by u.X, 1/ μt .X; η̂/ + u.X, 0/{1 − μt .X; η̂/}, where
μt .X; η/ =
exp{j.t, X; η/}
:
1 + exp{j.t, X; η/}
This estimator presumes, as we do, that u.x, 1/ and u.x, 0/ are well defined for all x. If S
contains only categorical variables, the conditional expectation E[v.X/|S, Yt ] can be estimated by
weighted, saturated regression with individual weights VI{T =t } ={qt .S,Yt / πt .X; α̂/}. If S contains
continuous variables, then the bounded conditional expectations such as E[πτ .X; αÅ /|S, Yτ ]
need careful modelling. Alternatively, one may choose to categorize continuous variables in S.
This may cause some loss of efficiency but can avoid instability and model incompatibility due
to misspecification of the models for, for example, E[πτ .X; αÅ /|S, Yτ ]. The marginal expectations E[u.X, Yt /] and E[v.X/] can be estimated by weighted averages with individual weights
VI{T =t } ={qt .S,Yt / πt .X; α̂/} and V=qT .S,YT / respectively.
Å
Our resulting estimator will be the solution μ̂t.j / to
.j Å /
.j Å /
Å
En [IFt {O; μt , α̂, μt .X; η̂/, ĝ1 , ĝ0 , b̂.j / }] = 0
.j Å /
.j Å /
.j Å /
are obtained by replacing conditional and marginal expectations by
where ĝ 1 , ĝ 0 and b̂
their estimates described above,
and j Å is an iteration number in which the iterative algorithm
.j Å /
LE
stops. We define μ̂t = μ̂t . When the working model for
μÅt .X/ is correctly specified, the influÅ/
.j Å / .j Å / .j Å /
.j
Å
Å
Å
ence function of μ̂t will be IFt {O; μt , α , μt .X; η /, g1 , g0 , b }. TheÅ asymptotic variance
.j Å / .j / .j Å / }2 ].
of this estimator can be estimated by En [IFt {O; μ̂LE
t , α̂, μt .X; η̂/, ĝ 1 , ĝ 0 , b̂
Å
If, however, the working model for μt .X/ is incorrectly specified, then η̂ converges to η̃ = η Å .
Consequently, IFt {O; μÅt , αÅ , μt .X; η̃/, g1 , g0 , b} is not an influence function because it is no
longer orthogonal to the nuisance tangent space. To obtain the influence function, we must
subtract its projection onto the space that is spanned by the score functions for α. The resulting
influence function is
Å
.j Å / .j Å / .j Å /
.j Å / .j Å /
Å Å
IFR
/ = IFt {O; μÅt , αÅ , μt .X; η̃/, g1 , g0 , b.j / }
t .O; μt , α , η̃, g1 , g0 , b
Å
.j Å / .j Å /
@IFt {O; μÅt , αÅ , μt .X; η̃/, g1 , g0 , b.j / }
−E
@α
V
@Sα .T , X; α/ −1 V
×E
Sα .T , X; α/:
qT .S, Y/
@α
qT .S, Y/
.8/
It is important to note that the second term in this influence function
willÅ be 0 if η̃ = η Å .
Å/
.j
LE
LE
R
.O; μ̂ , α̂, η̂, ĝ , ĝ .j / , b̂.j Å / /2 ], where
We estimate the asymptotic
variance of μ̂ by En [IF
t
t
1
0
Å/
Å/
Å/
.j
.j
Å
Å
R
.j
R
.O; μ , α , η̃, g , g , b / is the same as IF .O; μÅ , αÅ , η̃, g .j Å / , g.j Å / , b.j Å / / except that
IF
t
t
t
t
1
0
1
0
the expectations are replaced by empirical averages. The asymptotic variance that is calculated
using equation (8) and its associated estimator will be referred to as the ‘robust variance’ since
it is robust to misspecification of the working model for μtÅ .X/.
3.4.2 Continuous outcomes
For a continuous outcome, we can specify a working model for μÅt .X/ in one of two ways.
956
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
First, we can specify a fully parametric model for the conditional distribution of Yt given X and
estimate the parameters η Å by using weighted score equations as above. Second, we can specify a
(semiparametric) conditional mean model for Yt given X and estimate the parameters η Å by using
weighted estimating equations. One must also estimate conditional expectations of the form
E[u.X, Yt /|X]. If the first approach of specifying the model for μÅt .X/ is adopted, then we can
estimate these expectations by using the estimated conditional distribution of Yt given X. Since
the second approach leaves the distribution form of Yt given X unspecified, we must introduce
additional modelling assumptions here. Specifically, we can model the conditional expectations
directly by specifying a regression model for u.X, Yt / given X and estimate the parameters by
using additional weighted estimating equations. As the conditional expectations of this form
must be repeatedly estimated, this approach can lead to issues of model incompatibility. The
first approach does not suffer from this issue. If, from either approach, the resulting design
matrix for the regression is the same as that for the model of T given X , the right-hand side of
equation (3) will be equal to 0 and the LE estimator will reduce to the EDR estimator (see the
next section).
For a continuous outcome, smoothing will be required to estimate E[v.X/|S, Yt ]. As discussed
in binary outcomes, one may choose to categorize S and Y to avoid instability and model incompatibility.
As in binary outcomes, if the working model for μtÅ .X/ is misspecified, the robust variance
should be calculated by using equation (8). Misspecification of E[u.X, Yt /|X] or E[v.X/|S, Yt ]
will not affect the consistency of the resulting estimator: just the efficiency.
3.5. Doubly robust estimations
In this section, we consider the subclass of influence functions with b.X/ = 0, i.e. φ1 {O; μÅt , αÅ ,
μtÅ .X/} + Λg . This class is generated by the model in which no restrictions are placed on the
distribution of πtÅ .X/. Interestingly, this class of influence functions has the following properties:
(a) E[IFt {O; μÅt , α̃, μt .X; η Å /, g1 , g0 , 0}] = 0 whatever α̃ is;
(b) E[IF {O; μÅ , αÅ , μ .X; η̃/, g , g , 0}] = 0 whatever η̃ is.
t
t
t
1
0
As a result, members of this subclass are said to be ‘doubly robust’ in the sense that the resulting
estimators will be consistent and asymptotically normal when either the model for πtÅ .X/
or the model for μtÅ .X/ is correctly specified. In our setting, there are an infinite number of
doubly robust estimators, depending on the selection of the functions g1 and g0 . The doubly
robust estimator with the smallest variance will have influence function equal to the projection of φ1 {O; μÅt , αÅ , μt .X; η Å /} onto Λ⊥
g . Using previous results, this projection is equal to
.1/
ht {O; μÅt , αÅ , μt .X; η Å /} in equation (4), which is the first iteration in the alternating projection algorithm. The previous subsection has provided a detailed illustration of the computation
.1/
of ht {O; μÅt , αÅ , μt .X; η Å /}. It only involves conditional expectations of the form E[v.X/|S, Yt ]
and is computationally easier than the LE estimator. Although misspecification of E[v.X/|S, Yt ]
may result in a loss of efficiency, the estimator of μtÅ will still remain doubly robust. We denote
the most efficient doubly robust estimator as the EDR estimator, μ̂EDR
, with the word enriched
t
to emphasize that it uses information from the non-validation sample. The LE estimator is
not doubly robust, in contrast with many other cases discussed in van der Laan and Robins
(2003).
When either the model for πtÅ .X/ or the model for μÅt .X/ is incorrectly specified, the influence
function for the resulting estimator can be shown to be
Causal Inference in Sampling Designs
.1/
IFEDR
.O; μÅt , α̃, η̃/ = ht {O; μÅt , α̃, μt .X; η̃/}
t
.1/
@ht {O; μÅt , α̃, μt .X; η̃/}
−E
@α
E
V
@Sα .T , X; α̃/
qT .S, Y/
@α
.1/
@h {O; μÅ , α̃, μ .X; η̃/}
V
t
t
t
Sα .T , X; α̃/ − E
qT .S, Y/
@η
VI{T =t } @Stη .Y , X; η̃/ −1 VI{T =t }
Stη .Y , X; η̃/ ,
×E
qT .S, Y/
@η
qT .S, Y/
×
957
−1
.9/
where either α̃ = αÅ (the treatment regression model is correct) or η̃ = η Å (the outcome regression
model is correct). In equation (9), the second term vanishes if η̃ = η Å (the outcome regression
model is correct) and the third term vanishes if α̃ = αÅ (the treatment regression model is
EDR .O; μ̂EDR , α̂, η̂/2 ], where
correct). We estimate the asymptotic variance of μ̂EDR by En [IF
t
t
EDR
Å
Å
EDR
.O; μt , α̃, η̃/ is the same as IFt
.O; μt , α̃, η̃/ except that the expectations are replaced
IFt
by empirical averages. The asymptotic variance that is calculated by using equation (9) and
its associated estimator will be referred to as the ‘doubly robust variance’ since it is robust to
misspecification of the model for μtÅ .X/ or the model for πtÅ .X/.
Another special choice of g1 and g0 is to set both equal to 0. The corresponding estimator does
not use information on the non-validation sample and we refer to it as the SDR estimator. The
.1/
robust variance for the SDR estimator takes the same form as equation (9) with ht replaced by
φ1 . In the extreme case that every subject is selected to the validation sample (e.g. the sampling
probability is 1), the EDR and LE estimators both become the same as the SDR estimator.
3.6. Enriched inverse-weighted estimator
When compared with the SIW estimator, the LE estimator will always have a smaller variance
but it requires a working model for μÅt .X/ and is computationally more intensive. The simple or
EDR estimators are computationally easier and have the nice property of double robustness, but
they are not guaranteed to have smaller variances than the SIW estimator and a working model
for μtÅ .X/ is also required. This motivates our search for a computationally feasible estimator,
which always has a smaller variance than the SIW estimator. Here, we restrict attention to the
influence functions that are elements of the linear variety: IFSIW
.O; μÅt , αÅ / + Λg . Among this
t
.O; μÅt , αÅ / onto
set of influence functions, the most efficient is equal to the projection of IFSIW
t
⊥
Λg . We define
.O; μÅt , αÅ / = IFSIW
.O; μÅt , αÅ / − Π[IFSIW
.O; μÅt , αÅ /|Λg ]:
IFEIW
t
t
t
Equation (2) shows how to project a general function of the observed data onto Λg . When h.O/ =
IFSIW
.O; μÅt , αÅ /, the ratio of the conditional expectations for τ in equation (2) takes the form
t
−
I{τ =t } .Yt − μtÅ /
−.−1/τ E[@UtSIW .O; μÅt , αÅ /=@α] E[{V=qT .S, Y/}@Sα .T , X; αÅ /=@α]−1
E[πτ .X; αÅ /|S, Yτ ]
× E[{@l.X; αÅ /=@α}R.X; α/|S, Yτ ] E[πτ .X; αÅ /|S, Yτ ]−1 :
The EIW estimator does not require a model for μtÅ .X/. It does, however, require estimation of
conditional expectations of the form E[v.X/|S, Yt ] (see Sections 3.4.1 and 3.4.2 for estimation
strategies; model misspecification may result in loss of efficiency, in which case the EIW estimator
958
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
may not always have a smaller variance than the SIW estimator). This EIW estimator is not as
efficient as the LE estimator and is not doubly robust. The asymptotic variance of this estimator
EIW .O; μ̂EIW , α̂/2 ] where IF
EIW .O; μ̂EIW , α̂/ is the same as IFEIW .O;
can be estimated by En [IF
t
t
t
t
t
EIW
μ̂t , α̂/ with expectations replaced by empirical averages.
3.7. Summarizing remarks
We summarize the relationships between the five estimators that were presented above. By
the representation in equation (1), we obtain the SIW estimator by setting g1 = 0, g0 = 0 and
b = bSIW . Fixing b = bSIW and optimizing efficiency over .g1 , g0 /, we obtain the EIW estimator.
With g1 , g0 and b all equal to 0, we obtain the SDR estimator. With b = 0 and optimizing efficiency over .g1 , g0 /, we obtain the EDR estimator. Optimizing the efficiency over functions
.g1 , g0 , b/, we obtain the LE estimator.
4.
Simulation studies
Sections 4.1 and 4.2 present simulation studies for binary and continuous outcomes respectively.
4.1. Binary outcomes
We assess the finite sample performance of the five proposed estimation procedures for μ1 and
μ0 . In Section 4.1.1, we evaluate the performance when we correctly specify regression models
for the outcome and for treatment. In Section 4.1.2, we evaluate the performance under various
types of model misspecification. In our simulations, we only implemented, because of computational complexity, the first two iterations in the computation of the LE estimator (i.e. j Å = 2).
We present simulation results for two sample sizes: 800 and 2000. For each simulation, 2000
data sets were generated and summarized.
4.1.1. Efficiency
In this simulation, S = .S1 , S2 /, S1 ∼ Bernoulli.0:5/, S2 ∼ Bernoulli.0:3/ and W = .W1 , W2 , W3 / ∼
N.0, I/. The five covariates were independent. Further, Y1 and Y0 given S and W were assumed
to be independent Bernoulli distributed with probabilities of success equal to expit.S1 + 0:3S2 +
W1 + 0:2W2 + 0:1W3 / and expit.1 + 1:5S1 + 0:5S2 + 1:5W1 + 0:2W2 + 0:2W3 / respectively. The
distribution of T given X, Y1 and Y0 was assumed to be Bernoulli{π1Å .X/}, where π1Å .X/ =
expit.S1 + 0:2S2 + 0:8W1 + 0:05W2 + 0:05W3 /. Finally, V given T , X, Y1 and Y0 was assumed to
be Bernoulli{qT .S, Y/}, where qT .S, Y/ was selected so that around 25% of the total sample is
selected and the number of individuals selected from each of the 16 strata (defined by T , Y , S1
and S2 ) was set equal to the minimum of the stratum size and the total selected divided by 16.
The simulation was designed to have strong selection bias. The true values of μ1 and μ0 were
determined from a simulation with sample size equal to 108 (μ1Å = 0:6145 and μÅ0 = 0:7824/.
From the same simulation, the true values of E[Y |T = 1] and E[Y |T = 0] were found to be 0.6828
and 0.6755 respectively.
In the top section of Table 2, we summarize the results of the simulation. Table 2 shows
that, for each sample size, all estimators are unbiased (see the third and seventh columns). The
Monte Carlo variance (the fourth and eighth columns) agrees very well with the average of
the estimated variances (the fifth and ninth columns) for sample size 2000. For this sample size,
the coverage of the estimated 95% confidence intervals is close to their nominal level. For sample
size 800, the variance estimator appears to be slightly biased low and the estimated confidence
Causal Inference in Sampling Designs
959
Table 2.
Finite sample properties, efficiency and robustness comparisons for a binary outcome
Sample
size
Estimator
μ̂1
var
×104
var
×104
95% coverage
probability
μ̂0
var
×104
var
×104
95% coverage
probability
800
SIW
EIW
SDR
EDR
LE
SIW
EIW
SDR
EDR
LE
0.612
0.613
0.612
0.612
0.612
0.613
0.614
0.613
0.613
0.613
28.3
10.2
26.7
9.3
9.0
11.3
3.8
10.7
3.3
3.3
27.0
9.6
25.5
8.4
8.2
11.0
3.8
10.5
3.3
3.2
93.1
93.0
93.5
92.5
92.0
94.2
94.0
94.8
94.3
94.1
0.777
0.782
0.777
0.781
0.782
0.781
0.783
0.781
0.782
0.783
16.1
7.2
15.5
6.7
6.4
5.8
2.6
5.7
2.4
2.3
15.2
7.4
14.6
6.7
6.3
5.8
2.7
5.6
2.4
2.3
93.6
94.5
93.7
94.3
94.0
94.7
94.7
94.2
94.4
94.2
μtÅ (X) is misspecified
800
SIW
EIW
SDR
EDR
LE
2000
SIW
EIW
SDR
EDR
LE
0.608
0.611
0.609
0.610
0.610
0.613
0.613
0.613
0.613
0.613
29.5
11.0
29.5
13.5
14.0
11.4
3.9
11.4
4.4
4.4
27.5
10.0
27.6
11.8
11.9
11.0
3.8
10.9
4.1
4.2
93.5
92.9
93.7
94.3
93.7
94.5
94.2
94.5
94.2
93.7
0.779
0.783
0.776
0.787
0.789
0.781
0.782
0.780
0.783
0.784
15.9
7.2
17.5
15.6
14.5
5.8
2.5
6.0
3.9
3.7
15.2
7.4
17.6
13.0
11.8
5.8
2.7
6.2
3.9
3.7
93.8
93.8
94.3
94.3
93.5
95.0
95.1
95.0
95.8
95.1
π1Å (X) is misspecified
800
SIW
EIW
SDR
EDR
LE
2000
SIW
EIW
SDR
EDR
LE
0.665
0.668
0.611
0.613
0.647
0.666
0.668
0.612
0.614
0.648
23.6
5.1
27.0
8.4
5.8
9.4
1.8
10.4
2.9
2.0
23.0
5.3
25.4
8.0
5.4
9.2
2.0
10.3
3.1
2.0
78.6
35.6
93.8
93.5
69.9
60.8
3.0
94.3
95.4
34.9
0.699
0.705
0.779
0.783
0.735
0.701
0.704
0.780
0.782
0.733
21.5
7.8
15.1
7.5
6.8
9.1
2.9
6.0
2.6
2.5
22.1
8.1
14.8
7.0
7.5
8.6
2.9
5.8
2.5
2.8
60.1
20.0
94.3
93.4
60.7
20.5
0.4
94.0
95.0
14.2
2000
intervals tend to undercover slightly. In terms of efficiency, the SIW and SDR estimators are very
inefficient. The other three estimators perform comparably, among which the EIW estimator
has the biggest variance and the LE estimator has the smallest variance.
Another interesting question that is addressed in our simulation is how much efficiency was
lost because of not collecting W on the non-validation sample. As mentioned in Section 3.5,
if W is collected on the entire sample (i.e. the sampling probability is 1), the LE and EDR
estimators both reduce to the SDR estimator. In other words, the SDR estimator achieves full
efficiency. We estimated the SDR estimator’s performance at sample size 2000 when both the
models for μÅt .X/ and πtÅ .X/ are correctly specified, on the same simulated data sets that were
used in the 6th–10th rows of Table 2, except that we retained W on all subjects. In this situation, the SDR estimator for μ1Å and μ0Å had Monte Carlo variances equal to 2:0 × 10−4 and
1:7 × 10−4 respectively. In comparison, when W was not retained on the non-validation sample,
the Monte Carlo variances of the LE estimators were 3:3 × 10−4 for μ1Å and 2:3 × 10−4 for μ0Å .
The loss of efficiency is 40% and 26% respectively.
960
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
4.1.2. Robustness
We generated data as in the previous subsection; however, we analysed them with various
misspecified models. The robust variance was used for the LE estimator and the doubly robust
variance was used for the SDR and EDR estimators. In the middle section of Table 2, we consider the case where the model for μÅt .X/ is misspecified. In our analysis, we incorrectly dropped
W1 from the model for both μÅ1 .X/ and μÅ0 .X/. The SIW and EIW estimators remain unchanged
since they do not utilize μt .X/. The other three estimators, as expected, remained unbiased (see
the third and seventh columns) but lost some efficiency (fifth and ninth columns). The coverage
rates of the estimated 95% confidence intervals are good at both sample sizes (the sixth and 10th
columns), which indicates that our robust variance estimators have good performance.
In the bottom section of Table 2, we consider the case where the model for πtÅ .X/ is misspecified. In our analysis, we incorrectly dropped W1 in the model for πt .X/. As expected, we see
that only the two doubly robust estimators are consistent and only their associated confidence
intervals achieve the nominal coverage rate.
4.2. Continuous outcomes
We assess the finite sample performance of four proposed estimation procedures for μ1Å and
μ0Å . We did not implement the LE estimator owing to its computational complexity. In Section
4.2.1, we evaluate the performance when we correctly specify regression models for the outcome
and treatment. In Section 4.2.2, we evaluate the performance under various types of model
misspecification. We present simulation results for three sample sizes: 800, 2000 and 10 000, and
for each simulation we generated 2000 data sets.
4.2.1. Efficiency
We generated S and W as in Section 4.1.1. We assumed that Y1 and Y0 , given S and W ,
followed independent normal distributions with means equal to S1 + 0:3S2 + W1 + 0:2W2 +
0:1W3 and 1 + 1:5S1 + 0:5S2 + 1:5W1 + 0:2W2 + 0:2W3 respectively, and common variance of
1=25. We assumed the same model for π1Å .X/ as in Section 4.1.1. Under these distributional
assumptions, the true values of μÅ1 and μ0Å are 0.59 and 1.9 respectively. Owing to strong selection
bias, E[Y |T = 1] and E[Y |T = 0] are biased estimates for μ1Å and μ0Å . With a sample size 108 ,
we estimated E[Y |T = 1] = 0:94 and E[Y |T = 0] = 1:057. In generating the validation sample,
we dichotomized Y at 2. The validation sample was constructed by using the same idea as in
Section 4.1.1.
In the estimation of the conditional expectations of the form of E[v.X/|S, Yt ], semiparametric
modelling of E[v.X/|S, Yt ] was initially attempted by using the GAM package in R (Hastie, 1992;
Hastie and Tibshirani, 1990) but it produced unstable results. Instead, we assumed that the
conditional expectation depended on Yt only through I.Yt > 2/ and we fitted a saturated regression model.
In the top section of Table 3, we present the results when there is no model misspecification.
In this simulation setting, the coverage rates of the SIW and EIW estimators for μ1 were substantially lower than 95% at the sample size 800. Some observations in the treated (T = 1) group
had very large weights and caused instability in the SIW or EIW estimators. In such a situation,
a much larger sample size would be needed to have the coverage rate close to 95%. As expected,
the coverage rates improved as the sample size increased from 800 to 10 000. In contrast, the
SDR and EDR estimators’ 95% confidence intervals have reasonable coverage rates when the
sample size is 800 and were very close to 95% at sample size 2000. This shows that including
Causal Inference in Sampling Designs
961
Table 3.
Finite sample properties, efficiency and robustness comparisons for a continuous outcome
Sample
size
Estimator
μ̂1
var
×103
var
×103
95% coverage
probability
μ̂0
var
×103
var
×103
95% coverage
probability
800
SIW
EIW
SDR
EDR
SIW
EIW
SDR
EDR
SIW
EIW
SDR
EDR
0.614
0.641
0.589
0.614
0.599
0.607
0.591
0.599
0.592
0.593
0.590
0.591
18.3
12.0
10.1
9.6
6.7
4.1
3.9
2.9
1.2
0.7
0.7
0.5
12.9
9.8
10.2
7.2
5.8
3.8
4.0
2.6
1.2
0.8
0.8
0.5
87.5
83.7
94.2
92.0
91.6
88.8
95.2
94.2
94.5
93.5
96.0
95.2
1.953
1.920
1.899
1.882
1.922
1.910
1.901
1.895
1.906
1.905
1.900
1.900
71.4
67.9
21.8
18.8
24.7
21.6
8.3
6.0
4.0
3.7
1.5
1.1
76.7
70.8
22.1
16.6
24.6
22.0
8.6
6.1
4.2
3.7
1.6
1.2
94.7
92.3
94.8
94.1
94.8
93.8
95.2
95.0
95.5
95.0
96.7
95.3
μÅt (X) is misspecified
800
SIW
EIW
SDR
EDR
2000
SIW
EIW
SDR
EDR
10000
SIW
EIW
SDR
EDR
0.621
0.641
0.627
0.622
0.599
0.607
0.601
0.600
0.591
0.593
0.592
0.592
20.1
13.9
21.9
18.1
6.7
4.1
6.4
4.2
1.2
0.8
1.1
0.8
14.4
12.4
18.3
17.5
5.8
3.8
5.6
4.0
1.2
0.8
1.2
0.8
87.2
81.2
87.4
86.7
91.6
88.8
92.1
91.8
95.0
92.5
94.7
93.8
1.967
1.934
1.980
1.970
1.922
1.910
1.923
1.921
1.904
1.902
1.903
1.904
78.0
74.7
113.9
125.7
24.7
21.6
25.9
27.2
4.1
3.6
4.0
4.2
84.4
80.4
116.9
131.7
24.6
22.0
25.2
27.4
4.2
3.8
4.1
4.4
94.8
90.8
95.2
94.0
94.8
93.8
95.5
94.9
95.2
94.8
95.5
95.1
π1Å (X) is misspecified
800
SIW
EIW
SDR
EDR
2000
SIW
EIW
SDR
EDR
10000
SIW
EIW
SDR
EDR
0.875
0.863
0.594
0.591
0.866
0.861
0.593
0.590
0.862
0.861
0.590
0.590
15.4
2.8
9.6
6.0
6.6
0.9
4.1
1.9
1.2
0.2
0.7
0.3
15.0
2.9
10.2
4.8
6.2
1.0
4.0
1.9
1.2
0.2
0.8
0.4
37.2
0.5
94.8
93.2
7.8
0.0
94.4
94.5
0.0
0.0
95.3
95.7
1.269
1.248
1.907
1.903
1.257
1.246
1.904
1.899
1.247
1.246
1.900
1.900
27.8
10.8
20.1
17.1
11.4
3.6
8.7
6.4
2.0
0.7
1.6
1.2
29.7
11.3
22.0
15.0
11.0
3.8
8.5
6.0
2.0
0.7
1.6
1.2
4.8
0.2
95.3
92.8
0.0
0.0
94.5
94.5
0.0
0.0
94.8
95.2
2000
10000
the correct μtÅ .X/ can help with the instability owing to extreme weights. Comparing the four
estimators, the EDR estimator is the most efficient. As theoretically predicted, the enriched
estimator is always more efficient than the corresponding simple estimator.
As for a binary outcome, we also evaluated how much efficiency was lost owing to not
collecting W on the entire sample. At a sample size 10 000 and with models for μtÅ .X/ and
πtÅ .X/ correctly specified, if W were available on the entire sample, the EDR estimator (which
is the same as the SDR estimator) for μ1Å and μ0Å had Monte Carlo variances of 0:1 × 10−3 and
0:2 × 10−3 respectively. In contrast, as shown in Table 3, the EDR estimator had a Monte Carlo
variance of 0:5 × 10−3 for μ1Å and 1:1 × 10−3 for μ0Å . The loss of efficiency is 80% and 82% respectively.
962
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
4.2.2. Robustness
In the middle section of Table 3, we consider the effects of misspecification of the model for
μtÅ .X/. In our analysis, we incorrectly dropped W1 in the models for μtÅ .X/.
Since the model for μtÅ .X/ is misspecified, the SDR and EDR estimators lost protection
against instability of extreme weights, which is shown by the poor coverage rate of the 95% confidence intervals at sample size 800. The two doubly robust estimators also lost some efficiency.
Finally, in the bottom section of Table 3, we consider the effects of misspecification of the
model for πtÅ .X/. In our analysis, we incorrectly dropped W1 from this model. Here, only the
two doubly robust estimators are consistent, and only for these two estimators are the coverage
rates for the 95% confidence intervals near the nominal level.
5.
Data analysis
The NSCOT was designed to evaluate the effect of trauma centre care on mortality, functional
outcomes and cost. An outcome-dependent, two-stage sampling procedure was used to sample
individuals with moderately severe to severe injuries who were treated at trauma and non-trauma
centres (MacKenzie et al., 2006). In this paper, we evaluate the causal effect of trauma centre
versus non-trauma-centre care on in-hospital mortality.
Here, T denotes the indicator of receiving trauma centre care; Y1 and Y0 denote the indicator
of in-hospital death under trauma and non-trauma care respectively. Our goal is to estimate
P.Yt = 1/ for t = 0, 1 and the relative risk P.Y1 = 1/=P.Y0 = 1/.
5.1. Sampling process
The sample of patients was selected and eligibility determined in two stages. First, administrative discharge records and emergency department logs were prospectively reviewed to identify
all patients ages 18–84 years who were treated at one of the 69 participating hospitals (18 trauma
centres and 51 non trauma centres) for a moderately severe to severe injury. Several exclusion
criteria were applied. A total of 18 198 patients met the initial eligibility criteria.
In the second stage, all 1 438 hospital deaths and a random sample of 8021 patients discharged
alive were stratified within hospitals on
(a) age 18–64 years and 65–84 years,
(b) international classification of diseases injury severity scores (Baker et al., 1974; MacKenzie
et al., 1989) 15 or lower and greater than 15 and
(c) principal body region injured, head, arms and legs, and other regions.
Stratifying the sample of live hospital discharges was necessary to facilitate balance in baseline
risk between trauma centres and non-trauma centres.
Medical records were obtained for 1391 of the deaths in hospital. On the basis of a detailed
medical record review, 1104 deaths met the final eligibility criteria. Patients who were discharged
alive and selected for the study were contacted by mail at 3 months post injury. Those who did
not refuse by mail were contacted by phone and consent was obtained for access to the medical
record and interviews at 3 and 12 months. Of the 8021 live discharges that were selected for the
study, 4866 were enrolled. Of those enrolled, 779 were determined ineligible on further review of
their medical record (usually because their discharge diagnoses did not meet eligibility criteria),
leaving 4087 live discharges who were finally eligible and for whom complete medical record
data were abstracted.
There are two differences in the sampling design of the NSCOT and that described in Section 2.
First, the sampling strata are defined at individual hospital level, not the dichotomized treat-
Causal Inference in Sampling Designs
963
ment level (e.g. trauma or non-trauma centres). Second, the inclusion eligibility was determined
in two steps. Final eligibility could only be determined with a detailed medical record review
and therefore was only known on the validation sample. Thus, the non-validation sample also
contains ineligible individuals who should be excluded from the study but could not be identified.
The five estimators that are proposed in this paper can be extended to accommodate these
differences with some additional assumptions. Such extensions are beyond the scope of this
paper and will be the subject of future work. For illustration, we modified the NSCOT data so
that they fit into the framework that is described in Section 2. First, we aggregated the hospital
level strata into trauma–non-trauma strata and recalculated the sampling weights. Second, we
assumed that, within each sampling strata in the second stage of sampling, the final eligibility
of an individual does not depend on whether he or she was sampled for validation, agreed
to enrol or whether his or her complete medical record could be found. With this assumption, we then estimated the number of finally eligible individuals in the non-validation sample.
Only these non-validated individuals were retained in the data set. At the end of this process, a
total number of n = 15 617 validated and non-validated individuals were considered part of our
data set.
5.2. Analysis
We used the same treatment regression model for πt .X/ as in MacKenzie et al. (2006). Separate regression models were fitted for μ1Å .X/ and μ0Å .X/. The models included main effect
terms for the following covariates: age (natural cubic spline), gender, race, Charlson score,
mechanism of injury, first emergency department shock, field Glasgow coma scale motor, first
emergency department Glasgow coma scale motor, maximum abbreviated injury scale, maximum abbreviated injury scale for head injury, thorax injury, abdomen injury and extremity
injury, first emergency department pupils, coagulopathy, obesity, flail chest, open skull, paralysis, long bone fractures or amputation, midline shift and insurance status. An interaction
between mechanism of injury and insurance status was also included.
The estimated in-hospital mortality and relative risk from the five different estimation procedures are shown in Table 4. The five different estimates of in-hospital mortality under trauma
care are similar and comparable with the observed mortality among those who were treated at
trauma centres. The standard errors are also comparable. For non-trauma-centre care, the estimators of in-hospital mortality are almost twice as large as the observed mortality among those
Table 4.
In-hospital mortality from different estimators
Result for the following estimators:
Trauma centre (%)
Non-trauma centre (%)
Relative risk
Observed
SIW
EIW
SDR
EDR
LE
7:64†
.0:28/‡
11.18
(2.36)
0.68
(0.40–0.97)§
7.55
(0.25)
10.82
(2.36)
0.70
(0.40–1.00)
7.56
(0.28)
10.40
(1.82)
0.73
(0.52–1.03)
7.47
(0.25)
10.19
(1.81)
0.73
(0.48–0.99)
7.47
(0.25)
10.50
(1.90)
0.71
(0.50–1.02)
†Estimated in-hospital mortality.
‡Standard error.
§95% confidence interval of relative risk.
7.88
5.77
1.37
964
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
treated at non-trauma centres. The EDR estimator yields the lowest point estimate and also
has the lowest standard error. All five estimation procedures yield similar relative risk estimates.
We assumed asymptotic normality for the logarithm of relative risk and derived its confidence
interval by using the delta method. This confidence interval was then exponentiated to obtain
the confidence interval for the relative risk. The SDR, EDR and LE estimators had similar
confidence interval lengths, which were more than 15% shorter than those of the SIW and EIW
estimators.
6.
Discussion
In this paper, we presented five estimators for the causal effect of a binary treatment in the
outcome-dependent, two-phase sampling design.
On the basis of simulation and empirical work, we recommend use of the EDR estimator
because of its reasonable efficiency, robustness to model misspecification and computational
tractability. The SIW estimator is less efficient than the EIW estimator. The SDR estimator is
less efficient than the EDR estimator. Compared with the EDR estimator, the EIW estimator
does not require models for μÅt .X/. However, it is more sensitive to extreme weights and is
less efficient when ‘good’ models for μtÅ .X/ are available. The LE estimator requires complex
statistical modelling to gain its full efficiency, which we view as impractical in real data analyses.
We implement a two-step LE estimator for a binary outcome and have found no evidence that
it performs better than the EDR estimators in our simulation study.
There are several directions for future research. First, in this paper, we assumed that the
sampling probability qT .S, Y/ is known. However, in many studies the sampling weights are not
known a priori and must be estimated. Even if they are known, it can be shown that treating
them as estimates can yield more efficient estimators. In doing so, the influence functions no
longer take the form of equation (1). To derive the new class of influence functions, we must take
the influence functions in equation (1) and subtract their projection onto the space spanned by
the score function from a model for the sampling probability.
Second, in the presence of model misspecification, the EIW, SDR, EDR and LE estimators
may not have smaller variances than the SIW estimator. Tan (2006) introduced a way to construct a doubly robust estimator which always improves over the SIW estimator. Future work
could extend his method to outcome-dependent, two-phase sampling designs.
Finally, we have used the independent and identically distributed observations framework
for semiparametric inference that was developed by Bickel et al. (1993). The NSCOT study does
not fit ideally into this framework. This stems from the fact that the NSCOT employed binomial
sampling within strata. Future work should focus on causal inference methods under binomial
and other sampling schemes.
Appendix A
Proposition 1. Define "1 = Y1 − μÅ1 and "0 = Y0 − μÅ0 and reparameterize the full data .X, Y0 , Y1 / to
.X, "0 , "1 /. The nuisance tangent space of the full data is
ΛF = {a."0 , "1 , X/ :
E[a."0 , "1 , X/] = 0,
E["0 a."0 , "1 , X/] = 0,
E["1 a."0 , "1 , X/] = 0}:
The orthogonal nuisance tangent space of the full data is
.10/
ΛF, ⊥ = {B2×2
Causal Inference in Sampling Designs
"0
: B is any 2 × 2 matrix of constants}:
"1
965
.11/
Proof. The joint distribution f.X, "1 , "0 / is any distribution satisfying E["1 ] = 0 and E["0 ] = 0. The result
follows.
Under the assumption of no unmeasured confounding and that the sampling probability only depends
on .S, T , Y/, the complete data L = .X, Y0 , Y1 , V , T/ has the likelihood
f.X, Y0 , Y1 / P.T |X, Y0 , Y1 / P.V |T , X, Y0 , Y1 / = f.X, Y0 , Y1 / P.T |X/ P.V |T , S, Y/:
Thus, the nuisance tangent space of the complete-data likelihood can be factorized as
Λcomp = ΛF ⊕ ΛT |X ⊕ ΛV |T , S, Y :
Since P.V |T , S, Y/ is known by study design, we have
Λcomp = ΛF ⊕ ΛT |X :
We shall also assume that πt .X/ is known up to a finite number of parameters with αÅ denoting the true
parameter, i.e.
π Å .X/ = π .X; αÅ /:
t
t
Without loss of generality, we shall assume that
logit{π1 .X; αÅ /} = l.X, αÅ /
where l.X, αÅ / is a specified function of X and α which is differentiable in α. This model implies that
exp{t l.X; αÅ /}
πt .X; αÅ / =
:
1 + exp{l.X; αÅ /}
Let r be the length of αÅ .
Proposition 2.
@l.X; αÅ /
ΛT |X = B2×r
{T − π1 .X; αÅ /}: for all B :
@α
Proof. The likelihood is
pT |X .t, x; α/ = π1 .x; α/t {1 − π1 .x; α/}1−t :
The log-likelihood is equal to
log{pT |X .t, x; α/} = t log{π1 .x; α/} + .1 − t/ log{1 − π1 .x; α/}:
The score function is
Sα =
t
1−t
−
π1 .x; α/ 1 − π1 .x; α/
Table 5.
(V, T)
.0, 1/
.0, 0/
.1, 1/
.1, 0/
@π1 .x; αÅ / @l.x; αÅ /
=
{t − π1 .x; α/}:
@α
@α
Missingness patterns
O
P(V ,T |X,Y0 ,Y1 )
.S, Y1 /
.S, Y0 /
.X, Y1 /
.X, Y0 /
{1 − q1 .S, Y1 /} π1Å .X/
{1 − q0 .S, Y0 /} π0Å .X/
q1 .S, Y1 / π1Å .X/
q0 .S, Y0 / π0Å .X/
966
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
Since ΛT |X is the space that is spanned by Sα , the result follows.
The observed data are O = .S, VW , TY1 , .1 − T/Y0 , V , T/, indicating that there are missing data. The
missingness patterns are listed in Table 5. The data are not missing at random because the probability of
missingness depends on W which is not always observed.
Let Λ1 = ΛF and Λ2 = ΛT |X . The nuisance tangent space for the observed data is equal to
O
ΛO = ΛO
1 + Λ2 ,
ΛO
j = range.g ◦ Πj /, j = 1, 2,
where
range.·/ is the range of an operator. g : ΩL → ΩO , g.·/ = E[·|O], ΩL and
O
Ω are spaces of mean 0 random functions of L and O respectively, Πj is the projection operator from ΩL
to Λj and S̄ denotes the closed linear span of the set S (Bickel et al., 1993). Since the data are not missing
O
at random, ΛO
1 and Λ2 are not orthogonal. Regardless, we still have the result
⊥
⊥
ΛO, ⊥ = ΛO,
∩ ΛO,
:
1
2
⊥
Rotnitzky and Robins (1997) showed how to compute ΛO,
.
1
Proposition 3.
⎫
⎧
⎞
⎛
VT
Å/
⎪
⎪
−
μ
.Y
⎬
⎨
1
1
Å
⎟
O, ⊥
2×2 ⎜ q1 .S, Y1 / π1 .X; α /
.2/
.2/
.2/
+
A
,
:
for
all
B,
A
∈
Λ
Λ1 = B ⎝
⎠
V.1 − T/
⎪
⎪
⎭
⎩
.Y0 − μÅ0 /
Å
q0 .S, Y0 / π0 .X; α /
where Λ.2/ = {b.O/ : E[b.O/|L] = 0}.
Proposition 4.
Λ
where
.2/
=
1
τ =0
φ2, τ .O; gτ / + φ3 .O; αÅ , b/ : for all g0 , g1 , b
V
gτ .S, Yτ /,
qτ .S, Yτ /
V
T − π1 .X; αÅ /
b.X/,
φ3 .O; αÅ , h/ =
qT .S, YT / R.X; αÅ /
φ2, τ .O; gτ / = 1{T =τ } 1 −
g1 , g0 and b are arbitrary 2 × 1 vectors of functions of their arguments and R.X; αÅ / = π1 .X; αÅ / π0 .X; αÅ /.
Proof. Consider a function of the observed data
h.O/ = .1 − V/T g1 .S, Y1 / + .1 − V/.1 − T/ g2 .S, Y0 / + VT g3 .X, Y1 / + V.1 − T/ g4 .X, Y0 /:
Then,
E[h.O/|L] = {1 − q1 .S, Y1 /} π1Å .X/ g1 .S, Y1 / + {1 − q0 .S, Y0 /} π0Å .X/ g2 .S, Y0 / + q1 .S, Y1 / π1Å .X/ g3 .X, Y1 /
+ q0 .S, Y0 / π0Å .X/ g4 .X, Y0 /:
Setting the expectation to zero, we have
{1 − q1 .S, Y1 /} π1Å .X/ g1 .S, Y1 / + q1 .S, Y1 / π1Å .X/ g3 .X, Y1 / = −{1 − q0 .S, Y0 /} π0Å .X/ g2 .S, Y0 / − q0 .S, Y0 /
× π0Å .X/ g4 .X, Y0 /:
The left-hand side is a function of .X, Y1 / and the right-hand side is a function of .X, Y0 /. For them to be
equal, both must be a function of X alone. Thus,
{1 − q .S, Y /} π Å .X/ g .S, Y / + q .S, Y / π Å .X/ g .X, Y / = b.X/,
1
1
1
1
1
1
1
1
3
1
−{1 − q0 .S, Y0 /} π0Å .X/ g2 .S, Y0 / − q0 .S, Y0 / π0Å .X/ g4 .X, Y0 / = b.X/
where b.X/ is a function of X. Then,
Causal Inference in Sampling Designs
g3 .X, Y1 / = −
1 − q1 .S, Y1 /
1
b.X/
g1 .S, Y1 / +
q1 .S, Y1 /
q1 .S, Y1 / π1Å .X/
g4 .X, Y0 / = −
1 − q0 .S, Y0 /
1
b.X/
g2 .S, Y0 / −
q0 .S, Y0 /
q0 .S, Y0 / π0Å .X/
967
and
Simple algebra leads to the proposition.
Scharfstein et al. (1999) showed the following result.
Proposition 5.
⊥
= {h.O/ : Π[h.O/|ΛT |X ] = 0}
ΛO,
2
= {h.O/ : h.O/ ∈ ΛT |X, ⊥ }:
For simplicity of notation, we consider the estimation of μt from now on.
Proposition 6. The orthogonal nuisance tangent space of the observed data is equal to
ΛO, ⊥ = {B φ1 {O; μÅt , αÅ , μtÅ .X/} +
1
τ =0
φ2, τ .O; gτ / + φ3 .O; αÅ , b/, : for all B}
where
φ1 {O; μÅt , αÅ, μtÅ .X/} =
VI{T =t}
V
T −π1 .X; αÅ /
π1−t .X; αÅ /{μtÅ .X/−μÅt },
.Yt − μÅt /−.−1/t
Å
qt .S, Yt / πt .X; α /
qT .S,YT / R.X; αÅ /
V
φ2, τ .O; gτ / = 1{T =τ } 1 −
gτ .S, Yτ /,
qτ .S, Yτ /
V
T − π1 .X; αÅ /
b.X/,
φ3 .O; αÅ , b/ =
qT .S, YT / R.X; αÅ /
g1 and g0 are arbitrary functions of their arguments, b.X/ ∈ T ⊥ and
@l.X; αÅ /
T = k
: k is an arbitrary constant vector :
@α
⊥
and find out what conditions it must satisfy so that it also belongs
Proof. We take an element from ΛO,
1
⊥
O, ⊥
⊥
to ΛO,
.
Note
that
φ
.O;
g
/
∈
Λ
and
that, for any h.O/ ∈ ΛO,
, we have
2, τ
τ
2
2
1
VI{T =t}
@l.X; αÅ /
Å / + φ .O; αÅ , b /
E h.O/
{T − π1 .X; αÅ /} = E
−
μ
.Y
t
3
t
t
@α
qt .S, Yt / πt .X; αÅ /
Å
@l.X; α /
×
{T − π1 .X; αÅ /}
@α
@l.X; αÅ /
=E
[.−1/t−1 {μÅt .X/ − μtÅ } π1−t .X; αÅ / + bt .X/] :
@α
Therefore,
bt .X/ = .−1/t π1−t .X; αÅ /{μtÅ .X/ − μtÅ } + b.X/,
where E[{@l.X; αÅ /=@α} b.X/] = 0, i.e. b.X/ ∈ T ⊥ .
Proposition 7. The set of influence functions of all RAL estimators of μÅt is
⊥
Å Å Å
ΛO,
Å = {φ1 {O; μt , α , μt .X/} +
1
τ =0
φ2, τ .O; gτ / + φ3 .O; αÅ , b/}:
Proof. The set of influence functions of all RAL estimators is equal to
968
W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie
@h{O; μÅt , αÅ , μtÅ .X/}
⊥
Å Å Å
O, ⊥
=
−1
:
ΛO,
Å = h{O; μt , α , μt .X/} ∈ Λ : E
@μÅt
It is straightforward to show that, for any h{O; μÅt , αÅ , μtÅ .X/} ∈ ΛO, ⊥ ,
@h{O; μÅt , αÅ , μtÅ .X/}
E
= −B:
@μÅ
t
Therefore, B should be 1.
Proposition 8. For an arbitrary function of observed data h.O/,
1
E[h.O/I{T =τ } {1 − V=qτ .S, Yτ /} |S, Yτ ]
V
:
I{T =τ } 1 −
Π[h.O/|Λg ] =
q
.S,
Y
/
E[I{T =τ } {1 − V=qτ .S, Yτ /}2 |S, Yτ ]
τ
τ
τ =0
Proof. Suppose that Π[h.O/|Λgτ ] = φ2, τ .O; gτÅ /; then, for any gτ ,
E[{h.O/ − φ .O; g Å /} φ .O; g /] = 0:
2, τ
gτÅ can be directly solved from the above equation.
τ
2, τ
τ
Proposition 9. For an arbitrary function of observed data h.O/,
Å V T − π1 .X; αÅ /
−1
@l.X; α /
A.X/
,
Π[h.O/|Λb ] =
Ch .X/ − kh
qT R.X; αÅ /
@α
where
Ch .X/ = E h.O/
.12/
V
T − π1 .X; αÅ / ,
X
qT .X, YT / R.X; αÅ /
A.X/ = π1 .X; αÅ /−1 E[q1 .S, Y1 /−1 |X] + π0 .X; αÅ /−1 E[q0 .S, Y0 /−1 |X]
and
@l.X; αÅ /
@l.X; αÅ / @l.X; αÅ / −1
:
E A.X/−1
kh = E A.X/−1 Ch .X/
@α
@α
@α
Proof. Suppose that Π[h.O/|Λb ] = φ3 .O; αÅ , bÅ /; bÅ can be determined by the following two restrictions:
(a) for any b, E[{h.O/ − φ3 .O; αÅ , bÅ /} φ3 .O; αÅ , b/] = 0;
(b) bÅ ∈ T ⊥ .
References
Baker, S. P., O’Neill, B., Haddon, Jr, W. and Long, W. B. (1974) The Injury Severity Scores: a method for describing
patients with multiple injuries and evaluating emergency care. J. Trauma, 14, 187–196.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993) Efficient and Adaptive Estimation for Semi
Parametric Models. Baltimore: Johns Hopkins University Press.
Breslow, N. E. (2000) Statistics in the life and medical sciences. J. Am. Statist. Ass., 95, 281–282.
Breslow, N. E. and Cain, K. C. (1988) Logistic regression for two-stage case-control data. Biometrika, 75, 11–
20.
Breslow, N. E. and Holubkov, R. (1997a) Maximum likelihood estimation of logistic regression parameters under
two-phase, outcome-dependent sampling. J. R. Statist. Soc. B, 59, 447–461.
Breslow, N. E. and Holubkov, R. (1997b) Weighted likelihood pseudo-likelihood and maximum likelihood
methods for logistic regression analysis of two-stage data. Statist. Med., 16, 103–116.
Breslow, N., McNeney, B. and Wellner, J. A. (2003) Large sample theory for semiparametric regression models
with two-phase outcome dependent sampling. Ann. Statist., 31, 1110–1139.
Carroll, R. J. and Wand, M. P. (1991) Semiparametric estimation in logistic measurement error models. J. R.
Statist. Soc. B, 53, 573–585.
Chatterjee, N., Chen, Y.-H. and Breslow, N. E. (2003) A pseudoscore estimator for regression problems with
two-phase sampling. J. Am. Statist. Ass., 98, 158–168.
Causal Inference in Sampling Designs
969
Cochran, W. G. (1963) Sampling Techniques, p. 412. New York: Wiley.
Cosslett, S. R. (1981) Maximum likelihood estimator for choice-based samples. Econometrica, 49, 1289–1316.
Cosslett, S. R. (1983) Distribution-free maximum likelihood estimator of the binary choice model. Econometrica,
51, 765–782.
Fears, T. R. and Brown, C. C. (1986) Logistic regression methods for retrospective case-control studies using
complex sampling procedures. Biometrics, 42, 955–960.
Hampel, F. R. (1974) The influence curve and its role in robust estimation. J. Am. Statist. Ass., 69, 383–393.
Hastie, T. J. (1992) Generalized additive models. In Statistical Models in S (eds J. M. Chambers and T. J. Hastie),
ch. 7. Pacific Grove: Wadsworth and Brooks/Cole.
Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models. London: Chapman and Hall.
van der Laan, M. J. and Robins, J. M. (2003) Unified Methods for Censored Longitudinal Data and Causality. New
York: Springer.
Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999) Semiparametric methods for response-selective and missing
data problems in regression. J. R. Statist. Soc. B, 61, 413–438.
MacKenzie, E. J., Rivara, F. P., Jurkovich, G. J., Nathens, A. B., Frey, K. P., Egleston, B. L., Salkever, D. S. and
Scharfstein, D. O. (2006) A national evaluation of the effect of trauma-center care on mortality. New Engl.
J. Med., 354, 366–378.
MacKenzie, E. J., Steinwachs, I. M. and Shankar, B. (1989) Classifying trauma severity based on hospital discharge
diagnoses: validation of an ICD-9CM to AIS-85 conversion table. Med. Care, 27, 412–422.
Newey, W. K. (1994) The asymptotic variance of semiparametric estimators. Econometrica, 62, 1349–1382.
Neyman, J. (1938) Contribution to the theory of sampling human populations. J. Am. Statist. Ass., 33, 101–116.
Pepe, M. S. and Fleming, T. R. (1991) A nonparametric method for dealing with mismeasured covariate data.
J. Am. Statist. Ass., 86, 108–113.
Robins, J. M., Hsieh F. and Newey, W. (1995) Semiparametric efficient estimation of a conditional density with
missing or mismeasured covariates. J. R. Statist. Soc. B, 57, 409–424.
Robins, J. M. and Ritov, Y. (1997) Toward a curse of dimensionality appropriate (CODA) asymptotic theory for
semi-parametric models. Statist. Med., 16, 285–319.
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994) Estimation of regression coefficients when some regressors
are not always observed. J. Am. Statist. Ass., 89, 846–866.
Rotnitzky, A. and Robins, J. M. (1997) Analysis of semiparametric regression models with non-ignorable nonresponse. Statist. Med., 16, 81–102.
Rubin, D. B. (1986) Comments on “Statistics and causal inference”. J. Am. Statist. Ass., 81, 961–962.
Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999) Reply to comments on “Adjusting for nonignorable
drop-out using semiparametric nonresponse models”. J. Am. Statist. Ass., 94, 1135–1146.
Schill, W., Jöckel, K.-H., Drescher, K. and Timm, J. (1993) Logistic analysis in case-control studies under validation sampling. Biometrika, 80, 339–352.
Scott, A. J. and Wild, C. J. (1991) Fitting logistic regression models in stratified case-control studies. Biometrics,
47, 497–510.
Scott, A. J. and Wild, C. J. (1997) Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57–71.
Tan, Z. (2006) A distributional approach for causal inference using propensity scores. J. Am. Statist. Ass., 101,
1619–1637.
Tsiatis, A. A. (2006) Semiparametric Theory and Missing Data. New York: Springer.
van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge: Cambridge University Press.
Weaver, M. A. and Zhou, H. (2005) An estimated likelihood method for continuous outcome regression models
with outcome-dependent sampling. J. Am. Statist. Ass., 100, 459–469.
White, J. E. (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease.
Am. J. Epidem., 115, 119–128.