Estimation Based on Case-Control Designs with Known

University of California, Berkeley
U.C. Berkeley Division of Biostatistics Working Paper Series
Year 
Paper 
Estimation Based on Case-Control Designs
with Known Incidence Probability
Mark J. van der Laan∗
∗
Division of Biostatistics and Department of Statistics, University of California, Berkeley,
[email protected]
This working paper is hosted by The Berkeley Electronic Press (bepress) and may not be commercially reproduced without the permission of the copyright holder.
http://biostats.bepress.com/ucbbiostat/paper234
c
Copyright 2008
by the author.
Estimation Based on Case-Control Designs
with Known Incidence Probability
Mark J. van der Laan
Abstract
Case-control sampling is an extremely common design used to generate data to
estimate effects of exposures or treatments on a binary outcome of interest when
the proportion of cases (i.e., binary outcome equal to 1) in the population of interest is low. Case-control sampling represents a biased sample of a target population of interest by sampling a disproportional number of cases. Case-control
studies are also commonly employed to estimate the effects of genetic markers or
biomarkers on phenotypes. The typical approach used in practice is to fit (conditional) logistic regression models, ignoring the case-control sampling, in order
to estimate the conditional odds ratios of being a case, given baseline covariates
and the exposure of interest. Although these methods do not rely on knowing the
true incidence probability (i.e, probability of being a case), and provide valid logistic regression model based estimates of the conditional effect of exposure on
odds ratio scale, they do not provide an estimate of a marginal causal odds ratio or causal relative risk, which are causal parameters representing the typical
parameters of interest in randomized trials comparing different treatment or exposure levels. By the same argument, these methods do not provide measures of
marginal variable importance. In this article we focus on methods for causal inference and variable importance analysis for matched and unmatched case-control
studies relying on knowing the incidence probability, conditional on the matching
variable if matching is used. We start out with presenting, for both case-control
designs, a simple intercept adjustment method that deterministically maps a, possibly weighted for matched case-control designs, logistic regression fit into a valid
model based fit of the actual conditional probability on being a case, given the covariates. The resulting estimate of the conditional probability of being a case has
now the important property that its standard error is proportional to the incidence
probability (divided by the square root of the sample size) so that the obtained
precision is good enough for accurately estimating marginal causal relative risks
or causal odds-ratios even when the probability of being a case is extremely rare.
Subsequently, we present our general proposed methodology, involving a simple
weighting scheme of cases and controls, that maps any estimation method for a parameter developed for prospective sampling from the population of interest into an
estimation method based on case-control sampling from this population. For regular case-control designs the weighting only relies on knowing the true population
proportion of cases or, equivalenty, the true probability of being a case, and for
matched case-control sampling it also relies on knowing this proportion of cases
within each population strata of the matching variable. We show that this casecontrol weighting of an efficient estimator for a prospective sample from the target
population of interest maps into an efficient estimator for matched and unmatched
case-control sampling. We show how application of this generic methodology
provides us with double robust locally efficient targeted maximum likelihood estimators of the causal relative risk and causal odds ratio for regular case control
sampling and matched case control sampling. We also illustrate such double robust targeted maximum likelihood estimators in marginal structural models and
semi-parametric logistic regression models. Finally, we show that case-control
studies nested in randomized trials allow estimation, based on inverse probability
of treatment weighted (IPTW) estimators of the marginal causal relative risk or
odds ratios without the need to know the incidence probability, and we present
the simple implications for observational case-control studies in which this incidence probability is not known but known to be close to zero. By comparing
these methods with the efficient method for the case that the incidence probability
is known, it follows that even in randomized trials the knowledge of the incidence
probability allows for significantly more precise estimation of causal parameters.
1
Introduction
Case-control sampling is an extremely common design used to generate data
to estimate effects of exposures or treatments on a binary outcome of interest
when the actual population proportion of cases (i.e. binary outcome equal to
1) is small. As a consequence, it is of interest to present estimators of causal
effects or variable importance parameters based on case-control data.
1.1
Formulation of case-control estimation problem.
Let’s first formulate the statistical problem. For the sake of concreteness and
illustration, our formulation will focus on a case-control point treatment data
structure with baseline covariates in which one is concerned with estimation
of the causal effect or variable importance of the treatment variable on the
binary outcome. Our initial formulation will assume that the variables are
not subject to missingness or censoring. Our general methods are straightforward extensions and apply to general case control data structures, including
censored data structures and time-dependent longitudinal data structures.
Experimental unit of interest. Let O∗ = (W, A, Y ) ∼ P0∗ represent the
experimental unit and corresponding distribution P0∗ of interest, consisting of
baseline covariates W , a subsequent monitored treatment/exposure variable
A, and a ”final” binary outcome Y .
Causal or variable importance parameter of interest. Suppose one is
concerned with statistical inference regarding a particular euclidean valued
variable importance or causal effect parameter ψ0∗ = Ψ∗ (P0∗ ) ∈ IRd of this
distribution P0∗ . For example, one might be interested in the marginal causal
additive effect of a binary treatment A ∈ {0, 1} defined as
ψ0∗ ≡ E0∗ E0∗ (Y | A = 1, W ) − E0∗ (Y | A = 0, W ) = E0∗ (Y1 ) − E0∗ (Y0 )
= P0∗ (Y1 = 1) − P0∗ (Y0 = 1),
where the latter causal effect interpretation of this parameter of P0∗ requires the notion of treatment specific counterfactual outcomes Y0 , Y1 , viewing (W, A, Y = YA ) as a time-ordered missing data structure on the full data
structure (W, Y0 , Y1 ), and one needs to assume the randomization assumption
stating that A is independent of Y0 , Y1 , given W . The latter causal parameter formulation ψ0∗ can also be viewed as a W -adjusted variable importance
(of variable A) parameter of the true regression of Y on A, W , in which case
1
Hosted by The Berkeley Electronic Press
there is no need to assume the time ordering (W ⇒ A ⇒ Y ), the missing
data structure assumption, or the randomization assumption, and the adjustment set W is user supplied (and does thus not need to correspond with
the set of all confounders of A): see van der Laan (2006) for a general formulation of variable importance parameters and its direct relation to causal
effect parameters.
One can also define the parameter of interest as a causal relative risk
ψ0∗ =
EY1
P (Y1 = 1)
E0∗ E0∗ (Y | A = 1, W )
=
=
,
∗ ∗
E0 E0 (Y | A = 0, W )
EY0
P (Y0 = 1)
or a causal odds ratio,
ψ0∗ =
P (Y1 = 1)P (Y0 = 0)
,
P (Y1 = 0)P (Y0 = 1)
or their variable importance analogue.
We will use these particular marginal causal effects or marginal variable
importance parameters as our main examples in order to illustrate our proposed methodology for case-control data, including our proposed targeted
maximum likelihood estimation methodology.
Model for target probability distribution. A model for O∗ is obtained
by modelling this distribution of O∗ : for example, one might know that A is
independent of W , one might know the actual distribution (treatment mechanism) P0∗ (A = a | W ), or one might assume a marginal structural model
E0∗ (Ya | V ) = E0∗ (E0∗ (Y | A = a, W ) | V ) = m(a, V | β0∗ ),
where V ⊂ W denotes some user supplied potential effect modifier of interest, and m(· | β) some parameterization modelling the causal effect of the
intervention A = a on the outcome Y , conditional on V . We will denote such
a model for P0∗ with M∗ : i.e., it is assumed that P0∗ ∈ M∗ .
Case-control sampling and its probability distribution. If one would
sample n i.i.d. observations O1∗ , . . . , On∗ ∼ P0∗ , then we could (e.g.) apply
the locally efficient targeted MLE of ψ0∗ (see e.g. van der Laan and Rubin
(2006) or Moore and van der Laan (2007)), or one could use double robust
estimating function methodology (van der Laan and Robins (2002), van der
Laan (2006)).
However, this so called prospective sampling scheme is often considered
impractical and ineffective in situations in which the probability P0∗ (Y = 1)
2
http://biostats.bepress.com/ucbbiostat/paper234
on the event Y = 1 (say disease) is very small. For example, if the proportion
of diseased in the population of interest is one in hundred thousand, then one
would have to sample millions of observations in order to have some cases
(i.e, Yi = 1) in the sample. This sparsity of cases in the population of interest
is precisely the typical motivation for case-control sampling.
We will distinguish between two types of case-control sampling: independent or un-matched case-control sampling and matched case-control sampling. In both cases, the marginal distribution of the cases and the marginal
distribution of the controls is completely determined by the population (i.e.
prospective sampling) distribution P0∗ of the random variable (W, A, Y ) of
interest.
Independent Case-Control Sampling: One first samples a case by sampling (W1 , A1 ) from the conditional distribution of (W, A), given Y = 1.
Subsequently, one samples J controls (W0j , Aj0 ) from the conditional distribution of (W, A), given Y = 0, j = 1, . . . , J. It is allowed that these
J control observations are dependent as long as their marginal distributions are indeed equal to the conditional distribution of W, A, given
Y = 0.
This results in an experimental unit observed data structure:
O = ((W1 , A1 ), (W0j , Aj0 : j = 1, . . . , J)) ∼ P0 ,
where we denote the sampling distribution of this data structure O
described above with P0 . Thus, a case control data set will consists of
n independent and identically distributed observations O1 , . . . , On with
sampling distribution P0 described above. That is, we treat the cluster
consisting of one case and J controls as the experimental unit, and the
marginal distribution of the case and controls are specified as above by
P0∗ .
Matched Case-Control Sampling: One specifies a categorical matching
variable M ⊂ W . One first samples a case by sampling (M1 , W1 , A1 )
from the conditional distribution of (M, W, A), given Y = 1. Subsequently, one samples J controls (M0j , W0j , Aj0 ) from the conditional
distribution of (M, W, A), given Y = 0, M = M1 . That is, with probability equal to 1 we have M0j = M1 , j = 1, . . . , J. It is allowed that
these J control observations are dependent as long as their marginal distributions are indeed equal to the conditional distribution of M, W, A,
given Y = 0, M = M1 .
3
Hosted by The Berkeley Electronic Press
This results in an experimental unit data structure:
O = ((M1 , W1 , A1 ), (M0j = M1 , W0j , Aj0 : j = 1, . . . , J)) ∼ P0 ,
where we denote the sampling distribution of this data structure O described above with P0 . Thus, a matched case-control data set will consists of n independent and identically distributed observations O1 , . . . , On
with sampling distribution P0 described above. That is, we treat the
cluster consisting of one case and the J matched controls as the experimental unit, and the marginal distribution of the case and J controls
are specified as above by P0∗
We will also refer to the independent case-control experiment and the matched
case-control experiments as Case-Control Design I and Case-Control Design
II, respectively.
Extensions. Our methods naturally handle the case that J is random and
thus varies per experimental unit, assuming that the marginal distributions of
cases and controls, conditional on J = j, do not depend on j. In the situation
that a case was never coupled to a set of controls one can artificially create
such couplings. Our estimators are not sensitive to the particular choice of
coupling. In the discussion we show the simple extension of our methods to
some variations on these case-control designs I and II, such as pair-matched
case-control designs, case-control sampling within strata, and counter-match
case control designs.
The estimation problem: The statistical problem is now to estimate the
parameter ψ0 = Ψ∗ (P0∗ ) of the population distribution P0∗ ∈ M∗ of (W, A, Y ),
known to be an element of some specified model M∗ , based on the casecontrol data set O1 , . . . , On ∼ P0 .
Known or sensitivity analysis parameters/weights: We define
q0 ≡ P0∗ (Y = 1) and q0 (δ | M ) ≡ P0∗ (Y = δ | M ),
as the marginal probability of being a case, and the conditional probability of
being a case/non-case, conditional on the matching variable. It is assumed
that these probabilities are between 0 and 1. In addition, we define the
quantity
q0 (0 | M )
P ∗ (Y = 0 | M )
= q0
.
q̄0 (M ) ≡ q0 0∗
P0 (Y = 1 | M )
q0 (1 | M )
4
http://biostats.bepress.com/ucbbiostat/paper234
We note that q̄0 (M ) is determined by q0 and q0 (1 | M ) = P0∗ (Y = 1 | M ),
and we also note that E0 q̄0 (M1 ) = 1 − q0 . These two quantities q0 and
q̄0 (M ) (for matched case-control studies) will be used to weight the cases
and controls to obtain valid estimation procedures.
In order to be able to identify the wished causal parameters, for casecontrol design I, we only need to assume q0 is known, and, for matched casecontrol design II, we assume q0 and q̄0 (m) for each m are known. However,
we note here that for matched case-control designs one can also assume that
q0 and
r0 (m) ≡ P0∗ (Y = 0, M = m)
(instead of q̄0 (1 | m)) are known We note that, given r0 (m), q̄0 (m) is known
up till a simple to estimate nuisance parameter P (M1 = m):
q̄0 (m) =
r0 (m)
.
P0 (M1 = m)
As a consequence, our case-control weighted estimation procedures using
q0 , q̄0 (m) still apply in settings in which one assumes q0 and r0 (m) are known,
(m)
by replacing q̄0 (m) by its estimate 1 Pn r0I(M
.
1i =m)
n
i=1
Observed data model. In this article, we will typically assume that q0
is known, and that, for matched case-control designs we also assume that
q̄0 (M ), or equivalently, q0 (1 | m) = P0∗ (Y = 1 | M = m) is known for each
m. In Section 8 we show that if the ”treatment mechanism” g0∗ (a | w) =
P0∗ (A = a | W = w) is known, as it would be in a case control study nested
in a randomized trial, then we can estimate the relative risk or odds ratio
parameters without a need to know (any of) q0 or q̄0 (M ).
The model M∗ , possibly including the knowledge q0 or q̄0 (M ), imply
now models for the marginal distribution of the cases (M1 , W1 , A1 ) and the
marginal distributions of the controls (M1 , W2j , Aj2 ), j = 1, . . . , J. The model
M∗ does not imply much, if anything, about the dependence structure among
(M1 , W1 , A1 ), (M1 , W2j , Aj2 ), j = 1, . . . , J, beyond the fact that, for matched
case-control studies, all its components (i.e., the case and control observations) share a common variable M1 . Let M be the model for the observed
data distribution P0 compatible with M∗ (i.e., its marginals are specified by
P0∗ ).
One possible and probably very common model M is to assume that,
given the first draw (M1 , W1 , A1 ) from (M, W, A), given Y = 1, the control
5
Hosted by The Berkeley Electronic Press
observations are all independent draws from the specified conditional distributions. Note that in this latter model the marginal distributions for the case
and control observations implied by P ∗ describe now the whole case-control
sampling distribution P , so that we can write M = {P (P ∗ ) : P ∗ ∈ M},
where P (P ∗ ) is the distribution of O implied by P ∗ .
Other possible models might specify in another manner, or not specify
at all, the dependence structure and could, for example, be represented as
{P (P ∗ , η) : P ∗ ∈ M∗ , η}, where the nuisance parameter η in combination
with P ∗ describes the complete joint distribution of case and control observations (M1 , Z1 ), (M1 , Z2j : j = 1, . . . , J) compatible with its marginal
distributions implied by P ∗ .
We note that knowing q0 does not put restrictions on the data generating
distribution P0 since one conditions on Y = 1, but for case-control design
I it does allow identification of the wished parameters by expressing them
as a function of the distribution of the observed case-control data-structure
and q0 . Similarly, for matched case-control designs, knowing q0 and r0 (·)
does not put restrictions on the data generating distribution P0 for matched
case-control designs, but it allows one to express the wished parameter as
a function of the distribution of the data and (q0 , r0 ). It remains to be
investigated if knowing q0 and q̄0 puts a restriction on the data generating
distribution for matched-case-control designs.
1.2
General formulation of case-control sampling.
Above we provided the case control sampling framework for the data structure O∗ = (M, W, A, Y ) ∼ P0∗ . In general, we have O∗ = (M, Z, Y ) ∼ P0∗ ,
M the matching variable (which can be chosen to be empty for case control design I), the distribution of interest P0∗ is known to be an element of
a model M∗ , ψ0∗ = Ψ∗ (P0∗ ) is a particular parameter of this distribution P0∗
of interest, Ψ∗ : M∗ → IRd is a euclidean parameter defined on the model
M∗ , (M1 , Z1 ) is a draw from the conditional distribution of (M, Z), given
Y = 1, (M1 , Z2j ) is a draw from the conditional distribution of (M, Z), given
Y = 0, M = M1 (or just Y = 0 in case control design I), j = 1, . . . , J, and
the experimental unit observed data structure for the case-control design is
defined as O = ((M1 , Z1 ), ((M1 , Z2j ) : j = 1, . . . , J)) ∼ P0 .
The model M∗ , and possibly knowing q0 or q̄0 (M ), imply now models
for the marginal distribution of (M1 , Z1 ) and the marginal distributions of
(M1 , Z2j ), j = 1, . . . , J. The model M∗ does not imply much, if anything,
6
http://biostats.bepress.com/ucbbiostat/paper234
about the dependence structure among (M1 , Z1 ), (M1 , Z2j ), j = 1, . . . , J,
beyond the fact that they share a common variable M1 for m atched case
control studies. Let M be the model for the observed data distribution P0
compatible with M∗ . One possible and probably the most common model
M is obtained by assuming that, given the first draw (M1 , Z1 ) from (M, Z),
given Y = 1, the control observations are all independent draws from the
specified conditional distributions. Note that in this latter independence
model we can write M = {P (P ∗ ) : P ∗ ∈ M}, where P (P ∗ ) is the distribution
of O implied by P ∗ . Other possible models might specify or not specify
at all the dependence structure and could, for example, be represented as
{P (P ∗ , η) : P ∗ ∈ M∗ , η}, where the nuisance parameter η in combination
with P ∗ describes the complete joint distribution of (M1 , Z1 ), (M1 , Z2j : j =
1, . . . , J) compatible with its marginal distributions implied by P ∗ .
In this case Z could include general data structures including censoring
or missingness: e..g Z = (W, ∆, ∆A) for a missingness variable ∆ on the
exposure or treatment of interest A. In particular, this general formulation
includes that O∗ = (M, L(0), A(0), . . . , L(K), A(K), Y ) ∼ P0∗ is a longitudinal data structure with a time dependent treatment Ā = (A(0), . . . , A(K)),
and ψ0∗ = β0 is the unknown parameter of a marginal structural model
EP0∗ (Yā | V ) = m(a, V | β0 ), where we have to assume the time-ordering
assumption, the consistency assumption and the sequential randomization
assumption in causal inference (see e.g. van der Laan and Robins (2002) for
an overview of causal inference methods).
Since this generalization is completely straightforward, i.e., just replace
(W, A) by a general Z, for the sake of presentation, we focus here on the
point treatment data structure in which Z = (W, A) is defined as a set of
baseline covariates and a point-treatment/exposure variable of interest.
1.3
Overview of article
In Section 2 we start out with presenting for independent case-control sampling a simple method that deterministically maps the commonly employed
logistic regression fit that ignores the case-control sampling into a valid
model-based fit of the actual conditional probability on being a case, given
the covariates. We extend this methodology to matched case-control designs
as well in which case the initial logistic regression fit needs to be based on
weighted control observations. For both case-control designs this mapping
simply adds an intercept c0 = log q0 /1−q0 to the standard or control-weighted
7
Hosted by The Berkeley Electronic Press
logistic regression fit.
The resulting estimate of the conditional probability of being a case has
now the important property that its standard error is proportional to the
marginal probability of being a case (divided by the square root of the sample
size n) so that the obtained precision is good enough for accurately estimating
marginal causal relative risks or causal odds-ratios, even when the probability
of being a case is extremely rare. We present the formal identifiability results
and corresponding method for both matched and unmatched case-control
studies.
In Section 3 we present our general solution to the estimation problem
for these two types of case control designs I and II, which weights the cases
and controls with q0 and (1 − q0 )/J (q̄0 (M )/J for case control design II),
respectively, and then applies a method developed for prospective sampling
to estimate the parameter of interest (e.g., targeted maximum likelihood estimators or estimating equations for the causal effect or variable importance
parameter ψ0 of interest), as if the data was directly drawn from the population distribution P0∗ of interest. In other words, each estimating function
for ψ0∗ or likelihood for P0∗ in the underlying model M∗ maps into a ”casecontrol”-weighted estimating function or likelihood for the observed data
model M (whatever nuisance parameter specification P (P ∗ , η) it might have
beyond the description of its marginal distributions in terms of P ∗ ).
Beyond the weighting, we point out that one should aim to select the
best among these case-control weighted estimating equations/procedures for
the observed case-control data. We show the important and convenient result that case-control weighting of the efficient procedure for the parameter
of interest (as formalized by the efficient influence curve) in the prospective
sampling model M∗ maps into the efficient procedure for the observed casecontrol data model M. This implies, in particular, that case-control weighting of the locally efficient targeted maximum likelihood estimator developed
for prospective sampling model M∗ results in a locally efficient targeted
maximum likelihood estimation procedure for case-control sampling. In general, the power of our generic method is that one can map the estimation
procedures developed for prospective sampling into highly or fully efficient
estimation procedures for case-control sampling. In particular, our method
is now able to fully exploit software developed for prospective sampling.
To summarize, in Section 3 and Section 4 we establish general properties
of our case-control weighted mapping from estimating functions/influence
curves/gradients for the parameter of interest for model M∗ into estimat8
http://biostats.bepress.com/ucbbiostat/paper234
ing functions/influence curves/gradients for the parameter of interest for
the observed data model M, showing that 1) the case-control weighting
does map each parameter-specific influence curve for the model M∗ into a
parameter-specific influence curve for model M, 2) it maps the efficient influence curve/canonical gradient for model M∗ into the efficient influence
curve/canonical gradient for model M, and 3) that our case-control weighting inherits any robustness of estimating functions/influence curves for model
M∗ .
We suggest that even in cases that q0 (or q0 (1 | M ) for matched case
control designs) is unknown, it is of interest to present these estimators and
inferences for an interval of possible q0 -values, thereby presenting a sensitivity
analysis.
In the subsequent three sections we present various applications of casecontrol weighted targeted maximum likelihood estimators. In Section 5 we
show explicitly (i.e., by example) our general result for case-control design
I, that, if ψ0∗ is a marginal causal effect (on the additive, multiplicative or
odds ratio scale) and M∗ is the nonparametric model for P0∗ , then, for casecontrol design I, the case-control weighted efficient influence curve for the
parameter ψ0∗ and underlying nonparametric model M∗ equals the wished
efficient influence curve of ψ0∗ for the observed data model.
As a consequence of this result, we can show that indeed for case-control
design I the case-control weighted targeted maximum likelihood estimator
is indeed a locally efficient double robust estimator. This implementation
of a targeted maximum likelihood estimators needs to guarantee that the
initial maximum likelihood fit of the logistic regression P0∗ (Y = 1 | A, W ) is
proportional to q0 , which is a requirement for these double robust estimators
to not suffer from a large variance due to the singularity q0 ≈ 0. The latter
is precisely guaranteed by our method presented in Section 2.
In Section 6 we apply our case-control weighted double robust targeted
maximum likelihood and case-control weighted double robust estimating function methodology to estimate causal or variable importance parameters based
on assuming a semi-parametric logistic regression model, thereby avoiding the
need for inverse weighting by a fit of the treatment mechanism g0∗ .
In Section 7 we apply our case-control weighted targeted maximum likelihood estimator to fit a marginal structural model.
These double robust targeted maximum likelihood estimators rely on
knowing the incidence probability q0 and, for case-control design II, q̄0 (M ),
beyond either a correctly specified model for Q∗ (A, W ) = P0∗ (Y = 1 | A, W )
9
Hosted by The Berkeley Electronic Press
or a correctly specified model for g0∗ (a | W ) = P0∗ (A = a | W ).
For case-control design I, In Section 8 we address inverse-probability of
treatment weighed (IPTW) estimation of the causal relative risk and causal
odds ratio based on linear and logistic marginal structural models in the case
that q0 is not known, but the treatment mechanism g0∗ is known, as it would
be in a case-control study which is nested within a randomized trial.
In Section 8, we also show that, if g0∗ is unknown but modelled, and
q0 ≈ 0, one can estimate g0∗ based on the control observations only, which
extends the IPTW estimators to estimators which still do not require knowing
q0 . However, since not knowing q0 and not knowing g0∗ makes these causal
parameters non identifiable, we are concerned with the statistical properties
of these IPTW- estimators and their potential strong sensitivity to model
misspecification for g0∗ so that further (practical) study of this estimator
will be needed. That is, these IPTW-estimators target a nonparametrically
non-identifiable parameter, which suggests strong sensitivity towards model
misspecification for the treatment mechanism. Such sensitivity seems to have
been observed in the simulation studies of Mansson et al. (2007) investigating
various methods based on the propensity score including the IPTW-estimator
for a logistic marginal structural model.
In Section 9, we end this article with a discussion. Various technical
proofs are deferred to the Appendix.
Some relevant literature.
Case-control studies are probably one of the most commonly used designs, if
not the most used design. For example, searching for case-control analysis
on the internet resulted in a list of 56,000 articles. Logistic regression is
the most commonly used model in the literature for case-control studies.
Conditional logistic regression is the prominent method in the literature for
matched case-control studies and the statistical methodology goes back to
the early 80’s. It goes without saying that an overview of the literature in
this area is not possible. However, our proposed general methodology is not
covered by the current literature, as far as we know.
Some of the key papers on logistic regression in standard case-control
studies are Anderson (1972), Prentice and Pyke (1979), Breslow (1996), and
Breslow and Day (1980). Breslow et al. (2000) establish asymptotic efficiency
of the standard maximum likelihood estimator ignoring the case-control sam10
http://biostats.bepress.com/ucbbiostat/paper234
pling. The most frequently cited sources for conditional logistic regression
for matched case-control studies are Breslow and Day (1980), Holford et al.
(1978), and Breslow et al. (1978). Various books considering case-control
studies are Schlesselman (1982), Collett (1991), Jewell (2004) and Hosmer
and Lemeshow (2000), among others.
The method of adding an intercept to a standard logistic regression fit
based on case-control design I, and, in that manner, estimating effects different from the odds-ratio has been presented in the literature (see e.g., Greenland (2004), A.P. Morise (1996)), Wachholder (1996)). The corresponding
method presented in this article for matched case-control studies has not
been addressed in the literature, as far as we know.
Matched case-control studies can be handled with conditional logistic regression models, but these designs and methods also have limitations. Firstly,
it does not allow estimation of the effect of the matching variable on the disease (see, Jewell (2004), Schlesselman (1982)): Any variable used for matching cannot be studied as a risk factor, since cases and controls are constrained
to be equal with respect to the variables that are matched. Secondly, matching can hurt the precision if the matching variable is correlated with the exposure variable, which is often called over-matching. Finally, as we remarked
from the start, these methods are by necessity heavily model based, while the
methods presented here, relying on knowing the case-control weights, allow
double robust locally efficient estimation in semiparametric models, thereby
allowing the use of methods which minimize the reliance of the inference on
unknown model assumptions.
Robins (1999) discusses the approximately correct IPTW-method for estimation of the unknown parameters in a marginal structural logistic regression
model for a direct effect analysis based on standard case-control data under
the assumption that the population proportion of cases, q0 , is small. We
also refer to Newman (2006) for an IPTW-type approach for fitting marginal
structural models based on case-control data. Mansson et al. (2007) investigate a variety of IPTW and propensity score methods in case-control studies
through a simulation study, which includes the IPTW estimator for the logistic marginal structural model.
11
Hosted by The Berkeley Electronic Press
2
Identifiability and estimation of causal effects based on logistic regressions.
We present identifiability results of the conditional probability P0∗ (Y = 1 |
A, W ) based on case-control data and knowing q0 and, for case-control design
II, also knowing q̄0 (M ). These identifiability results are based on first identifying the conditional odds-ratio, and subsequently mapping this into the
wished conditional probability by using q0 . We present these results first for
unmatched case-control designs, and subsequently for matched case-control
designs.
2.1
Case-Control Design I.
A typical approach for case control studies concerns the use of logistic regression models for P (Y = 1 | A, W ):
Q∗0 (A, W ) = P (Y = 1 | A, W ) = Q∗β0 (A, W ) =
1
, (1)
1 + exp(−mβ0 (A, W ))
for some parametrization of mβ (A, W ) such as β > (1, A, W ).
Let
OR(Q∗0 (a, w)) ≡ {Q∗0 (a + 1, w)/(1 − Q∗0 (a + 1, w)}/{Q∗0 (a, w)/(1 − Q∗0 (a, w))}
be the Odds-ratio at (a, w) measuring the effect of an increase in A from a
to a + 1. Due to the identifiability result
P0∗ (A = a + 1 | Y = 1, W = w)P0∗ (A = a | Y = 0, W = w)
P0∗ (A = a + 1 | Y = 0, W = w)P0∗ (A = a | Y = 1, W = w)
P0 (A1 = a + 1 | W1 = w)P0 (A2 = a | W2 = w)
=
P0 (A2 = a + 1 | W2 = w)P0 (A1 = a | W1 = w)
OR(Q∗0 (a, w)) =
it follows that any conditional odds ratio comparing the odds at A = a + 1
with the odds at A = a can be identified from the case-control sampling
distribution.
For case control design I these odds ratios can be estimated with standard
logistic regression ignoring the case control sampling, as is well known. In the
next theorem this identifiability result of the odds ratio is stated in terms of
our statistical formulation. In addition, the next theorem states that adding
an intercept log q0 /(1 − q0 ) to the logistic regression targeted by this method
yields the true logistic regression function P0∗ (Y = 1 | A, W ).
12
http://biostats.bepress.com/ucbbiostat/paper234
Theorem 1 Given arbitrary constants c, dj , j = 1 . . . , J, define
Q̃∗0 ≡ arg max
EP0 c log Q∗ (W1 , A1 ) +
∗
Q
J
1X
dj log(1 − Q∗ (W2j , Aj2 )),
J j=1
where Q∗ ranges over all positive functions of (W, A) mapping into (0, 1).
Let Q∗0 (w, a) = P0∗ (Y = 1 | W = w, A = a). We have
Q∗0 (w, a)
q0 d¯ Q̃∗0 (w, a)
,
(2)
=
1 − Q∗0 (w, a)
1 − q0 c 1 − Q̃∗0 (w, a)
P
where d¯ = 1/J j dj .
If q0 is known, the identifiability relation (2) immediately implies a corresponding identifiability relation for Q∗0 itself:
Q∗0 = Q∗0,q0
≡
c(q0 )Q̃∗0 /(1 − Q̃∗0 )
,
1 + c(q0 )Q̃∗0 /(1 − Q̃∗0 )
where c(q0 ) ≡ q0 /(1−q0 ). Equivalently, if we represent Q̃∗0 = 1/(1+exp(h̃0 )),
Q∗0 = 1/(1+exp(h0 )) for functions h̃0 = log Q̃∗0 /(1− Q̃∗0 ) and h0 = log Q∗0 /(1−
Q∗0 ), respectively, then
h0 = log c0 + h̃0 .
The resulting identifiability result for (e.g.) ψ0 (a) ≡ EYa = EP0∗ Q∗0 (W, a)
is obtained by averaging Q∗0,q0 over the case control weighted distribution of
W , Q∗W = q0 Q1 + (1 − q0 )Q0 , where Q1 is the marginal distribution of W1
and Q0 is the marginal distribution of W2 . Thus,
ψ0 (a) = ψ0,q0 (a)




1 − q0 X ∗
≡ E0 q0 Q∗0,q0 (W1 , a) +
Q0,q0 (W2j , a) ,


J
j
which on its turn maps into an identifiability result for the causal relative
risk,
ψ0,RR ≡
=
ψ0 (1)
ψ0,q0 (1)
=
ψ0 (0)
ψ0,q0 (0)
n
P
n
P
E0 q0 Q∗0,q0 (W1 , 1) + (1 − q0 ) J1
E0 q0 Q∗0,q0 (W1 , 0) + (1 − q0 ) J1
o
j
∗
j Q0,q0 (W2 , 1)
o.
j
∗
j Q0,q0 (W2 , 0)
13
Hosted by The Berkeley Electronic Press
For q0 ≈ 0, the case-control weighted distribution of W is well approximated
by the distribution of the covariate W for controls, so that this causal relative
risk relation is well approximated by
ψ0,RR ≈
E0
P
Q∗0,q0 (W2j , 1)
E0
P
Q∗0,q0 (W2j , 0)
j
j
.
For the validity of case-control weighting , we refer to the general method of
case-control weighting as presented in the next section. Proof of Theorem
1. Consider fluctuations Q̃∗0 ()(Y | A, W ) of Q̃∗0 (Y | A, W ) with parameter and score at = 0 equal to h(Y | A, W ) for arbitrary functions h(Y | A, W )
with mean zero w.r.t. Q̃∗0 (Y | A, W ). Since h has mean zero, it follows that
Q̃∗
h(0 | A, W ) = − 1−Q̃0 ∗ (A, W )h(1 | A, W ). Substitution of these fluctuations
0
into the log likelihood criterion in Q∗ and taking the derivative w.r.t at
= 0 yields the score functions:
Q̃∗0
1X
dj
(W2j , Aj2 )h(1 | W2j , Aj2 ),
0 = E0 ch(1 | W1 , A1 ) −
∗
J j
1 − Q̃0
where h(1 | w, a) is now an arbitrary function. The right hand side can be
worked out as:
∗
0 = EA,W



J
1 − Q∗0 (W, A) 
Q∗ (W, A) 1 X
Q̃∗0
−
(W,
A)h(1
|
W,
A)
ch(1 | W, A) 0
dj
.


q0
J j=1 1 − Q̃∗0
1 − q0
Since this equation needs to hold for all functions h(1 | W, A) it follows that
c
1 − Q∗0 (W, A)
Q∗0 (W, A) ¯ Q̃∗0
(W,
A)
−d
=0
q0
1 − q0
1 − Q̃∗0
or equivalently,
Q∗0
d¯ q0
Q̃∗0
=
.
1 − Q∗0
c 1 − q0 1 − Q̃∗0
2
Estimation of causal effects based on logistic regression.
This teaches us that we can define (setting, c = 1, d¯ = 1 in the theorem)
Q̃∗n ≡ arg max
β
n
X
log Q∗β (W1i , A1i ) +
i=1
J
1X
log(1 − Q∗β (W2ij , Aj2i )),
J j=1
14
http://biostats.bepress.com/ucbbiostat/paper234
which can be computed with standard logistic regression software.
Let h̃n = log Q̃∗n /(1 − Q̃∗n ) be the log-odds of Q̃∗n . In addition, one can
use this standard log likelihood loss function to carry out model selection
based on cross-validation and one can apply data adaptive logistic regression
algorithms.
Clearly, the variance of the resulting estimator of the odds ratio OR(Q∗0 (w, a))
at a particular w, a does not suffer from the singularity q0 ≈ 0. In addition,
if q0 is known, the identifiability relation (2) immediately implies a corresponding estimator of Q∗0 given by
Q∗n,q0 ≡ c(q0 )
Q̃∗n /(1 − Q̃∗n )
,
1 + c(q0 )Q̃∗n /(1 − Q̃∗n )
which is equivalent with adding an intercept log c0 to the log odds fit h̃n =
log Q̃∗n /(1 − Q̃∗n ):
hn ≡ log Q∗n,q0 /(1 − Q∗n,q0 ) = log c0 + h̃n .
Since the standard error of this estimator Q∗n,q0 is proportional to c0 and
thus q0 , this estimator will result in stable estimators of the causal relative
risk or causal odds ratio not suffering from the typical singularity q0 ≈ 0.
The resulting estimator of EYa is obtained by averaging Q∗n,q0 over the
case-control weighted empirical distribution of W , and is thus given by
ψn,q0 (a) =
n
1X ∗
1X
q0 Q∗n,q0 (W1i , a) + (1 − q0 )
Q (W j , a),
n i=1
J j n,q0 2i
which now maps into an estimator of the causal relative risk,
ψn,RR =
=
ψn,q0 (1)
ψn,q0 (0)
1
n
1
n
Pn
∗
i=1 q0 Qn,q0 (W1i , 1)
Pn
∗
i=1 q0 Qn,q0 (W1i , 0)
+ (1 − q0 ) J1
P
+ (1 − q0 ) J1
P
j
Q∗n,q0 (W2ij , 1)
j
Q∗n,q0 (W2ij , 0)
For q0 ≈ 0, the case control weighted empirical distribution of W is well
approximated by the pooled empirical distribution of the controls, so that
this causal relative risk estimator is well approximated by (for variable J one
replaces J by Ji )
Pn 1 P
j
∗
j Qn,q0 (W2i , 1)
i=1 J
ψn,RR ≈ Pn 1 P ∗
j
j Qn,q0 (W2i , 0)
i=1 J
15
Hosted by The Berkeley Electronic Press
It is important to note that without knowing q0 it would not have been
possible to map the standard logistic regression fit Q̃∗n , that ignores the casecontrol sampling and yields a robust (against q0 ≈ 0) and consistent estimator
of the odds ratio of Q∗0 , into an estimator of EYa which has a standard error
proportional to q0 , and thereby in a robust (against q0 ≈ 0) estimator of the
causal relative risk or causal odds ratio.
Notation: We introduce now some useful notation. Given a function
D∗ (O∗ ), we define P0,q0 D∗ = P0 Dq0 , where Dq0 (O) ≡ q0 D∗ (W1 , A1 , 1) +
j
j
1 PJ
∗
∗
j=1 q̄0 (M1 )D (W2 , A2 , 0). Similarly, we define Pn,q0 D = Pn Dq0 , where
J
Pn is the empirical distribution of O1 , . . . , On . We apply this notation to
both case-control designs, where for case-control design I q̄0 (M1 ) reduces to
1 − q0 . We refer to Dq0 as the case-control weighted version of D∗ .
2.2
Matched case-control design II.
Consider a logistic regression model mβ (A, W ) (1) for P0∗ (Y = 1 | A, W ).
For the matched case control design II the odds ratios can be estimated with
standard logistic regression assigning the weights 1 to the cases and weighting
the controls by q̄0 (M1 ), as we show in the next theorem. In the next theorem
we also show how this odds ratio can be mapped into P0∗ (Y = 1 | A, W ) by
adding the intercept log c0 , as in the previous Theorem 1.
Theorem 2 Given arbitrary constants c, dj , j = 1 . . . , J, define
Q̃∗0 ≡ arg max
E0 c log Q∗ (M1 , W1 , A1 )+q̄0 (M1 )
∗
Q
J
1X
dj log(1−Q∗ (M1 , W2j , Aj2 )),
J j=1
where Q∗ varies over positive valued functions mapping into (0, 1).
Let Q∗0 (a, w) = P0∗ (Y = 1 | A = a, W = w). We have
Q∗0 (m, w, a)
d¯ Q̃∗0 (m, w, a)
= q0
,
1 − Q∗0 (m, w, a)
c 1 − Q̃∗0 (m, w, a)
(3)
where d¯ = 1/J j dj .
For c = q0 , this implies Q∗0 = Q̃∗0 .
¯ to the log odds h̃0 = log Q̃∗ /(1−
Equivalently, one adds an intercept log q0 d/c
0
Q̃∗0 ) of Q̃∗0 :
¯ + h̃0 .
h0 ≡ log Q∗0,q0 /(1 − Q∗0,q0 ) = log q0 d/c
P
16
http://biostats.bepress.com/ucbbiostat/paper234
The resulting identifiability result for ψ0 (a) = EYa = EP0∗ Q∗0 (M, W, a)
is obtained by averaging Q∗0,q0 over the case control weighted distribution
of M, W , Q∗M,W = q0 Q1 + q̄0 Q0 , where Q1 is the marginal distribution of
(M1 , W1 ) and Q0 is the marginal distribution of (M1 , W2 ):
ψ0 (a) = ψ0,q0 (a)




1X ∗
= E0 q0 Q∗0,q0 (M1 , W1 , a) + q̄0 (M1 )
Q0,q0 (M1 , W2j , a) . (4)
J j
This implies an identifiability result for the causal relative risk,
n
ψ0,RR =
E0 q0 Q∗0,q0 (M1 , W1 , 1) + q̄0 (M1 ) J1
o
j
∗
j Q0,q0 (M1 , W2 , 1)
P
ψ0,q0 (1)
n
o.
=
P
ψ0,q0 (0)
E0 q0 Q∗0,q0 (M1 , W1 , 0) + q̄0 (M1 ) J1 j Q∗0,q0 (M1 , W2j , 0)
For q0 ≈ 0, the case control weighted distribution of M, W is well approximated by Q0 , i.e. the distribution of the covariate W for controls, so that
this causal relative risk relation is well approximated by
ψ0,RR ≈
E0 J1
P
Q∗0,q0 (M1 , W2j , 1)
E0 J1
P
Q∗0,q0 (M1 , W2j , 0)
j
j
Proof of Theorem 2. Consider fluctuations Q̃∗0 ()(Y | A, M, W ) of Q̃∗0 (Y |
A, M, W ) with parameter with score at = 0 equal to h(Y | A, M, W ) for
arbitrary functions h(Y | A, M, W ) with mean zero w.r.t. Q̃∗0 (Y | M, A, W ).
Q̃∗
Note that h(0 | A, M, W ) = − 1−Q̃0 ∗ (A, M, W )h(1 | A, W ). Substitution
0
of these fluctuations into the log likelihood criterion in Q∗ and taking the
derivative w.r.t at = 0 yields the score functions:
0 = E0 ch(1 | M1 , W1 , A1 ) −
1X
Q̃∗0
dj q̄0 (M1 )
(M1 , W2j , Aj2 )h(1 | W2j , Aj2 ),
∗
J j
1 − Q̃0
where h(1 | w, a) is now an arbitrary function. The right hand side can be
worked out as:
n
∗
0 = EA,M,W
ch(1 | M, W, A)
1
a,m,w J
∗
= EA,M,W
R
− a,m,w J1
−
=
R
∗
EA,M,W
PJ
j=1
Q̃∗
0
ch(1 | M, W, A)
j=1
o
dj 1−Q̃0 ∗ (m, w, a)h(1 | m, w, a)q̄0 (m)PM1 (m)P0∗ (w, a | Y = 0, m)
n
PJ
Q∗0 (M,W,A)
q0
Q̃∗
Q∗0 (M,W,A)
q0
o
dj 1−Q̃0 ∗ (m, w, a)h(1 | m, w, a)P (M = m, W = w, A = a, Y = 0)
0
h(1 | M, W, A)
Q∗ (M,W,A)
c 0 q0
Q̃∗
− d¯1−Q̃0 ∗ (M, W, A)(1 − Q∗0 (M, W, A))
.
0
17
Hosted by The Berkeley Electronic Press
Since this equation needs to hold for all functions h(1 | W, A) it follows
that
Q∗0 (M, W, A) ¯ Q̃∗0
−d
c
(M, W, A)(1 − Q∗0 (M, W, A)) = 0,
∗
q0
1 − Q̃0
or equivalently,
d¯
Q∗0
Q̃∗0
=
q
.
0
1 − Q∗0
c 1 − Q̃∗0
2
Estimation of marginal causal effects based on a logistic
regression fit.
This teaches us that we can define (e.g., c = 1, d¯ = 1)
Q̃∗n ≡ arg max
β
n
X
log Q∗β (M1i , W1i , A1i )+q̄0 (M1i )
i=1
J
1X
log(1−Q∗β (M1i , W2ij , Aj2i )),
J j=1
which can be computed with standard logistic regression software using
weights for the control observations. In addition, one can use this log likelihood loss function to carry out model selection based on cross-validation
and one can apply data adaptive logistic regression algorithms. Clearly,
the variance of the resulting estimator of the odds ratio OR(Q∗0 (m, w, a)) =
ODDS(Q∗0 (m, w, a + 1))/ODDS(Q∗0 (m, w, a)) at a particular m, w, a does
not suffer from the singularity q0 ≈ 0.
In addition, the identifiability relation (3) immediately implies a corresponding estimator of Q∗0 itself given by
Q∗n,q0 ≡ c(q0 )
Q̃∗n /(1 − Q̃∗n )
,
1 + c(q0 )Q̃∗n /(1 − Q̃∗n )
or, equivalently, one adds an intercept log c0 to the log odds fit h̃n = log Q̃∗n /(1−
Q̃∗n ):
hn ≡ log Q∗n,q0 /(1 − Q∗n,q0 ) = log c0 + h̃n .
Since the standard error of this estimator Q∗n,q0 is proportional to q0 divided by the square root of the sample size, this estimator will result in stable
18
http://biostats.bepress.com/ucbbiostat/paper234
estimators of the causal relative risk or causal odds ratio not suffering from
the singularity q0 ≈ 0.
The resulting estimator of EYa is obtained by averaging Q∗n,q0 over the
case-control weighted empirical distribution of W , and is thus given by
ψn,q0 (a) =
n
1X
1X ∗
q0 Q∗n,q0 (M1i , W1i , a) + q̄0 (M1i )
Qn,q0 (M1i , W2ij , a),
n i=1
J j
which now maps into an estimator of the causal relative risk,
ψn,RR =
=
ψn,q0 (1)
ψn,q0 (0)
1
n
1
n
Pn
∗
i=1 q0 Qn,q0 (M1i , W1i , 1)
Pn
∗
i=1 q0 Qn,q0 (M1i , W1i , 0)
+ (1 − q0 ) J1
P
+ (1 − q0 ) J1
P
j
Q∗n,q0 (M1i , W2ij , 1)
j
Q∗n,q0 (M1i , W2ij , 0)
For q0 ≈ 0, the case-control weighted empirical distribution of W is well
approximated by the empirical distribution Q0n of the controls, so that this
causal relative risk estimator is well approximated by
ψn,RR ≈
3
1
n
j
1 P
∗
j Qn,q0 (M1i , W2i , 1)
i=1 J
j
1 ∗
1 Pn
i=1 J Qn,q0 (M1i , W2i , 0)
n
Pn
Case-Control weighting of estimation procedures developed for prospective sampling.
Throughout this section, we will make the convention that q̄0 (M ) reduces to
1 − q0 in the case control design I, so that we can state our results for both
the regular case-control design I and the matched case-control design II in
one formula.
We start out with stating the theorem which proves that the case-control
weighting maps a function of O∗ into a function of the case-control data
structure O, while preserving the expectation of the function.
Definition 1 (Case-control weighted function) Given a D∗ (O∗ ) = D∗ (W, A, Y )
we define the case-control weighted version of D∗ as
J
1X
q̄0 (M1 )D∗ (M1 , W2j , Aj2 , 0),
Dq0 (O) ≡ q0 D (M1 , W1 , A1 , 1) +
J j=1
∗
where in the special case of Case Control Design I, we have q̄0 (M ) = 1 − q0 .
19
Hosted by The Berkeley Electronic Press
Theorem 3 (Unbiased estimating function mapping) Let D∗ (O∗ ) = D∗ (W, A, Y )
be a function so that P0∗ D∗ ≡ EP0∗ D∗ (O∗ ) = 0. Then P0 Dq0 = 0. In particular, in Case Control Design I,
Dq0 (0) ≡ q0 D∗ (W1 , A1 , 1) + (1 − q0 )
J
1X
D∗ (W2j , Aj2 , 0)
J j=1
satisfies P0 Dq0 = 0.
In more generality, for any function D∗ and corresponding case control
weighted function Dq0 , we have
P0 Dq0 = P0∗ D∗ .
Proof: We provide the proof for case-control design II and we suppress the
index q0 in Dq0 . The same proofR applies to case-control design I. First, we note
that P0 q0 D(M1 , W1 , A1 , 1) = M1 ,W1 ,A1 D(M1 , W1 , A1 , 1)P0∗ (M1 , W1 , A1 , Y =
1). Secondly, we note that
P
q̄ (M1 )D(M1 , W2j , Aj2 , 0) =
R0 0
∗
m,w,a D(m, w, a, 0)q̄0 (m)P0 (M1 = m)P0 (W = w, A = a | M = m, Y = 0),
where we also need to note that P0 (M1 = m) = P0∗ (M = m | Y = 1). We
have
q̄0 (m)P0 (M1 = m)P0∗ (W = w, A = a | M = m, Y = 0)
= q̄0 (m)P0∗ (M = m | Y = 1)P0∗ (W = w, A = a, M = m, Y = 0)/P0∗ (Y = 0, M = m)
= P0∗ (M = m, W = w, A = a, Y = 0).
This proves that
P0 D = M1 ,W1 ,A1 D(M1 , W1 , A1 , 1)P0∗ (M1 , W1 , A1 , Y = 1)
R
P
+ J1 Jj=1 M1 ,W2 ,A2 D(M1 , W2 , A2 , 0)P0∗ (M1 , W2 , A2 , Y = 0)
= P0∗ D = 0.
R
This completes the proof. 2
Before we proceed with presenting the statistical implications of this mapping for the analysis of case-control data, we first establish some general
properties of this mapping which help us to understand the generality and
optimality of the statistical approach for dealing with case-control sampling
implied by this mapping.
20
http://biostats.bepress.com/ucbbiostat/paper234
3.1
Case-control weighted mapping maps gradients into
gradients.
Consider a target parameter Ψ∗ : M∗ → IRd at P ∗ in model M∗ . The class
of all regular asymptotically linear estimators of Ψ∗ (P ∗ ) at P ∗ can be characterized by their influence curves, and their influence curves constitute the
set of gradients of the pathwise derivative of Ψ∗ at P ∗ given a rich class of
parametric fluctuations through P ∗ . In particular, an estimator is asymptotically efficient at P ∗ if and only if its influence curve equals the canonical
gradient, that is, the unique gradient which is also an element of the tangent
space generated by the scores of the class of parametric fluctuations. As a
consequence of these general and powerful results an estimation problem is
essentially characterized by the class of gradients and the canonical gradient. In particular, the class of gradients yields the class of wished estimating
functions to construct double robust locally efficient estimators (van der Laan
and Robins (2002)) and the canonical gradient provides the fundamental ingredient of the double robust locally efficient targeted maximum likelihood
estimator.
This motivates us to identify the class of gradients, and, in particular, the
canonical gradient, of the parameter Ψ∗ in the case-control sampling model
M = {P (P ∗ , η) : P ∗ ∈ M∗ , η} implied by the model M∗ for the probability
distribution P ∗ of interest and possible specification of dependence as identified by the η parameter, assuming that this parameter Ψ∗ can be identified
from case-control sampling.
The following theorem establishes that the case-control weighting does
provide a mapping from the set of all gradients of the parameter Ψ∗ : M∗ →
IRd at P ∗ in model M∗ into a set of gradients of Ψ : M → IRd defined as
Ψ(P (P ∗ , η)) = Ψ∗ (P ∗ ) at P (P ∗ , η) in model M = {P (P ∗ , η) : P ∗ ∈ M∗ , η}
for parameters Ψ∗ which are identifiable from P (P ∗ , η) (e.g. by being a
function of q0 or q̄0 (M )). Since the class of all gradients of a parameter
defined on a model represents the class of all possible influence curves of
regular asymptotically linear estimators (see e.g, Bickel et al. (1993)), this
result teaches us that the case-control weighting does map any estimation
procedure developed for ψ0∗ based on prospective data into a corresponding
estimation procedure based on case-control data, at least, from an asymptotic
point of view.
In addition, since the case-control weighted mapping is 1-1, it also teaches
us that it maps into a very rich set of estimation procedures for case-control
21
Hosted by The Berkeley Electronic Press
data, if not all estimation procedures of interest: Indeed, we will show in
the next section that the case-control weighted gradient mapping maps, in
particular, into the optimal canonical gradient/efficient influence curve.
If the parameter of interest Ψ∗ (P ∗ ) is only identified from P = P (P ∗ , η)
if q0 and (for matched case-control designs) q̄0 is known, then one needs to
define the parameter as a parameter indexed by the known q0 and q̄0 (M ):
Ψ∗ = Ψ∗q0 .
We start with providing a useful definition of a gradient of a pathwise
derivative.
Definition 2 We define a gradient of pathwise derivative of the parameter
Ψ∗ : M∗ → IRd at P ∗ in model M∗ as a function D∗ (P ∗ ) satisfying for each
of the submodels {PS∗∗ () : } ⊂ M∗ through P ∗ at = 0 with score S ∗ at
= 0 (within the class of submodels through P ∗ specified)
d
d ∗ ∗
= − P ∗ D(PS∗∗ ()) .
Ψ (PS ∗ ())
d
d
=0
=0
Consider a parameter Ψ∗ : M∗ → IRd which is identified in model M =
{P = P (P ∗ , η) : P ∗ ∈ M∗ , η}, and corresponding parameter Ψ : M → IRd
defined as Ψ(P (P ∗ , η)) = Ψ∗ (P ∗ ).
By the same definition of a gradient above, a gradient of the pathwise
derivative of the parameter Ψ : M → IRd at P = P (P ∗ , η) in model M is defined as a function D(P ∗ , η) of O satisfying for each sub-model {P (PS∗∗ (), ηS1 ()) :
} ⊂ M implied by a submodel {PS∗∗ () : } through P ∗ and a nuisance submodel {ηS1 () : } through η indexed by S1 ,
Ψ∗ (PS∗∗ ())|=0
d
= − P D(PS∗∗ (), ηS1 ()) .
d
=0
Given this definition of a gradient we obtain the following theorem.
Theorem 4 Given a P ∗ ∈ M∗ , a class of sub-models {PS∗∗ () : } ⊂ M∗
through P ∗ at = 0 indexed by S ∗ , with score S ∗ , we have for each of these
submodels
d
d ∗ ∗ ∗
∗
P Dq0 (PS ∗ ())
= P D (PS ∗ ()) ,
(5)
d
d
=0
=0
where it is assumed that the left and right derivative exist.
By (5) it follows that any gradient D∗ (P ∗ ) of Ψ∗ : M∗ → IRd at P ∗ ∈ M∗
is mapped into a gradient Dq0 (P ∗ ) of Ψ : M → IRd at P = P (P ∗ , η) (for
each η) in the model M.
22
http://biostats.bepress.com/ucbbiostat/paper234
This last statement is an immediate consequence of (5) and the fact that
Dq0 (P ∗ ) does only depend on P = P (P ∗ , η) through P ∗ (and thus not through
η), so that the derivatives along nuisance models {η() : } are zero, as
required.
We now note that under extremely weak regularity conditions, the above
definition of a gradient D∗ (P ∗ ) of the pathwise derivative exactly agrees with
the definition of a gradient of the pathwise derivative of Ψ∗ : M∗ → IRd in
efficiency theory (e.g., Bickel et al. (1993)), and similarly for Ψ. Namely,
the equivalence follows if the second equality below holds (the first follows
since D∗ (P ∗ ) ∈ L20 (P ∗ )): for the function P ∗ → D∗ (P ∗ ) ∈ L20 (P ∗ ) and each
submodel {P ∗ () : } (for each P ∗ ∈ M∗ ) we have
1Z ∗ ∗
dP ∗ () − dP ∗ ∗
1 ∗ ∗ ∗
P D (P ()) = −
D (P ())
dP ()
dP ∗ ()
= −P ∗ D∗ (P ∗ )S(P ∗ ) + o(1),
d
where S(P ∗ ) is the score d
log dP ∗ ()/dP ∗ of the submodel {P ∗ () : }.
=0
For the interested reader, the following analogue theorem states the result
in terms of the gradient of the pathwise derivative as in efficiency theory.
That is, it provides the regularity condition under which we have that if
D∗ (P ∗ ) is a gradient of Ψ∗ at P ∗ , then Dq0 (P ∗ ) is a gradient of the path-wise
derivative of Ψ at P (P ∗ , η).
Theorem 5 Assume Ψ : M → IRd satisfies Ψ(P (P ∗ , η)) = Ψ∗ (P ∗ ) for all
P ∗ ∈ M∗ and η.
Assume P ∗ → D∗ (P ∗ ) is a gradient of the pathwise derivative of Ψ∗ :
∗
M → IRd in the sense that it satisfies for each member of a class of submodels {PS∗∗ () : } through P ∗ ∈ M∗ at = 0 with score S ∗
d
d ∗ ∗
Ψ (PS ∗ ())
= − P ∗ D∗ (PS∗∗ ()) ,
d
d
=0
=0
and the right-hand side equals P ∗ D∗ (P ∗ )S ∗ , where it is assumed the derivative on the left and right-hand side exist.
Assume P ∗ → Dq0 (P ∗ ) satisfies for each submodel {P () = P (P ∗ (), η()) :
} ⊂ M through P (P ∗ , η) at = 0 (implied by the class of submodels {PS∗∗ ()}
and {ηS1 ()}) with score S(P ) that
d
− P Dq0 (P ∗ ())
= P Dq0 (P ∗ )S(P ).
d
=0
23
Hosted by The Berkeley Electronic Press
The latter is a regularity condition since
1Z
dP () − dP
1
∗
P Dq0 (P ()) = −
Dq0 (P ∗ ())
dP ()
dP ()
= −P Dq0 (P ∗ )S(P ) + o(1),
where S(P ) is the score
d
d
log dP ()/dP =0
of the submodel {P () : }.
Then, Ψ : M → IRd is pathwise differentiable in the sense that for each
of the submodels {P () = P (P ∗ (), η()) : } ⊂ M through P (P ∗ , η) at = 0
with score S(P ) we have
d
Ψ(P ())
= P Dq0 (P )S(P ),
d
=0
and Dq0 (P ) is a gradient of the pathwise derivative.
Thus, for each gradient D∗ (P ∗ ) of the pathwise derivative of Ψ∗ : M∗ →
IRd satisfying the above mentioned regularity conditions, the corresponding
Dq0 (P ∗ ) is a gradient of the pathwise derivative of Ψ : M → IRd .
Proof. We have
Ψ∗ (P ∗ ()) − Ψ∗ (P ∗ )
Ψ(P ()) − Ψ(P )
=
d ∗ ∗ ∗
= − P D (P ()) + o(1)
d
=0
d
= − P Dq0 (P ∗ ()) + o(1)
d
=0
= P Dq0 (P ∗ )S(P ) + o(1).
This proves that Ψ : M → IRd defined as Ψ(P (P ∗ , η)) = Ψ∗ (P ∗ ) is pathwise
differentiable at P = P (P ∗ , η) ∈ M and that Dq0 (P ∗ ) is a gradient of this
pathwise derivative. 2
Thus, the above result shows that each gradient D∗ (P ∗ ) for Ψ∗ : M∗ →
d
IR is mapped into a gradient Dq0 (P ∗ ) for Ψ : M = {P (P ∗ , η) : P ∗ ∈
M∗ , η} → IRd defined as Ψ(P (P ∗ , η)) = Ψ∗ (P ∗ ). We note that this gradient
mapping is not affected by the particular choice (i.e., model of dependence
structure of case and control observations) of model M = {P (P ∗ , η) : P ∗ ∈
M∗ , η} compatible with M∗ . Thus, for example, for case-control design I,
our mapping from gradients into gradients for model M is the same for the
24
http://biostats.bepress.com/ucbbiostat/paper234
independence model assuming the case and controls are all independent as
it is for a particular dependence model.
A particular case is that Ψ∗ : M∗ → IRd is defined on a nonparametric
model M∗ . In this case, there exists only one gradient for model M∗ so that
one just needs to determine the canonical gradient D∗ (P ∗ ) of Ψ∗ at P ∗ and
map it into its case-control weighted version Dq0 (P ∗ ), which, by our results
in the next section, equals the canonical gradient of Ψ at P (P ∗ , η).
Remark. Since q0 is a non-identifiable parameter for both case-control designs (so that knowledge of q0 does not restrict the distribution of the data
structure O), this implies that 1) for each gradient D∗ (P ∗ ) for model M∗ ,
the corresponding Dq0 (P ∗ ) is a gradient in the model M also including the
knowledge that q0 is known (even if that knowledge was not included in M∗ ),
or, equivalently, the class of all gradients {Dh∗ (P ∗ ) : h} at P ∗ for model M∗
is mapped into a class {Dh,q0 : h} of gradients at P = P (P ∗ ) for model M
also including q0 is known.
For matched case-control design II, if we define our parameter as Ψ∗q0 ,
indexed by q0 and q̄0 (M ) (treating them as known and fixed), then the casecontrol weighting maps the class of all gradients of this parameter for model
M∗ into the class of gradients of this parameter for model M = {P (P ∗ , η) :
P ∗ ∈ M∗ , η}. If the observed data model is the same with and without the
restriction that (q0 , q̄0 (M )) is known in the model M∗ , then the canonical
gradient in the model M will be the same as the canonical gradient of the
model also including the knowledge of (q0 , q̄0 (M )).
3.2
Preservation of robustness of the case-control weighted
functions.
If a function D∗ satisfying P0∗ D(P0∗ ) = 0 also satisfies the robustness property
P0∗ (D(P ∗ )) = 0 for any P ∗ ∈ M∗1 ⊂ M∗ for a submodel M∗1 , then the same
robustness w.r.t. to misspecification of P0∗ applies to Dq0 since, for P ∗ ∈ M∗1 ,
P0 Dq0 (P ∗ ) = P0∗ D(P ∗ ) = 0 .
In particular, double robust estimating functions for censored and causal
inference data structures and models M∗ , as presented in general in van der
Laan and Robins (2002), are mapped into double robust case-control weighted
estimating functions.
25
Hosted by The Berkeley Electronic Press
In the remainder of this section we outline the general statistical methods
implied by the case-control weighted mapping.
3.3
Case-control weighted estimating functions and locally efficient estimation.
Estimating function methodology developed for prospective sampling immediately implies now, through the case-control weighted mapping, estimating
function methodology for case-control sampling.
For sake of presentation, let’s start with estimating functions without
nuisance parameters. That is, let {Dh∗ (ψ) : h} denote a class of estimating
functions of parameter Ψ∗ : M∗ → IRd in model M∗ indexed by functions
h ranging over a particular index set. This class maps into a class of casecontrol weighted estimating functions



J

1X
j
∗
∗
D
,
0)
:
h
.
(ψ)(M
,
Z
D
1
h,q0 (ψ)(O) = q0 Dh (ψ)(M1 , Z1 , 1) + q̄0 (M1 )
2


J j=1 h
Let {ch Dh,q0 (ψ) ) : h} be the corresponding set of gradients/influence
curves (i.e., the influence curve of the estimator defined as solution of estimating equation implied by Dh,q0 ), where ch = − dψd 0 E0 Dh,q0 (ψ0 )−1 . We can now
apply (for example) Theorem 2.9 in van der Laan and Robins (2002) to determine the optimal choice hopt of estimating function minimizing a> Σ0 (h)a for
all a, where Σ0 (h) denotes the covariance of the influence curve ch Dh.q0 (ψ0 ).
This optimal estimating function can now be used to construct locally
P
efficient estimators by defining ψn as a solution of 0 = i Dhn ,q0 (ψ)(Oi )
for an estimator hn of hopt (or by using corresponding one-step NewtonRaphson estimators), or by constructing a targeted maximum likelihood type
estimator Pn∗ (van der Laan and Rubin (2006), and see later subsection), and
corresponding targeted maximum likelihood estimator Ψ∗ (Pn∗ ) of ψ0∗ solving
P
this equation 0 = i Dh(P ∗ ),q0 (P ∗ )(Oi ) viewed as a function in P ∗ ∈ M∗ ,
where h(P ∗ ) is a representation of hopt as a function of P ∗ .
By our Theorems 7 and 8 in the next section it follows that selecting hopt
so that Dh∗opt (P ∗ ) is the efficient influence curve in model M∗ actually results
in the wished optimal choice, and corresponding locally efficient estimating
function procedure.
This template for construction of locally efficient estimators can be generalized to estimating functions which also depend on nuisance parameters,
26
http://biostats.bepress.com/ucbbiostat/paper234
as follows. Let {Dh∗ (P ∗ ) : h} denote a class of estimating functions of parameter Ψ∗ : M∗ → IRd in model M∗ , and let Dh∗ (P ∗ ) = Dh∗ (ψ ∗ , η ∗ ) for variation
independent parameters ψ ∗ and η ∗ of P ∗ . Suppose that these estimating
functions are orthogonalized
to nuisance parameters in the sense that they
d
= 0 for nuisance fluctuations P0∗ () through P0∗ at
satisfy d
E0∗ Dh∗ (P0∗ ())
=0
= 0 (i.e. Ψ∗ (P0∗ ()) has derivative zero w.r.t. at = 0). In addition, assume that the estimating functions are standardized to have derivative w.r.t.
ψ minus the identify matrix:
d ∗ ∗ ∗
d
= − Ψ∗ (P0∗ ())
E0 Dh (P0 ())
d
d
=0
=0
for fluctuations P0∗ () changing ψ0∗ .
This defines Dh∗ (P ∗ ) as a gradient/influence curve of Ψ∗ in model M∗
at P ∗ : see Chapter 1 van der Laan and Robins (2002). This class of gradients/influence curves for model M∗ now maps into the class of case-control
weighted functions


J
1X

J
Dh,q0 (P ∗ )(O) = q0 Dh∗ (P ∗ )(M1 , Z1 , 1) + q̄0 (M1 )


Dh∗ (P ∗ )(M1 , Z2j , 0) : h .
j=1

Application of Theorem 3 yields that
P0 Dh,q0 (P0∗ ()) = P0∗ Dh∗ (P ∗ ()),
so that it follows that also the case-control weighted Dh,q0 is an influence
curve. Thus this shows that our mapping indeed maps the gradients for
parameter Ψ∗ for model M∗ into gradients of Ψ∗ for model M.
We can now apply Theorem 2.9 in van der Laan and Robins (2002) to
determine the optimal choice hopt = h(P0∗ ) minimizing a> Σ0 (h)a for all vectors a, where Σ0 (h) denotes the covariance of the influence curve Dh (P0∗ ). By
our Theorems 7 and 8 it follows that selecting hopt so that Dh∗opt (P ∗ ) is the
efficient influence curve in model M∗ results in the wished optimal choice.
This optimal estimating function Dh(P0 ) (ψ0 , η0∗ ) can now be used to construct
locally efficient estimators by, given an estimator ηn∗ , defining ψn as a solution
P
of 0 = i Dhn ,q0 (ψn , ηn )(Oi ) or by constructing a targeted maximum likelihood type estimator Pn∗ and corresponding substitution estimator Ψ∗ (Pn∗ ) of
P
ψ0∗ solving this equation 0 = i Dh(P ∗ ),q0 (P ∗ )(Oi ) viewed as a function in
P ∗ ∈ M∗ .
We note that this approach can be further generalized to estimating functions Dh∗ with non-variation independent nuisance parameters.
27
Hosted by The Berkeley Electronic Press
3.4
Example: Case-control weighted double robust estimating function.
Let’s illustrate this estimating function method by constructing a double
robust estimator of the additive causal effect ψ0∗ = E(Y1 − Y0 ) for a nonparametric model M∗ for the distribution P0∗ of (W, A, Y ).
The double robust efficient estimating function for sampling from P0∗ is
given by
(
)
I(A = 1)
I(A = 0)
D (ψ , g , Q )(O ) =
− ∗
(Y − Q∗ (M, W, A))
∗
g (1 | M, W ) g (0 | M, W )
+Q∗ (M, W, 1) − Q∗ (M, W, 0) − ψ ∗ .
(6)
∗
∗
∗
∗
∗
It is double robust in the sense that
E0∗ D∗ (ψ0∗ , g ∗ , Q∗ )(O∗ ) = 0 if either g ∗ = g0∗ or Q∗ = Q∗0 ,
and in both cases one needs that g ∗ (1 | W )g ∗ (0 | W ) > 0 a.e. Let D∗ (g ∗ , Q∗ )
be defined so that D∗ (ψ ∗ , g ∗ , Q∗ ) = D∗ (g ∗ , Q∗ ) − ψ ∗ .
The weighted double robust estimating function for case-control data is
thus given by:
Dq0 (ψ ∗ , g ∗ , Q∗ )(O) = q0 D∗ (ψ ∗ , g ∗ , Q∗ )(M1 , W1 , A1 , 1)
+
J
q̄0 (M1 ) X
D∗ (ψ ∗ , g ∗ , Q∗ )(M1 , W2j , Aj2 , 0),
J
j=1
or we can define it as
Dq0 (ψ ∗ , g ∗ , Q∗ )(O) = q0 D∗ (g ∗ , Q∗ )(M1 , W1 , A1 , 1)
+
J
q̄0 (M1 ) X
D∗ (g ∗ , Q∗ )(M1 , W2j , Aj2 , 0) − ψ ∗ .
J
j=1
This estimating function is now also double robust for case control data:
E0 Dq0 (ψ0∗ , g ∗ , Q∗ ) = 0 if either g ∗ = g0∗ or Q∗ = Q∗0 ,
and in both cases one needs that g ∗ (1 | W )g ∗ (0 | W ) > 0 a.e.
28
http://biostats.bepress.com/ucbbiostat/paper234
The solution ψn of the case-control weighted estimating equation Pn Dq0 (gn∗ , Q∗n )−
ψ = 0 exists in closed form and is given by:
∗
ψn
n
1X
=
q0 D∗ (gn∗ , Q∗n )(M1i , W1i , A1i , 1)
n i=1
+
J
q̄0 (M1i ) X
D∗ (gn∗ , Q∗n )(M1i , W2ij , Aj2i , 0).
J
j=1
This estimator is now consistent if either gn∗ consistently estimates g0∗ or Q∗n
consistently estimates Q∗0 , which explains why it is called double robust.
Under some extra appropriate regularity conditions, this estimator is also
asymptotically linear and thereby has a normal limit distribution (see van der
Laan and Robins (2002) for general ”central limit” theorems for solutions
of estimating equations). In particular, if gn∗ consistently estimates g0∗ and
Q∗n consistently estimates Q∗0 , then, under appropriate regularity conditions,
ψn is asymptotically linear with influence curve Dq0 (g0∗ , Q∗0 , ψ0 ) and is thus
asymptotically efficient.
Statistical behavior of double robust estimator when cases are rare.
Inspection of this influence curve Dq0 sheds some light on the statistical
behavior of this double robust estimator for the important case that q0 ≈ 0
is very small. In particular, we are interested in how well one can estimate the
relative effect ψ0 /q0 , since ψ0 is itself very small. It follows that, in general,
the influence curve of ψn /q0 as an estimator of ψ0 /q0 will blow up for small
values q0 , except if it guaranteed that Q∗n = q0 Q#
n for some bounded estimator
#
Qn . Therefore, in our proposed targeted maximum likelihood or double
robust estimator we propose such estimators based on logistic regression fits
as presented in Section 2.
3.5
Case-control weighted loss functions.
Our case-control weighting can also be used to map loss functions for the
underlying model M∗ into loss functions for the observed data model M.
In particular, we can construct a case-control weighted log likelihood loss
function.
Theorem 6 (Case Control Weighted Log-Likelihood Loss function)
29
Hosted by The Berkeley Electronic Press
Define the following case-control weighted log-likelihood loss function for
the density p∗0 of O∗ under sampling of O ∼ P0 :
L(p∗ , O) = q0 log p∗ (M1 , Z1 , 1) + q̄0 (M1 )
J
1X
log p∗ (M1 , Z2j , 0).
J j=1
In particular, in Case Control Design I, we have
J
1X
L(p , O) = q0 log p (M1 , Z1 , 1) + (1 − q0 )
log p∗ (M1 , Z2j , 0).
J j=1
∗
∗
We have
p∗0 = arg max
E0 L(p∗ , O),
∗
p
where the argmax is taken over all densities p∗ . That is, the density maximizing the expectation of the loss function L(p∗ , O) is unique and given by
the density p∗0 of O∗ .
The proof of this theorem is similar to the proof of Theorem 3 and is
therefore omitted.
3.6
Case-control weighted maximum likelihood estimation.
Given a specified model M∗ for p∗0 , we can estimate P0∗ with the case-control
weighted maximum likelihood estimator:
p∗n = arg max
∗
∗
n
X
p ∈M
L(Oi , p∗ ).
i=1
The implementation of this weighted maximum likelihood estimator simply
involves assigning weights q0 to the cases, assigning weights q̄0 (M1i )/J to the
corresponding J controls, and then implementing the maximum likelihood
estimator for prospective sampling (i.e. treating the sample of cases and
controls as an i.i.d sample of P0∗ ), thus ignoring the case control sampling.
For example, let’s consider the point treatment data structure O∗ =
(M, W, A, Y ). Consider a nonparametric model for the marginal distribution of W , Q∗W , a model {gη∗ : η} for g0∗ (A | M, W ), and a model {Q∗θ : θ} for
the conditional distribution P0∗ (Y = 1 | M, W, A) = Q∗0 (M, W, A).
30
http://biostats.bepress.com/ucbbiostat/paper234
The case-control weighted maximum likelihood estimator of the marginal
distribution of W is now the weighted empirical distribution of the pooled
sample (W1i , (W2ij : j = 1, . . . , J)). Similarly, the case-control weighted maximum likelihood estimator of g0∗ (A | W ) is given by
ηn = arg max
η
n
X
i=1
q0 log gη∗ (A1i
J
q̄0 (M1i ) X
| M1i , W1i ) +
log gη∗ (Aj2i | M1i , W2ij ),
J
j=1
and the case-control weighted maximum likelihood estimator of Q∗0 (M, W, A)
is given by
θn = arg max
θ
n
X
q0 log Q(M1i , W1i , A1i )+
i=1
J
q̄0 (M1i ) X
log(1−Q(M1i , W2ij , Aj2i )).
J
j=1
Indeed, it follows that each of these case-control weighted maximum likelihood estimators can be implemented by assigning the two weights q0 and
q̄0 (M1 ) to the cases and controls, respectively, and apply the standard maximum likelihood estimator of the density p∗0 under prospective sampling.
Given the weighted maximum likelihood estimators Q∗1n and Q∗n , described above, the corresponding substitution estimator of EYa = EQ∗1 Q∗ (W, a)
is given by
n
J
X
1
q̄0 (M1i ) X
q0 Q∗n (M1i , W1i , a)+
Q∗n (M1i , W2ij , a).
{q
+
q̄
(M
))}
J
0
1i
i=1 0
i=1
j=1
ψn (a) = Pn
In particular, these estimators of EY0 and EY1 now map into an estimator
ψn (1)/ψn (0) of the relative risk EY1 /EY0 .
3.7
Case-control weighted targeted maximum likelihood estimation.
Targeted maximum likelihood estimation is a general methodology introduced in van der Laan and Rubin (2006) and illustrated with a variety of
examples. The case-control weighting allows us now to provide a case-control
weighted targeted maximum likelihood estimation methodology targeting the
parameter of interest.
Specifically, let D∗ (P0∗ ) be the efficient influence curve of the parameter
Ψ∗ : M∗ → IRd . Consider an initial estimator Pn∗0 of P0∗ based on O1 , . . . , On
31
Hosted by The Berkeley Electronic Press
such as a case-control weighted maximum likelihood estimator according to
a working model within M∗ . Let {Pn∗ () : } be a submodel of M∗ with
parameter satisfying that the linear span of its score at = 0 includes
D∗ (Pn∗0 ). Let 1n be the case-control weighted maximum likelihood estimator
of :
1n = arg max Pn,q0 log p∗0
n ().
This yields an update Pn∗1 = Pn∗0 (1n ) of the initial estimator Pn∗0 . We iterate
this updating process till step k at which kn ≈ 0 and we denote the final
update with Pn∗ . By the score condition, this final estimator solves the casecontrol weighted efficient influence curve:
0 = Pn,q0 D∗ (Pn∗ ) = Pn Dq0 (Pn∗ )
up till numerical precision (see van der Laan and Rubin (2006)). We refer
to ψn = Ψ∗ (Pn∗ ) as the case-control weighted targeted maximum likelihood
estimator of ψ0 .
One particular approach for establishing the asymptotics of this estimator is obtained under the assumption that D∗ (P ∗ ) = D∗ (ψ ∗ , η ∗ ) for some
nuisance parameter, thereby assuming an estimating function representation
for the efficient influence curve. (This assumption is not necessary at all
to establish the same asymptotics: see van der Laan and Rubin (2006).)
In this case, it follows that the targeted maximum likelihood estimator ψn
solves Pn Dq0 (ψn , ηn∗ ) = 0 so that one can establish asymptotic linearity of ψn
and derive its influence curve under relatively standard differentiability and
empirical process conditions.
In particular, if ηn∗ is a consistent estimator of a η0∗ satisfying P0 Dq0 (ψ0 , η0∗ ) =
0, then under such standard conditions, asymptotic consistency and asymptotic linearity can be established. For example, if η0∗ = η(P0∗ ) is the true
parameter, then ψn will have influence curve given by Dq0 (ψ0 , η0∗ ).
3.8
Case-control weighted targeted MLE of marginal
causal effect for case control data.
We will illustrate the targeted maximum likelihood estimator for the parameter ψ0 = EY1 − EY0 and the nonparametric model M∗ for the point
treatment data structure (W, A, Y ) ∼ P0∗ .
32
http://biostats.bepress.com/ucbbiostat/paper234
Recall that the double robust estimating function/efficient influence curve
of Ψ under i.i.d sampling from P0∗ is given by
)
(
I(A = 0)
I(A = 1)
− ∗
(Y − Q∗2 (M, W, A))
D (g , Q )(M, W, A, Y ) =
∗
g (1 | M, W ) g (0 | M, W )
+Q∗2 (M, W, 1) − Q∗2 (M, W, 0) − Ψ(Q∗ )
≡ D1∗ (g ∗ , Q∗ )(M, W, A, Y ) + D2∗ (Q∗ )(M, W ),
∗
∗
∗
where Q∗ = (Q∗1 , Q∗2 ) represents both the marginal distribution Q∗1 of W and
the conditional distribution Q∗2 of Y , given A, W . We note that D∗ (g ∗ , Q∗ )
can also be represented as an estimating function for ψ since D∗ (g ∗ , Q∗ ) =
D∗ (Ψ(Q∗ ), g ∗ , Q∗ ), as we did above.
∗
∗
Let Q∗0
2n be an initial estimator of Q20 (A, W ) = P0 (Y = 1 | A, W ) according to a particular working model Qw for Q∗20 : for example,
Q∗0
2n = arg max
∗
w
Q2 ∈Q
n
X
q0 log Q∗2 (A1i , W1i ) +
i=1
J
q̄0 (M1i ) X
log(1 − Q∗2 (Aj2i , W2ij )),
J
j=1
or the logistic regression based estimator Q∗n,q0 presented in Section 2.
Given a model G for g0∗ , let gn∗ be the corresponding weighted MLE:
gn∗ = arg max
g∈G
n
X
q0 log g(A1i | W1i ) +
i=1
J
q̄0 (M1i ) X
log g(Aj2i | W2ij ).
J
j=1
Similarly, let Q∗1n be the nonparametric weighted MLE:
Q∗1n = arg max
Q1
n
X
q0 log dQ1 (W1i ) +
i=1
J
q̄0 (M1i ) X
log dQ1 (W2ij ),
J
j=1
where the maximum is over all discrete distributions which put mass on W1i
and W2i , i = 1, . . . , n. It follows that Q∗1n is a discrete distribution which
puts mass q0 /n on W1i , i = 1, . . . , n, and puts mass q̄0 (M1i ))/(nJ) on W2ij ,
j = 1, . . . , j, i = 1, . . . , n.
Given any Q∗ , g ∗ , let {Q∗2g∗ () : } be a model through Q∗2 at = 0
and satisfying that the span of its score at = 0 includes the component
D1∗ (g ∗ , Q∗ ) of the efficient influence curve of Ψ under i.i.d. sampling from
PQ∗ ∗ ,g∗ . For example,
n
o
d
log Q∗2g∗ ()Y (1 − Q∗2g∗ ())1−Y = D1∗ (g ∗ , Q∗ ).
d
=0
33
Hosted by The Berkeley Electronic Press
This can be achieved with the following fluctuation function of Q∗2 :
logitQ∗2g∗ () = logitQ∗2 + Z(g ∗ ),
where
(
∗
Z(g ) ≡
)
I(A = 1)
I(A = 0)
.
−
g ∗ (1 | M, W ) g ∗ (0 | M, W )
Given the estimator gn∗ of g0∗ , consider the fluctuation function {Q∗0
∗ () :
2ngn
0
} and let n be its weighted MLE:
0n
= arg max
n
X
q̄0 (M1i )
q0 log Q∗0
∗ ()(A1i , W1i )+
2ngn
J
i=1
J
X
j
j
log(1−Q∗0
∗ ()(A2i , W2i )),
2ngn
j=1
which can be computed with standard logistic regression software.
0
0
∗
∗
The first step targeted MLE is now defined as (gn∗ , Q∗1n , Q∗1
2n = (gn , Q1n , Q2n (n )).
∗k−1 k−1
∗k
∗
∗
The k-th step targeted MLE is given by (gn , Q1n , Q2n = Q2n (n )), where,
for k = 0, . . .
kn
= arg max
n
X
q̄0 (M1i )
q0 log Q∗k
∗ ()(A1i , W1i )+
2ngn
J
i=1
J
X
j
j
log(1−Q∗k
∗ ()(A2i , W2i )).
2ngn
j=1
The corresponding k-th step targeted MLE of ψ0 is defined as ψnk = Ψ(Q∗k
n ) ≡
).
In
this
particular
application,
it
follows
that
convergence
occurs
Ψ(Q∗1n , Q∗k
2n
in one step so that ψn = Ψ(Q∗1
n ).
The case-control weighted double robust estimating function for case control data is given by:
Dq0 (g ∗ , Q∗ )(O) = q0 D∗ (g ∗ , Q∗ )(M1 , W1 , A1 , 1)
+
J
q̄0 (M1 ) X
D∗ (g ∗ , Q∗ )(M1 , W2j , Aj2 , 0),
J
j=1
and the targeted MLE (gn∗ , Q∗n ) solves
0=
n
X
Dq0 (gn∗ , Q∗n )(Oi ).
i=1
Statistical inference for ψn can be derived from the corresponding estimatP
ing equation 0 = ni=1 D(ψn , gn∗ , Q∗n )(Oi ) solved by the targeted MLE ψn =
Ψ(Q∗n ).
34
http://biostats.bepress.com/ucbbiostat/paper234
4
Case-control weighting of efficient procedure yields an efficient procedure for both
case-control designs I and II.
In this section we state and show the remarkable nice result that assigning
the case-control weights to the case-control sample and then applying an
efficient procedure developed for prospective sampling actually yields an efficient procedure. These results are presented and derived for both case-control
designs.
4.1
Independence models for case-control designs I and
II to derive efficiency results.
We consider the independence model M so that M = {P (P ∗ ) : P ∗ ∈ M∗ },
where for case-control design I, we have
dP (P ∗ )(W1 , A1 , (W2j , Aj2 : j)) = dP ∗ (W1 , A1 | Y = 1)
J
Y
dP ∗ (W2j , Aj2 | Y = 0),
j=1
(7)
and, for case-control design II, we have
dP (P ∗ )(M1 , W1 , A1 , (M1 , W2j , Aj2 : j)) = dP ∗ (M1 , W1 , A1 | Y = 1)
J
Y
dP ∗ (W2j , Aj2 | M = M1 , Y = 0).
j=1
∗
= dPM
(M1 )dP ∗ (W1 , A1 | M = M1 , Y = 1)
J
Y
dP ∗ (W2j , Aj2 | M = M1 , Y = 0).
(8)
j=1
Our results immediately generalize to models M for which the densities
of the distributions P (P ∗ , η) factorize as
dP (P ∗ , η) = dP1 (P ∗ )dP2 (η),
where dP1 (P ∗ ) is given by the independence likelihood (7) or (8), and P ∗ and
η are variation independent. This follows from the fact that such models the
tangent space contains the tangent space of the independence model, and our
35
Hosted by The Berkeley Electronic Press
proof of the wished result is based on showing that the case-control weighted
efficient influence curve is a member of the tangent space and thereby equals
the efficient influence curve for the model M.
Our results in this section show that the case-control weighting of the
canonical gradient for the prospective sampling model M∗ yields the canonical gradient for the parameter of interest Ψ based on case-control sampling
model M. Our results rely on the assumption that (the typically very
large/semiparametric) M∗ corresponds with (i.e., equals the intersection of)
separate models for P0∗ (W, A | Y = δ) for δ ∈ {0, 1} for case-control design
I, and that M∗ corresponds with (i.e., equals the intersection of) separate
models for P0∗ (W, A | Y = δ, M = m) for δ ∈ {0, 1} and m varying over the
support of the matching variable M .
As a consequence of our results, our proposed case-control weighted targeted maximum likelihood estimator for variable importance and causal effect
parameters, involving selecting estimators of Q∗0 and g0∗ , under appropriate
regularity conditions guaranteeing the wished convergence of the standardized estimator to a normal limit distribution, is efficient if both of these
estimators are consistent, and remains consistent if one of these estimators
is consistent.
We note that the working-model to obtain the initial model based maximum likelihood estimators in our double robust targeted maximum likelihood estimator is obtained by modeling the factors of dP ∗ (W, A, Y ) =
dP ∗ (W )dP ∗ (A | W )dP ∗ (Y | A, W ), which does thus not correspond with
separate models for dP ∗ (W, A | Y = δ) as we ”required” for the actual
model M∗ in order to make sure that the case-control weighted canonical
gradient is a canonical gradient. In order to understand the rational of this
discrepancy we provide the following explanation.
It happens to be that the efficient influence curve for our parameter of
interest Ψ for an underlying model M∗ identified by separate models for
P (W, A | Y = δ) has a double robust representation in terms of Q∗0 and g0∗ ,
while it does not have a double robust representation w.r.t. to say P (W, A |
Y ) or factors thereof. To fully exploit this double robust representation of
the efficient influence curve of our parameter of interest, one should base
estimation of the unknowns parameters of the efficient influence curve on
the latter representation, and that is why we proposed our particular double
robust locally efficient targeted maximum likelihood estimators.
Alternatively, we could use a targeted maximum likelihood estimator
based on initial estimators based on working models for P (W, A | Y = δ),
36
http://biostats.bepress.com/ucbbiostat/paper234
δ ∈ {0, 1}: in this manner we would obtain generalized locally efficient double robust estimators where the double robustness is stated in terms of the
models for Q∗0 and g0∗ implied by the models for P (W, A | Y = δ).
4.2
Case-control weighting of canonical gradient yields
canonical gradient: Case Control Design I.
Firstly, we present the theorem for case-control design I.
Theorem 7 Consider case-control design I. Assume that the model M∗ allows independent variation of P ∗ (W, A | Y = 1) and P ∗ (W, A | Y = 0).
Let D∗ (P ∗ ) be the canonical gradient of the pathwise derivative Ψ∗ :
∗
M → IRd at P ∗ ∈ M∗ , let M = {P (P ∗ ) : P ∗ ∈ M∗ } be the independence model defined by (7), and let Ψ : M → IRd satisfy Ψ(P (P ∗ )) = Ψ∗ (P ∗ )
for all P ∗ ∈ M∗ . Assume the regularity conditions for P ∗ → D∗ (P ∗ ) of
Theorem 5 apply so that it follows that Ψ is pathwise differentiable at P ∗ and
Dq0 (P ∗ ) is a gradient of this pathwise derivative.
We have that Dq0 (P ∗ ) is the canonical gradient of the pathwise derivative
of Ψ : M → IRd .
We already knew that, if we set D∗ (P ∗ ) equal to the canonical gradient
(or any other gradient) of Ψ∗ : M∗ → IRd , then its case-control weighted
version Dq0 (P ∗ ) is a gradient of Ψ : M → IRd . The surprising and important
extra result is that this Dq0 (P ∗ ) actually equals the canonical gradient. That
is, for case-control design I, the case-control weighted gradient mapping does
not only map gradients into gradients, it also maps the optimal canonical
gradient for model M∗ into the optimal canonical gradient for the observed
data model M for case-control data.
Remark regarding q0 known in model M∗ . Since q0 is a non-identifiable
parameter based on case-control sampling (design I), assuming q0 is known
in model M∗ puts no restriction on the observed data model M. As a consequence, the efficient influence curve for the parameter Ψ : M → IRd is the
same for the model M∗ in which this quantity is known as it is in the model
in which this quantity is unknown.
37
Hosted by The Berkeley Electronic Press
4.3
Example of efficient method for case-control design II based on stratified efficient method for casecontrol design I.
Before we present our general analogue result for case-control design II, it
is helpful to consider an example for case-control design II. Consider the
data structure O∗ = (M, W, A, Y ) ∼ P0∗ and let M∗ be a nonparametric
model. Consider case-control design II, in which our observed data O =
((M1 , W1 , A1 ), ((W2j , Aj2 ) : j = 1, . . . , J)). Suppose we wish to estimate ψ0∗ =
E0∗ Y1 = E0∗ E0∗ (Y | A = 1, M, W ) and that q0 (δ | m) = δP0∗ (Y = 1 | M =
m) + (1 − δ)P0∗ (Y = 0 | M = m) is known. Recall that the efficient influence
curve for this parameter Ψ∗ : M∗ → IR in model M∗ at P ∗ is given by
D∗ (Q∗ , g ∗ )−ψ ∗ = I(A = 1)/g ∗ (1 | M, W )(Y −Q∗ (M, W, A))+Q∗ (M, W, 1)−
ψ∗.
Consider the following general approach for estimation of ψ0∗ based on
data generated by a case-control design II:
• Apply the case-control weighted targeted MLE for case-control design
I to the subsample {i : M1i = m} to estimate the conditional version
ψ0∗ (m) = E ∗ (Y1 | M = m) of the parameter ψ0∗ . Thus this corresponds
with weighting the cases with q0 (1 | m) = P0∗ (Y = 1 | M = m) and
the controls with q0 (0 | m) = P0∗ (Y = 0 | M = m) and applying the
standard prospective targeted MLE based on an initial estimator of
Q∗0 (m, a, w) = P0∗ (Y = 1 | m, a, w) and g0∗ (a | m, w) = P0∗ (A = a |
M = m, W = w). By our results for case-control design I, we know
that this estimator yields a double robust locally efficient estimator of
ψ0 (m).
This case-control weighted targeted maximum likelihood estimator of
ψ0 (m) based on the subsample {i : M1i = m} solves the m-specific case∗
control weighted efficient influence curve equation 0 = Pn Dm,q
(Q∗n , gn∗ )−
0
∗
∗
Ψ (Qn )(m) and can thus be represented as
P
ψn (m) =
i
I(M1i = m)Dm,q0 (Q∗n , gn∗ )(Oi )
,
P
i I(M1i = m)
(9)
where
Dm,q0 (Q∗ , g ∗ )(O) = q0 (1 | m)
+ q0 (0|m)
J
I(Aj2 =1)
(0
g ∗ (1|m,W2j )
∗
−Q
n
I(A1 =1)
(1
g0∗ (1|m,W1 )
(m, W2j , Aj2 , 1))
o
− Q∗ (m, W1 , 1)) + Q∗ (m, W1 , 1)
∗
+Q
(m, W2j , Aj2 , 1)
.
38
http://biostats.bepress.com/ucbbiostat/paper234
The rational behind the consistency of this estimator ψn (m) follows
directly from the identity
E(Y1 | M = m) =
E0 Dm,q0 (Q∗0 , g0∗ )(O)I(M1 = m)
.
P0 (M1 = m)
• Now, note that
P0∗ (M = m) = P0 (M1 = m)
q0
.
q0 (1 | m)
Thus, one maps ψn (m) into an estimator of ψ0 by averaging it w.r.t.
to q0 /q0 (1 | M1i )Pn (M1 = m):
(
X
ψn =
m
n
1X
q0
ψn (m)
I(M1i = m)
n i=1
q0 (1 | M1i )
)
n X
1X
q0
I(M1i = m)Dm,q0 (Q∗n , gn∗ )(Oi ),
n i=1 m q0 (1 | m)
=
where we used (9).
Again, the rational of this estimator of ψ0 follows immediately from
the following derivation:
q0
I(M1 = m)Dm,q0 (Q∗0 , g0∗ )
E0 m q0 (1|m)
q0
= E0 q0 (1|M
DM1 ,q0 (Q∗0 , g0∗ )
1)
P
n
q0
1)
= E0 q0 (1|M
q0 (1 | M1 )D∗ (M1 , W1 , A1 , 1) + j q0 (0|M
D∗ (M1 , W2j , Aj2 , 0)
J
1)
P
j
j
1)
∗
= E0 q0 D∗ (M1 , W1 , A1 , 1) + q̄0 (M
j D (M1 , W2 , A2 , 0)
J
= E0∗ Y1 ,
P
o
where we suppressed the dependence of D∗ = D∗ (Q∗ , g ∗ ) on Q∗ , g ∗ .
• We conclude that this estimator ψn of ψ0∗ corresponds with solving
our proposed case-control weighted efficient influence curve equation
Pn Dq0 ,q̄0 − ψ = 0, where
Dq0 ,q̄0 (O) = q0 D∗ (M1 , W1 , A1 , 1) +
q̄0 (M1 ) X ∗
D (M1 , W2j , Aj2 , 0).
J
j
39
Hosted by The Berkeley Electronic Press
We conclude that this general approach for estimation of ψ0∗ of applying
the case-control weighted targeted MLE ψn (m) of case-control design I to the
sub-sample {i : M1i = m} to estimate the analogue ψ0∗ (m) of the parameter of
interest ψ0∗ (i.e., the same function but now applied to the conditional P0∗ (· |
M = m)), and subsequently averaging ψn (m) w.r.t. q0 /q0 (1 | m)Pn (M1 =
m), corresponds with using our for case-control design II proposed casecontrol weighting Dq0 ,q̄0 of the efficient influence curve D∗ for model M∗ .
This suggests that Dq0 ,q̄0 is indeed also, just as we showed for case-control
design I, the efficient influence curve. Our results below confirm this.
4.4
Case-control weighting of canonical gradient yields
canonical gradient: Matched Case Control Design.
For case-control design II, we establish the same result.
Theorem 8 Consider case-control design II.
In this theorem we use the notation: Dq0 ,q̄0 (P ∗ ) = q0 D∗ (P ∗ )(M1 , W1 , A1 , 1)+
j
j
q̄0 (M1 ) P
∗
∗
j D (P )(M1 , W2 , A2 , 0).
J
Assume that the model M∗ allows independent variation of P ∗ (W, A |
Y = δ, M = m) for δ ∈ {0, 1} and possible outcomes m of M under P0∗ .
Let D∗ (P ∗ ) be the canonical gradient of the pathwise derivative Ψ∗ :
∗
M → IRd at P ∗ ∈ M∗ , let M = {P (P ∗ ) : P ∗ ∈ M∗ } be the independence model defined by (8), and let Ψ : M → IRd satisfy Ψ(P (P ∗ )) = Ψ∗ (P ∗ )
for all P ∗ ∈ M∗ .
Assume the regularity conditions for P ∗ → D∗ (P ∗ ) of Theorem 5 apply so
that it follows that Ψ is pathwise differentiable and Dq0 ,q̄0 (P ∗ ) is a gradient
of this pathwise derivative at P (P ∗ ) ∈ M.
Then, Dq0 ,q̄0 (P ∗ ) is the canonical gradient of the pathwise derivative of
Ψ : M → IRd .
4.5
Selecting the efficient influence curve of unrestricted
target parameter.
In order to define an identifiable parameter Ψ(P (P ∗ )) = Ψ∗ (P ∗ ) of the casecontrol data generating distribution, one often needs to define Ψ∗ as indexed
by the known q0 and possibly q̄0 parameters. We denote such a parameter
with Ψ∗q0 : M∗ → IRd to stress its dependence on these known fixed quantities.
40
http://biostats.bepress.com/ucbbiostat/paper234
Our results above for case-control designs I and II above prove that if D∗ (P ∗ )
is the canonical gradient of Ψ∗q0 at P ∗ , then the case-control weighted Dq0 (P ∗ )
is the canonical gradient of Ψ : M → IRd , where Ψ(P (P ∗ )) = Ψ∗q0 (P ∗ ) for
all P ∗ ∈ M. The following theorem shows that one can typically replace
D∗ (P ∗ ) by the canonical gradient of the path-wise derivative of the unrestricted Ψ∗ (P ∗ ) = Ψq(P ∗ ) (P ∗ ).
Theorem 9 Consider the two pathwise differentiable parameters Ψ∗r0 : M∗ →
IRd indexed by a fixed r0 = r(P0∗ ) (e..g, representing q0 and q̄0 ), and a corresponding parameter Ψ∗ : M∗ → IRd defined as Ψ∗ (P ∗ ) = Ψ∗r(P ∗ ) (P ∗ ). Thus,
Ψ∗r0 (P0∗ ) = Ψ∗ (P0 ).
d
r(P ∗ ())
= 0,
Assume that for all the sub-models P ∗ () for which d
=0
we have
d ∗ ∗
d ∗
Ψ (P ())
= Ψr0 (P ∗ ()) .
d
d
=0
=0
Assume that the fixed parameter r0 in Ψ∗r0 is locally non-identifiable at P ∗
in the model M in the sense that the tangent space at P (P ∗ ) ∈ M generated
d
by the submodels {P ∗ () : } at P ∗ for which d
r(P ∗ ())
= 0 equals the
=0
∗
tangent space at P (P ) ∈ M generated by all submodels used in definition of
pathwise derivative of Ψ∗r0 : M∗ → IRd .
If the conditions of Theorem 7 or Theorem 8 apply for this choice Ψ∗r0 :
M∗ → IRd , then we also have, if D∗ (P ∗ ) is the canonical gradient of Ψ∗
at P ∗ , then the case-control weighted Dq0 (P ∗ ) is the canonical gradient of
Ψ : M → IRd .
Proof. This result is shown as follows. Let D∗ be the canonical gradient of
Ψ∗ : M∗ → IRd and let D1∗ be the canonical gradient of Ψ∗q0 : M∗ → IRd . As
a consequence of the first assumption, we have for all scores S of all these
submodels P ∗ () not changing q0 (in first order),
hD∗ , SiP ∗ = hD1∗ , SiP ∗ .
So, if we restrict our class of sub-models at P ∗ in the definition of the pathwise derivative to these sub-models in M∗ not varying r0 (which globally
corresponds with restricting M∗ to all P ∗ with r(P ∗ ) = r0 , but path-wise
differentiability at P ∗ only depends on local thickness of model at P ∗ ), then
we have that the canonical gradient for the corresponding class of submodels
for the observed data model is given by the case-control weighted Dq0 (P ∗ )
41
Hosted by The Berkeley Electronic Press
and the latter also equals the case-control weighted D1q0 (P ∗ ). So under this
restriction on the class of submodels through P ∗ we have equality of the
two case-control weighted canonical gradients corresponding with D∗ and
D1∗ . Now, by using that this restriction on the class of submodels does not
change the tangent space for the observed data models, and therefore does
not affect the canonical gradient representation at P (P ∗ ) of the parameter Ψ
in the observed data model M. Thus this Dq0 (P ∗ ), which equals D1q0 (P ∗ ),
also equals the canonical gradient for the class of all submodels used in the
actual definition of the pathwise derivative. This completes the proof of the
theorem. 2
Since q0 is non-identifiable for case-control design I it follows that casecontrol weighting of the canonical gradient of the unrestricted parameter Ψ∗
also yields the wished canonical gradient of Ψ. The same would apply for
the matched case-control design, if enforcing the restriction (q0 , q0 (1 | m) =
P0∗ (Y = 1 | M = m)) in M∗ does not reduce the observed data tangent
space, but this remains to be verified.
Proof of Theorems 7 and 8.
We already know that for both designs Dq0 (P ∗ ) (defined as Dq0 ,1−q0 (P ∗ ) for
design I and defined as Dq0 ,q̄0 for design II) is a gradient of the pathwise
derivative of Ψ at P (P ∗ ). Therefore, it remains to show that Dq0 (P ∗ ) is an
element of the tangent space T (P (P ∗ )) ⊂ L20 (P (P ∗ )) defined as the closure
of the linear span of the scores of each of the submodels {P () : } within
the Hilbert space L20 (P (P ∗ )).
In the Appendix we have a separate section establishing these results for
both designs, stating that if we select D∗ (P ∗ ) as the canonical gradient of Ψ∗
at P ∗ and the model M∗ allows independent variation of P (W, A | Y = δ) for
Design I and independent variation of P (W, A | M = m, Y = δ) for Design
II, then Dq0 (P ∗ ) is an element of the tangent space at P (P ∗ ) in the observed
case-control data model M.
Here we provide a summary of the proof for case-control design I in order
to provide the reader with an understanding of these results.
d
dP ∗ ()/dP ∗ Since D∗ (P ∗ ) is a canonical gradient it equals a score d
=0
for a particular submodel {P ∗ () : } at = 0, or it can be arbitrarily well
approximated by such a sequence of scores. We first consider the case that
D∗ (P ∗ ) is itself a score.
42
http://biostats.bepress.com/ucbbiostat/paper234
The tangent space under the independence model for a nonparametric
model M∗ is an orthogonal sum of the Hilbert space T1 (P ) = {S1 (W1 , A1 ) :
S1 } of functions of (W1 , A1 ) with mean zero, and the Hilbert space T2 (P ) =
P
{ j S2 (W2j , Aj2 ) : S2 } with S2 (W2j , Aj2 ) having mean zero, j = 1, . . . , J. For
an actual model M∗ these two Hilbert spaces are replaced by sub-spaces
spanned by the scores of the allowed sub-models {P ∗ () : } through P ∗ .
d dP ∗ ()
(W
,
A
|
Y
=
1)
That is, T1 (P ) consists of (and is generated by) functions d
1
1
∗
dP
P
j
j
d dP ∗ ()
j d dP ∗ (W2 , A2
∗
,
=0
| Y = 0) ,
and T2 (P ) consists of (and is generated by) functions
j = 1, . . . , J. We assumed that the marginal distributions P (W, A | Y = 1)
and P ∗ (W, A | Y = 0) are independently varied by these submodels, so that
indeed the tangent space is an orthogonal sum of T1 (P ) and T2 (P ).
For notational convenience, we introduce the notation 0 = 0. Let D∗ (P ∗ ) =
d dP ∗ (0 )
(W, A, Y ) be a score. Since q0 is non-identifiable, we can assume that
d0 dP ∗
p∗ ()(Y = 1) = q0 for all . It follows that
=0
d ∗
1
p (0 )(W1 , A1 , 1)
1 , A1 , 1) d0
1
d ∗
p (0 )(W1 , A1 | Y = 1)q0
= q0 ∗
p (W1 , A1 | Y = 1)q0 d0
1
d ∗
= q0 ∗
p (0 )(W1 , A1 | Y = 1)
p (W1 , A1 | Y = 1) d0
∈ T1 (P ∗ ),
q0 D∗ (P ∗ )(W1 , A1 , 1) = q0
p∗ (W
since the latter term equals q0 times a score of P ()(W1 , A1 ) at = 0 (which
in particular has mean zero).
Again, using that P ∗ ()(Y = 0) = 1 − q0 for all ,
(1 − q0 )D∗ (P ∗ )(W2j , Aj2 , 0) = (1 − q0 ) p∗ (W j1,Aj ,0) dd0 p∗ (0 )(W2j , Aj2 , 0)
2
= (1 −
≡ (1 −
2
d ∗
p (0 )(W2j , Aj2 |
d0
)
0
2
2
q0 ) p∗ (W j ,A1 j |Y =0) dd0 p∗ (0 )(W2j , Aj2 | Y =
2
2
q0 )S2 (W2j , Aj2 ),
= (1 − q0 ) p∗ (W j ,Aj |Y1 =0)(1−q
Y = 0)p∗ ()(Y = 0)
0)
where the latter term equals is 1 − q0 times a score of P ()(W2j , Aj2 ) at = 0
(which, in particular, has mean zero). It follows that
(1 − q0 ) X ∗ ∗
1 − q0 X
D (P )(W2j , Aj2 , 0) =
S2 (W2j , Aj2 ) ∈ T2 (P (P ∗ )).
J
J
j
j
43
Hosted by The Berkeley Electronic Press
This proves that for case-control design I, if D∗ (P ∗ ) is a score, then Dq0 (P ∗ )(O) =
P
j
j
∗
∗
0
q0 D∗ (P ∗ )(W1 , A1 ) + 1−q
j D (P )(W2 , A2 ) is a score itself, and thus an elJ
ement of the tangent space T (P ).
∗
Suppose now that D∗ (P ∗ ) = limm→∞ Dm
(P ∗ ) ∈ T ∗ (P ∗ ), where Dm (P ∗ ) ∈
∗
2
L0 (P ) is a score. Then, for each m, we have Dmq0 (P ∗ ) ∈ L20 (P (P ∗ )) is a
score. To show that Dq0 (P ∗ ) ∈ L20 (P ∗ ) is a score requires thus that the casecontrol mapping D∗ → Dq0 , as a mapping from L20 (P ∗ ) into L20 (P (P ∗ )) is
continuous. This is trivially established. This proves that indeed Dq0 (P ∗ )
is an element of the tangent space T (P (P ∗ )). This completes the proof for
case-control design I.
The proof for case-control design II is more delicate and provided in detail
in the Appendix.
5
5.1
Double robust locally efficient estimation
of marginal causal effects for case-control
design I.
Efficient influence curve of marginal causal effects
for case-control design I.
In this subsection we establish that the efficient influence curve of the marginal
causal effects defined on a nonparametric model M∗ and indexed by a fixed
known q0 for case-control design I can be represented as a case-control weighted
Dq0 = q0 D∗ (·, 1) + (1 − q0 )D∗ (·, 0), with D∗ being the efficient influence curve
of the marginal causal effect defined on the nonparametric model M∗ (and
not indexed by q0 ).
Our general theorem 4 and Theorem 7 teach us that this should indeed
be true but with D∗ being the efficient influence curve of the marginal causal
effect defined as Ψ∗q0 indexed by a fixed q0 . Theorem 9 proves that it should
also hold for the choice D∗ being the efficient influence curve of the marginal
causal effect defined as the unrestricted parameter Ψ∗ (P ∗ ) = Ψ∗q(P ∗ ) . Thus,
the result in this subsection is completely predicted by our theorems presented in previous sections.
Theorem 10 (Efficient influence curve for case control design I)
Consider Case Control Design I with data structure O = ((W1 , A1 ), ((W2j , Aj2 ) :
j)), where (W1 , A1 ) has distribution Q1 ∼ (W, A) | Y = 1 and (W2j , Aj2 ) has
44
http://biostats.bepress.com/ucbbiostat/paper234
distribution Q0 ∼ (W, A) | Y = 0. Let Q1 (w, a) = P (W1 = w, A1 = a) and
let Q0 (w, a) = P (W2 = w, A2 = a). Similarly, we define Q1 (w) = P (W1 =
w) and Q0 (w) = P (W2 = w). We also define Q∗ (a, W ) = P ∗ (Y = 1 | A =
a, W ), Q∗W (w) = P ∗ (W = w). Define
ψ0 = f (Q10 , Q11 , Q1 , Q0 )
≡
X
{q0 Q1 (w) + (1 − q0 )Q0 (w)}
w
Q1 (w, 1)q0
.
Q1 (w, 1)q0 + Q0 (w, 1)(1 − q0 )
The efficient influence curve of ψ0 is given by
Dq0 (P0 )(O) = q0 D∗ (P0∗ )(W1 , A1 , 1) + (1 − q0 )
J
1X
D∗ (P0∗ )(W2j , Aj2 , 0)
J j=1
This result teaches us that the variance under P0 of the weighted double
P
robust estimating function Dq0 (O) = q0 D∗ (W1 , A1 , 1)+ J1 Jj=1 (1−q0 )D∗ (W2j , Aj2 , 0)
is the Cramer Rao lower bound for regular asymptotically linear estimators of
Ψq0 (P0 )(1) = EY1 . For example, application of the theorem to ψ0 (1) = EY1
and ψ0 (0) = EY0 , and the delta-method to ψ0 = ψ0 (1)/ψ0 (0) = EY1 /EY0 ,
teaches us that the efficient influence curve of the causal relative risk ψ0 =
Ψq0 (P0 )(1)/Ψq0 (P0 )(0) (indexed by q0 ) can be represented as
DRR,q0 =
ψ0 (1)
1
D1,q0 −
D0,q0 ,
ψ0 (0)
ψ0 (0)2
where D1,q0 denotes the efficient influence curve of ψ0 (1) = EY1 and D0,q0
denotes the efficient influence curve of ψ0 (0) = EY0 . Using that each of the
two efficient influence curves D0,q0 , D1,q0 are case control weighted versions
of the efficient influence curves D0∗ and D1∗ in model M∗ , it follows that we
can also represent the efficient influence curve DRR,q0 as
1
{q0 D1∗ (W1 , A1 , 1) + (1 − q0 )D1∗ (W2 , A2 , 0)}
ψ0 (0)
ψ0 (1)
{q0 D0∗ (W1 , A1 , 1) + (1 − q0 )D0∗ (W2 , A2 , 0)}
−
2
ψ0 (0)
∗
∗
= q0 DRR
(W1 , A1 , 1) + (1 − q0 )DRR
(W2 , A2 , 0),
DRR,q0 =
where
∗
DRR
(W, A, Y ) =
ψ0 (1) ∗
1
D1∗ (W, A, Y ) −
D (W, A, Y ).
ψ0 (0)
ψ0 (0)2 0
45
Hosted by The Berkeley Electronic Press
∗
We note that DRR
is the efficient influence curve of ψ0 = EY1 /EY0 in the
nonparametric model M∗ , not treating the parameter ψ0 as being indexed
by q0 .
5.2
Variance of efficient influence curve of causal relative risk at q0 ≈ 0.
From the representation of the efficient influence curve of the relative risk,
it follows that if q0 and thereby ψ0 (0) is very small, then the variance of
the efficient influence curve DRR can be very large if D0∗ (W2 , A2 , 0) is a non
constant random variable with a variance which is not proportional to 1/q02 or
1/ψ0 (0)2 . As a consequence, it is fundamental that Q∗0 is itself proportional
to q0 , which is a reasonable assumption, and, for the sake of estimation, the
estimators Q∗n need to be proportional to q0 as well.
We have the following result formalizing this.
Theorem 11 We refer to the definition Da∗ being the efficient influence
curve of ψ0∗ (a) = E0∗ Ya , a ∈ {0, 1} provided in previous subsection. Let
Daq0 (Q∗0 , g0∗ , ψ0 (a))(O) ≡ q0 Da∗ (Q∗0 , g0∗ , ψ0 (a))(W1 , A1 , 1)
+(1 − q0 )
J
1X
D∗ (Q∗ , g ∗ , ψ0 (a))(W2j , Aj2 , 0).
J j=1 a 0 0
We have that Daq0 (Q∗0 , g0∗ , ψ0 (a)) equals the efficient influence curve of the
parameter ψq0 (a) and the model implied by the nonparametric model M∗ and
knowing q0 .
We have that Daq0 is double robust:
E0 Daq0 (Q∗ , g ∗ , ψ0 (a)) = 0 if either g ∗ = g0∗ or Q∗ = Q∗0 ,
and in both cases we need that g ∗ (1 | W ) > 0 a.e.
The variance of Daq0 (Q∗ , g0∗ , ψ0 (a)) under P0 is O(q0 ) if
Q∗ = q0
Q̃/(1 − Q̃)
1
,
1 − q0 1 + c0 Q̃/(1 − Q̃)
or equivalently Q∗ /(1−Q∗ ) = q0 /(1−q0 )Q̃/(1− Q̃), and Q̃/(1− Q̃) is bounded.
46
http://biostats.bepress.com/ucbbiostat/paper234
Thus, in order to construct estimators of ψ0 (a) that have standard error
proportional to q0 it is crucial that we restrict our estimators of Q∗0 to estima#
tors of the form q0 Q#
n for nicely bounded Qn or, that it can be represented
as a logistic regression with an intercept log c0 and bounded function in the
covariates. In that manner, our resulting double robust locally efficient estimators of the causal relative risk or odds ratio will have a bounded influence
curve at q0 ≈ 0.
5.3
Double robust locally efficient estimator of treatment specific mean and causal relative risk/odds
ratio for case control design I.
So we can construct a double robust locally efficient estimator of ψ0 (a) as
follows. Let Q̃n be an estimator based on fitting a logistic regression model
for Q∗0 ignoring the case control sampling: i.e., for some working model Q∗
for Q∗0 (A, W ) let
Q̃n = arg max
∗
∗
Q ∈Q
n
X
log Q∗ (W1i , A1i ) +
i=1
J
1X
log(1 − Q∗ (W2ij , Aj2i )).
J j=1
We can now map this into an estimator Q∗n,q0 of Q∗0 :
c0 Q̃n /(1 − Q̃n )
1 + c0 Q̃n /(1 − Q̃n )
Q̃n /(1 − Q̃n )
1
= q0
1 − q0 1 + c0 Q̃n /(1 − Q̃n )
≡ q0 Q#
n,q0 ,
Q∗n,q0 =
Equivalently, we can add the intercept log c(q0 ) to the log odds of the fit Q̃n .
Let gn∗ be an estimator of g0∗ . For example, given a working model G ∗ for
g0∗ , let
gn∗
= arg max
∗
∗
g ∈G
n
X
q0 g ∗ (A1i | W1i ) + (1 − q0 )g ∗ (A2i | W2i ).
i=1
Now, define ψn (a) as the solution of the efficient influence curve estimating
equation Pn Daq0 (Q∗n,q0 , gn∗ , ψ(a)) = 0 given by
ψn (a) =
n
1X
Daq0 (Q∗n,q0 , gn∗ )(Oi )
n i=1
47
Hosted by The Berkeley Electronic Press
n
1X
I(A1i = a)
I(A1i = a)
= q0
+ q0 ∗
Q∗n,q0 (W1i , a) 1 − ∗
n i=1
gn (a | W1i )
gn (a | W1i )
!
1X ∗
I(Aj = a)
+(1 − q0 )
.
Qn,q0 (W2ij , a) 1 − ∗ 2i j
J j
gn (a | W2i )
!
Note that for q0 ≈ 0, this estimator is well approximated (also for purpose
of causal relative risk of odds ratio) by
n
1X
I(A1i = a)
1X #
I(Aj2i = a)
j
ψn (a) ≈ q0
.
+
Q (W , a) 1 − ∗
n i=1 gn∗ (a | W1i ) J j n,q0 2i
gn (a | W2ij )
!
5.4
Double robust locally efficient targeted MLE of
treatment specific mean, causal relative risk and
odds ratio for case control design I.
Let Q̃∗n be defined as a standard logistic regression fit ignoring the case control
sampling. Subsequently, we map this into our estimator Q∗n,q0 of Q∗0 by adding
the intercept log c(q0 ) to the log odds of Q̃∗n .
We now construct an -fluctuation Q∗n,q0 () through the corresponding
logistic regression fit Q∗n,q0 (Y | A, W ) satisfying
d
log Q∗n,q0 () = D∗ (Q∗n,q0 , gn∗ ),
d
where D∗ (Q∗ , g ∗ ) is the efficient influence curve of the bivariate parameter
(Ψ(Q∗ )(0), Ψ(Q∗ )(1)) (i.e. EY0 , EY1 ). This can be done by adding a two
dimensional extension (I(A = 1)/gn∗ (1 | W ), I(A = 0)/gn∗ (0 | W )) to the log
odds of the logistic regression fit Q∗n,q0 .
Let
n = arg max
X
q0 log Q∗ (W1i , A1i ) + (1 − q0 )
i
1X
log(1 − Q∗ (W2ij , Aj2i ))
J j
be the case control weighted maximum likelihood estimator of , which can be
fitted with standard logistic regression software again. The one-step targeted
MLE of Q∗0 is now defined as Q∗n ≡ Q∗n,q0 (n ).
Since the update of the MLE Q∗n,q0 only depends on gn∗ which does not
change, it follows that this one-step targeted MLE Q∗n already solves the
48
http://biostats.bepress.com/ucbbiostat/paper234
case-control weighted efficient influence curve estimating equation:
0 =
X
≡
X
q0 D∗ (Q∗n , gn∗ )(W1i , A1i , 1) + (1 − q0 )
i
1X ∗ ∗ ∗
D (Qn , gn )(W2ij , Aj2i , 0)
J j
Dq0 (Q∗n , gn∗ )(Oi ),
i
so that the generally prescribed iteration for targeted MLE is not needed.
The resulting targeted maximum likelihood estimator Ψ(Q∗n ) = EQ∗W,n Q∗n (a, W ),
with Q∗W,n = q0 Q∗W1 ,n + (1 − q0 )Q∗W2 ,n being the case control weighted empirical distribution of the covariate vector W , solves now the double robust
P
estimating equation 0 = i Dq0 (Q∗n , gn∗ , Ψ(Q∗n ))(Oi ) (where we now use the
estimating function representation of Dq∗0 ), and is therefore a double robust
estimator in the sense that it is consistent and asymptotically linear if either
Q∗n is consistent or gn∗ is consistent.
The same statistical properties are now established for the corresponding
causal relative risks and odds ratios, where one uses that Q∗n = Q∗n,q0 (n ),
just like Q∗n,q0 , equals q0 times a bounded estimator Q#
n so that the standard
error
of this double robust targeted MLE is proportional to q0 (divided by
√
n).
6
Estimation of semi-parametric logistic regression models based on case-control sampling.
Let O = (W, A, Y ) ∼ P0∗ . Assume
Q∗0 (A, W ) ≡ P0∗ (Y = 1 | A, W ) =
1
1 + exp(−{Aβ0 W + r0 (W )})
for some β0 and unspecified function r0 . We refer to this model {Qβ,r : β, r}
for Q∗0 as a semi-parametric logistic regression model.
We first wish to construct the iterative targeted MLE of β0 based on an
i.i.d. sample O1 , . . . , On from P0∗ and after that we consider the corresponding
case-control weighted targeted MLE for case-control sampling from P0∗ .
Firstly, we are concerned with construction of the nuisance tangent space
of the unspecified r0 so that we can find the efficient influence curve and
49
Hosted by The Berkeley Electronic Press
corresponding hardest sub-model through a current fit, as needed to define
∗
the targeted MLE. For that purpose, we can consider -paths P0
(Y = 1 |
1
for
arbitrary
functions
h.
This
results
in
A, W ) = 1+exp(−{Aβ0 W +r
0 (W )+h(W )})
the nuisance tangent space
Tnuis,r0 (P0∗ ) = {h(W )(Y − Q∗0 (A, W )) : h}.
In order to find the efficient score we wish to construct a path Q∗0 () through
Q∗0 at = 0 so that its score at = 0 is orthogonal to the nuisance tangent
space. Since any score is already orthogonal to the nuisance scores generated
by the distribution of (A, W ), it follows that is suffices to establish that this
score is orthogonal to Tnuis,r0 (P0∗ ). Consider the candidate paths
Q∗0h1 ()(Y = 1 | A, W ) =
1
.
1 + exp(−{A(β0 + )W + r0 (W ) + h1 (W )})
The score of this path at = 0 equals
S(h1 ) ≡ (AW + h1 (W ))(Y − Q∗0 (A, W )).
We now need to select h1 so that for each h(W ) we have
0 = E0∗ (AW + h1 (W ))(Y − Q∗0 (A, W ))h(W )(Y − Q∗0 (A, W ))
= E0∗ (AW + h1 (W ))h(W )σ02 (A, W )),
where σ02 (A, W ) = Q∗0 (A, W )(1 − Q∗0 (A, W )). It follows that the unique
solution is given by
h∗1 (Q∗0 , g0∗ )(W ) = −
E0∗ {AW Q∗0 (1 − Q∗0 )(A, W ) | W }
,
E0∗ {Q∗0 (1 − Q∗0 )(A, W ) | W }
where the conditional expectation is w.r.t. the conditional distribution g0∗ of
A, given W . In particular, this shows that the efficient influence curve is up
till a scaling matrix given by:
D∗ (Q∗0 , g0∗ )(O) = {AW + h∗1 (Q∗0 , g0∗ )(W )} (Y − Q∗0 (A, W )).
We note that one can also represent D∗ as function in g0∗ , β0 , r0 :
D∗ (β0 , r0 , g0∗ )(O∗ ) = {AW + h∗1 (β0 , r0 , g0∗ )(W )} (Y − Qβ0 ,r0 (A, W )).
50
http://biostats.bepress.com/ucbbiostat/paper234
We are now ready to define the targeted MLE based on a sample of P0∗ .
Let Q∗0 , g ∗0 be initial estimators of Q∗0 , g0∗ , where Q∗0 is defined by (β 0 , r0 ).
0
Construct the path Q∗0
h∗1 (Q∗0 ,g ∗0 ) () and compute the MLE n of . This corresponds with fitting a logistic regression model in covariate AW +h∗1 (Q∗0 , g ∗0 ),
0
with offset β 0 AW +r0 (W ). We now update Q∗1 = Q∗0
h∗1 (Q∗0 ,g ∗0 ) (n ). We iterate
this updating process till kn ≈ 0 at which point we have
0=
n
X
D∗ (βnk , rnk , gn∗0 )(Oi )
i=1
up till a user supplied numerical precision. The estimator βnk ’s influence curve
can now be derived from the fact that it solves this estimating equation and
statistical inference proceeds accordingly.
Let’s now consider a case-control sample. We now set our initial estimate Q∗0 above equal to Q∗n,q0 obtained by adding an intercept log c0 into a
weighted logistic regression fit Q̃n in which the cases get weight 1 and the
controls receive a weight q̄0 (M1 ). Secondly, in each estimation step of the
iterative targeted MLE we assign weights q0 and q̄0 (M1 ) to the cases and controls, respectively. The resulting case-control weighted targeted MLE βnk , rnk
now solves
0=
n
X
i=1
q0 D∗ (βnk , rnk , gn∗0 )(Wi1 , A1i , 1) +
J
q̄0 (M1i ) X
D∗ (βnk , rnk , gn∗0 )(W2ij , Aj2i , 0)
J
j=1
up till a user supplied numerical precision. Statistical inference for this estimator βn can now be based on the fact that it solves this estimating equation,
or one could apply the bootstrap.
7
Targeted MLE of realistic marginal structural models for case-control studies.
The data structure on each experimental unit is O∗ = (W, A, Y ) ∼ P0∗ , where
W is a collection of baseline covariates, A is a treatment variable, and Y is an
outcome of interest. Given a case control sample of n i.i.d. copies O1 , . . . , On ,
the goal is to estimate the causal effect of treatment on the outcome within
subgroups defined by the strata of a baseline covariate V included in W .
This has important applications in causal effect estimation of a drug (e.g.
51
Hosted by The Berkeley Electronic Press
dose) in clinical trials as well for observational (e.g post market) studies.
The full data structure and parameter of interest: Let Y (a) represent
a treatment specific outcome one would observe if the randomly sampled subject would be assigned a treatment coded as a ∈ A, and let X = (W, (Y (a) :
∗
represent the full data structure of interest on the randomly
a ∈ A)) ∼ PX0
sampled subject consisting of the treatment specific outcomes, and baseline
covariates W . Let A1 denote an index set of a set of dynamic point treatment
rules
D = {W → d(a)(W ) ∈ A : a ∈ A1 },
where each rule in this set D of rules, represents a rule for assigning treatment
in response to the subject’s/experimental unit’s baseline covariates W .
A special case is that A1 = A and d(a) denotes a rule which aims to assign
a but if a is such that the conditional probability g0∗ (a | W ) of treatment being
equal to a, given the baseline covariates W , is too close to zero, then it assigns
a treatment in the set of realistic treatment options closest to a, where the
latter ”realistic” and ”closest” need to be defined appropriately. We refer to
such rules avoiding treatment assignments which are not supported by the
treatment mechanism g0∗ as realistic treatment rules. We refer to van der
Laan and Petersen (2007) for a general class of causal models for realistic
treatment rules.
∗
is unspecified.
We consider a model in which the full data distribution PX0
A scientific parameter of interest is a realistic causal treatment curve defined
as the mean ψ0 (a) = E0∗ Y (d(a)) of the treatment specific outcome Y (d(a)),
where d(a) is a dynamic point treatment rule W → d(a)(W ), and Y (d(a))
represents the outcome one would observe if the subject follows this rule. In
addition, we are also concerned with the V -adjusted causal response curve
for a V ⊂ W defined as
ψ0 (a, v) = E0∗ (Y (d(a)) | V = v),
where V represents a baseline characteristic which might potentially strongly
affect the causal response curve.
Here d(a) is a dynamic point treatment rule W → d(a)(W ) mapping the
baseline covariates in the set A of treatment options satisfying for some user
supplied δ > 0 the following condition:
P0∗ (A = d(a)(W ) | W ) > δ almost everywhere, for all a ∈ A1 .
(10)
A counterfactual Y (d(a)) indexed by a dynamic treatment rule d(a) is a well
defined function of the complete set of counterfactuals (Y (a) : a ∈ A) and
52
http://biostats.bepress.com/ucbbiostat/paper234
baseline covariates W , and the rule d(a): Y (d(a)) = Y (d(a)(W )).
Missing data structure representation of observed data on experimental unit: It is assumed that O∗ = (W, A, Y = Y (A)) with probability
1.
Randomization assumption: We also assume that A is randomized conditional on W :
g0∗ (a | X) = P0∗ (A = a | X) = P0∗ (A = a | W ).
The assumption (10) guarantees that the distribution of the counterfactual
Y (d(a)) is identifiable from the observed data structure O = (W, A, Y =
Y (A)).
Working model: We consider a working model m(a, v | β) for the treatment
specific mean ψ0 (a, v), and define the target parameter as
∗
β0 = arg min E0V
β
X
(m(a, V | β) − ψ0 (a, V ))2 h(a, V ),
a∈A1
where h is a user supplied weight function. For simplicity, we assume here
that A1 is discrete, but if A1 is a continuous set, then one can replace it by
a discrete approximation in the above definition.
Typical models are models m(a, v | β) = β(a, V ), m(a, v | β) = exp β(a, v)),
and m(a, v | β) = 1/(1 + exp(−β(1, a, v)) for additive effects, multiplicative
effects, and odds-ratio effects, respectively.
The summary measure ψ̃0 (a, v) = m(a, v | β0 ) of ψ0 implied by the working model {m(· | β) : β} provides now a model based approximation of the
true causal response curve ψ0 . Note that β0 is a parameter of ψ0 and the
∗
marginal distribution P0V
of V . Although, we will consider the model for
∗
the full data distribution PX0
to be nonparametric and the working model
as an approximation of the true causal response curve, our proposed estimators are valid if one actually assumes the working model m(a, V | β0 ) to be
correctly specified. Our goal is to construct a targeted MLE of β0 based on
a case-control sample.
Important identity: Under a mild regularity condition, it follows that
β0 = β(Q∗01 , Q∗02 ) solves
∗
0 = E0V
X
h(a, V )
a∈A1
= EQ∗01
X
a∈A1
d
m(a, V | β0 )(E0∗ (Y (d(a)) | V ) − m(a, V | β0 ))
dβ0
h(a, V )
d
m(a, V | β0 )(E0∗ (Y (d(a)) | W ) − m(a, V | β0 ))
dβ0
53
Hosted by The Berkeley Electronic Press
= EQ∗01
X
a∈A1
h(a, V )
d
m(a, V | β0 )(Q∗02 (d(a), W ) − m(a, V | β0 )),
dβ0
where we defined Q∗02 (d(a), W ) = E0∗ (Y | A = d(a)(W ), W ) and Q∗01 is the
marginal distribution of W . This identity will be assumed to hold.
Optimal treatment: We are also concerned with statistical inference for
the optimal treatment for subgroup v
a∗ (β0 )(v) = arg max m(a, v | β0 ),
a∈A1
and, in case V is chosen to be the empty set, then this reduces to the marginal
optimal treatment
a∗ (β0 ) = arg max m(a | β0 ).
a∈A1
A particular working model of interest for determining an optimal treatment
among a continuous set A1 is given by a quadratic model
m(a, v | β0 ) = β0 (0)(v) + β0 (1)(v)a + β0 (2)(v)a2 ,
where, for example, β0 (j)(v) = β0 (j)(0) + β0 (j)(1)v, j = 0, 1, 2. Such a
quadratic model allows for applications in which the optimal dose is neither
the maximum value nor the minimum, but something in between. For this
choice of working model we have that the optimal dose for subgroup V = v
is given by:
−β0 (1)(v)
.
a∗ (β0 )(v) =
2β0 (2)(v)
In particular, the optimal marginal dose is given by
a∗ (β0 ) =
−β0 (1)
.
2β0 (2)
Likelihood and Identifiability: Firstly, we note that the likelihood of the
observed data set O∗ factorizes as:
dPQ∗ ∗0 ,g0∗ (O∗ ) = Q∗01 (W )Q∗02 (Y | A, W )g0∗ (A | W ),
where the conditional density of Y , given A = a, W , Q∗20 (· | a, W ), equals the
conditional density of Y (a), given W , and Q∗10 denotes the marginal density
54
http://biostats.bepress.com/ucbbiostat/paper234
of W . In particular, it follows that the marginal causal dose response curve
ψ0 (a) is identified by the Q∗0 -factor of the likelihood by the following relation:
ψ0 (a) = E0∗ E0∗ (Y | A = d(a)(W ), W ).
In general, under this same condition,
ψ0 (a, v) = E0∗ {E0∗ (Y | A = d(a)(W ), W ) | V = v}.
Case-control weighted maximum likelihood estimation: Consider a
logistic regression model {Q∗2θ : θ} for the distribution of Y (a), given W , or
equivalently, the distribution Q∗02 of Y , given A, W , and the corresponding
case control weighted maximum likelihood estimator θn :
θn = arg max
θ
n
X
q0 log Q∗2θ (1 | A1i , W1i ) + q̄0 (M1i )
i=1
1X
log Q∗2θ (0 | Aj2i , W2ij ).
J j
Alternatively, we use the estimator Q∗n,q0 which adds the intercept log c0 into
a logistic regression fit Q̃n only using the weights q̄0 (M1i )/J for the controls,
while using weight 1 for the cases.
We will leave the marginal distribution of W unspecified, so that this is
estimated with the case control weighted empirical probability distribution
Q∗1n . The model {Q∗2θ : θ} defines a working model Qw for the unknown
components Q∗0 = (Q∗10 , Q∗20 ) of the likelihood of the observed data. Given
an estimator θn , we will use the short-hand notation Q∗θn = (Q1n , Q∗2θn ) for
the estimate of both the marginal distribution of W as well as the conditional
distribution of Y , given A, W . We also assume that we are given an estimate
gn∗ of the treatment mechanism g0∗ (A | W ) in the case that the latter is
not known by design, such as the case control weighted maximum likelihood
estimator according to a model G ∗ for g0∗ .
We wish to compute the case-control weighted targeted MLE for the
nonparametric model targeting β0 , based on the case-control weighted initial
maximum likelihood estimator Q∗θn based on this working model Qw . For
this purpose, we first need to know the efficient influence curve of β0 in our
nonparametric model for the observed data O.
Efficient influence curve: The efficient influence curve for β0 at PQ∗ ∗0 ,g0∗ is,
55
Hosted by The Berkeley Electronic Press
up till a normalizing matrix, given by
D
∗
(Q∗0 , g0∗ )(O∗ )
=
X
I(A = d(a)(W ))
a∈A1
+
≡
X
h(a, V )
h(a, V ) dβd0 m(a, V | β0 )
g0∗ (A | X)
(Y − Q∗02 (A, W ))
d
m(a, V | β0 )(Q∗02 (d(a), W ) − m(a, V | β0 ))
dβ0
a∈A1
∗
D1 (Q∗0 , g0∗ )(W, A, Y
) + D2∗ (Q∗0 )(W ),
where we defined Q∗02 (d(a), W ) = EQ∗0 (Y | A = d(a)(W ), W ) and Q∗02 (a, W ) =
E0∗ (Y | A = a, W ), and we note that β0 = β(Q∗0 ) is a parameter of Q∗0 =
(Q∗01 , Q∗02 ).
P
The IPTW component of D∗ (Q∗0 , g0∗ ) is DIP T W (g0∗ , β0 ) = a∈A1 I(A =
h(a,V )
d
m(a,V |β0 )
dβ0
(Y −m(a, V | β0∗ )) and we have the usual DR-IPTW
d(a)(W ))
g0∗ (A|X)
representation D∗ = DIP T W − E(DIP T W | A, W ) + E(DIP T W | W ) of D∗ .
This insight allows us to define the normalizing matrix as
c(PQ∗ ∗0 ,g0∗ , g0∗ , β0 ) =
P
) d
d
>
PQ∗ ∗0 ,g0∗ a∈A1 I(A = d(a)(W )) gh(a,V
∗ (A|X) dβ m(a, V | β0 ) dβ m(a, V | β0 )
0
0
0
P
= EQ∗0 a∈A1 h(a, V ) dβd0 m(a, V | β0 ) dβd0 m(a, V | β0 )> .
The efficient influence curve for β0 at P0∗ in the nonparametric model M∗ is
given by c(PQ∗ ∗0 ,g0∗ , g0∗ , β0 )−1 D∗ (Q∗0 , g0∗ ). The efficient influence curve for a (e.g.
lower dimensional) function of β0 in model M∗ can be derived (as a linear
mapping applied to the vector efficient influence curve D∗ ) based on the δmethod. The targeted MLE presented below could be equally well developed
for this function by the efficient influence curve of the lower dimensional
function instead, possibly up till a normalizing matrix. Below, we present
the targeted MLE for the whole vector β0 .
Epsilon-fluctuation for Targeted MLE: Let {Q∗2θ () : } be a path
d
through Q∗2θ at = 0 and satisfy the score condition d
log Q2θ ()
=
=0
∗
∗
∗
D1 (Q2θ , g0 ). (For the targeted MLE for functions of β0 we would also decompose its efficient influence curve in a D1∗ component representing its projection on functions of O∗ with conditional mean zero, given A, W , and D2∗
component representing its projections on the functions of W with mean
zero). If Q∗2θ is a logistic regression of a binary Y on A, W , then we simply
56
http://biostats.bepress.com/ucbbiostat/paper234
add C(A, W ), where
C(A, W ) ≡
X h(a, V ) dβd m(a, V | β0 )
0
g0∗ (A | X)
a∈A1
to the logit of Q∗2θ (1 | A, W ). In other words,
logitEQ∗2θ () (Y | A, W ) = logitEQ∗2θ (Y | A, W ) + C(A, W ).
In both cases, these extensions have a score at = 0 equal to D1∗ (Q∗2θ , g0∗ ).
Making the epsilon-covariate extension independent of β0 : The
targeted MLE can be obtained in one maximum likelihood step determining
the maximum likelihood estimator of in the case that the epsilon-covariate
C(A, W ) does not depend on β0 . In the case that m(a, V | β) is a logistic
linear regression model, say, m(a, V | β0 ) = 1/(1 + exp(−β0 (a, V )), then we
recommend to select h(a, V ) = h1 (a, V )/(m(a, V | β0 )(1 − m(a, V | β0 )) for
some h1 so that h(a, V ) dβd0 m(a, V | β0 ) reduces to h1 (a, V )(a, V )> and is
thus independent of β0 . Similarly, if m(a, V | β) is a log linear regression
model (modelling a causal relative risk), say m(a, V | β) = exp(β(a, V )),
then we could select h(a, V ) = h1 (a, V )/m(a, V | β0 ) for some h1 so that
h(a, V ) dβd0 m(a, V | β0 ) reduces to h1 (a, V )(a, V )> so that the -covariate is
thus independent of β0 , again. If m(a, V | β) is a linear model, then we can
choose h(a, V ) = h1 (a, V ) with (e.g.) h1 (a, V ) = g0∗ (a | V ).
The one-step targeted MLE: Given an estimate gn∗ of the treatment mechanism g0∗ , let n be the case-control weighted maximum likelihood estimator
n = arg max
n
X
log q0 Q∗2θn ()(1 | W1i , A1i )+
i=1
J
q̄0 (M1i ) X
log Q∗2θn ()(0 | W2ij , Aj2i ).
J
j=1
We call βn = β(Q∗1n , Q∗2θn (n )) corresponding with the updated Q∗θn (n )
the one step targeted MLE of β0 and if C(a, W ) does not depend on Q∗ ,
then this is also the iterative targeted MLE (since the next update gives an
n = 0 and thereby does not change the estimate). Either way, we let βn
be the final update after convergence of the iterative updating process has
been achieved, which, for -extensions of the type presented above, occurs in
a single step.
Recall the above mentioned identity
0 = EQ∗01
X
a∈A1
h(a, V )
d
m(a, V | β0 )(Q∗02 (d(a), W ) − m(a, V | β0 )),
dβ0
57
Hosted by The Berkeley Electronic Press
which defines β0 = β(Q∗01 , Q∗02 ) as a function of the marginal distribution Q∗01
of W and the conditional distribution (i.e., mean) Q∗02 , of Y , given A, W .
Let βn = β(Q∗1n , Q∗2θn (n )) be the targeted MLE, where Q∗1n is the empirical
probability distribution for the marginal distribution of W . It follows that,
given Q∗1n and Q∗2θn (n ), βn can be defined as the solution of
0=
n
1X
q̄0 (M1i ) X ∗
q0 D2∗ (βn , Q∗θn (n ))(W1i ) +
D2 (βn , Q∗θn (n ))(W2ij ),
n i=1
J
j
where
D2∗ (β, Q∗ )(W ) =
X
h(a, V )
a∈A1
d
m(a, V | β)(m(a, V | β) − Q∗2 (d(a), W ))).
dβ
Equivalently, one can view βn as a case-control weighted least squares
solution of the regression of Q∗2θn (n )(d(a), Wi ) on the realistic MSM m(a, Vi |
β):
βn = arg min
β
X
Pn,q0 h(a, V )(Q∗2θn (n )(d(a), W ) − m(a, V | β))2 ,
a∈A1
where Pn,q0 denotes the case-control weighted empirical distribution putting
mass q0 /n on (W1i , A1i ) and q̄0 (M1i )/(nJ) on (W2ij , Aj2i ), j = 1, . . . , J. Recall
that for a function D∗ (O∗ ) and corresponding Dq0 (O) we have Pn,q0 D∗ =
Pn Dq0 .
The targeted MLE as double robust estimating function based estimator: For the purpose of statistical inference it is also helpful to note
that
P
0 = ni=1 q0 D1∗ (Q∗θn (n ), gn∗ )(W1i , A1i , 1)+
Pn
j
j
1 PJ
∗
∗
∗
i=1 q̄0 (M1i ) J
j=1 D1 (Qθn (n ), gn )(W2i , A2i ),
so that we also have
0=
X
q0 D∗ (Q∗θn (n ), gn∗ )(W1i , A1i , 1)+
i
q̄0 (M1i ) X ∗ ∗
D (Qθn (n ), gn∗ )(W2ij , Aj2i , 0)).
J
j
Let’s now use the estimating function representation of the efficient influence curve in model M∗ ,
∗
∗
∗
D (β, Q , g ) =
X
I(A = d(a)(W ))
a∈A1
+
X
a∈A1
h(a, V )
d
h(a, V ) dβ
m(a, V | β)
g ∗ (A | X)
(Y − Q∗2 (A, W ))
d
m(a, V | β)(Q∗2 (d(a), W ) − m(a, V | β)),
dβ
58
http://biostats.bepress.com/ucbbiostat/paper234
where Q∗2 (a, W ) = EQ∗ (Y | A = a, W ) and Q∗2 (d(a), W ) = EQ∗ (Y | A =
d(a)(W ), W ). We have D∗ (Q∗ , g ∗ ) = D∗ (β(Q∗ ), Q∗ , g ∗ ) so that the fact that
the targeted MLE Q∗n = Q∗θn (n ) solves the case-control weighted efficient
influence curve equation, Pn Dq0 (Q∗θn (n ), gn∗ ) = 0 implies that βn solves the
case-control weighted Pn Dq0 (βn , Q∗θn (n ), gn∗ ) = 0. Thus the targeted MLE
βn is a solution of the double robust IPTW estimating function:
0=
X
Dq0 (βn , Q∗θn (n ), gn∗ ).
i
As a consequence, we can analyze βn in the same manner as we analyze
P
the double robust IPTW estimator βnDR solving 0 = i Dq0 (β, Q∗n , gn∗ ) for
a given estimator Q∗n , but where Q∗n is now simply playing the role of the
updated Q∗θn (n ) (van der Laan and Robins (2002)).
Statistical Inference: Thus (see van der Laan and Robins (2002)), if
gn∗ = g0∗ , under regularity conditions, we have that the targeted MLE βn =
β(Q∗1n , Q∗2θn (n )) is consistent and asymptotically linear with influence curve
∗
∗ ∗
∗
c−1
0 Dq0 (β0 , Q , g0 ), where c0 = c(PQ∗0 ,g0∗ , g0 , β0 ) is the derivative matrix defined above and Q∗ denotes the limit of Q∗θn (n ) (which is allowed to be
misspecified):
βn − β0 =
n
√
1X
∗ ∗
c−1
0 Dq0 (β0 , Q , g0 )(Oi ) + oP (1/ n).
n i=1
We suggest that this influence curve, as in prospective sampling, can also be
used for conservative inference in the case that g0∗ is estimated according to a
model, though this will need to be formally verified. If one wants statistical
inference in the double robust model only assuming that either gn∗ or Q∗θn (n )
is consistent, then we recommend to use the bootstrap.
Variance of estimator of P0∗ (Ya = 1 | V ) at q0 ≈ 0: Let D∗ (Q∗0 , g0∗ )(O∗ ) be
the non-standardized efficient influence curve in model M∗ presented above,
∗ ∗
thus not multiplied with c−1
0 . Let Dq0 (Q0 , g0 )(O) be the corresponding casecontrol weighted function. We note that if Q∗02 = q0 Q#
02 for some bounded
#
∗ ∗
Q02 , then we have that Dq0 (Q0 , g0 ) can be represented as q0 times a bounded
function. As a consequence, the variance of Dq0 (Q∗ , g0∗ ) for such Q∗ is proportional to q02 , so that the variance of the targeted MLE of m(a, v | β0 ) behaves
as O(q02 /n). As a consequence, we can robustly estimate causal relative risk
and odds ratio effects at q0 ≈ 0.
59
Hosted by The Berkeley Electronic Press
8
Estimation of causal parameters for case
control design I nested in a randomized
trial with unknown incidence probability.
In this section we address estimation of causal parameters when one does not
know q0 , but one knows the treatment mechanism. The estimators imply immediate analogues for observational case-control studies in which q0 ≈ 0, by
simply estimating the treatment mechanism based on the control observations only.
8.1
Randomized trial example.
In previous sections, in the case that q0 is known, we presented double robust
locally efficient estimators of causal parameters. In the special case that
g0∗ is known these estimators are guaranteed to always be consistent and
asymptotically linear.
The IPTW estimators presented in this section are appropriate in the
case that q0 is unknown and g0∗ is known, in which case the proposed IPTW
estimators of the causal relative risk and odds ratio are also always consistent
and asymptotically linear. In this first subsection we discuss a randomized
trial application.
Consider a randomized trial in which one samples i.i.d. Oi∗ = (Wi , Ai , Yi =
Yi (Ai )), i = 1, . . . , N , Ai ∈ {0, 1} is binary, and P0∗ (Ai = 1 | Xi ) = 0.5
(say). Suppose now that at baseline one has taken a tissue sample from
each patient which can be utilized to measure various markers of interest.
After having run the randomized trial one wishes to design a follow up study
in which one determines markers which are strong effect modifiers of the
effect of treatment. Let’s denote these J markers with Vj , j = 1, . . . , J.
For the sake of illustration suppose that the parameters targeted in such a
follow up study are defined by the linear marginal structural model E(Y (a) |
Vj ) = β0 + β1 a + β2 Vj + β3 aVj , where β3 defines the parameter representing
treatment effect modification of Vj . Such a study can be useful to determine
future phase III trials of interest, since biomarkers which are strong effect
modifiers might define sub-populations in which the treatment is particularly
effective. Similarly, one might wish to assume a logistic marginal structural
model.
Suppose now that it is actually very expensive to do this testing so that
60
http://biostats.bepress.com/ucbbiostat/paper234
one does not wish to run the biomarker assays on each patient, but on a
relatively small set of patients. However, the proportion of patients with
Yi = 1 among the N patients is small so that selecting a random sample of
the N patients for the biomarker follow up study would be a very ineffective
study.
Therefore, one might run a case-control study by randomly sampling a
case, and for each case one samples J controls, and one repeats this experiment n times to obtain a new sample consisting of n cases and nk controls.
This data set can now be represented as (V1i , W1i , A1i ) ∼ P (V, W, A | Y = 1),
(V0ij , W2ij , Aj2i ) ∼ P (V, W, A | Y = 0), j = 1, . . . , J, i = 1, . . . , n.
It should be remarked that an application of the IPTW-estimators presented in this section would ignore the controls in the N i.i.d observations
Oi∗ which are not sampled, while an efficient analysis would also use these
controls to improve efficiency (Molinaro et al. (2005)).
8.2
Marginal structural logistic regression models for
the causal odds ratio.
In this subsection we address estimation of the unknown parametes in a
marginal structural logistic regression model modelling the causal effect of
treatment on the odds-ratio scale. We have the following formal result.
Theorem 12 Consider a marginal structural logistic regression model
E(Ya | V ) = m(a, V | β0 ) =
1
,
1 + exp(−β0 C(a, V ))
for a vector-valued C(a, V ) with C(a, V )(0) = 1 so that β0 (0) denotes the
intercept. Let O∗ = (W, A, Y ) and let P0∗ be its distribution, and let g0∗ (a |
w) = P0∗ (A = a | W = w) be known.
Let O = ((W1 , A1 ), (W2 , A2 )) be the experimental unit generated by casecontrol design I, where, for simplicity, we consider the case that we have one
control for each case (i.e., J = 1).
Consider the following IPTW-estimating functions of O∗ for β0 indexed
by h:
h(A, V )
Dh∗ (β0 )(O∗ ) = ∗
(Y − m(A, V | β0 )).
g0 (A | W )
We note that P0∗ Dh (β0 ) = 0 for all h, under the assumption that supa h(a, V )/g0∗ (a |
W ) is bounded a.e.
61
Hosted by The Berkeley Electronic Press
Consider the following class of corresponding IPTW-estimating functions
of O for β0
Dh (β)(O) = Dh∗ (β)(W1 , A1 , 1) + Dh∗ (β)(W2 , A2 , 0).
If P0 Dh∗ (β0 ) = 0, then there exists a β00 with β00 (1, . . . , d) = β0 (1, . . . , d)
satisfying
P0 Dh (β00 ) = 0,
As a consequence, for case-control design I, one can estimate the nonintercept coefficients in the marginal structural logistic regression model with
weighted maximum likelihood estimation for the logistic regression of Y on
A, V according to the MSM model, using weights g0∗ (A1 | V1 )/g0∗ (A1 | W1 ) for
the cases and g0∗ (A2 | V2 )/g0∗ (A2 | W2 ) for the controls and further ignoring
the case-control sampling. That is, we can estimate β0 with the solution of
0=
n
X
Dh (βn )(Oi ),
i=1
or equivalently, we can set
βn = arg maxβ
g0∗ (A1i |V1i )
i g ∗ (A1i |W1i )
0
P
log mβ (V1i , A1i ) +
g0∗ (A2i |V2i )
g0∗ (A2i |W2i )
log(1 − mβ (V2i , A2i )).
In other words, one can fit the (odds-ratio part) of the marginal structural
logistic regression model with the IPTW estimator ignoring the case-control
sampling.
We note that this result naturally generalizes to data structures O∗ including a time-dependent treatment and marginal structural logistic regression
models modelling the causal effect of the multiple time point treatment on
the binary outcome.
This result is related and based on the same principles as the remark on
case-control studies in Robins (1999) in the context of direct effect estimation
and this IPTW-estimator is practically investigated in Mansson et al. (2007).
For the sake of completeness, we will now prove this result.
Proof of Theorem 12: Firstly, we note that replacing g0∗ (A | W ) by a g0∗ (A |
V ) in P0∗ corresponds with assuming that A is independent of (Y0 , Y1 ), given
V , and we denote the corresponding manipulated version of P0∗ with P0∗m .
We first define a correct estimation procedure under case-control sampling
from this manipulated population distribution P0∗m .
62
http://biostats.bepress.com/ucbbiostat/paper234
In that case, E0∗m (Ya | V ) = E0∗m (Y | A = a, V ). Since a standard
maximum likelihood estimator of the logistic regression model for E(Y |
A = a, V ) correctly estimates the odds-ratio part of E(Y | A, V ) it follows
that the marginal structural model can be fitted with standard maximum
likelihood estimation, and that the resulting estimator is consistent for the
non-intercept components of β0 . This maximum likelihood estimator solves
the estimating equation corresponding with the estimating function of the
form Dhm (β0 ) ≡ h(A, V )(Y − m(A, V | β0 )), where h = d/dβm/(m(1 − m)).
Thus, it follows that for each h P0∗m Dhm (β00 ) = 0 for a β00 which agrees with
the true β0 up till its intercept, and for this same β00 we have
0 = P0m h(A1 , V1 )(1 − m(A1 , V1 | β00 )) + h(A2 , V2 )(0 − m(A2 , V2 | β00 ))
P
= q0 h(a, v)(1 − m(a, v | β00 ))dP0∗m (w, a, 1)
P
+(1 − q0 ) h(a, v)(0 − m(a, v | β00 ))dP0∗m (w, a, 0).
In other words, the empirical summation i Dhm (β00 )(W1i , A1i , 1)+Dhm (β00 )(W2i , A2i , 0)
ignoring the case-control sampling has mean zero at this β00 under the manipulated case-control samping distribution P0m .
This basic latter identity will now be used to show that the Inverse Probability of Treatment Weighted (IPTW) estimating function Dh (β00 ) is unbiased
under P0 . Note that
P
g ∗ (A |V )
P0 g∗0(A11|W11 ) h(A1 , V1 )(1 − m(A1 , V1 | β00 ))
0
g ∗ (A |V )
+P0 g∗0(A22|W22 ) h(A2 , V2 )(0 − m(A2 , V2 | β00 )) =
0
g0∗ (a|v)
h(a, v)(1 − m(a, v | β00 ))dP0∗ (w, a, 1)
g0∗ (a|w)
R g∗ (a|v)
+(1 − q0 ) g∗0(a|w) h(a, v)(0 − m(a, v | β00 )dP0∗ (w, a, 0)
0
R
= q0 h(a,R v)(1 − m(a, v | β00 ))dP0∗m (w, a, 1)
+(1 − q0 ) h(a, v)(0 − m(a, v | β00 ))dP0∗m (w, a, 0)
q0
R
= 0.
by our previously established identity.
The more general result also follows by noting that the above result applies to each h. This completes the proof. 2
8.3
Case-only IPTW estimators for the causal relative
risk.
In this subsection we consider a simple estimator of a causal relative risk in
a nonparametric model.
63
Hosted by The Berkeley Electronic Press
We start out with proving the following theorem which is the basis of the
IPTW-estimator only using the cases, as presented below.
Theorem 13 Suppose D∗ (P0∗ ) = D∗ (ψ0 , η0∗ ) is an estimating function for
ψ0 with nuisance parameter η0∗ . Assume that
D∗ (ψ, η) = D1∗ (η) − ψ
for some D1∗ . Then ψ0 satisfying P0 Dq0 (ψ0 , η0 ) = 0, with Dq0 (O) = q0 D∗ (W1 , A1 , 1)+
j
j
1−q0 P
∗
j D (W0 , A0 , 0), is given by
J
ψ0 = P0 D1∗ (W2 , A2 , 0) + q0 P0 {D1∗ (W1 , A1 , 1) −
1X ∗ j j
D (W , A , 0)}.
J j 1 2 2
In particular, if D1∗ (W, A, 0) = 0 a.e., then
ψ0 = q0 P0 D1∗ (W1 , A1 , 1).
One can apply this theorem to EY1 and EY0 for D∗ satisfying D1∗ (W, A, 0) =
0. Consider ψ0 (1) = EY1 = EEP0∗ (Y | A = 1, W ) and consider the estimating
function D∗ (P0∗ )(W, A, Y ) = Y A/g0∗ (1 | W )−ψ0 (1). Then D(ψ0 (1), g0∗ )(W, A, 0) =
−ψ0 (1) so that it follows that
ψ0 (1) = q0 P0
A1
.
| W1 )
g0∗ (1
This yields the following identifiability result:
ψ0 (1)
EY (1)
A1
=
= P0 ∗
.
q0
q0
g0 (1 | W1 )
Note that this parameter represents a relative risk measure representing an
effect of an intervention relative to the current population proportion and
note that the identifiability result does not require knowing q0 . We refer to
the resulting estimator as the case-only IPTW estimator.
Similarly,
1 − A1
ψ0 (0) = q0 P0 ∗
,
g0 (0 | W1 )
giving us the identifiability result
ψ0 (0)/q0 = P0 (1 − A1 )/g0∗ (0 | W1 ).
64
http://biostats.bepress.com/ucbbiostat/paper234
Thus, if g0∗ is known, then one can identify the following causal parameters
of interest
E(Y (1) − Y (0))
ψ0 (1) − ψ0 (0)
=
q0
E ∗Y
ψ0 (1)
EY (1)
=
q0
E ∗Y
ψ0 (0)
EY (0)
=
.
q0
E ∗Y
Similarly, it follows that we can identify the causal relative risk ψ0RR ≡
ψ0 (1)/ψ0 (0):
P0 A1 /g0∗ (1 | W1 )
ψ0 (1)
=
.
ψRR0 ≡
ψ0 (0)
P0 (1 − A1 )/g0∗ (0 | W1 )
One can also identify the causal odds ratio:
ψOR0
ψ0 (1)/(1 − ψ0 (1))
P0 A1 /g0∗ (1 | W1 )/(1 − P0 A1 /g0∗ (1 | W1 ))
≡
=
,
ψ0 (0)/(1 − ψ0 (0))
P0 (1 − A1 )/g0∗ (0 | W1 )/(1 − P0 (1 − A1 )/g0∗ (0 | W1 ))
but estimation of the causal odds-ratio was addressed in the previous subsection with a preferred estimation strategy.
The corresponding case only IPTW estimators of these causal parameters
have robust influence curves even if q0 ≈ 0. For example, the influence curve
of the case-only IPTW estimator
Pn
I(A1i = 1)/g0∗ (1 | W1i )
∗
i=1 I(A1i = 0)/g0 (0 | W1i )
ψn = Pni=1
of the causal relative risk ψ0 (1)/ψ0 (0) = E(Y (1))/E(Y (0)) is given by
q0
IC1 (O) ≡
ψ0 (0)
(
)
I(A1 = 1) ψ0 (1) I(A1 = 0)
−
.
g0∗ (1 | W1 ) ψ0 (0) g0∗ (0 | W1 )
(11)
Note that this influence curve remains bounded for rare diseases as long as
q0 /ψ0 (0) remains bounded.
We note that if q0 ≈ 0, one can also estimate the causal relative risk
with the IPTW estimator of the saturated marginal structural logistic model
for P (Ya = 1), by using that the causal odds ratio approximates the causal
relative risk. We believe, if q0 ≈ 0, then the latter approach will result in a
more precise estimator.
65
Hosted by The Berkeley Electronic Press
8.4
Inverse probability of treatment weighted estimators of linear marginal structural models for rare
outcomes.
The previous theorem provided us for the case that q0 is unknown and g0∗
known with simple inverse probability of treatment weighted identifiability
results for the causal relative risk parameters. These IPTW estimators appeared to be relatively stable estimation procedures, even at small q0 . In
the case that q0 ≈ 0, but still unknown, these IPTW-estimators could be
extended by replacing g0∗ by a model based estimate based on the control
observations only.
We will now extend these results to linear marginal structural models by
using the following theorem.
Theorem 14 Consider an estimating function D∗ (ψ)(W, A, Y ) for a k-dimensional
parameter ψ satisfying
D∗ (ψ)(W, A, Y ) = Y D1∗ (A, W ) + D2∗ (A, W )ψ
(12)
for some k × 1 vector function D1∗ and k × k matrix function D2∗ . Let ψ0
solve P0∗ D∗ (ψ0 ) = 0. In this case,



J

1 − q0 X
0 = P0 q0 D∗ (ψ)(M1 , W1 , A1 , 1) +
D∗ (ψ)(M1 , W2j , Aj2 , 0)


J j=1
implies
ψ0
= {E0∗ D2 (W, A)}−1 E0 D1∗ (A1 , W1 ),
q0
where


E0∗ D2 (W, A) = E q0 D2 (W1 , A1 ) +

J
1 − q0 X
J
j=1


D2 (W2j , Aj2 ) .

The proof of Theorem 14 is straightforward and therefore omitted. This
theorem teaches us that by using estimating functions D∗ satisfying the mentioned structure (12), one can obtain closed estimators of ψ0 /q0 which are
stable even for small values of q0 : i.e., these estimators have bounded influence curve uniformly in q0 ≈ 0). In addition, one can obtain these estimators
without knowing the vale of q0 as long as one knows that q0 ≈ 0.
66
http://biostats.bepress.com/ucbbiostat/paper234
As an example, let’s consider a causal effect model describing how the
treatment specific mean changes as a function of treatment and an adjustment variable V :
E0∗ (Y (a) | V ) = β0> m(a, V ),
where, for example, m(a, V )> = (1, a, V, aV ). Let β0 denote the true vector
of values.
A least squares IPTW estimating function for β based on i.i.d sampling
of O∗ = (W, A, Y ) is given by
D∗ (β)(W, A, Y ) =
m(A, V )
(Y − β > m(A, V )).
g0∗ (A | W )
which satisfies that P0∗ D∗ (β0 ) = 0 if supa m(a, V )/g0∗ (a | W ) < ∞ a.e. In
general, we can choose
D∗ (β)(W, A, Y ) =
h(A, V )
(Y − β > m(A, V )),
g0∗ (A | W )
for arbitrary function h, where, for example, one can select
h(A, V ) =
β > m(A, V
m(A, V )
,
)(1 − β > m(A, V ))
which corresponds with weighted least squares. For simplicity, let’s consider
the case that h(A, V ) = m(A, V ).
We have that D∗ indeed satisfies the wished linear structure (12):
D∗ (β)(O∗ ) = Y
m(A, V )m> (A, V )
m(A, V )
−
β.
g0∗ (A | W )
g0∗ (A | W )
Thus, an application of Theorem 14 teaches us that
m(A, V )m> (A, V )
β0
= E0∗
q0
g0∗ (A | W )
(
)−1
E0
m(A1 , V1 )
,
g0∗ (A1 | W1 )
where the normalizing matrix
>
∗ m(A, V )m (A, V
E0
g0 (A | W )
(
c0 =
)
)
= E0∗
X
m(a, V )m> (a, V )
a
67
Hosted by The Berkeley Electronic Press
is identified as
E0 q0 a m(a, V1 )m> (a, V1 )
PJ P
j
j
>
0
+ 1−q
a m(a, V2 )m (a, V2 ).
j=1
J
P
Since the latter depends on q0 this is not really providing the wished identifiability result. But, if q0 ≈ 0, then the estimator βn can be defined as:
n
X
βn
m(A1i , V1i )
−1 1
= cn
,
q0
n i=1 g0∗ (A1i | W1i )
with
cn =
1
n
Pn
1
i=1 J
PJ
j=1
P
a
m(a, V2ij )m> (a, V2ij ),
which does not rely on knowing q0 .
These results can now be used to estimate a number of relative risk parameters of interest. We have E(Y1 | V ) = β0> m(1, V ), E(Y0 | V ) = β0> m(0, V ),
so that
E(Y1 | V ) − E(Y0 | V )
β > (m(1, V ) − m(0, V ))
β0> /q0 (m(1, V ) − m(0, V ))
= 0
=
.
E(Y0 | V )
β0> m(0, V )
β0> /q0 m(0, V )
Thus, this conditional relative causal effect of treatment versus control is a
simple function of β0 /q0 , and can thus be estimated as above.
Another possible relative risk parameter is
E(Y1 | V ) − E(Y0 | V )
β>
= 0 (m(1, V ) − m(0, V )).
EY
q0
8.5
Deriving the efficient influence curve for case-control
design I and unknown incidence probability
In this subsection we indicate how one would go about deriving an efficient
estimator in the model in which the incidence probability is unknown, but
g0∗ is known.
Consider the model M∗ only assuming that g0∗ is known, J = 1, and
case-control design I so that
1
Q∗ (W1 , A1 ))(1−Q∗ (W2 , A2 ))Q∗1 (W1 )Q∗1 (W2 )
− q(Q∗ , Q∗1 ))
(13)
∗
∗
is indexed by infinite dimensional parameters Q (a, w) = P (Y = 1 | A =
a, W = w) and Q∗1 (w) = P (W = w).
∗
dP,Q
∗ ,Q∗ (O) =
1
q(Q∗ , Q∗1 )(1
68
http://biostats.bepress.com/ucbbiostat/paper234
We stress that the following efficiency calculations apply to this particular
model in which g0∗ is known, and q0 is unknown (and thus not to the case
that q0 is known). We expect that the following can be easily generalized to
general J.
We first determine the so called tangent space for this case-control design
I model. We will first determine the score operator which maps the scores
h(Y | A, W ) (satisfying h(0 | A, W ) = Q∗ (A, W )/(1−Q∗ (A, W ))h(1 | A, W ))
and h1 (W ) (mean zero) of fluctuations Q∗ (Y | A, W ) and Q∗1 (W ) through
Q∗ (Y | A, W ) and Q∗ (W ) into the score of dPQ∗ ,Q∗1 . This score operator is
given by:
A∗ (h, h1 ) = h(1 | A1 , W1 ) −
Q∗ (A2 , W2 )
h(1 | A2 , W2 ) + h(W1 ) + h(W2 ),
1 − Q∗ (A2 , W2 )
where h(1 | A1 , W1 ) can be an arbitrary function with mean zero, and h is an
arbitrary functions of W with mean zero w.r.t. Q̄(W ) = Q1 (W ) + Q0 (W ),
where Q1 (w) = P (W = w | Y = 1), Q0 (w) = P (W = w | Y = 0).
Secondly, it is well known (Bickel et al. (1993)) that the efficient influence
curve of a parameter ψ0∗ of P0∗ , such as the marginal causal odds ratio in the
nonparametric model, can be obtained by projecting the influence curve IC1
of an initial regular asymptotically linear estimator of this parameter ψ0∗ ,
such as the IPTW estimator for the saturated logistic marginal structural
model ignoring the case-control sampling as above, onto the closure of the
range of the score operator A∗ in the Hilbert space L20 (P0 ) endowed with inner
product hV1 , V2 iP0 = EP0 V1 (O)V2 (O). This insight allows us in principle to
calculate the efficient influence curve, and thereby obtain a locally efficient
estimator improving on the initial estimator. Since the resulting calculations
are quite extensive, this is not reported here. Our intuition is that the IPTW
estimator for the logistic marginal structural model is highly efficient in the
case that q0 is unknown, since not knowing q0 makes it essentially impossible
(at q0 ≈ 0) to estimate E0∗ (Y | A, W ) which is needed to fully exploit the
covariates for efficiency gains, as in the double robust targeted MLE for the
case that q0 is known.
8.6
Discussion.
For the sake of discussion consider the case that g0∗ is known. We suggest
that the mentioned IPTW-estimators for the logistic and linear marginal
structural model are highly efficient in the model in which the incidence
69
Hosted by The Berkeley Electronic Press
probability is unknown. Based on that premise one needs to conclude that
somehow knowing q0 , even when it is known that it is very small q0 ≈ 0, can
truly help to gain efficiency relative to these IPTW estimators, while without
knowing q0 this gain in efficiency cannot be achieved. We believe that the
crucial reason is that it requires knowing q0 to convert a fit of a logistic
regression Q̃∗n (that does not suffer from q0 ≈ 0 for the sake of estimation of
the conditional odds-ratio), obtained with the maximum likelihood estimator
ignoring the case control sampling,
into a valid estimator of Q∗0 with standard
√
error proportional to q0 / n (uniform in q0 ≈ 0). In this manner one can
use covariate information, as utilized by fitting a logistic regression model
correctly targeting the conditional odds ratio and then adding the intercept
log c0 into the obtained logistic regression fit, to improve efficiency. Without
knowing q0 , an estimator of Q∗0 obtained by using the known g0∗ will have a
standard error which is too large since it will not be proportional to q0 .
8.7
Estimation of treatment mechanism.
Consider now the case that one does not know the treatment mechanism g0∗ ,
but that one is willing to assume a model for g0 . If q0 would be known,
then, for all these causal parameters, one would estimate g0∗ with weighted
maximum likelihood estimation: given a model G for g0∗
gn∗ = arg max
g∈G
n
X
q0 log g(A1i | W1i +
i=1
J
1 − q0 X
log g(Aj2i | W2ij ).
J j=1
Note that in the case that q0 ≈ 0, the case-control weighted maximum
likelihood estimator of g0∗ would essentially only use the control observations
with weight 1 and thus yield an estimate of the conditional distribution of
A2 , given W2 .
Inspection of possible bias in estimator of g0∗ due to only using
control observations. We note
P ∗ (Y = 0 | A = 1, W )
P0∗ (A = 1 | W, Y = 0) = P0∗ (A = 1 | W ) 0 ∗
P0 (Y = 0 | W )
∗
≡ P0 (A = 1 | W )r0 (1, W )−1 .
So we need that for q0 small the ratio
r0 (1, W ) =
P0∗ (Y = 0 | W )
≈ 1.
P0∗ (Y = 0 | A = 1, W )
70
http://biostats.bepress.com/ucbbiostat/paper234
This seems a weak assumption since one also knows that
EW P0∗ (Y = 0 | W ) = 1 − q0 ≈ 1 and EW P0∗ (Y = 1 | A = 1, W ) = O(q0 ),
assuming that g0∗ (1 | W ) is bounded away from zero and 1. Thus, one expects
that both numerator and denominator of r0 (1, W ) are close to 1 for almost
all W , and thereby that r0 ≈ 1. Having said this, it should theoretically be
possible to have certain integrals be close to 1 while another being very large,
so that strictly speaking it seems that we need to make an assumption in order
to guarantee that for q0 small the required integrals behave as expected.
For each specific parameter and corresponding IPTW-estimator, we can
generate more precise statements regarding this approximation of r0 (a, W )
in a context in which q0 → 0. For example, we wish to replace the causal
relative risk by the approximation
I(A1 =1)
E0 g∗ (1|W
1 ,Y =0)
0
I(A1 =0)
E0 g∗ (0|W
1 ,Y =0)
0
=
I(A1 =1)
E0 g∗ (1|W
1 )r0 (1,W1 )
0
I(A1 =0)
E0 g∗ (0|W
1 )r0 (0,W1 )
.
0
For example, if
r0 (a, W ) = 1 + r(q0 ),
for a constant (in W ) remainder term r(q0 ) which converges to zero when
q0 → 0, then under a weak regularity condition, it follows that
I(A1 =1)
E0 g∗ (1|W
1 ,Y =0)
0
I(A1 =0)
E0 g∗ (0|W
1 ,Y =0)
0
=
1 =1)
E0 gI(A
∗ (1|W )
1
0
1 =0)
E0 gI(A
∗ (0|W )
1
+ O(r(q0 )).
0
However, since r0 (a, W ) is averaged over W , one should only need that an
integral of the remainder r(q0 ) as a function of W is small (e.g., O(q0 )).
This does not seem to be a strong assumption at al given that an integral of
numerator and denominator of r(q0 ) have to be O(q0 ).
To conclude, in order to have that gn∗ based on control observations only
is a practically unbiased estimator of g0 when the disease studied is very rare,
P ∗ (Y =0|W )
we need that r0 (a, W ) = P ∗ (Y0 =0|A=a,W ) ≈ 1 for a ∈ {0, 1} for almost every
0
W . We note that this does allow that the treatment A = 1 increases the
risk on Y = 1 by a factor relative to A = 0. In other words, this condition
requires that for most W , P0∗ (Y = 1 | A = a, W ) is small (say of the order
q0 ).
71
Hosted by The Berkeley Electronic Press
9
Summary, discussion and simple extensions.
We provide a generic approach for locally efficient estimation such as targeted
maximum likelihood estimation of any parameter based on matched and unmatched case-control designs, which relies on specification of one or two
non-identifiable parameters/scalars q0 and, for matched case-control designs,
q0 (1 | m) = P0∗ (Y = 1 | M = m).
These non-identifiable parameters could be known or they could be set
in a sensitivity analysis, for example, in the case that these parameters are
known to be contained in a particular interval. Our approach is remarkably
simple since it only requires weighting the cases by q0 and the controls by
1 − q0 or q̄0 (M1 ) and then applying a method developed for prospective sampling. Moreover, our approach has the remarkable convenient feature that
applying the case-control weighting to an optimal method for the prospective sample results in an optimal method for independent and matched casecontrol designs.
We also showed how the case-control weighting for matched case-control
designs corresponds with applying the case-control weighting for the standard
unmatched case-control design for each sub-sample defined by a category for
the matching variable to obtain the analogue conditional parameter, conditional on the matching variable category, and subsequently averaging these
results over the matching variable categories to get the wished marginal parameter. This helps us to understand that our somewhat strange looking
weights for the control observations in a matched case-control study are actually just as sensible as the much easier to understand weights for standard
case-control designs.
We worked out the case-control weighted targeted maximum likelihood
estimators in a number of important applications involving estimation of
variable importance and causal effect parameters.
In addition, we showed for both types of case-control designs how standard maximum likelihood logistic regression fits can be adjusted by using
these known quantities to estimate conditional probabilities P0∗ (Y = 1 |
A, W ) with a standard error which is proportional to q0 divided by the square
root of the sample size, so that the acquired precision results in stable estimators of such challenging parameters as relative risk and odds-ratios at
q0 ≈ 0.
We believe that in most applications the marginal population proportion
of cases, q0 , should be known, at least within close approximation, assuming
72
http://biostats.bepress.com/ucbbiostat/paper234
one has made an effort to understand the target population the cases are
sampled from. In matched case-control studies in which one uses a matching
variable with a large number of categories, then the value of the population
proportion of cases within each matching category might not be known. In
that case, if the number of matching categories is large, a sensitivity analysis
would likely be too cumbersome. On the other hand, even for such matched
case-control samples, using the case-control weighting for design I might already provide an important bias reduction so that our methods only relying
on q0 will likely still provide a useful set of tools. Off course, this would
require some validation that ignoring the matching does not cause severe
bias.
During the design of a case-control study, we recommend to keep in mind
that knowing these population proportion of cases for each matching category make the convenient and double robust efficient estimation of any causal
effect and variable importance parameter possible (through the methods presented here) without restrictive assumptions such as the no-interaction assumption and parametric model form for conditional logistic regression models. This insight might help and motivate people to design case-control studies in which the required case-control weights are known or approximately
known so that a sensitivity analysis is possible.
Allthough the main purpose of our article is the introduction, study, and
application of the general methodology for analyzing case-control studies
based on a known (or set value in a sensitivity analysis) incidence probability,
for the sake of completeness, we also wanted to consider the case that the
population proportion of cases q0 is unknown in case-control design I. We
presented various IPTW-type estimators of causal parameters relying on q0 ≈
0 or that the treatment mechanism g0∗ (A | W ) is known. We also highlighted
the known result (Robins (1999) and Mansson et al. (2007)) showing that one
can estimate a marginal structural logistic regression model with standard
IPTW-logistic regression software, again either assuming g0∗ is known or that
q0 ≈ 0. These IPTW-estimators rely on a correctly specified model for g0∗
and require q0 ≈ 0. Since, without knowing q0 and without knowing g0∗ ,
the IPTW-estimators target a non-identifiable parameter, we are concerned
about the sensitivity of these IPTW estimators w.r.t. misspecifying g0∗ .
To summarize, , by knowing q0 , one has available more efficient and more
robust (i.e., double robust) targeted maximum likelihood estimators, targeting an identifiable parameter, and one does not have to restrict oneselves to
odds-ratio parameters.
73
Hosted by The Berkeley Electronic Press
We now consider a few direct extensions and applications of our methodology.
Frequency matching: Frequency matching in case-control studies is
typically defined as running a case-control design I within each strata M = m.
In this case one can estimate any causal parameter ψ0 (m) of the conditional
distribution of O∗ , given M = m, by assigning weights q0 (1 | m) to the cases
and q0 (0 | m)/J to the corresponding J controls. Thus our methods for casecontrol design I can be applied to each strata M = m. In particular, this
yields a locally efficient double robust targeted maximum likelihood estimator
of ψ0 (m) for each m. In order to estimate the marginal parameter ψ0 one
would need an estimate of the marginal distribution of M , which cannot be
identified based on knowing q0 (1 | m) only, so that other knowledge will be
needed such as the marginal population distribution of M . Either way, one
can always estimate causal parameters such as E(Ya | M = m) for each m or
the corresponding variable importance measure. If the number of categories
of the matching variable is large, then a sensible strategy for estimation of
ψ0 (m) is to assume a model ψ0 (m) = f (m | β0 ) and obtain a pooled locally
efficient targeted maximum likelihood of β0 based on all observations.
Pair matching: Pair matching in case-control studies is typically described as, for each matching category, sample a case and a set of controls.
So this description agrees with frequency matching except that the number
of categories can be very large. Therefore, we should now always assume
a model ψ0 (m) = f (m | β0 ) and obtain a pooled locally efficient targeted
maximum likelihood of β0 based on all observations.
Without the knowledge of q0 (1 | m), one would use conditional logistic
regression models, and, as noted in Jewell (2006) page 258, these methods
do not allow estimation of the association of M with Y , while if one knows
the population proportion q0 (1 | m) we can estimate every parameter of the
population distribution, conditional on M = m.
Counter matching: Finally, another type of matching in case-control
studies is called counter-matching, which involves sampling a control with an
exposure (maximally) different from the exposure of the case. Formally, we
can define this sampling scheme as follows. The observation O = ((M1 , Z1 ), (M2 , Z2 ))
on each experimental unit is generated as 1) sample (M1 , Z1 ) from the conditional distribution of (M, Z), given Y = 1, and 2) sample (M2 , Z2 ) from
the conditional distribution of (M, Z), given M = m∗ (M1 ) and Y = 0, where
m∗ (m) maps a particular outcome m into a counter-match m∗ (m) in the
outcome space for M . Similarly, this is defined for the case that one sam74
http://biostats.bepress.com/ucbbiostat/paper234
ples J controls counter-matched to the case. The population distribution
of interest is the distribution P0∗ of O∗ = (M, Z, Y ) and we are concerned
with estimation of a particular parameter ψ0∗ of this distribution P0∗ based on
a counter-matched case-control sample O1 , . . . , On . In this case, given that
D∗ (M, Z, Y ) satisfies P0∗ D∗ = 0, we have
E0 Dq0 ,q̄0∗ (O) = 0,
where the case-control weighted version of D∗ is defined as
Dq0 ,q̄0 (O) = q0 D∗ (M1 , Z1 , 1) + q̄0∗ (M )D∗ (m∗ (M1 ), Z2 , 0),
with
q̄0∗ (m) = (1 − q0 )
P0∗ (M = m∗ (m) | Y = 0)
.
P0∗ (M = m | Y = 1)
Note that if m∗ (m) = m is the identity function, then indeed q̄0∗ = q̄0 . The
non-identifiable component of the control-weight q̄0∗ is P0∗ (M = m∗ (m), Y =
0), or, assuming q0 is known, P0∗ (M = m∗ (m) | Y = 0), while the denominator P0∗ (M = m | Y = 1) = P0 (M1 = m) can be empirically estimated. Since
in many applications the control observations are relatively easily accessible,
one might use a separate sample of controls to estimate these proportions
P0∗ (M = · | Y = 0) having a certain value for the (counter-)matching variable M among the controls. So under the condition that these weights q0 , q̄0∗
are known (or set in a sensitivity analysis), our results in this article can be
applied to counter-matched case-control designs by just replacing q̄0 by q̄0∗ .
Propensity score matching design: A commonly used design is the
following. One samples from the units that received treatment. For each
treated unit, one finds a matched non-treated unit, where the matching is
done based on a fit of the so called propensity score. The goal of this design is
to create a sample in which the confounders are reasonably balanced between
the treated and untreated units. This design can formally be described as
follows. The random variable of interest is O∗ = (W, A, Y ) ∼ P0∗ , and one
is typically concerned with estimation of a causal effect such as E0∗ {E0∗ (Y |
A = 1, W ) − E0∗ (Y | A = 0, W )}. Let M ≡ Π∗ (W ) be a summary measure of
W which is supposedly an approximation of the propensity score Π∗0 (W ) =
P0 (A = 1 | W ) (e.g., estimated from external data). One samples (M1 =
Π∗ (W1 ), W1 , Y1 ) from the conditional distribution of (W, Y ), given A = 1,
and one samples one or more (M2 = Π∗ (W2 ), W2 , Y2 ) from the conditional
distribution of (W, Y ), given M = M1 and A = 0.
75
Hosted by The Berkeley Electronic Press
One now wishes to use n i.i.d. observations on the observed experimental
unit O = ((W1 , Y1 ), (W2j , Y2j : j)) representing a treated unit and one or more
propensity score matched untreated units to estimate the causal parameter
of interest.
Notice that we can immediately apply the methodology presented in this
article by defining the Y as the A and the matching variable M is playing
the role of Π∗ (W ). As a consequence, one can use any method developed for
sampling from (W, A, Y ) by using our ”case control” weights q0 = P0∗ (A = 1)
P ∗ (A=0|M )
for the treated units, and q̄0 (W ) = q0 P0∗ (A=1|M ) for the untreated units. Thus,
0
to correct for the biased sampling one will need to know the actual true
treatment mechanism/propensity score P0∗ (A = 1 | W ). Thus, under the
assumption that this propensity score is known or can be estimated based
on an external data source, one can apply any method for estimation of
the wished causal effect for standard sampling by applying these weights
to the treated and untreated units. Off course, for the sake of statistical
inference and model selection (say, based on cross-validation) one should
respect the fact that the independent and identically distributed observations
are O1 , . . . , On , and not the treated and untreated units.
General biased sampling: Finally, we like to discuss the implications
of the proposed optimal case-control weighting for general biased sampling
models with known probabilities for the conditioning events, where optimal
refers to the fact that the case-control weighting maps an efficient procedure
for an unbiased sample into an efficient procedure for the biased sample. The
following generalization of our method for case-control design I applies to general biased sampling. Consider a particular target probability distribution
P0∗ representing the unbiased sampling distribution and its corresponding
random variable O∗ ∼ P0∗ . Suppose now that the outcome space for the
random variable O∗ is partitioned by a union of events Aj , j = 1, . . . , J: i.e.
P r(O∗ ∈ ∪j Aj ) = 1 and the sets Aj are pairwise disjoint. Let the experimental unit for the observed data be (O1 , . . . , OJ ), where Oj ∼ O∗ | O∗ ∈ Aj
is a draw from the conditional distribution, given O∗ ∈ Aj , j = 1, . . . , J.
For simplicity, we enforced here equal number of draws, but this can be
generalized to having different number of draws from each conditional distribution. Let q0 (j) = P0∗ (O∗ ∈ Aj ) ∈ (0, 1) and suppose these probabilities
are known. Weighting observation Oj with q0 (j) for j = 1, . . . , J, and applying a method developed for the unbiased sample will yield valid estimators.
We also conjecture that under appropriate similar conditions as we assumed
76
http://biostats.bepress.com/ucbbiostat/paper234
for case-control sampling, this weighting will be optimal in the sense that assigning these weights to an efficient estimation procedure for i.i.d. samples of
P0∗ will yield an efficient estimation procedure based on the biased sampling
model. Given our interpretation of case-control weighting for matched casecontrol sampling in terms of case-control weighting for standard case-control
studies conditional on the matching category, we suggest that weighting for
matched case-control sampling can be generalized to matched biased sampling in general (say matched on a draw M1 from the first biased sampling
distribution).
Appendix: Tangent space results proving casecontrol weighted canonical gradient of prospective sampling model equals canonical gradient.
Our results in this section show that the case-control weighted canonical
gradient for the prospective sampling model M∗ yields the canonical gradient for the parameter of interest Ψ in the actual case-control sampling
model. These results rely on the following assumption. The (typically very
large/semiparametric) model M∗ corresponds with (i.e., equals the intersection of) separate models for P0∗ (W, A | Y = δ) for δ ∈ {0, 1} for case-control
design I, and, for case-control design II, M∗ corresponds with (i.e., equals the
intersection of) separate models for P0∗ (W, A | Y = δ, M = m) for δ ∈ {0, 1}
and m varying over the support of the matching variable M . As a consequence of this canonical gradient representation our proposed case-control
weighted targeted maximum likelihood estimator, involving selecting estimators of Q∗0 and g0∗ , under appropriate regularity conditions guaranteeing the
wished convergence to a normal limit distribution, is efficient if both of these
estimators are consistent, and remains consistent if one of these estimators
is consistent.
The results are stated in an incremental fashion thereby building up the
proof of the final wished result. As a consequence, most stated results do
not require a proof but can be straightforwardly verified.
Tangent space for case-control design I: We start out with presenting
the tangent space for case-control design I.
Theorem 15 (Tangent space for case-control design I) Consider case77
Hosted by The Berkeley Electronic Press
control design I and the independence model M described by (7),
dP (P ∗ )(O) = P ∗ (W1 , A1 | Y = 1)
Y
P ∗ (W2j , Aj2 | Y = 0),
j
and let T ∗ (P ∗ ) denote the tangent space at P ∗ in model M∗ . The tangent
space at P (P ∗ ) in model M is given by


TI (P ∗ ) = S ∗ (W1 , A1 , 1) − E ∗ (S ∗ | Y = 1) +
X


j

{S ∗ (W2j , Aj2 , 0) − E ∗ (S ∗ | Y = 0)} ,
where S ∗ varies across T ∗ (P ∗ ).
Since this tangent space is expressed in terms of the tangent space of the
underlying model M∗ we now need to understand the tangent space of M∗ .
The following theorem fully characterizes this tangent space for models M∗
described by separate models for P (W, A | Y = δ) for δ ∈ {0, 1}.
Theorem 16 (Tangent space for underlying model M∗ ) Consider the
data structure O∗ = (W, A, Y ) and model M∗ for its probability distribution.
We make the following assumption on M∗ : Let M∗ = ∩δ M∗ (δ), where
M∗ (δ) is a model for P0∗ (W, A | Y = δ) indexed by (possibly infinite dimensional) parameter θ(δ), for each δ ∈ {0, 1}, and assume that θ(δ) for different
choices of δ are variation independent parameters.
If the marginal distribution q0 (δ) = P (Y = δ) of Y is known in model
M∗ , then, we can represent T ∗ (P ∗ ) as
T ∗ (P ∗ ) =
X
Tδ∗ (P ∗ ),
(14)
δ
where the latter sum-space is an orthogonal sum, and Tδ∗ (P ∗ ) denotes the
tangent space generated by θ(δ), which can be represented as
Tδ∗ (P ∗ ) = {I(Y = δ) (S ∗ (W, A, δ) − E(S ∗ | Y )) : S ∗ ∈ T ∗ (P ∗ )} .
If q0 (δ) is unknown and modelled, then
T ∗ (P ∗ ) = L20 (PY∗ ) ⊕
X
Tδ∗ (P ∗ ),
(15)
δ
78
http://biostats.bepress.com/ucbbiostat/paper234
where L20 (PY∗ ) is the Hilbert space of functions of Y with mean zero and finite
variance w.r.t. P ∗ . We also note that for a S ∗ ∈ L20 (P ∗ ), the projection of
S ∗ on Tδ∗ (P ∗ ) is given by
Π(S ∗ | Tδ∗ (P ∗ )) = I(Y = δ) (S ∗ (W, A, δ) − E(S ∗ | Y )) ,
and the projection of S ∗ onto T ∗ (P ∗ ) described by the orthogonal decomposition (15) is given by
S ∗ = E(S ∗ | Y ) +
X
Π(S ∗ | Tδ∗ (P ∗ )).
δ
Tangent space for case-control design II: We now present the tangent space for matched case-control design II.
Theorem 17 (Tangent space for case-control design II) Consider casecontrol design II and the independence model M described by (8),
dP (P ∗ )(O) = P ∗ (M1 )P ∗ (A1 , W1 | Y = 1, M1 )
Y
P ∗ (Aj2 , W2j | Y = 0, M1 ),
j
and let T ∗ (P ∗ ) denote the tangent space at P ∗ in model M∗ . The tangent
space at P (P ∗ ) in model M is given by
∗
2
T
nII (P ) = L0 (M1 )⊕
o
P
S ∗ (Z1 , 1) − E ∗ (S ∗ | M = M1 , Y = 1) + j {S ∗ (Z2j , 0) − E ∗ (S ∗ | M = M1 , Y = 0)} ,
where S ∗ varies across T ∗ (P ∗ ), Z1 = (M1 , W1 , A1 ) and Z2j = (M1 , W2j , Aj2 ).
Since this tangent space is characterized in terms of the underlying tangent
space T ∗ (P ∗ ) for model M∗ we now fully characterize the latter tangent space
for models M∗ described by separate models for P ∗ (W, A | M = m, Y = δ)
for the different values of m and δ.
Theorem 18 (Tangent space for model M∗ including matching variable)
We make the following assumption on M∗ : Suppose that M∗ = ∩m,δ M∗ (m, δ),
where M∗ (m, δ) is a model for P0∗ (W, A | M = m, Y = δ) indexed by (e.g.,
infinite dimensional) parameter θ(m, δ), for each δ ∈ {0, 1} and possible outcome m for M , and it is assumed that θ(m, δ) are variation independent
parameters.
79
Hosted by The Berkeley Electronic Press
If q0 (δ | m) = P (Y = δ | M = m) is known and the marginal distribution
of M is unspecified in model M∗ , then, we can represent T ∗ (P ∗ ) as
T ∗ (P ∗ ) = L20 (M ) ⊕
X
∗
Tm,δ
(P ∗ ),
(16)
m,δ
∗
where the latter sum-space is an orthogonal sum, and Tm,δ
(P ∗ ) denotes the
tangent space generated by θ(m, δ), which can be represented as
∗
Tm,δ
(P ∗ ) = {I(M = m, Y = δ) (S ∗ (m, W, A, δ) − E(S ∗ | M, Y )) : S ∗ ∈ T ∗ (P ∗ )} .
If the conditional distribution q0 (δ | m) of Y , given M , is unknown and
modeled, then
∗
T ∗ (P ∗ ) = L20 (PM
) ⊕ T ∗ (q0 ) ⊕
X
∗
Tm,δ
(P ∗ ),
(17)
m,δ
where T ∗ (q0 ) denotes the tangent space generated by the scores of the parameters of q0 (δ | m). We also note that for a S ∗ ∈ L20 (P ∗ ), the projection onto
∗
Tm,δ
(P ∗ ) is given by
∗
Π(S ∗ | Tm,δ
(P ∗ )) = I(M = m, Y = δ) (S ∗ (m, W, A, δ) − E(S ∗ | M, Y )) ,
and, under the assumption that q0 (δ | m) is unspecified, the projection of S ∗
onto T ∗ (P ∗ ) described by the orthogonal decomposition (17) is given by
S ∗ = E(S ∗ | M ) + {E(S ∗ | Y, M ) − E(S ∗ | M )} +
X
∗
Π(S ∗ | Tm,δ
(P ∗ )).
m,δ
Special score for case-control design I: We will later show that the
case-control weighted canonical gradient is in the tangent space TI (P ∗ ) by selecting a special choice S ∗ ∈ T ∗ (P ∗ ) defined in the next result. The following
result shows that this special choice is indeed a member of T ∗ (P ∗ ).
Result 1 Let O∗ = (W, A, Y ) ∼ P0∗ ∈ M∗ and assume that the tangent
space T ∗ (P ∗ ) at P ∗ ∈ M∗ is given by orthogonal decomposition (15). Given
a D∗ ∈ T ∗ (P ∗ ), we have
S ∗ (W, A, Y ) = q0 (Y ) {D∗ (W, A, Y ) − E ∗ (D∗ | Y )}
∈ T ∗ (P ∗ ).
The same applies if q0 (0) is replaced by q0 (0)/J.
80
http://biostats.bepress.com/ucbbiostat/paper234
Proof. Firstly, we note that for each δ, Π(D∗ | Tδ (P ∗ )) ∈ T ∗ (P ∗ ), and by
linearity of the space Tδ (P ∗ ) (i.e., closure under multiplication by scalar) we
have that q0 (δ)Π(D∗ | Tδ∗ (P ∗ )) ∈ T ∗ (P ∗ ). By linearity of T ∗ (P ∗ ), it follows
thus that
| Tδ∗ (P ∗ ))
= δ q0 (δ)I(Y = δ) (D∗ (W, A, δ) − E ∗ (D∗ | Y ))
= q0 (Y ) (D∗ (W, A, Y ) − E ∗ (D∗ | Y ))
= S ∗ (W, A, Y )
∈ T ∗ (P ∗ ).
P
q0 (δ)Π(D
δP
∗
This completes the proof. 2
Special score for case-control design II: For case-control design II,
we need a similar result.
Result 2 Consider the model O∗ = (M, W, A, Y ) ∼ P0∗ ∈ M∗ and let
T ∗ (P ∗ ) denote the tangent space at P ∗ ∈ M∗ and assume it satisfies orthogonal decomposition (17). Given a D∗ ∈ T ∗ (P ∗ ), we have
∗
Sm
(M, W, A, Y ) ≡ I(M = m)q0 (Y | m) {D∗ (m, W, A, Y ) − E ∗ (D∗ | M, Y )}
∈ T ∗ (P ∗ ).
(18)
The same result applies if we replace q0 (0 | m) by q0 (0 | m)/J.
Proof. Firstly, we note that for each m, δ, Π(D∗ | Tm,δ (P ∗ )) ∈ T ∗ (P ∗ ),
and by linearity of the space Tm,δ (P ∗ ) (i.e., closure under multiplication by
∗
scalar) we have that q0 (δ | m)Π(D∗ | Tm,δ
(P ∗ )) ∈ T ∗ (P ∗ ). By linearity of
∗
∗
T (P ), it follows thus that
∗
q (δ | m)Π(D∗ | Tm,δ
(P ∗ ))
= δ q0 (δ | m)I(M = m, Y = δ) (D∗ (m, W, A, δ) − E ∗ (D∗ | M, Y ))
= I(M = m)q0 (Y | m) (D∗ (m, W, A, Y ) − E ∗ (D∗ | M, Y ))
∗
= Sm
(M, W, A, Y )
∗
∈ T (P ∗ ).
P
P0
δ
This completes the proof. 2
Case-control weighted score equals a score, case-control design
I: We are now ready to establish our wished results showing that the casecontrol weighted canonical gradient of the prospective sampling model is an
element of the tangent space for the observed data model M.
81
Hosted by The Berkeley Electronic Press
Theorem 19 (Case-control weighted score is a score, Design I)
Consider case-control design I, its independence model M described by
(7), and assume the tangent space T ∗ (P ∗ ) of M∗ at P ∗ satisfies the orthogonal decomposition (15).
If D∗ ∈ T ∗ (P ∗ ), then
Dq0 (O) = q0 D∗ (W1 , A1 , 1) +
(1 − q0 ) X ∗ j j
D (W2 , A2 , 0) ∈ TI (P ∗ ).
J
j
Specifically, if we set
S ∗ (W, A, Y ) = q0 (Y ) {D∗ (W, A, Y ) − E ∗ (D∗ | Y )} ∈ T ∗ (P ∗ ),
where q0 (Y ) = I(Y = 1)q0 + I(Y = 0)(1 − q0 )/J, then
Dq0 (O) = S ∗ (W1 , A1 , 1) − E ∗ (S ∗ (W, A, Y ) | Y = 1)
X
+ {S ∗ (W2j , Aj2 , 0) − E ∗ (S ∗ (W, A, Y ) | Y = 0)}.
j
(Here, we use the fact for J = 1, E ∗ (S ∗ | Y = 1) + E ∗ (S ∗ | Y = 0) = 0.)
This establishes the wished corollary stating that the case-control weighted
canonical gradient for the prospective sampling model yields the canonical
gradient for the case-control sampling model M.
Corollary 1 Consider case-control design I, its independence model M described by (7), and assume the tangent space T ∗ (P ∗ ) of M∗ at P ∗ satisfies
the orthogonal decomposition (15).
Suppose that D∗ (P ∗ ) is the canonical gradient of Ψ∗ : M∗ → IRd Ψ∗ , and
let Ψ : M → IRd at P (P ∗ ) ∈ M, satisfy Ψ(P (P ∗ )) = Ψ∗ (P ∗ ).
Assume that the corresponding case-control weighted Dq0 (satisfies the
regularity conditions such that it) is a gradient for Ψ at P (P ∗ ). Then Dq0 is
the canonical gradient of Ψ at P (P ∗ ).
Case-control weighted score is a score, Case-Control Design II:
We establish the same type result for case-control design II.
Theorem 20 (Case-control weighted score is a score, Design II)
Consider case-control design II, its independence model M described by
(8), and assume the tangent space T ∗ (P ∗ ) of M∗ at P ∗ satisfies the orthogonal decomposition (17).
82
http://biostats.bepress.com/ucbbiostat/paper234
For any D∗ ∈ L2 (P ∗ ), we have
Dq0 ,q̄0 (O) ≡ q0 D∗ (M1 , W1 , A1 , 1) + q̄0 (M1 )
=
X
m
1X ∗
D (M1 , W2j , Aj2 , 0)
J j
q0
∗
I(M1 = m)Dm,q
,
0
q0 (1 | m)
where
∗
Dm,q
(O) ≡ q0 (1 | m)D∗ (m, W1 , A1 , 1) +
0
q0 (0 | m) ∗
D (m, W2j , Aj2 , 0).
J
For each m, and D∗ ∈ T ∗ (P ∗ ), we have
∗
I(M1 = m)Dm,q
∈ TII (P ∗ )
0
so that it follows that
Dq0 ,q̄0 (P ∗ ) ∈ TII (P ∗ ).
Let q0J (δ | m) = q0 (1 | m)δ + (1 − δ)q0 (0 | m)/J. Specifically, if we set
∗
Sm
(M, W, A, Y ) = I(M = m)q0J (Y | m) {D∗ (m, W, A, Y ) − E ∗ (D∗ | M, Y )} ,
which is an element of T ∗ (P ∗ ) by (18) above, then
∗
∗
∗
I(M1 = m)Dm,q
(O) = Sm
(M1 , W1 , A1 , 1) − E ∗ (Sm
| M, Y = 1)
0
+
∗
{Sm
(M1 , W2j , Aj2 , 0) − E(S ∗ | M, Y = 0)}
X
j
∈ TII (P ∗ ).
Here we use that for any D∗ ∈ L20 (P ∗ ),
q0 (1 | m)E ∗ (D∗ | M = m, Y = 1) + q0 (0 | m)E(D∗ | M = m, Y = 0) = 0.
This gives us the wished result.
Corollary 2 (Case-control weighted canonical gradient is a canonical gradient, Design II)
Consider case-control design II, its independence model M described by
(8), and assume the tangent space T ∗ (P ∗ ) of M∗ at P ∗ satisfies the orthogonal decomposition (17).
83
Hosted by The Berkeley Electronic Press
If D∗ (P ∗ ) is the canonical gradient of Ψ∗ : M∗ → IRd at P ∗ , then
q0
∗
I(M1 = m)Dm,q
0
q
(1
|
m)
m 0
∈ TII (P ∗ ).
Dq0 ,q̄0 ≡
X
Thus, under the conditions for which which Dq0 ,q̄0 (P ∗ ) is a gradient of
Ψ : M → IRd at P (P ∗ ) ∈ M, satisfying Ψ(P (P ∗ )) = Ψ∗ (P ∗ ) for specified
parameter Ψ∗ : M∗ → IRd , we also have that Dq0 ,q̄0 (P ∗ ) is the canonical
gradient of Ψ at P (P ∗ ).
Appendix: Efficient influence curve of marginal
causal effect in nonparametric model for casecontrol design I
In this section we establish that the efficient influence curve of the marginal
causal effects defined on a nonparametric model M∗ and indexed by a fixed
known q0 for case-control design I can be represented as a case-control weighted
Dq0 = q0 D∗ (·, 1) + (1 − q0 )D∗ (·, 0), with D∗ being the efficient influence curve
of the marginal causal effect defined on the nonparametric model M∗ (and
not indexed by q0 ).
Our general theorems 4 and 7 teaches us that this should indeed be true
but with D∗ being the efficient influence curve of the marginal causal effect
defined as Ψ∗q0 indexed by a fixed q0 , which can be shown as well (and follows
from Theorem 9).
Theorem 21 (Efficient influence curve for case control design I)
Consider Case Control Design I with data structure O = ((W1 , A1 ), ((W2j , Aj2 ) :
j)), where (W1 , A1 ) has distribution Q1 ∼ (W, A) | Y = 1 and (W2 , A2 ) has
distribution Q0 ∼ (W, A) | Y = 0. Let Q1 (w, a) = P (W1 = w, A1 = a) and
let Q0 (w, a) = P (W2 = w, A2 = a). Similarly, we define Q1 (w) = P (W1 =
w) and Q0 (w) = P (W2 = w). We also define Q∗ (a, W ) = P ∗ (Y = 1 | A =
a, W ), Q∗W (w) = P ∗ (W = w). Let J = 1.
Consider the working model that (W1 , A1 ) is independent of (W2 , A2 ), and
no further assumptions, so that the likelihood is simply
p0 (O) = Q1 (W1 , A1 )Q0 (W2 , A2 ),
84
http://biostats.bepress.com/ucbbiostat/paper234
and Q1 and Q2 are unspecified. Then, under appropriate regularity conditions, the nonparametric maximum likelihood estimator, defined by the empirical distributions of Q1n and Q0n of Q1 and Q0 , of the parameter Ψq0 (P0 )(1)
defined by
P0 (Y1 = 1) = EW E(Y | A = 1, W )
X
=
Q∗ (w, 1)Q∗W (w)
w
Q1 (W, 1)q0
{q0 Q1 (w) + (1 − q0 )Q0 (w)}
w Q1 (W, 1)q0 + Q0 (W, 1)(1 − q0 )
≡ Ψq0 (P0 )
=
X
is regular and asymptotically linear with influence curve
Dq0 (P0 )(O) = q0 D∗ (P0∗ )(W1 , A1 , 1) + (1 − q0 )D∗ (P0∗ )(W2 , A2 , 0)
with D∗ (P0∗ )(W, A, Y ) = (Y − Q∗0 (1, W ))I(A = 1)/g0∗ (1 | W ) + Q∗0 (1, W ) −
Ψ(Q∗0 ), Q∗0 (a, W ) = P0∗ (Y = 1 | A = a, W ) and g0∗ (a | W ) = P0∗ (A = a | W ).
This shows that in this independence model the efficient influence curve
is given by Dq0 (P0 ).
Since, also under dependence of (W1 , A1 ) and (W2 , A2 ), this nonparametric NPMLE is a regular consistent and asymptotically linear estimator
of Ψq0 (P0 ), it follows that Dq0 (P0 ) is also the efficient influence curve in the
model in which one allows dependence between (W1 , A1 ) and (W2 , A2 ) (as long
as they have the specified marginal distributions Q1 and Q0 , respectively).
In general, for Case Control I, we have that the efficient influence curve
for this independence or the bigger arbitrary dependence model (or any model
in between) is given by
Dq0 (P0 )(O) = q0 D
∗
(P0∗ )(W1 , A1 , 1)
J
1X
D∗ (P0∗ )(W2j , Aj2 , 0)
+ (1 − q0 )
J j=1
Proof. We provide the proof for J = 1. Let Q1n (w, a) = 1/n ni=1 I(W1i =
P
w, A1i = a), Q0n = 1/n ni=1 I(W2i = w, A2i = a) be the empirical distriP
butions of Q1 and Q0 . Similarly, let Q11n (w) = 1/n ni=1 I(W1i = w) and
P
Q10n (w) = 1/n ni=1 I(W2i = w) be the marginal empirical distributions of
Q11 and Q10 .
P
85
Hosted by The Berkeley Electronic Press
Note the NPMLE of ψ0
ψn = f (Q10n , Q11n , Q1n , Q0n )
≡
X
{q0 Q1n (w) + (1 − q0 )Q0n (w)}
w
Q1n (w, 1)q0
.
Q1n (w, 1)q0 + Q0n (w, 1)(1 − q0 )
We also have ψ0 = f (Q10 , Q11 , Q1 , Q0 ). Thus, the first order linear expansion
of ψn is given by:
ψn − ψ0 ≈ df (Q10 , Q11 , Q1 , Q0 )(Q10n − Q10 , Q11n − Q11 , Q1n − Q1 , Q0n − Q0 ),
which provides the influence curve of ψn as an estimator of ψ0 . Thus, we
should first determine these derivative of f . We define Q̄(w) = q0 Q1 (w) +
(1 − q0 )Q0 (w) and Q̄(w, 1) = q0 Q1 (w, 1) + (1 − q0 )Q0 (w, 1). It follows
ψn − ψ0 ≈
X Q(w, 1)q02
(Q1n − Q1 )(w)
Q̄(w, 1)
X Q1 (w, 1)q0 (1 − q0 )
(Q0n − Q0 )(w)
+
Q̄(w, 1)
w
X
q0
(Q1n − Q1 )(w, 1)
+
Q̄(w)
Q̄(w, 1)
w
X
q 2 Q1 (w, 1)
−
Q̄(w) 0 2
(Q1n − Q1 )(w, 1)
Q̄ (w, 1)
w
X
q0 (1 − q0 )Q1 (w, 1)
Q̄(w)
−
(Q0n − Q0 )(w, 1).
Q̄2 (w, 1)
w
w
Substitution of the empirical distributions for a single observation Oi results
in the wished influence curve
D =
q02 Q1 (W1 , 1) Q1 (W2 , 1)q0 (1 − q0 )
+
Q̄(W1 , 1)
Q̄(W2 , 1)
q0 Q̄(W1 )I(A1 = 1) q02 Q̄(W1 )I(A1 = 1)Q1 (W1 , 1)
+
−
Q̄(W1 , 1)
Q̄(W1 , 1)2
q0 (1 − q0 )Q̄(W2 )I(A2 = 1)Q1 (W2 , 1)
−
−c
Q̄(W2 , 1)2
where c denotes the constant guaranteeing that D has mean zero. Now, we
use the following identities:
1
Q̄(w)
= ∗
Q̄(w, 1)
g0 (1 | w)
86
http://biostats.bepress.com/ucbbiostat/paper234
Q1 (w, 1)q0
= Q∗ (1, w) = P ∗ (Y = 1 | A = 1, W = w).
Q̄(w, 1)
With these identities it follows
I(A1 = 1)
(1 − Q∗ (1, W1 ) + q0 Q∗ (1, W1 )
g0∗ (1 | W1 )
I(A2 = 1)
+(1 − q0 ) ∗
(0 − Q∗ (1, W2 ) + (1 − q0 )Q∗ (1, W2 ) − c.
g0 (1 | W2 )
D = q0
Since D needs to have mean zero under P0 it follows that c = ψ0 . 2
9.1
Derivation of influence curve of estimator of causal
parameter for case-control Design I based on saturated logistic regression model.
Consider the standard logistic regression estimator Q̃∗n ignoring the casecontrol sampling for a saturated logistic regression model and its corresponding estimator ψn (a) = EQ∗W,n Q∗q0 ,n (a, W ) of ψ0 (a) = E0 Ya . For the sake of
presentation, we consider the simple case with c = 1, d¯ = 1. The proof is
¯
trivially generalized to general c, d.
∗
Firstly, we note that Q̃n solves the score equations
0=
X
i
J
1X
Q̃∗n
h(W1i , A1i ) −
(W2ij , Aj2i )h(W2ij , Aj2i )
∗
J j=1 1 − Q̃n
indexed by arbitrary functions h(W, A).
Now, set h(W, A) = Iw,a (W, A) be equal to the indicator that W = w and
A = a and we can choose w, a arbitrary. This gives us the equations
Q̃∗n (w, a)
Q1n (w, a)
,
=
Q0n (w, a)
1 − Q̃∗n (w, a)
j
1
where Q1n (w, a) = n1 i I(W1i = w, A1i = a) and Q0n (w, a) = nJ
j
i I(W2i =
w, Aj2i = a) are the empirical distributions of the cases and controls, respectively. It is of interest to note that this empirical relation corresponds with
the true known relation
P
P P
Q∗0
1 − q0 Q1 (w, a)
(w, a) =
.
∗
1 − Q0
q0 Q0 (w, a)
87
Hosted by The Berkeley Electronic Press
Thus,
ψn (a) =
Z
w
c0 Q1n /Q0n (w, a)
{q0 Q1n (w) + (1 − q0 )Q0n (w)} ≡ Φ(Q1n , Q0n ).
1 + c0 Q1n /Q0n (w, a)
We also have
ψ0 (a) =
Z
w
c0 Q1 /Q0 (w, a)
{q0 Q1 (w) + (1 − q0 )Q0 (w)} ≡ Φ(Q1 , Q0 ).
1 + c0 Q1 /Q0 (w, a)
Note that the function f (x) = c0 x/(1 + c0 x) has derivative c0 /(1 + c0 x)2 .
So
c0 Q1n /Q0n
c0 Q1 /Q0
−
≈ c0 /(1 + c0 Q1 /Q0 )2 (Q1n /Q0n − Q1 /Q0 )
1 + c0 Q1n /Q0n 1 + c0 Q1 /Q0
(
)
c0
1
Q1
=
(Q1n − Q1 ) − 2 (Q0n − Q0 ).
(1 + c0 Q1 /Q0 )2 Q0
Q0
Thus,
Φ(Q1n , Q0n ) − Φ(Q1 , Q0 ) ≈
+
c0 Q1 /Q0 (w,a)
w 1+c0 Q1 /Q0 (w,a)
R
c0
w (1+c0 Q1 /Q0 )2
R
n
1
(Q1n
Q0
− Q1 ) −
Q1
(Q0n
Q20
o
− Q0 ) Q̄(w)
{q0 (Q1n − Q1 )(w) + (1 − q0 )(Q0n − Q0 )(w)} ,
where we used the notation Q̄(w) = q0 Q1 (w) + (1 − q0 )Q0 (w). Therefore, the
influence curve of the estimator Φ(Q1n , Q0n ) of Φ(Q1 , Q0 ) is given by
ICa (O) =
Q̄(W1 )
c0
(W1 , a)
I(A1 = a)
2
(1 + c0 Q1 /Q0 )
Q0 (W1 , a)
Q̄(W2j )Q1 (W2j , a)
1X
c0
j
,
a)
(W
−
I(Aj2 = a)
j
2
J j (1 + c0 Q1 /Q0 )2 2
Q0 (W2 , a)
+q0
c0 Q1 /Q0
1 X c0 Q1 /Q0
(W1 , a) + (1 − q0 )
(W j , a) − ψ0 (a).
1 + c0 Q1 /Q0
J j 1 + c0 Q1 /Q0 2
Let R0∗ ≡ Q1 /Q0 which is estimated with the logistic regression fit Rn∗ =
Q̃∗n /(1 − Q̃∗n ). Let K0 (W, A) ≡ Q̄(W )/Q0 (W, A). Note,
c0
ICa (R0∗ , K0 , ψ0 (a))(O) =
(W1 , a)K0 (W1 , a)I(A1 = a)
(1 + c0 R0∗ )2
1X
c0 R0∗
(W2j , a)K0 (W2j , a)I(Aj2 = a)
−
∗ 2
J j (1 + c0 R0 )
+q0
c0 R0∗
1X
c0 R0∗
(W
,
a)
+
(1
−
q
)
(W j , a) − ψ0 (a).
1
0
1 + c0 R0∗
J j
1 + c0 R0∗ 2
88
http://biostats.bepress.com/ucbbiostat/paper234
It follows immediately that
E0 IC(R0 , K, ψ0 (a)) = 0 for all K.
We will now use another representation of the influence curve ICa which
establishes the wished double robustness. We use
Q∗0 = c0 R0∗ /(1 + c0 R0∗ )
c0 R0∗ = Q∗0 /(1 − Q∗0 )
1 − q0
K0 (w, a) =
.
∗
(1 − Q0 (w, a))g0∗ (a | w)
We have
I(A1 = a)
(1 − Q∗0 (W1 , a))
∗
g0 (a | W1 )
I(Aj = a) ∗ j
1X
Q0 (W2 , a)
(1 − q0 ) ∗ 2
−
J j
g0 (a | W2j )
1X
+q0 Q∗0 (W1 , a) +
(1 − q0 )Q∗0 (W2j , a) − ψ0 (a).
J j
ICa (Q∗0 , g0∗ , ψ0 (a)) = q0
The double robustness for this representation can now be stated as
E0 ICa (Q∗ , g ∗ , ψ0 (a)) = 0 if either g ∗ = g0∗ or Q∗ = Q∗0 ,
and in both cases we need that g ∗ (1 | W ) > 0 a.e.
Appendix: Derivation of influence curve of particular nonparametric maximum likelihood estimator of causal relative risk for case-control
design II.
In this section of the Appendix we consider a causal parameter specified in
terms of r0 (m) = P0∗ (Y = 0, M = m), assuming r0 (m) is known. We note
that our influence curve results for matched-case-control designs relied on
q0 (1 | m) being known instead of r0 (m) being known. As a consequence, one
now anticipates an additional component (beyond the case-control weighted
89
Hosted by The Berkeley Electronic Press
influence curve) to the influence curve due to the estimation of q0 (1 | m).
Indeed, below we derive the influence curve of the nonparametric maximum
likelihood estimator and show it involves now an additional term only being
a function of M1 .
To start with we derive the identifiability result for EY1 and subsequently
we define the corresponding nonparametric maximum likelihood estimator
and derive its influence curve.
We define Q1 (a, m, w) ≡ P0∗ (M = m, W = w, A = a | Y = 1), and
Q0 (a, w | m) = P0∗ (A = a, W = w | M = m, Y = 0)
P0 (M2 = m, W2 = w, A2 = a)
=
P0 (M = m)
Q0 (a, w, m)
≡
.
Q0 (m)
We also define Q0 (w | m) = P (W = w | M = m, Y = 0), and, we have
Q0 (w | m) = Q0 (w, m)/Q0 (m). Let r0 (m) = P0 (Y = 0, M = m). We wish
to establish an identifiability result of P (Y1 = 1) from the distribution P0 of
O: that is, we wish to write P (Y1 = 1) as a function of Q0 and Q1 . Firstly,
we note
P (A = 1, M, W | Y = 1)q0
.
P (Y1 = 1) = EM,W P (Y = 1 | A = 1, M, W ) = EW
P (A = 1, M, W )
Secondly, we note that
P (A = 1, M = m, W = w) = P (A = 1, M = m, W = w | Y = 1)q0
+P (A = 1, W = w | Y = 0, M = m)P (Y = 0, M = m).
Finally, we have that
P (M = m, W = w) = q0 p(M = m, W = w | Y = 1)
+P (Y = 0, M = m)p(W = w | M = m, Y = 0).
Thus, we have shown that
ψ0 (1) = P (Y1 = 1)
=
X
(q0 Q1 (m, w) + r0 (m)Q0 (w | m))
m,w
≡
X
Q̄(m, w)
m,w
Q1 (1, m, w)q0
q0 Q1 (1, m, w) + r0 (m)Q0 (1, w | m)
q0 Q1 (1, m, w)
Q̄(1, m, w)
= f (Q1 , Q0 ).
90
http://biostats.bepress.com/ucbbiostat/paper234
We note that this is an identifiability result relying on knowing r0 (m) and q0
instead of q̄0 (m) and q0 .
P
Let Q1n (a, w, m) = n1 ni=1 I(A1i = a, M1i = m, W1i = w),
Q0n (a, w, m) =
Q0n (m) =
n
1X
I(A2i = a, M1i = m, W2i = w)
n i=1
n
1X
I(M1i = m),
n i=1
and Q0n (a, w | m) = Q0n (a, w, m)/Q0n (m). The nonparametric maximum
likelihood estimator for the likelihood
p0 (O) = Q1 (A1 , M1 , W1 )Q0 (A2 , W2 | M1 ),
is given by these empirical distribution functions, and the corresponding
nonparametric maximum likelihood estimator of EY1 is thus
ψn (1) = f (Q0n , Q1n ).
In order to determine the efficient influence curve of EY1 , we will derive the
influence curve of the nonparametric maximum likelihood estimator ψn (1) as
an estimator of ψ0 (1).
We will determine the derivatives of f as a function of Q11 (w, m), Q10 (w, m),
Q1 (1, w, m), Q0 (1, w, m). We note
(a,w,m)
Q0n (a, w | m) − Q0 (a, w | m) = Q0n
− Q0Q(a,w,m)
Q0n (m)
0 (m)
(Q
(m)
− Q0 (m)
≈ Q01(m) (Q0n − Q0 )(a, w, m) − QQ0 (a,w,m)
0n
2
0 (m)
1 Pn
1
= n i=1 Q0 (m) {I(A2i = a, W2i = w, M2i = m) − Q0 (a, w, m)}
P
− n1 ni=1 QQ0 (a,w,m)
2 (I(M2i = m) − Q0 (m)).
0 (m)
Similarly,
Q0n (w | m) − Q0 (w | m) = QQ0n0n(w,m)
− QQ0 (w,m)
(m)
0 (m)
≈ Q01(m) (Q0n − Q0 )(w, m) − QQ00(w,m)
(Q
(m)
− Q0 (m))
0n
(m)2
P
n
= n1 i=1 Q01(m) {I(W2i = w, M2i = m) − Q0 (w, m)} − QQ00(w,m)
(I(M2i = m) − Q0 (m)).
(m)2
The first order linear expansion of ψn is given by:
ψn − ψ0 ≈ df (Q10 , Q11 , Q1 , Q0 )(Q10n − Q10 , Q11n − Q11 , Q1n − Q1 , Q0n − Q0 ),
91
Hosted by The Berkeley Electronic Press
which provides the influence curve of ψn as an estimator of ψ0 . Thus, we
should first determine these derivative of f . Recall Q̄(m, w) = q0 Q1 (m, w) +
r0 (m)Q0 (w | m) and Q̄(1, m, w) = q0 Q1 (1, m, w) + r0 (m)Q0 (1, w | m). It
follows
ψn − ψ0 ≈
X Q1 (1, m, w)q02
m,w
Q̄(1, m, w)
(Q1n − Q1 )(m, w)
X Q1 (1, m, w)q0 r0 (m)
(Q0n − Q0 )(w | m)
Q̄(1, m, w)
X
q0
(Q1n − Q1 )(1, m, w)
+
Q̄(m, w)
Q̄(1, m, w)
m,w
+
m,w
−
X
Q̄(m, w)
q02 Q1 (1, m, w)
(Q1n − Q1 )(1, m, w)
Q̄2 (1, m, w)
Q̄(m, w)
q0 r0 (m)Q1 (1, m, w)
(Q0n − Q0 )(1, w | m).
Q̄2 (1, m, w)
m,w
−
X
m,w
We note Q0 (m) = P (M2 = m) = P (M1 = m) = Q1 (m). We also note
r0 (m)/Q1 (m) = q̄0 (m). Now, we substitute the empirical distributions and
the above empirical approximation for Q0n (1, w | m) and Q0n (w | m). This
yields the influence curve
IC(Oi ) = m,w q0 Q∗ (1, m, w){I(W1i = w, M1i = m) − Q1 (w, m)}
P
+ m,w q̄0 (m)Q∗ (1, m, w) {I(W2i = w, M2i = m) − Q0 (w | m)I(M2i = m)}
P
q0
{I(A1i = 1, M1i = m, W1i = w) − Q1 (1, m, w)}
+ m,w g∗ (1|m,w)
P
0
Q∗ (1,m,w)
m,w q0 g ∗ (1|m,w) {I(A1i = 1, M1i = m, W1i = w) − Q1 (1, m, w)}
0
∗ (1,m,w)
P
{I(A2i = 1, W2i = w, M2i = m)}
− m,w q̄0 (m) Qg∗ (1|m,w)
P
Q∗ (1,m,w)
− m,w q̄0 (m) g∗ (1|m,w) {Q0 (1, w | m)I(M2i = m)} ,
−
P
So
IC(Oi ) = q0 Q∗ (1, M1 , W1 ) − Q∗ (1, m, w)P (M = m, W = w, Y = 1)
∗
+q̄0 (M1 )Q
(1, M1 , W2 )
R
−q̄0 (M1 ) w Q∗ (1, M1 , w)P (W = w | Y = 0, M = M1 )
R
1 =1)
+q0 g∗I(A
− Q∗ (1, m, w)P (M = m, W = w)
(1|M1 ,W1 )
R
∗ (1,M ,W )
1
1
−q0 I(A1 = 1) Qg∗ (1|M
+ Q∗2 (1, m, w)P (M = m, W = w)
1 ,W1 )
∗ (1,M ,W )
1
2
−q̄0 (M1 ) Qg∗ (1|M
I(A2 = 1)
1 ,W2 )
R
q̄0 (M1 )
+ r0 (M1 ) P (M = M1 ) w Q∗ (1, M1 , w)(1 − Q∗ (1, M1 , w))P (W = w | M = M1 ),
R
92
http://biostats.bepress.com/ucbbiostat/paper234
where we should read P (W = w | M = M1 ) as P (W = w | M = m)
evaluated at m = M1 . So
o
n
I(A1 =1)
(1 − Q∗ (1, M1 , W1 )) + Q∗ (1, M1 , W1 )
g ∗ (1|M1 ,W1 )
n
o
2 =1)
∗
∗
+q̄0 (M1 ) g∗I(A
(0
−
Q
(1,
M
,
W
))
+
Q
(1,
M
,
W
)
1
2
1
2
(1|MR
1 ,W2 )
q0
∗
∗
∗
+ P (Y =1|M
{
w Q (1, M1 , w)(Q (w, M1 ) − Q (1, M1 , w))P (W
=M1 )
R ∗
R ∗
IC(O) = q0
= w | M = M1 )}
− R Q (1, m, w)P (M = m, W = w, Y = 1) − Q (1, m, w)P (M = m, W = w)
+ Q∗2 (1, m, w)P (M = m, W = w)
References
R.Detrano M. Bobbio E. Gunel A.P. Morise, G.A. Diamond. The effect
of disease-prevalence adjustments on the accuracy of a logistic prediction
model. Med Decis Making, 16:133–142, 1996.
P.J. Bickel, C.A. J. Klaassen, Y. Ritov, and J.A. Wellner. Efficient and
adaptive estimation for semiparametric models. Johns Hopkins University
Press, Baltimore, MD, 1993. ISBN 0-8018-4541-6.
N.E. Breslow. Statistics in epidemiology: the case-control study. J Am Stat
Soc, 91:14–28, 1996.
N.E. Breslow and N.E. Day. Statistical Methods in Cancer Research: Volume
1 – The analysis of case-control studies. International Agency for Research
on Cancer, Lyon, 1980.
N.E. Breslow, N.E. Day, K.T. Halvorsen, R.L. Prentice, and C. Sabal. Estimation of multiple relative risk functions in matched case-control studies.
Am J Epid, 108(4):299–307, 1978.
N.E. Breslow, J.M. Robins, and J.A. Wellner. On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli, 6(3):
447–455, 2000.
D. Collett. Modeling Binary Data. Chapman and Hall, London, 1991.
S. Greenland. Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies.
Am J Epidemiol, 160(4):301–305, 2004.
93
Hosted by The Berkeley Electronic Press
T.R. Holford, C. White, and J.L. Kelsey. Multivariate analysis for matched
case-control studies. Am J Epid, 107(3):245–255, 1978.
D.W. Hosmer and S. Lemeshow. Applied Logistic Regression. John Wiley
and Sons, New York, 2nd edition, 2000.
N.P. Jewell. Statistics for Epidemiology. Chapman and Hall/CRC, Boca
Raton, 2004.
R. Mansson, M.M. Joffe, W. Sun, and S. Hennessy. On the estimation and use
of propensity scores in case-control and cohort studies. American Journal
of Epidemiology, 00:1–8, 2007.
A.M. Molinaro, M.J. van der Laan, D.H. Moore, and K. Kerlikowske. Survival point estimate prediction in matched and nonmatched case-control subsample designed studies. Technical report,
Division of Biostatistics, University of California, Berkeley, 2005.
http://www.bepress.com/ucbbiostat/paper149.
K.L. Moore and M.J. van der Laan. Covariate adjustment in randomized
trials with binary outcomes. Technical report 215, Division of Biostatistics,
University of California, Berkeley, April 2007.
S. Newman. Causal analysis of case-control data. Epid Persp Innov, 3:2,
2006. URL http://www.epi-perspectives.com/content/3/1/2.
R.L. Prentice and R. Pyke. Logistic disease incidence models and case-control
studies. Biometrika, 66:403–411, 1979.
J.M. Robins. [choice as an alternative to control in observational studies]:
Comment. Statistical Science, 14(3):281–293, 1999.
J.J. Schlesselman. Case-Control Studies: Design, Conduct, Analysis. Oxford
University Press, Oxford, 1982.
M.J. van der Laan. Causal effect models for intention to treat and realistic
individualized treatment rules. Technical report 203, Division of Biostatistics, University of California, Berkeley, 2006.
M.J. van der Laan and M.L. Petersen. Causal effect models for realistic
individualized treatment and intention to treat rules. International Journal
of Biostatistics, 3(1), 2007.
94
http://biostats.bepress.com/ucbbiostat/paper234
M.J. van der Laan and J.M. Robins. Unified methods for censored longitudinal data and causality. Springer, New York, 2002.
M.J. van der Laan and D. Rubin. Targeted maximum likelihood learning.
The International Journal of Biostatistics, 2(1), 2006.
S. Wachholder. The case-control study as data missing by design: Estimating
risk differences. Epidemiology, 7(2):144–150, 1996.
95
Hosted by The Berkeley Electronic Press