Deductive Derivation and Computerization of Compatible

University of California, Berkeley
U.C. Berkeley Division of Biostatistics Working Paper Series
Year 
Paper 
Deductive Derivation and Computerization of
Compatible Semiparametric Efficient
Estimation
Constantine E. Frangakis∗
Tianchen Qian†
Zhenke Wu‡
Ivan Diaz∗∗
∗
Johns Hopkins University, [email protected]
Johns Hopkins University, [email protected]
‡
Johns Hopkins University, [email protected]
∗∗
Johns Hopkins University, [email protected]
This working paper is hosted by The Berkeley Electronic Press (bepress) and may not be commercially reproduced without the permission of the copyright holder.
†
http://biostats.bepress.com/ucbbiostat/paper324
c
Copyright 2014
by the authors.
Deductive Derivation and Computerization of
Compatible Semiparametric Efficient
Estimation
Constantine E. Frangakis, Tianchen Qian, Zhenke Wu, and Ivan Diaz
Abstract
Researchers often seek robust inference for a parameter through semiparametric estimation. Efficient semiparametric estimation currently requires theoretical
derivation of the efficient influence function (EIF), which can be a challenging
and time-consuming task. If this task can be computerized, it can save dramatic
human effort, which can be transferred, for example, to the design of new studies.
Although the EIF is, in principle, a derivative, simple numerical differentiation to
calculate the EIF by a computer masks the EIF’s functional dependence on the parameter of interest. For this reason, the standard approach to obtaining the EIF has
been the theoretical construction of the space of scores under all possible parametric submodels. This process currently depends on the correctness of conjectures
about these spaces, and the correct verification of such conjectures. The correct
guessing of such conjectures, though successful in some problems, is a nondeductive process, i.e., is not guaranteed to succeed (e.g., is not computerizable), and
the verification of conjectures is generally susceptible to mistakes. We propose a
method that can deductively produce semiparametric locally efficient estimators.
The proposed method is computerizable, meaning that it does not need either conjecturing for, or otherwise theoretically deriving the functional form of the EIF,
and is guaranteed to produce the result. The method is demonstared through an
example.
1.
Introduction
The desire for estimation that is robust to model assumptions has led to a growing literature
on semiparametric estimation. Approximately efficient estimators can be obtained in general
as the zeros of an approximation to the efficient influence function (EIF). Semiparametric
estimation is useful, for example, for survival analysis (Cox, 1972), for estimating growth
parameters in longitudinal studies (Liang and Zeger, 1986), and for estimating quantities under
missing data, including treatment effects based on potential outcomes (Crump et al., 2009).
Here, we focus on problems in which the distribution of the observed data is, in principle,
unrestricted, but where estimability requires use of lower dimensional working models.
Theoretical derivation of the EIF in such problems can be challenging. If this task can be
computerized, it can save dramatic human effort, which can then be transferred, for example,
to designing new studies. The EIF for the unrestricted problem can be written, in general,
as a Gateaux derivative (Hampel, 1974). However, if simple numerical differentiation is used
to calculate the EIF by a computer to avoid theoretical derivations, then the EIF’s functional
dependence on the parameter of interest is not revealed. For this reason, the derivative approach
has not been generally used. Instead, the standard approach to obtaining the EIF has been
the theoretical construction of the space of scores under all possible parametric submodels
(Begun et al., 1983). This process currently depends on the correctness of conjectures about
these spaces and the correctness of their verification. The correct guessing of such conjectures
can succeed in some problems, but is a nondeductive process, i.e., is not guaranteed to succeed
(e.g., is not computerizable) and, as with their verification, is generally susceptible to mistakes.
We propose a method that can deductively produce semiparametric locally efficient estimators. In Section 2, we formulate the goal of a deductive method and show that it essentially
requires numerical access to the functional dependence of the EIF on the parameter of interest.
Section 3 shows how the concept of compatibility solves the functional dependence problem,
2
Hosted by The Berkeley Electronic Press
and derives a deductive method. Throughout, we use the two-phase design as a test problem
where the EIF is known theoretically, and we demonstrate our method with a study on asthma
as an example. Section 4 discusses extensions, and Section 5 concludes with remarks.
2.
The problem of deductive computerization of semiparametric estimators
2.1 The goal of a deductive method
Suppose we conduct a study to measure data Di , i = 1, ..., n, independent and identically distributed (iid) from an unknown distribution F , in order to estimate a feature of the
distribution
τ (F ).
(1)
Suppose τ has a nonparametric EIF denoted by φ(Di , F − τ, τ ), where F − τ denotes the
remaining components of the distribution, other than τ . The goal is to find a deductive
method that can derive φ and can compute estimators τ̂ that solve
X
φ{Di , (F − τ )w , τ } = 0
(2)
i
for working estimators of (F − τ )w . In general, estimators solving (2) are consistent and
locally efficient if the working estimators of (F − τ )w are consistent with rates larger than
n1/4 (Van der Vaart, 2000). Our specific requirement that the method be “deductive and
computerizable”, means that the method should need neither conjecturing for, nor otherwise
theoretically deriving the functional form of φ, and should be guaranteed to produce the result,
in the sense of Turing (1937).
3
http://biostats.bepress.com/ucbbiostat/paper324
2.2 Conjecturing and functional form as limitations to a deductive method
A test problem: estimating the mean in a two-phase design. To help make arguments concrete,
we consider he following example where the EIF is known. Suppose that in order to estimate the
mean τ = E(Y ) in a population, first the study obtains a simple random sample of individuals
and records an easily measured covariate Xi . Then, the study is to measure the main outcome
Yi only for a subset denoted with Ri = 1, where the missing data mechanism is ignorable given
X, i.e., pr(Ri = 1 | Yi , Xi ) = pr(Ri = 1 | Xi ). The final data Di are (Xi , Ri , Yi Ri ), i = 1, ..., n,
iid from a distribution F , and, by ignorablity, the parameter τ is identified from F as
Z
τ (F ) = y(x)p(x)dx,
(3)
where p(x) is the density of Xi ; and y(x) is the conditional expectation E(Yi | Ri = 1, Xi = x).
For this problem, the EIF is known (e.g., Robins and Rotnitzky (1995) and Hahn (1998)) to
be
φ{Di , (F − τ ), τ } =
Ri · {Yi − y(Xi )}
+ y(Xi ) − τ,
e(Xi )
(4)
where e(x) is the propensity score pr(Ri = 1 | Xi = x). The derivation has, so far, been
nondeductive because it is first based on conjectures on the score space over all submodels,
which are then verified to be true (e.g., Hahn (1998)).
Current estimation methods need the functional form of EIF. Most existing approaches to using
(2), first isolate a dependence of φ on τ , then replace the remaining dependence on F with
P
a working model, and finally solve i φ for τ . In the test problem above, the most common
approach to using (4) to estimate τ first obtains working functions yw (Xi ) and ew (Xi ), for
example using parametric MLEs, and estimates τ as the zero of the empirical sum of (4), to
obtain:
τ̂
nondeductive
=
1 X Ri · {Yi − yw (Xi )}
+ yw (Xi );
n i
ew (Xi )
(5)
4
Hosted by The Berkeley Electronic Press
see, for example, Robins et al. (1994), Davidian et al. (2005), and Kang et al. (2007). While
there also exist modified estimators like the targeted minimum loss estimator (TMLE) (van der
Laan and Rubin, 2006), all methods that have been presented so far have advocated that it is
critical to know the functional form dependence of φ on F , and so are nondeductive, hence,
noncomputerizable without prior knowledge of the functional form.
The Gateaux derivative approach to EIF. For a general parameter τ , the EIF evaluated at an
observation d0 can be obtained as the Gateaux derivative
τ (Fd0 , ) − τ (F )
where
→0
φ(d0 , F ) = lim
(6)
Fd0 , =(1 − )F + · 1 < d0 >
(7)
where 1 < d0 > denotes a point mass at d0 (Hampel, 1974). Calculating this derivative at a
given d0 and F is a deductive and computerizable operation. To demonstrate the ease of its
derivation, consider again the test problem with missing data.
Specifically, for a given observation d0 = [x0 , r0 , y 0 r0 ] and a distribution F , it follows from
(3), (7), and Bayes rule, that
Z
τ (Fd0 , ) =
yd0 , (x)pd0 , (x)dx
where pd0 , (x) = (1 − )p(x) + 1(x = x0 )
and yd0 , (x) =
· 1(x = x0 , r0 = 1) · y 0 + (1 − ) · p(x)e(x)y(x)
.
· 1(x = x0 , r0 = 1) + (1 − ) · p(x)e(x)
Then, (6) becomes
0
φ(d , F ) =
Z Z ∂yd0 , (x)
∂pd0 , (x)
pd0 , (x) dx +
yd0 , (x)
dx
∂
∂
=0
=0
The first and second terms of the above are
r0 {y 0 −y(x0 )}
e(x0 )
and y(x0 ) − τ , respectively, which is the
result (4) above.
The problem with the derivative operation is that if simple numerical differentiation is used
5
http://biostats.bepress.com/ucbbiostat/paper324
to calculate the EIF by a computer to avoid theoretical derivations, then the EIF’s functional
dependence on the parameter of interest is not revealed.
3.
A deductive estimation method
Method
A start to finding a deductive method is to appreciate from a new perspective a problem that nondeductive estimators such as (5) have. Specifically, nondeductive estimators are
usually constructed from a dependence of the EIF φ on τ that is different from the variationindependent partition into [(F −τ ), τ ] (this is probably because of the limitations of closed-form
expressions). For example, the estimator τ̂
nondeductive
of (5) is a sample analogue of (a) the expres-
sion of the last appearance “τ ” in the right hand side of (4), using (b) a working expectation
yw (x); and (c) the empirical estimator for p(x) to average over quantities of Xi . However,
the parameters underlying (a) (namely, τ ), (b) (namely, y(x)) and (c) (namely, p(x)) are not
variation-independent, because τ is the average of y(x) over p(x). This creates an incompatibility: the value of the estimator τ̂
nondeductive
from this method differs (almost surely) from its defining
expression τ (F ) if for F we use the estimates in (b) and (c) that are used to produce τ̂
nondeductive
.
The problem of incompatibility has been noted before as a nuisance and has motivated
compatible estimators like the TMLE (e.g., van der Laan and Rubin (2006)). Here, we show
that, more fundamentally, the concept of incompatibility together with the Gateaux derivative
create a solution to the problem of deductive estimation. In particular, the previous section
noted that evaluation of the Gateaux derivative at a working distribution Fw masks the dependence on τ . However, the same evaluation does contain evidence of whether parts of the
working distribution Fw are misspecified, if the empirical sum of the Gateaux derivative is
not zero. This evidence of misspecified Fw can be turned, by “εις άτοπον απαγωγή” 1 , into
1
“Reduction to the absurd”.
6
Hosted by The Berkeley Electronic Press
estimation for τ , where plausible values of τ are values τ (F ) for distributions F for which the
empirical sum of the Gateaux derivative eliminates any evidence of misspecification.
Based on the above argument, we have devised the following method that solves the deductive computarization problem by addressing the above compatibility problem.
(step 1) Extend the working distribution Fw to a parametric model, say, Fw (δ), around
Fw (i.e., so that Fw (0) = Fw ), where δ is a finite dimensional vector. In this extension,
we can keep unmodified the part of Fw that is known to be most reliably estimated
(e.g., a propensity score elicited by physicians).
(step 2) Using the Gateaux numerical difference derivative
Gateaux{τ, Fw (δ), Di , } := τ {Fw(Di ,) (δ)} − τ {Fw (δ)} /
for a machine-small , to deduce the value of φ{Di , Fw (δ)} for arbitrary δ, find
δ̂ opt that minimizes the empirical variance of τ {Fw (δ̂)}
(8)
among all roots {δ̂} that are subject to the condition
i
Xh
φ{Di , Fw (δ̂)} ←− Gateaux{τ, Fw (δ̂), Di , } = 0.
(9)
i
where “ ←−” means “computed as”. Property (9) is the empirical analogue of the
central, mean-zero property if the evaluated φ at Fw (δ̂) is the true influence function
of τ . An average of the EIF at a Fw (δ) that deviates from zero is evidence that the
working distribution is incorrect. This step finds a distribution Fw (δ̂) that eliminates
such evidence. Technically, there may be no zeros, in which case δ̂ can be defined as the
minimizer of the absolute value of (9), although a better solution would be to make the
model Fw (δ) more flexible (see below). More realistically, for a working model Fw (δ),
7
http://biostats.bepress.com/ucbbiostat/paper324
we can expect as many zeros as the dimension of δ, and so condition (8) selects the
best one.
(step 3) Calculate the parameter at the EIF-fitted distribution Fw (δ̂) as
τ̂ deductive := τ {Fw (δ̂ opt )}
(10)
Properties
The above method is deductive because step 2 does not need the functional form of φ,
but deduces it by the numerical Gateaux derivative (6). If δ is 1-dimensional, then (9) is
expected to have one root, and this can be found by numerical root-finding methods such as
in Brent (1973) or Newton-Raphson. If δ has more dimensions, then δ̂ opt can be found by
numerical Lagrange multipliers, where (8) can be coded as the jackknife variance. Also, the
above estimates for τ and the remaining model parameters are compatible, by construction.
The deductive estimator shares useful properties of so-far known, nondeductive estimators
that take φ as given. Notably, if the actual expectation of φ(Di , Fw ) is zero for a working distribution when, say part1 (Fw ) = part1 (F ), or,...,or partK (Fw ) = partK (F ), then the deductive
estimator above will be consistent as would be usual, nondeductive estimators (e.g., Scharfstein
et al. (1999)). Also, the deductive estimator above shares with the TMLE the idea of extending
the working model (Chaffee and van der Laan (2011)), and with other estimators the idea of
empirical maximization (e.g., Rubin and van der Laan (2008)). However, to our knowledge, all
such existing work for local efficiency has considered it critical to have the theoretically derived
form of the EIF based on the score theory. The contribution of the proposed method above is
to show that this theory can be translated to estimation that can be computerized in general,
by combining model extension with the Gateaux derivative.
The extension in step 1 can take different forms. For example, for the two-phase problem,
suppose an original working function yw (x) has been obtained as the OLS fit x0 β̂ ols of a linear
8
Hosted by The Berkeley Electronic Press
regression model x0 β for E(Y | R = 1, X = x). Then, a simple model extension is to free-up
(again) the intercept of x0 β̂ ols and let it be the parameter δ. This, as well as any analogous
such extension, results in an estimator, as defined by the above method, that is consistent
if the propensity score model is correct, and locally efficient under the working model with
parameter δ.
Feasibility evaluations
To evaluate the feasibility of our method, we applied it to the study reported by Huang et al.
(2005), as an example of the two-phase design. The goal of that study was to compare rates of
patient satisfaction for asthma care as the outcome (yes/no) among different physician groups
(treatments). Physician groups differed in their distribution of patient covariates. So, in order
to compare between, say, two physician groups, we set the goal to estimate the average (3) of
patient satisfaction for each group, standardized by the distribution of patient covariates in
the combined population of the two groups. The following covariates X were considered: age,
gender, race, education, health insurance, drug insurance coverage; asthma severity; number
of comorbidities, and SF-36 physical and mental scores.
We tested feasibility of the above method for the comparison within two pairs of groups,
denoted in Table 1 as a1 vs. b1 and a2 vs. b2 (actual ids omitted). We chose (a1 , b1 ) as a
pair for which the usual estimator τ̂
nondeductive
produces values diverging from the unadjusted rates
for a1 and b1 ; and we chose (a2 , b2 ) as a pair for which the usual estimator produces values
shrinking from a1 and b1 . The nondeductive estimator used as propensity score the quintiles of
the logistic regression of group membership conditionally on X; and a working expectation yw
as the prediction from the logistic regression of patient satisfaction conditionally on X within
each group. The deductive estimator uses the same propensity score, and, for step 1 of the
method, extended the working expectation yw by including back the intercept in the logistic
regression for each group as a free parameter δ. The computation of φ for each δ in (9) was
9
http://biostats.bepress.com/ucbbiostat/paper324
obtained by straight-forward numerical differentiation for the Gateaux derivative; and the root
δ̂ was found by the method of Brent (1973) implemented by the function “uniroot” in R.
In all cases, the deductive estimator gives answers close to the nondeductive estimator.
Moreover, as discussed earlier, both are doubly-robust for the same model. Both estimators
produced their answers in less than a second for each group, although the nondeductive estimator used the a-priori knowledge of the closed form expression (4) for φ, whereas the deductive
estimator did not. Finally, the standard errors (based on jackknife) are also comparable. A
guarantee that the deductive estimator be more precise can be incorporated by extending δ to
two dimensions (two coefficients) and minimize the empirical variance as in step 2.
4.
Extensions
In complex problems, it is possible that standard root finding methods for (9) are unstable.
In this section we show that the Gateaux numerical derivative may still be used to construct
a deductive estimation method that does not rely on solving an estimating equation.
Suppose that the parameter τ (F ) depends on F only through a set of variation independent
parameters qj (F ) : j = 1, . . . , J. Such is the case of parameter (3) in our example, with
q1 (F ; x) = y(x) and q2 (F ; x) = p(x). In an abuse of notation, let τ (q1 (F ), . . . , qJ (F )) := τ (F ).
Since the parameters qj are variation independent, the Gateaux derivative expression of φ in
(6) reduces to
0
φ(d , F ) =
J
X
τ (q1 (F ), . . . , qj (Fd0 , ), . . . , qJ (F )) − τ (F )
.
→0
j=1
lim
This expression provides the decomposition φ(d0 , F ) =
PJ
j=1
(11)
φj (d0 , F ), where φj is the non-
parametric efficient score associated to qj . Once the Gateaux numerical derivatives φj have
been computed, it is possible to implement a standard TMLE. We only provide a brief recap
10
Hosted by The Berkeley Electronic Press
of the TMLE template since extensive discussions are presented elsewhere van der Laan and
Rubin (2006); van der Laan and Rose (2011). For each qj , consider a loss function Lj (qj ; D)
whose expectation is minimized at the true value of qj . Consider also a working model qjw and
a parametric extension qjw (δ) satisfying
d
0 L(qjw (δ); d )
= φj (d0 ).
dδ
δ=0
In our example, since qj are components of the likelihood, the negative log-likelihood loss
function and the exponential family may be used in this step:
L(qj ; d) = − log qj (d)
qjw (δ; d) ∝ exp(δφj (d))qjw (d).
(12)
The TMLE is then defined by an iterative procedure that, at each step, estimates δ by minimizing the expected sum of the loss functions Lj (qjw (δ); ·). An update of the working model
is then computed as qjw ← qjw (δ̂), and the process is repeated until convergence. The TMLE
?
?
?
denotes the estimate obtained in the last step of
), where qjw
, . . . , qJw
is defined by τ̂ = τ (q1w
the iteration. Like the estimator presented in Section 3, the TMLE is a compatible estimator,
and solves the EIF estimating equation. Unlike the estimator of Section 3, the TMLE does
not require direct solution of that equation. However, the TMLE may be computationally
more intensive, as it is iterative and may require numerical integration for computation of the
proportionality constant in (12).
5.
Remarks
We proposed a deductive method to produce semiparametric estimators that are locally
efficient. The method does not rely on conjectures of tangent spaces and is not susceptible
11
http://biostats.bepress.com/ucbbiostat/paper324
to possible errors in the verification of such conjectures. Instead, the new method relies on
computability of the estimand τ for specified working distributions of the observed data F ,
and on numerical methods for differentiation and for root finding.
Although we have focused on local efficiency of originally unrestricted problems, one can
see a path towards finding a deductive method also for problems with restrictions set a priori.
Such a path can explore, first, nesting the restricted problem within an unrestricted one, and
then, making use of the proposed deductive method for the unrestricted problem, modified to
impose numerically the nested restrictions. Such deductive methods can save dramatic amounts
of human effort on essentially computerizable processes, and allow the transfer of that effort to
other statistically demanding parts of the scientific process such as the efficient design of new
studies.
Acknowledgments
We thank Mark van der Laan, Michael Rosenblum, Daniel Scharfstein, and Stijn Vansteelandt for helpful discussions.
References
Begun, J. M., Hall, W., Huang, W.-M., and Wellner, J. A. (1983). Information and asymptotic
efficiency in parametric-nonparametric models. The Annals of Statistics 11, 432–452.
Brent, R. (1973). Algorithms for minimization without derivatives. Englewood Cliffs, NJ:
Prentice-Hall .
Chaffee, P. and van der Laan, M. J. (2011). Targeted minimum loss based estimation based on
directly solving the efficient influence curve equation. UC Berkeley Division of Biostatistics
Working Paper Series .
12
Hosted by The Berkeley Electronic Press
Cox, D. R. (1972). Regression models and life tables. Journal of the Royal Statistical Society,
Ser. B 34, 187–220.
Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2009). Dealing with limited
overlap in estimation of average treatment effects. Biometrika 96, 187–199.
Davidian, M., Tsiatis, A. A., and Leon, S. (2005). Semiparametric estimation of treatment
effect in a pretest–posttest study with missing data. Statistical science: a review journal of
the Institute of Mathematical Statistics 20, 261.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of
average treatment effects. Econometrica pages 315–331.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the
American Statistical Association 69, 383–393.
Huang, I., Frangakis, C., Dominici, F., Diette, G., and Wu, A. (2005). Application of a propensity score approach for risk adjustment in profiling multiple physician groups on asthma care.
Health Services Research 40, 253–278.
Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of
alternative strategies for estimating a population mean from incomplete data. Statistical
science 22, 523–539.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear
models. Biometrika 73, 13–22.
Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression
models with missing data. Journal of the American Statistical Association 90, 122–129.
13
http://biostats.bepress.com/ucbbiostat/paper324
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when
some regressors are not always observed. Journal of the American Statistical Association 89,
846–866.
Rubin, D. B. and van der Laan, M. J. (2008). Empirical efficiency maximization: improved
locally efficient covariate adjustment in randomized experiments and survival analysis. The
International Journal of Biostatistics 4,.
Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association 94, 1096–1120.
Turing, A. (1937). On computable numbers, with an application to the entscheidungs problem.
Proceedings of the London Mathematical Society 42, 230–265.
van der Laan, M. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational
and Experimental Data. Springer, New York.
van der Laan, M. J. and Rubin, D. B. (2006). Targeted maximum likelihood learning. The
International Journal of Biostatistics 2, 1–40.
Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge University Press.
14
Hosted by The Berkeley Electronic Press
Table 1: Feasibility of the deductive method for estimating the probability of patient satisfaction adjusted for covariates for two physician group pairs of the asthma study of Huang et al.
(2005)
R
estimates of τ (F ) = y(x)p(x)dx
nondeτ̂ deductive (%) se (%)
τ̂ ductive (%) se (%)
physician
group g
n
unadjusted
pr(Y | G) (%)
a1
b1
177
86
60.5
59.3
63.1
52.0
4.5
8.8
64.3
51.4
4.1
9.1
a2
b2
110
194
78.2
48.5
72.1
49.4
8.2
4.5
69.4
48.8
7.9
4.3
15
http://biostats.bepress.com/ucbbiostat/paper324