J. R. Statist. Soc. B (2009) 71, Part 5, pp. 947–969 Causal inference in outcome-dependent two-phase sampling designs Weiwei Wang, Princeton University, USA Daniel Scharfstein, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA Zhiqiang Tan Rutgers University, Piscataway, USA and Ellen J. MacKenzie Johns Hopkins Bloomberg School of Public Health, Baltimore, USA [Received June 2008. Revised March 2009] Summary. We consider estimation of the causal effect of a treatment on an outcome from observational data collected in two phases. In the first phase, a simple random sample of individuals is drawn from a population. On these individuals, information is obtained on treatment, outcome and a few low dimensional covariates. These individuals are then stratified according to these factors. In the second phase, a random subsample of individuals is drawn from each stratum, with known stratum-specific selection probabilities. On these individuals, a rich set of covariates is collected. In this setting, we introduce five estimators: simple inverse weighted; simple doubly robust; enriched inverse weighted; enriched doubly robust; locally efficient. We evaluate the finite sample performance of these estimators in a simulation study. We also use our methodology to estimate the causal effect of trauma care on in-hospital mortality by using data from the National Study of Cost and Outcomes of Trauma. Keywords: Doubly robust estimator; Outcome-dependent sampling; Two-phase sampling 1. Introduction In some observational studies, it may be relatively inexpensive to measure the outcome Y , the treatment T and a low dimensional set of covariates S, while the remaining set of covariates W is expensive to obtain. In such settings, an outcome-dependent two-phase sampling design, which was first introduced by Neyman (1938) and discussed in Cochran (1963), can significantly reduce the cost of the study. In the first phase, Y , T and S are collected on all subjects. In the second phase, the subjects are divided into strata according to .Y , T , S/ and a random subsample, which is also called a validation sample, is randomly drawn from each stratum. The expensive covariates W are measured only on the validation sample. Existing statistical methods for outcome-dependent two-phase sampling studies are primarily focused on obtaining consistent and efficient estimates for the regression parameters in a Address for correspondence: Weiwei Wang, Department of Operations Research and Financial Engineering, 215 Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA. E-mail: [email protected] © 2009 Royal Statistical Society 1369–7412/09/71947 948 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie population level model for the distribution of Y given T , S and W ; see, for example, Cosslett (1981, 1983), White (1982), Fears and Brown (1986), Breslow and Cain (1988), Pepe and Fleming (1991), Carroll and Wand (1991), Schill et al. (1993), Scott and Wild (1991, 1997), Robins et al. (1995), Breslow and Holubkov (1997a, b), Lawless et al. (1999), Breslow (2000), Breslow et al. (2003), Chatterjee et al. (2003) and Weaver and Zhou (2005). In contrast with these methods, we are interested in estimating the causal effect of a nonrandomized treatment on the outcome. Specifically, we want to obtain estimators for the marginal mean of the ‘potential’ outcome that would have been observed if all subjects had received a specified treatment. Our work is motivated, in part, by the National Study on the Costs and Outcomes of Trauma (NSCOT) (MacKenzie et al., 2006), in which an outcome-dependent sample of severely injured patients treated at trauma and non-trauma centres is prospectively followed for survival and functional status. One of the primary aims is to estimate the causal effect of trauma centre versus non-trauma-centre care on in-hospital mortality, accounting for (a) outcome dependent, two-phase sampling and (b) selection bias due to non-randomized assignment of type of care. The paper is organized as follows. Section 2 introduces the notation and data structure. Section 3 introduces five estimators: (a) (b) (c) (d) (e) simple inverse weighted (SIW); simple doubly robust (SDR); enriched inverse weighted (EIW); enriched doubly robust (EDR); locally efficient (LE). Section 4 provides the results of a comprehensive simulation study. Section 5 presents an analysis of the NSCOT data by using our estimators. The last section is devoted to a discussion. 2. Notation and framework For an individual, let X = .S , W / denote covariates and let Y be the observed outcome. On all subjects, .S , Y , T/ is collected whereas W is collected only on the validation sample. Let V be the validation indicator and T be the treatment indicator. On the basis of the study design, the probability of being part of the validation sample depends on .S, Y , T/, but not W. We denote q.S, Y , T/ = P.V = 1|S, Y , T/, which is known by study design. For clarity, we further denote q1 .S, Y/ = q.S, Y , 1/ and q0 .S, Y/ = q.S, Y , 0/. Let Y1 and Y0 be the potential outcomes under treatment 1 and 0 respectively. We make the stable unit treatment assumption (Rubin, 1986), so that Y = TY1 + .1 − T/Y0 : The observed data for an individual are O = .S , Y , T , V , V · W / . We are interested in using n independent and identically distributed copies of O to draw inference about μtÅ = E[Yt ]. To identify μtÅ from the observed data, we assume that T is independent of .Y0 , Y1 / given X. In other words, we assume that, conditional on X = x, Y0 and Y1 , treatment assignment is an independent Bernoulli process with assignment probabilities πtÅ .x/ = P.T = t|X = x/, t = 0, 1. We refer to this assumption as no unmeasured confounding. We further make the positivity assumption, i.e. πtÅ .x/ > 0 for all x and t = 0, 1. For convenience, Table 1 summarizes our notation. When it is important, we shall emphasize the dependence of functions on model parameters; otherwise, we shall suppress such dependence. Causal Inference in Sampling Designs Table 1. Notation Symbol Y T Y1 Y0 μ1Å μ0Å S W X V μ1Å .X/ μ0Å .X/ q.S, Y , T/ q1 .S, Y/ q0 .S, Y/ π1Å .X/ π0Å .X/ 3. 949 Description Observed outcome Treatment (0–1) Potential outcome under treatment 1 Potential outcome under treatment 0 E[Y1 ] E[Y0 ] Covariates observed for all subjects Covariates observed only on validation sample .S , W / Validation indicator E[Y1 |X] E[Y0 |X] P.V = 1|S, Y , T/ q.S, Y , 1/ q.S, Y , 0/ P.T = 1|X/ P.T = 0|X/ Estimation 3.1. Simple inverse-weighted estimation Under the assumptions in Section 2, it can be shown that the observed data estimating function VI{T =t } UtSIW .O; μt , πt / = .Y − μt / qt .S, Y/ πt .X/ has mean 0 when evaluated at μtÅ and πtÅ .X/.√Without additional modelling assumptions, this estimating function cannot be used to draw n-inference about μÅt . This is because we would need a non-parametric estimator of π Å .X/ converging at a rate faster than n1=4 , which cant not be achieved when X is high dimensional (Newey, 1994). This has been called the ‘curse of dimensionality’ (Robins and Ritov, 1997). To proceed, we can parametrically model πtÅ .X/ = πt .X; αÅ /, where αÅ denotes the true value of the model parameters α. Without loss of generality, we shall assume that logit{π1 .X; αÅ /} = l.X; αÅ / where l.X, α/ is a specified function of X and α, which is differentiable in α. This model implies that exp{t l.X; αÅ /} πt .X; αÅ / = : 1 + exp{l.X; αÅ /} Under this model, αÅ can be estimated as the solution α̂ to the score function of the logistic regression T given X where each individual receives weight V=qT .S, Y/. Letting @l.X; α/ Sα .T , X; α/ = {T − π1 .X; α/}, @α then α̂ is the solution to V En Sα .T , X; α/ = 0, qT .S, Y/ where En [·] is the empirical expectation operator. 950 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie In terms of estimating μÅt , we note that UtSIW .O; μt , α/ = VI{T =t } qt .S, Y/πt .X; α/ .Y − μt / has mean 0 when evaluated at μtÅ and αÅ . This suggests that we can estimate μtÅ as the solution En [UtSIW .O; μt , α̂/] = 0. The resulting estimator is then VI{T =t } VI{T =t } = E Y E μ̂SIW : n n t qt .S, Y/ πt .X; α̂/ qt .S, Y/ πt .X; α̂/ The advantage of the SIW estimator is that it is simple to compute and only requires modelling of πtÅ .X/. The disadvantage of this estimator is twofold. First, it does not take full advantage of the data that are recorded on subjects who are not in the validation sample. It only uses information on .S, Y , T/ through the construction of the sampling weights. Second, the SIW estimator is inefficient, even if we ignore data on those who are not validated. 3.2. Regular and asymptotically linear estimators Under correct specification of the logistic regression model for πtÅ .X/ that was discussed in the previous subsection and additional smoothness conditions, μ̂SIW can be shown to be a regular t and asymptotically linear (RAL) estimator of μtÅ . To review, an estimator μ̂t is said to be asymptotically linear if there is a random variable IFt .O/ such that n √ n.μ̂t − μtÅ / = n−1=2 IFt .Oi / + op .1/ i=1 with E[IFt .O/] = 0 and 0 < E[IFt .O/2 ] < ∞. The function IFt .O/ is referred to as an influence function (Hampel, 1974). By the central limit theorem, it then follows that √ D n.μ̂ − μÅ / → N{0, E [IF .O/2 ]}: t t θ0 t An estimator is said to be regular if its asymptotic distribution does not change in a shrinking neighbourhood of the true data-generating distribution (van der Vaart, 1998). An asymptotically linear estimator of μtÅ will be regular if and only if its influence function is orthogonal to the model’s nuisance tangent space and its inner product with the score for μÅt is equal to 1. In a semiparametric model, like ours, the nuisance tangent space is defined as the mean-square closure of the nuisance tangent spaces of all parametric submodels, where a parametric submodel is a model that (a) is indexed by a finite dimensional parameter vector and (b) contains the true data-generating distribution, and where the nuisance tangent space of the parametric submodel is all linear combinations of the submodel’s nuisance score vector. As discussed in Tsiatis (2006), elements of the orthogonal complement of the nuisance tangent space can be used to construct estimating functions, which can in turn be used to generate RAL estimators of μtÅ . In general, the influence functions of the resulting estimators will be normalized versions of the elements that are selected from this space. The influence function for μ̂SIW can be shown to be t SIW @Ut .O; μt ,αÅ / V @Sα .T , X; αÅ / −1 SIW SIW Å Å Å Å IFt .O; μt , α / = Ut .O; μt , α / − E E @α qT .S,Y/ @α V × Sα .T ,X; αÅ /: qT .S, Y/ Causal Inference in Sampling Designs 951 This implies that the asymptotic variance of μ̂SIW is equal to E[IFtSIW .O; μÅt , αÅ /2 ]. A natural t SIW SIW .O; μt , α/ is the estimator for the asymptotic variance is En [IFt .O; μ̂SIW , α̂/2 ], where IF t t SIW same as IFt .O; μt , α/ except that the expectations are replaced with empirical averages. 3.3. Representation of influence functions Using the semiparametric theory of inference in coarsened data problems (Bickel et al., 1993; Robins et al., 1994; Scharfstein et al., 1999; van der Laan and Robins, 2003; Tsiatis, 2006), we derive the class of all influence functions for RAL estimators of μÅt under correct specification of the logistic regression model for πtÅ .X/. In Appendix A, we show that the class of all influence functions consists of elements of the form 1 IF {O; μÅ , αÅ , μÅ .X/, g , g , b} = φ {O; μÅ , αÅ , μÅ .X/} + φ .O; g / + φ .O; αÅ , b/ .1/ t t t 1 0 t 1 t τ 2,τ τ =0 3 where φ1 {O; μt , α, μt .X/} = VI{T =t } V T − π1 .X; α/ .Yt − μt / + .−1/t {μt .X/ − μt }, qT .S, YT / πt .X; α/ V φ2,τ .O; gτ / = I{T =τ } 1 − gτ .S, Yτ /, qτ .S, Yτ / V T − π1 .X; α/ φ3 .O; α, b/ = b.X/, qT .S, YT / R.X; α/ qt .S, Yt / πt .X; α/ g1 and g0 are arbitrary functions of their arguments, R.X; α/ = π1 .X; α/ π0 .X; α/, b.X/ ∈ T ⊥ and @l.X; αÅ / T = k : k is an arbitrary constant vector : @α The influence function for the SIW estimator is a member of this class with g1 .S, Y1 / = 0, g0 .S, Y0 / = 0 and b.X/ equal to @l.X; αÅ / SIW t+1 Å Å Å Å Å Å b .X/ = .−1/ π1−t .X; α /{μt .X/ − μt } − R.X; α / E π1−t .X; α /.Yt − μt / @α −1 Å Å Å @l.X; α / @l.X; α / @l.X; α / : × E R.X, αÅ / @α @α @α Letting Λgτ = {φ2,τ .O; gτ / : gτ is arbitrary}, Λb = {φ3 .O; αÅ , b/ : b.X/ ∈ T ⊥ }, Λg = Λg0 ⊕ Λg1 and Λ = Λg + Λb , the class of all influence functions can be written as a linear variety of form {IF {O; μÅ , αÅ , μÅ .X/, g , g , b} : g , g , b} = φ {O; μÅ , αÅ , μÅ .X/} + Λ: t t t 1 0 1 0 1 t t Λg0 and Λg1 are orthogonal, whereas Λg and Λb are not. 3.4. Locally efficient estimation The variance of an influence function in the above class depends on the choice of g1 , g0 and b. The influence function with the smallest variance (i.e the most efficient) is the projection Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λ⊥ ], where Π[·|·] is the projection operator (Fig. 1). 952 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie f1 + Λ IFt Λ f1 P [f1 | L^] Fig. 1. Optimal influence function The required projection does not have a closed form solution; however, we can use the alternating projection theorem (Bickel et al., 1993) to approximate computationally the solution. We reproduce a version of the theorem here. Theorem 1 (alternating projection theorem). Let SA and SB be two non-orthogonal spaces. Let h denote a vector. Let Q[·], QA [·] and QB [·] be the projection operator to the spaces ⊥ and S ⊥ respectively. Let Q.1/ [h] = Q [Q [h]] and Q.j/ [h] = Q.1/ [Q.j−1/ [h]], .SA + SB /⊥ , SA B A B j > 1. Then, lim {Q.j/ [h]} = Q[h]: j→∞ In our context SA = Λg , SB = Λb and h.O/ = φ1 {O; μÅt , αÅ , μtÅ .X/}. Fig. 2 illustrates the alternating projection theorem in our context. The key then is to derive closed form expressions for the projection operators QA and QB . To do so, we need to know how to project a function of the observed data h.O/ onto Λg and onto Λb . It can be shown that 1 E[h.O/I{T =τ } {1 − V=qτ .S, Yτ /}|S, Yτ ] V I{T =τ } 1 − Π{h.O/|Λg } = .2/ q .S, Y / E[I{T =τ } {1 − V=qτ .S, Yτ /}2 |S, Yτ ] τ τ τ =0 and Π{h.O/|Λb } = where Å V T − π1 .X; αÅ / −1 @l.X; α / C .X/ − k A.X/ , h h qT .S, YT / R.X; αÅ / @α Ch .X/ = E h.O/ V T − π1 .X; αÅ / X, qT .S, YT / R.X; αÅ / A.X/ = π1 .X; αÅ /−1 E[q1 .S, Y1 /−1 |X] + π0 .X; αÅ /−1E[q0 .S, Y0 /−1 |X] and Å Å −1 @l.X; αÅ / −1 −1 @l.X; α / @l.X; α / E A.X/ A.X/ Ch .X/ : @α @α @α kh = E .3/ Causal Inference in Sampling Designs 953 Λ⊥ b f1 Λ⊥ g Λ⊥ g Fig. 2. Λ⊥b Alternating projection theorem To find the projection, Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λ⊥ ], we use the following algorithm. Let denote the result of the jth iteration. Step 1: compute .1/ ht .O/ = φ1 {O; μÅt , αÅ , μtÅ .X/} − Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λg ]: .4/ .j/ ht .O/ Step 2: compute .2/ ht .O/ = φ1 {O; μÅt , αÅ , μtÅ .X/} − Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λg ] − Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λb ] .5/ + Π.Π[φ {O; μÅ , αÅ , μÅ .X/}|Λ ]|Λ /: t 1 g t b Step 3: on iteration 2k + 1 (k 1), compute .2k+1/ ht .2k/ .O/ = ht .2k/ = ht .2k/ .O/ − Π{ht .O/|Λg } .2k−1/ .O/ + Π[Π{ht .O/|Λb }|Λg ]: .6/ .O/ = ht .O/ − Π{ht .O/|Λb } .2k+1/ .2k/ = ht .O/ + Π[Π{ht .O/|Λg }|Λb ]: .7/ Step 4: on iteration 2k + 2 (k 1), compute .2k+2/ ht .2k+1/ .2k+1/ Equations (6) and (7) show that, for k 3, the variance reduction that is achieved on iteration k is always smaller than on iteration k − 1. The algorithm stops when the variance reduction is smaller than some specified tolerance. Although the general projection formulae are given above, the following special cases are used in the algorithm. When h.O/ = φ1 {O; μÅt , αÅ , μtÅ .X/} (as in equations (4) and (5)), then Π[φ1 {O; μÅt , αÅ , μtÅ .X/}|Λg ] takes the form in equation (2) where the ratio of the conditional expectations is equal to 954 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie I{τ =t } .Yt − μtÅ / + .−1/t E[{πτ .X; αÅ /=πt .X; αÅ /}{τ − π1 .X; αÅ /}{μÅt .X/ − μtÅ }|S, Yτ ] E[πτ .X; αÅ /|S, Yτ ] Å Å Å and Π[φ1 {O; μt , α , μt .X/}|Λb ] takes the form in equation (3) with .−1/t+1 Yt − μtÅ Cφ1 .X/ = E X + .−1/t A.X/ π1−t .X; αÅ /{μÅt .X/ − μtÅ }: πt .X; αÅ / qt .S, Yt / When h.O/ = φ .O; αÅ , b† / ∈ Λ (as in equation (6)), then Π{φ .O; αÅ , b† /|Λ } takes the form − b 3 3 in equation (2) with the ratio of the conditional expectations for τ equal to .−1/τ E[π .X; αÅ /|S, Y ]−1 E[b† .X/|S, Y ]: τ τ g τ † † When h.O/ = Σ1τ =0 φ2,τ .O; gτ / ∈ Λg (as in equations (5) and (7)), then Π{Σ1τ =0 φ2,τ .O; gτ /|Λb } takes the form in equation (3) with 1 1 τ +1 † X : .−1/ E gτ .S, Yτ / 1 − CΣ1 φ2,τ .X/ = τ =0 q .S, Y / τ τ =0 τ .j/ .j/ IFt .O; μÅt , αÅ , g1 , g0 , b.j/ / .j/ ht .O/ .j/ .j/ On each iteration, can be expressed as for g1 , g0 and b.j/ determined via the above projection formulae. Thus, for j Å large, the solution to Å .j Å / .j Å / E [IF {O; μ , α̂, μÅ .X/, g , g , b.j / }] = 0 n t t t 1 0 will be an efficient estimator of μtÅ whose influence function approximates Π[φ1 {O; μÅt , αÅ , Å Å Å μtÅ .X/}|Λ⊥ ]. This estimator, however, depends on μtÅ .X/ and .g1.j / , g0.j / , b.j / /, which as seen in the preceding displays depends on conditional expectations of the form E[u.X, Yt /|X] and E[v.X/|S, Yt ] and marginal expectations of the form E[u.X, Yt /] and E[v.X/] where u.·/ and v.·/ are functions of .X, Yt / and X respectively. To proceed, we shall need to utilize a working model for μtÅ .X/. Regardless of whether this working model is correctly specified and how many iterations are used to compute the projection, the resulting estimators will be consistent, since it can be shown that E[IFt {O; μÅt , αÅ , f.X/, g1 , g0 , b}] = 0, whatever the choice of f , g1 , g0 and b. However, if this working model is correctly specified, its parameters can be estimated at n1=4+" -rates and the number of iterations in the projection is large, the resulting estimator will be efficient, achieving the semiparametric variance bound. As a result, our estimator is said to be ‘LE’. Next, we discuss the implementation of the LE estimator for a binary outcome and a continuous outcome separately. 3.4.1. Binary outcomes For a binary outcome, we can posit a working logistic regression model for μtÅ .X/ = μt .X; η Å /, where logit{μ .X; η Å /} = j.t, X; η Å / t and j.t, X; η/ is a specified function of t, X and η. To estimate η Å , we note that μ .X; η Å / = E[Y |X, T = t] = E[Y |X, T = t], t t where the first equality follows by the assumption of no unmeasured confounding and the second by the stable unit treatment assumption. Thus, η Å can be estimated as the solution η̂ to the weighted Yt given X score equations where each individual receives weight VI{T =t } =qT .S, YT /. Let @j.t, X; η/ Stη .Y , X; η/ = {Y − μt .X; η/}; @η then η̂ is the solution to Causal Inference in Sampling Designs En VI{T =t } qT .S, YT / 955 Stη .Y , X; η/ = 0: The projections above require estimation of conditional and marginal expectations of the form E[u.X, Yt /|X], E[v.X/|S, Yt ], E[u.X, Yt /] and E[v.X/]. We can estimate the conditional expectation E[u.X, Yt /|X] by u.X, 1/ μt .X; η̂/ + u.X, 0/{1 − μt .X; η̂/}, where μt .X; η/ = exp{j.t, X; η/} : 1 + exp{j.t, X; η/} This estimator presumes, as we do, that u.x, 1/ and u.x, 0/ are well defined for all x. If S contains only categorical variables, the conditional expectation E[v.X/|S, Yt ] can be estimated by weighted, saturated regression with individual weights VI{T =t } ={qt .S,Yt / πt .X; α̂/}. If S contains continuous variables, then the bounded conditional expectations such as E[πτ .X; αÅ /|S, Yτ ] need careful modelling. Alternatively, one may choose to categorize continuous variables in S. This may cause some loss of efficiency but can avoid instability and model incompatibility due to misspecification of the models for, for example, E[πτ .X; αÅ /|S, Yτ ]. The marginal expectations E[u.X, Yt /] and E[v.X/] can be estimated by weighted averages with individual weights VI{T =t } ={qt .S,Yt / πt .X; α̂/} and V=qT .S,YT / respectively. Å Our resulting estimator will be the solution μ̂t.j / to .j Å / .j Å / Å En [IFt {O; μt , α̂, μt .X; η̂/, ĝ1 , ĝ0 , b̂.j / }] = 0 .j Å / .j Å / .j Å / are obtained by replacing conditional and marginal expectations by where ĝ 1 , ĝ 0 and b̂ their estimates described above, and j Å is an iteration number in which the iterative algorithm .j Å / LE stops. We define μ̂t = μ̂t . When the working model for μÅt .X/ is correctly specified, the influÅ/ .j Å / .j Å / .j Å / .j Å Å Å ence function of μ̂t will be IFt {O; μt , α , μt .X; η /, g1 , g0 , b }. TheÅ asymptotic variance .j Å / .j / .j Å / }2 ]. of this estimator can be estimated by En [IFt {O; μ̂LE t , α̂, μt .X; η̂/, ĝ 1 , ĝ 0 , b̂ Å If, however, the working model for μt .X/ is incorrectly specified, then η̂ converges to η̃ = η Å . Consequently, IFt {O; μÅt , αÅ , μt .X; η̃/, g1 , g0 , b} is not an influence function because it is no longer orthogonal to the nuisance tangent space. To obtain the influence function, we must subtract its projection onto the space that is spanned by the score functions for α. The resulting influence function is Å .j Å / .j Å / .j Å / .j Å / .j Å / Å Å IFR / = IFt {O; μÅt , αÅ , μt .X; η̃/, g1 , g0 , b.j / } t .O; μt , α , η̃, g1 , g0 , b Å .j Å / .j Å / @IFt {O; μÅt , αÅ , μt .X; η̃/, g1 , g0 , b.j / } −E @α V @Sα .T , X; α/ −1 V ×E Sα .T , X; α/: qT .S, Y/ @α qT .S, Y/ .8/ It is important to note that the second term in this influence function willÅ be 0 if η̃ = η Å . Å/ .j LE LE R .O; μ̂ , α̂, η̂, ĝ , ĝ .j / , b̂.j Å / /2 ], where We estimate the asymptotic variance of μ̂ by En [IF t t 1 0 Å/ Å/ Å/ .j .j Å Å R .j R .O; μ , α , η̃, g , g , b / is the same as IF .O; μÅ , αÅ , η̃, g .j Å / , g.j Å / , b.j Å / / except that IF t t t t 1 0 1 0 the expectations are replaced by empirical averages. The asymptotic variance that is calculated using equation (8) and its associated estimator will be referred to as the ‘robust variance’ since it is robust to misspecification of the working model for μtÅ .X/. 3.4.2 Continuous outcomes For a continuous outcome, we can specify a working model for μÅt .X/ in one of two ways. 956 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie First, we can specify a fully parametric model for the conditional distribution of Yt given X and estimate the parameters η Å by using weighted score equations as above. Second, we can specify a (semiparametric) conditional mean model for Yt given X and estimate the parameters η Å by using weighted estimating equations. One must also estimate conditional expectations of the form E[u.X, Yt /|X]. If the first approach of specifying the model for μÅt .X/ is adopted, then we can estimate these expectations by using the estimated conditional distribution of Yt given X. Since the second approach leaves the distribution form of Yt given X unspecified, we must introduce additional modelling assumptions here. Specifically, we can model the conditional expectations directly by specifying a regression model for u.X, Yt / given X and estimate the parameters by using additional weighted estimating equations. As the conditional expectations of this form must be repeatedly estimated, this approach can lead to issues of model incompatibility. The first approach does not suffer from this issue. If, from either approach, the resulting design matrix for the regression is the same as that for the model of T given X , the right-hand side of equation (3) will be equal to 0 and the LE estimator will reduce to the EDR estimator (see the next section). For a continuous outcome, smoothing will be required to estimate E[v.X/|S, Yt ]. As discussed in binary outcomes, one may choose to categorize S and Y to avoid instability and model incompatibility. As in binary outcomes, if the working model for μtÅ .X/ is misspecified, the robust variance should be calculated by using equation (8). Misspecification of E[u.X, Yt /|X] or E[v.X/|S, Yt ] will not affect the consistency of the resulting estimator: just the efficiency. 3.5. Doubly robust estimations In this section, we consider the subclass of influence functions with b.X/ = 0, i.e. φ1 {O; μÅt , αÅ , μtÅ .X/} + Λg . This class is generated by the model in which no restrictions are placed on the distribution of πtÅ .X/. Interestingly, this class of influence functions has the following properties: (a) E[IFt {O; μÅt , α̃, μt .X; η Å /, g1 , g0 , 0}] = 0 whatever α̃ is; (b) E[IF {O; μÅ , αÅ , μ .X; η̃/, g , g , 0}] = 0 whatever η̃ is. t t t 1 0 As a result, members of this subclass are said to be ‘doubly robust’ in the sense that the resulting estimators will be consistent and asymptotically normal when either the model for πtÅ .X/ or the model for μtÅ .X/ is correctly specified. In our setting, there are an infinite number of doubly robust estimators, depending on the selection of the functions g1 and g0 . The doubly robust estimator with the smallest variance will have influence function equal to the projection of φ1 {O; μÅt , αÅ , μt .X; η Å /} onto Λ⊥ g . Using previous results, this projection is equal to .1/ ht {O; μÅt , αÅ , μt .X; η Å /} in equation (4), which is the first iteration in the alternating projection algorithm. The previous subsection has provided a detailed illustration of the computation .1/ of ht {O; μÅt , αÅ , μt .X; η Å /}. It only involves conditional expectations of the form E[v.X/|S, Yt ] and is computationally easier than the LE estimator. Although misspecification of E[v.X/|S, Yt ] may result in a loss of efficiency, the estimator of μtÅ will still remain doubly robust. We denote the most efficient doubly robust estimator as the EDR estimator, μ̂EDR , with the word enriched t to emphasize that it uses information from the non-validation sample. The LE estimator is not doubly robust, in contrast with many other cases discussed in van der Laan and Robins (2003). When either the model for πtÅ .X/ or the model for μÅt .X/ is incorrectly specified, the influence function for the resulting estimator can be shown to be Causal Inference in Sampling Designs .1/ IFEDR .O; μÅt , α̃, η̃/ = ht {O; μÅt , α̃, μt .X; η̃/} t .1/ @ht {O; μÅt , α̃, μt .X; η̃/} −E @α E V @Sα .T , X; α̃/ qT .S, Y/ @α .1/ @h {O; μÅ , α̃, μ .X; η̃/} V t t t Sα .T , X; α̃/ − E qT .S, Y/ @η VI{T =t } @Stη .Y , X; η̃/ −1 VI{T =t } Stη .Y , X; η̃/ , ×E qT .S, Y/ @η qT .S, Y/ × 957 −1 .9/ where either α̃ = αÅ (the treatment regression model is correct) or η̃ = η Å (the outcome regression model is correct). In equation (9), the second term vanishes if η̃ = η Å (the outcome regression model is correct) and the third term vanishes if α̃ = αÅ (the treatment regression model is EDR .O; μ̂EDR , α̂, η̂/2 ], where correct). We estimate the asymptotic variance of μ̂EDR by En [IF t t EDR Å Å EDR .O; μt , α̃, η̃/ is the same as IFt .O; μt , α̃, η̃/ except that the expectations are replaced IFt by empirical averages. The asymptotic variance that is calculated by using equation (9) and its associated estimator will be referred to as the ‘doubly robust variance’ since it is robust to misspecification of the model for μtÅ .X/ or the model for πtÅ .X/. Another special choice of g1 and g0 is to set both equal to 0. The corresponding estimator does not use information on the non-validation sample and we refer to it as the SDR estimator. The .1/ robust variance for the SDR estimator takes the same form as equation (9) with ht replaced by φ1 . In the extreme case that every subject is selected to the validation sample (e.g. the sampling probability is 1), the EDR and LE estimators both become the same as the SDR estimator. 3.6. Enriched inverse-weighted estimator When compared with the SIW estimator, the LE estimator will always have a smaller variance but it requires a working model for μÅt .X/ and is computationally more intensive. The simple or EDR estimators are computationally easier and have the nice property of double robustness, but they are not guaranteed to have smaller variances than the SIW estimator and a working model for μtÅ .X/ is also required. This motivates our search for a computationally feasible estimator, which always has a smaller variance than the SIW estimator. Here, we restrict attention to the influence functions that are elements of the linear variety: IFSIW .O; μÅt , αÅ / + Λg . Among this t .O; μÅt , αÅ / onto set of influence functions, the most efficient is equal to the projection of IFSIW t ⊥ Λg . We define .O; μÅt , αÅ / = IFSIW .O; μÅt , αÅ / − Π[IFSIW .O; μÅt , αÅ /|Λg ]: IFEIW t t t Equation (2) shows how to project a general function of the observed data onto Λg . When h.O/ = IFSIW .O; μÅt , αÅ /, the ratio of the conditional expectations for τ in equation (2) takes the form t − I{τ =t } .Yt − μtÅ / −.−1/τ E[@UtSIW .O; μÅt , αÅ /=@α] E[{V=qT .S, Y/}@Sα .T , X; αÅ /=@α]−1 E[πτ .X; αÅ /|S, Yτ ] × E[{@l.X; αÅ /=@α}R.X; α/|S, Yτ ] E[πτ .X; αÅ /|S, Yτ ]−1 : The EIW estimator does not require a model for μtÅ .X/. It does, however, require estimation of conditional expectations of the form E[v.X/|S, Yt ] (see Sections 3.4.1 and 3.4.2 for estimation strategies; model misspecification may result in loss of efficiency, in which case the EIW estimator 958 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie may not always have a smaller variance than the SIW estimator). This EIW estimator is not as efficient as the LE estimator and is not doubly robust. The asymptotic variance of this estimator EIW .O; μ̂EIW , α̂/2 ] where IF EIW .O; μ̂EIW , α̂/ is the same as IFEIW .O; can be estimated by En [IF t t t t t EIW μ̂t , α̂/ with expectations replaced by empirical averages. 3.7. Summarizing remarks We summarize the relationships between the five estimators that were presented above. By the representation in equation (1), we obtain the SIW estimator by setting g1 = 0, g0 = 0 and b = bSIW . Fixing b = bSIW and optimizing efficiency over .g1 , g0 /, we obtain the EIW estimator. With g1 , g0 and b all equal to 0, we obtain the SDR estimator. With b = 0 and optimizing efficiency over .g1 , g0 /, we obtain the EDR estimator. Optimizing the efficiency over functions .g1 , g0 , b/, we obtain the LE estimator. 4. Simulation studies Sections 4.1 and 4.2 present simulation studies for binary and continuous outcomes respectively. 4.1. Binary outcomes We assess the finite sample performance of the five proposed estimation procedures for μ1 and μ0 . In Section 4.1.1, we evaluate the performance when we correctly specify regression models for the outcome and for treatment. In Section 4.1.2, we evaluate the performance under various types of model misspecification. In our simulations, we only implemented, because of computational complexity, the first two iterations in the computation of the LE estimator (i.e. j Å = 2). We present simulation results for two sample sizes: 800 and 2000. For each simulation, 2000 data sets were generated and summarized. 4.1.1. Efficiency In this simulation, S = .S1 , S2 /, S1 ∼ Bernoulli.0:5/, S2 ∼ Bernoulli.0:3/ and W = .W1 , W2 , W3 / ∼ N.0, I/. The five covariates were independent. Further, Y1 and Y0 given S and W were assumed to be independent Bernoulli distributed with probabilities of success equal to expit.S1 + 0:3S2 + W1 + 0:2W2 + 0:1W3 / and expit.1 + 1:5S1 + 0:5S2 + 1:5W1 + 0:2W2 + 0:2W3 / respectively. The distribution of T given X, Y1 and Y0 was assumed to be Bernoulli{π1Å .X/}, where π1Å .X/ = expit.S1 + 0:2S2 + 0:8W1 + 0:05W2 + 0:05W3 /. Finally, V given T , X, Y1 and Y0 was assumed to be Bernoulli{qT .S, Y/}, where qT .S, Y/ was selected so that around 25% of the total sample is selected and the number of individuals selected from each of the 16 strata (defined by T , Y , S1 and S2 ) was set equal to the minimum of the stratum size and the total selected divided by 16. The simulation was designed to have strong selection bias. The true values of μ1 and μ0 were determined from a simulation with sample size equal to 108 (μ1Å = 0:6145 and μÅ0 = 0:7824/. From the same simulation, the true values of E[Y |T = 1] and E[Y |T = 0] were found to be 0.6828 and 0.6755 respectively. In the top section of Table 2, we summarize the results of the simulation. Table 2 shows that, for each sample size, all estimators are unbiased (see the third and seventh columns). The Monte Carlo variance (the fourth and eighth columns) agrees very well with the average of the estimated variances (the fifth and ninth columns) for sample size 2000. For this sample size, the coverage of the estimated 95% confidence intervals is close to their nominal level. For sample size 800, the variance estimator appears to be slightly biased low and the estimated confidence Causal Inference in Sampling Designs 959 Table 2. Finite sample properties, efficiency and robustness comparisons for a binary outcome Sample size Estimator μ̂1 var ×104 var ×104 95% coverage probability μ̂0 var ×104 var ×104 95% coverage probability 800 SIW EIW SDR EDR LE SIW EIW SDR EDR LE 0.612 0.613 0.612 0.612 0.612 0.613 0.614 0.613 0.613 0.613 28.3 10.2 26.7 9.3 9.0 11.3 3.8 10.7 3.3 3.3 27.0 9.6 25.5 8.4 8.2 11.0 3.8 10.5 3.3 3.2 93.1 93.0 93.5 92.5 92.0 94.2 94.0 94.8 94.3 94.1 0.777 0.782 0.777 0.781 0.782 0.781 0.783 0.781 0.782 0.783 16.1 7.2 15.5 6.7 6.4 5.8 2.6 5.7 2.4 2.3 15.2 7.4 14.6 6.7 6.3 5.8 2.7 5.6 2.4 2.3 93.6 94.5 93.7 94.3 94.0 94.7 94.7 94.2 94.4 94.2 μtÅ (X) is misspecified 800 SIW EIW SDR EDR LE 2000 SIW EIW SDR EDR LE 0.608 0.611 0.609 0.610 0.610 0.613 0.613 0.613 0.613 0.613 29.5 11.0 29.5 13.5 14.0 11.4 3.9 11.4 4.4 4.4 27.5 10.0 27.6 11.8 11.9 11.0 3.8 10.9 4.1 4.2 93.5 92.9 93.7 94.3 93.7 94.5 94.2 94.5 94.2 93.7 0.779 0.783 0.776 0.787 0.789 0.781 0.782 0.780 0.783 0.784 15.9 7.2 17.5 15.6 14.5 5.8 2.5 6.0 3.9 3.7 15.2 7.4 17.6 13.0 11.8 5.8 2.7 6.2 3.9 3.7 93.8 93.8 94.3 94.3 93.5 95.0 95.1 95.0 95.8 95.1 π1Å (X) is misspecified 800 SIW EIW SDR EDR LE 2000 SIW EIW SDR EDR LE 0.665 0.668 0.611 0.613 0.647 0.666 0.668 0.612 0.614 0.648 23.6 5.1 27.0 8.4 5.8 9.4 1.8 10.4 2.9 2.0 23.0 5.3 25.4 8.0 5.4 9.2 2.0 10.3 3.1 2.0 78.6 35.6 93.8 93.5 69.9 60.8 3.0 94.3 95.4 34.9 0.699 0.705 0.779 0.783 0.735 0.701 0.704 0.780 0.782 0.733 21.5 7.8 15.1 7.5 6.8 9.1 2.9 6.0 2.6 2.5 22.1 8.1 14.8 7.0 7.5 8.6 2.9 5.8 2.5 2.8 60.1 20.0 94.3 93.4 60.7 20.5 0.4 94.0 95.0 14.2 2000 intervals tend to undercover slightly. In terms of efficiency, the SIW and SDR estimators are very inefficient. The other three estimators perform comparably, among which the EIW estimator has the biggest variance and the LE estimator has the smallest variance. Another interesting question that is addressed in our simulation is how much efficiency was lost because of not collecting W on the non-validation sample. As mentioned in Section 3.5, if W is collected on the entire sample (i.e. the sampling probability is 1), the LE and EDR estimators both reduce to the SDR estimator. In other words, the SDR estimator achieves full efficiency. We estimated the SDR estimator’s performance at sample size 2000 when both the models for μÅt .X/ and πtÅ .X/ are correctly specified, on the same simulated data sets that were used in the 6th–10th rows of Table 2, except that we retained W on all subjects. In this situation, the SDR estimator for μ1Å and μ0Å had Monte Carlo variances equal to 2:0 × 10−4 and 1:7 × 10−4 respectively. In comparison, when W was not retained on the non-validation sample, the Monte Carlo variances of the LE estimators were 3:3 × 10−4 for μ1Å and 2:3 × 10−4 for μ0Å . The loss of efficiency is 40% and 26% respectively. 960 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie 4.1.2. Robustness We generated data as in the previous subsection; however, we analysed them with various misspecified models. The robust variance was used for the LE estimator and the doubly robust variance was used for the SDR and EDR estimators. In the middle section of Table 2, we consider the case where the model for μÅt .X/ is misspecified. In our analysis, we incorrectly dropped W1 from the model for both μÅ1 .X/ and μÅ0 .X/. The SIW and EIW estimators remain unchanged since they do not utilize μt .X/. The other three estimators, as expected, remained unbiased (see the third and seventh columns) but lost some efficiency (fifth and ninth columns). The coverage rates of the estimated 95% confidence intervals are good at both sample sizes (the sixth and 10th columns), which indicates that our robust variance estimators have good performance. In the bottom section of Table 2, we consider the case where the model for πtÅ .X/ is misspecified. In our analysis, we incorrectly dropped W1 in the model for πt .X/. As expected, we see that only the two doubly robust estimators are consistent and only their associated confidence intervals achieve the nominal coverage rate. 4.2. Continuous outcomes We assess the finite sample performance of four proposed estimation procedures for μ1Å and μ0Å . We did not implement the LE estimator owing to its computational complexity. In Section 4.2.1, we evaluate the performance when we correctly specify regression models for the outcome and treatment. In Section 4.2.2, we evaluate the performance under various types of model misspecification. We present simulation results for three sample sizes: 800, 2000 and 10 000, and for each simulation we generated 2000 data sets. 4.2.1. Efficiency We generated S and W as in Section 4.1.1. We assumed that Y1 and Y0 , given S and W , followed independent normal distributions with means equal to S1 + 0:3S2 + W1 + 0:2W2 + 0:1W3 and 1 + 1:5S1 + 0:5S2 + 1:5W1 + 0:2W2 + 0:2W3 respectively, and common variance of 1=25. We assumed the same model for π1Å .X/ as in Section 4.1.1. Under these distributional assumptions, the true values of μÅ1 and μ0Å are 0.59 and 1.9 respectively. Owing to strong selection bias, E[Y |T = 1] and E[Y |T = 0] are biased estimates for μ1Å and μ0Å . With a sample size 108 , we estimated E[Y |T = 1] = 0:94 and E[Y |T = 0] = 1:057. In generating the validation sample, we dichotomized Y at 2. The validation sample was constructed by using the same idea as in Section 4.1.1. In the estimation of the conditional expectations of the form of E[v.X/|S, Yt ], semiparametric modelling of E[v.X/|S, Yt ] was initially attempted by using the GAM package in R (Hastie, 1992; Hastie and Tibshirani, 1990) but it produced unstable results. Instead, we assumed that the conditional expectation depended on Yt only through I.Yt > 2/ and we fitted a saturated regression model. In the top section of Table 3, we present the results when there is no model misspecification. In this simulation setting, the coverage rates of the SIW and EIW estimators for μ1 were substantially lower than 95% at the sample size 800. Some observations in the treated (T = 1) group had very large weights and caused instability in the SIW or EIW estimators. In such a situation, a much larger sample size would be needed to have the coverage rate close to 95%. As expected, the coverage rates improved as the sample size increased from 800 to 10 000. In contrast, the SDR and EDR estimators’ 95% confidence intervals have reasonable coverage rates when the sample size is 800 and were very close to 95% at sample size 2000. This shows that including Causal Inference in Sampling Designs 961 Table 3. Finite sample properties, efficiency and robustness comparisons for a continuous outcome Sample size Estimator μ̂1 var ×103 var ×103 95% coverage probability μ̂0 var ×103 var ×103 95% coverage probability 800 SIW EIW SDR EDR SIW EIW SDR EDR SIW EIW SDR EDR 0.614 0.641 0.589 0.614 0.599 0.607 0.591 0.599 0.592 0.593 0.590 0.591 18.3 12.0 10.1 9.6 6.7 4.1 3.9 2.9 1.2 0.7 0.7 0.5 12.9 9.8 10.2 7.2 5.8 3.8 4.0 2.6 1.2 0.8 0.8 0.5 87.5 83.7 94.2 92.0 91.6 88.8 95.2 94.2 94.5 93.5 96.0 95.2 1.953 1.920 1.899 1.882 1.922 1.910 1.901 1.895 1.906 1.905 1.900 1.900 71.4 67.9 21.8 18.8 24.7 21.6 8.3 6.0 4.0 3.7 1.5 1.1 76.7 70.8 22.1 16.6 24.6 22.0 8.6 6.1 4.2 3.7 1.6 1.2 94.7 92.3 94.8 94.1 94.8 93.8 95.2 95.0 95.5 95.0 96.7 95.3 μÅt (X) is misspecified 800 SIW EIW SDR EDR 2000 SIW EIW SDR EDR 10000 SIW EIW SDR EDR 0.621 0.641 0.627 0.622 0.599 0.607 0.601 0.600 0.591 0.593 0.592 0.592 20.1 13.9 21.9 18.1 6.7 4.1 6.4 4.2 1.2 0.8 1.1 0.8 14.4 12.4 18.3 17.5 5.8 3.8 5.6 4.0 1.2 0.8 1.2 0.8 87.2 81.2 87.4 86.7 91.6 88.8 92.1 91.8 95.0 92.5 94.7 93.8 1.967 1.934 1.980 1.970 1.922 1.910 1.923 1.921 1.904 1.902 1.903 1.904 78.0 74.7 113.9 125.7 24.7 21.6 25.9 27.2 4.1 3.6 4.0 4.2 84.4 80.4 116.9 131.7 24.6 22.0 25.2 27.4 4.2 3.8 4.1 4.4 94.8 90.8 95.2 94.0 94.8 93.8 95.5 94.9 95.2 94.8 95.5 95.1 π1Å (X) is misspecified 800 SIW EIW SDR EDR 2000 SIW EIW SDR EDR 10000 SIW EIW SDR EDR 0.875 0.863 0.594 0.591 0.866 0.861 0.593 0.590 0.862 0.861 0.590 0.590 15.4 2.8 9.6 6.0 6.6 0.9 4.1 1.9 1.2 0.2 0.7 0.3 15.0 2.9 10.2 4.8 6.2 1.0 4.0 1.9 1.2 0.2 0.8 0.4 37.2 0.5 94.8 93.2 7.8 0.0 94.4 94.5 0.0 0.0 95.3 95.7 1.269 1.248 1.907 1.903 1.257 1.246 1.904 1.899 1.247 1.246 1.900 1.900 27.8 10.8 20.1 17.1 11.4 3.6 8.7 6.4 2.0 0.7 1.6 1.2 29.7 11.3 22.0 15.0 11.0 3.8 8.5 6.0 2.0 0.7 1.6 1.2 4.8 0.2 95.3 92.8 0.0 0.0 94.5 94.5 0.0 0.0 94.8 95.2 2000 10000 the correct μtÅ .X/ can help with the instability owing to extreme weights. Comparing the four estimators, the EDR estimator is the most efficient. As theoretically predicted, the enriched estimator is always more efficient than the corresponding simple estimator. As for a binary outcome, we also evaluated how much efficiency was lost owing to not collecting W on the entire sample. At a sample size 10 000 and with models for μtÅ .X/ and πtÅ .X/ correctly specified, if W were available on the entire sample, the EDR estimator (which is the same as the SDR estimator) for μ1Å and μ0Å had Monte Carlo variances of 0:1 × 10−3 and 0:2 × 10−3 respectively. In contrast, as shown in Table 3, the EDR estimator had a Monte Carlo variance of 0:5 × 10−3 for μ1Å and 1:1 × 10−3 for μ0Å . The loss of efficiency is 80% and 82% respectively. 962 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie 4.2.2. Robustness In the middle section of Table 3, we consider the effects of misspecification of the model for μtÅ .X/. In our analysis, we incorrectly dropped W1 in the models for μtÅ .X/. Since the model for μtÅ .X/ is misspecified, the SDR and EDR estimators lost protection against instability of extreme weights, which is shown by the poor coverage rate of the 95% confidence intervals at sample size 800. The two doubly robust estimators also lost some efficiency. Finally, in the bottom section of Table 3, we consider the effects of misspecification of the model for πtÅ .X/. In our analysis, we incorrectly dropped W1 from this model. Here, only the two doubly robust estimators are consistent, and only for these two estimators are the coverage rates for the 95% confidence intervals near the nominal level. 5. Data analysis The NSCOT was designed to evaluate the effect of trauma centre care on mortality, functional outcomes and cost. An outcome-dependent, two-stage sampling procedure was used to sample individuals with moderately severe to severe injuries who were treated at trauma and non-trauma centres (MacKenzie et al., 2006). In this paper, we evaluate the causal effect of trauma centre versus non-trauma-centre care on in-hospital mortality. Here, T denotes the indicator of receiving trauma centre care; Y1 and Y0 denote the indicator of in-hospital death under trauma and non-trauma care respectively. Our goal is to estimate P.Yt = 1/ for t = 0, 1 and the relative risk P.Y1 = 1/=P.Y0 = 1/. 5.1. Sampling process The sample of patients was selected and eligibility determined in two stages. First, administrative discharge records and emergency department logs were prospectively reviewed to identify all patients ages 18–84 years who were treated at one of the 69 participating hospitals (18 trauma centres and 51 non trauma centres) for a moderately severe to severe injury. Several exclusion criteria were applied. A total of 18 198 patients met the initial eligibility criteria. In the second stage, all 1 438 hospital deaths and a random sample of 8021 patients discharged alive were stratified within hospitals on (a) age 18–64 years and 65–84 years, (b) international classification of diseases injury severity scores (Baker et al., 1974; MacKenzie et al., 1989) 15 or lower and greater than 15 and (c) principal body region injured, head, arms and legs, and other regions. Stratifying the sample of live hospital discharges was necessary to facilitate balance in baseline risk between trauma centres and non-trauma centres. Medical records were obtained for 1391 of the deaths in hospital. On the basis of a detailed medical record review, 1104 deaths met the final eligibility criteria. Patients who were discharged alive and selected for the study were contacted by mail at 3 months post injury. Those who did not refuse by mail were contacted by phone and consent was obtained for access to the medical record and interviews at 3 and 12 months. Of the 8021 live discharges that were selected for the study, 4866 were enrolled. Of those enrolled, 779 were determined ineligible on further review of their medical record (usually because their discharge diagnoses did not meet eligibility criteria), leaving 4087 live discharges who were finally eligible and for whom complete medical record data were abstracted. There are two differences in the sampling design of the NSCOT and that described in Section 2. First, the sampling strata are defined at individual hospital level, not the dichotomized treat- Causal Inference in Sampling Designs 963 ment level (e.g. trauma or non-trauma centres). Second, the inclusion eligibility was determined in two steps. Final eligibility could only be determined with a detailed medical record review and therefore was only known on the validation sample. Thus, the non-validation sample also contains ineligible individuals who should be excluded from the study but could not be identified. The five estimators that are proposed in this paper can be extended to accommodate these differences with some additional assumptions. Such extensions are beyond the scope of this paper and will be the subject of future work. For illustration, we modified the NSCOT data so that they fit into the framework that is described in Section 2. First, we aggregated the hospital level strata into trauma–non-trauma strata and recalculated the sampling weights. Second, we assumed that, within each sampling strata in the second stage of sampling, the final eligibility of an individual does not depend on whether he or she was sampled for validation, agreed to enrol or whether his or her complete medical record could be found. With this assumption, we then estimated the number of finally eligible individuals in the non-validation sample. Only these non-validated individuals were retained in the data set. At the end of this process, a total number of n = 15 617 validated and non-validated individuals were considered part of our data set. 5.2. Analysis We used the same treatment regression model for πt .X/ as in MacKenzie et al. (2006). Separate regression models were fitted for μ1Å .X/ and μ0Å .X/. The models included main effect terms for the following covariates: age (natural cubic spline), gender, race, Charlson score, mechanism of injury, first emergency department shock, field Glasgow coma scale motor, first emergency department Glasgow coma scale motor, maximum abbreviated injury scale, maximum abbreviated injury scale for head injury, thorax injury, abdomen injury and extremity injury, first emergency department pupils, coagulopathy, obesity, flail chest, open skull, paralysis, long bone fractures or amputation, midline shift and insurance status. An interaction between mechanism of injury and insurance status was also included. The estimated in-hospital mortality and relative risk from the five different estimation procedures are shown in Table 4. The five different estimates of in-hospital mortality under trauma care are similar and comparable with the observed mortality among those who were treated at trauma centres. The standard errors are also comparable. For non-trauma-centre care, the estimators of in-hospital mortality are almost twice as large as the observed mortality among those Table 4. In-hospital mortality from different estimators Result for the following estimators: Trauma centre (%) Non-trauma centre (%) Relative risk Observed SIW EIW SDR EDR LE 7:64† .0:28/‡ 11.18 (2.36) 0.68 (0.40–0.97)§ 7.55 (0.25) 10.82 (2.36) 0.70 (0.40–1.00) 7.56 (0.28) 10.40 (1.82) 0.73 (0.52–1.03) 7.47 (0.25) 10.19 (1.81) 0.73 (0.48–0.99) 7.47 (0.25) 10.50 (1.90) 0.71 (0.50–1.02) †Estimated in-hospital mortality. ‡Standard error. §95% confidence interval of relative risk. 7.88 5.77 1.37 964 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie treated at non-trauma centres. The EDR estimator yields the lowest point estimate and also has the lowest standard error. All five estimation procedures yield similar relative risk estimates. We assumed asymptotic normality for the logarithm of relative risk and derived its confidence interval by using the delta method. This confidence interval was then exponentiated to obtain the confidence interval for the relative risk. The SDR, EDR and LE estimators had similar confidence interval lengths, which were more than 15% shorter than those of the SIW and EIW estimators. 6. Discussion In this paper, we presented five estimators for the causal effect of a binary treatment in the outcome-dependent, two-phase sampling design. On the basis of simulation and empirical work, we recommend use of the EDR estimator because of its reasonable efficiency, robustness to model misspecification and computational tractability. The SIW estimator is less efficient than the EIW estimator. The SDR estimator is less efficient than the EDR estimator. Compared with the EDR estimator, the EIW estimator does not require models for μÅt .X/. However, it is more sensitive to extreme weights and is less efficient when ‘good’ models for μtÅ .X/ are available. The LE estimator requires complex statistical modelling to gain its full efficiency, which we view as impractical in real data analyses. We implement a two-step LE estimator for a binary outcome and have found no evidence that it performs better than the EDR estimators in our simulation study. There are several directions for future research. First, in this paper, we assumed that the sampling probability qT .S, Y/ is known. However, in many studies the sampling weights are not known a priori and must be estimated. Even if they are known, it can be shown that treating them as estimates can yield more efficient estimators. In doing so, the influence functions no longer take the form of equation (1). To derive the new class of influence functions, we must take the influence functions in equation (1) and subtract their projection onto the space spanned by the score function from a model for the sampling probability. Second, in the presence of model misspecification, the EIW, SDR, EDR and LE estimators may not have smaller variances than the SIW estimator. Tan (2006) introduced a way to construct a doubly robust estimator which always improves over the SIW estimator. Future work could extend his method to outcome-dependent, two-phase sampling designs. Finally, we have used the independent and identically distributed observations framework for semiparametric inference that was developed by Bickel et al. (1993). The NSCOT study does not fit ideally into this framework. This stems from the fact that the NSCOT employed binomial sampling within strata. Future work should focus on causal inference methods under binomial and other sampling schemes. Appendix A Proposition 1. Define "1 = Y1 − μÅ1 and "0 = Y0 − μÅ0 and reparameterize the full data .X, Y0 , Y1 / to .X, "0 , "1 /. The nuisance tangent space of the full data is ΛF = {a."0 , "1 , X/ : E[a."0 , "1 , X/] = 0, E["0 a."0 , "1 , X/] = 0, E["1 a."0 , "1 , X/] = 0}: The orthogonal nuisance tangent space of the full data is .10/ ΛF, ⊥ = {B2×2 Causal Inference in Sampling Designs "0 : B is any 2 × 2 matrix of constants}: "1 965 .11/ Proof. The joint distribution f.X, "1 , "0 / is any distribution satisfying E["1 ] = 0 and E["0 ] = 0. The result follows. Under the assumption of no unmeasured confounding and that the sampling probability only depends on .S, T , Y/, the complete data L = .X, Y0 , Y1 , V , T/ has the likelihood f.X, Y0 , Y1 / P.T |X, Y0 , Y1 / P.V |T , X, Y0 , Y1 / = f.X, Y0 , Y1 / P.T |X/ P.V |T , S, Y/: Thus, the nuisance tangent space of the complete-data likelihood can be factorized as Λcomp = ΛF ⊕ ΛT |X ⊕ ΛV |T , S, Y : Since P.V |T , S, Y/ is known by study design, we have Λcomp = ΛF ⊕ ΛT |X : We shall also assume that πt .X/ is known up to a finite number of parameters with αÅ denoting the true parameter, i.e. π Å .X/ = π .X; αÅ /: t t Without loss of generality, we shall assume that logit{π1 .X; αÅ /} = l.X, αÅ / where l.X, αÅ / is a specified function of X and α which is differentiable in α. This model implies that exp{t l.X; αÅ /} πt .X; αÅ / = : 1 + exp{l.X; αÅ /} Let r be the length of αÅ . Proposition 2. @l.X; αÅ / ΛT |X = B2×r {T − π1 .X; αÅ /}: for all B : @α Proof. The likelihood is pT |X .t, x; α/ = π1 .x; α/t {1 − π1 .x; α/}1−t : The log-likelihood is equal to log{pT |X .t, x; α/} = t log{π1 .x; α/} + .1 − t/ log{1 − π1 .x; α/}: The score function is Sα = t 1−t − π1 .x; α/ 1 − π1 .x; α/ Table 5. (V, T) .0, 1/ .0, 0/ .1, 1/ .1, 0/ @π1 .x; αÅ / @l.x; αÅ / = {t − π1 .x; α/}: @α @α Missingness patterns O P(V ,T |X,Y0 ,Y1 ) .S, Y1 / .S, Y0 / .X, Y1 / .X, Y0 / {1 − q1 .S, Y1 /} π1Å .X/ {1 − q0 .S, Y0 /} π0Å .X/ q1 .S, Y1 / π1Å .X/ q0 .S, Y0 / π0Å .X/ 966 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie Since ΛT |X is the space that is spanned by Sα , the result follows. The observed data are O = .S, VW , TY1 , .1 − T/Y0 , V , T/, indicating that there are missing data. The missingness patterns are listed in Table 5. The data are not missing at random because the probability of missingness depends on W which is not always observed. Let Λ1 = ΛF and Λ2 = ΛT |X . The nuisance tangent space for the observed data is equal to O ΛO = ΛO 1 + Λ2 , ΛO j = range.g ◦ Πj /, j = 1, 2, where range.·/ is the range of an operator. g : ΩL → ΩO , g.·/ = E[·|O], ΩL and O Ω are spaces of mean 0 random functions of L and O respectively, Πj is the projection operator from ΩL to Λj and S̄ denotes the closed linear span of the set S (Bickel et al., 1993). Since the data are not missing O at random, ΛO 1 and Λ2 are not orthogonal. Regardless, we still have the result ⊥ ⊥ ΛO, ⊥ = ΛO, ∩ ΛO, : 1 2 ⊥ Rotnitzky and Robins (1997) showed how to compute ΛO, . 1 Proposition 3. ⎫ ⎧ ⎞ ⎛ VT Å/ ⎪ ⎪ − μ .Y ⎬ ⎨ 1 1 Å ⎟ O, ⊥ 2×2 ⎜ q1 .S, Y1 / π1 .X; α / .2/ .2/ .2/ + A , : for all B, A ∈ Λ Λ1 = B ⎝ ⎠ V.1 − T/ ⎪ ⎪ ⎭ ⎩ .Y0 − μÅ0 / Å q0 .S, Y0 / π0 .X; α / where Λ.2/ = {b.O/ : E[b.O/|L] = 0}. Proposition 4. Λ where .2/ = 1 τ =0 φ2, τ .O; gτ / + φ3 .O; αÅ , b/ : for all g0 , g1 , b V gτ .S, Yτ /, qτ .S, Yτ / V T − π1 .X; αÅ / b.X/, φ3 .O; αÅ , h/ = qT .S, YT / R.X; αÅ / φ2, τ .O; gτ / = 1{T =τ } 1 − g1 , g0 and b are arbitrary 2 × 1 vectors of functions of their arguments and R.X; αÅ / = π1 .X; αÅ / π0 .X; αÅ /. Proof. Consider a function of the observed data h.O/ = .1 − V/T g1 .S, Y1 / + .1 − V/.1 − T/ g2 .S, Y0 / + VT g3 .X, Y1 / + V.1 − T/ g4 .X, Y0 /: Then, E[h.O/|L] = {1 − q1 .S, Y1 /} π1Å .X/ g1 .S, Y1 / + {1 − q0 .S, Y0 /} π0Å .X/ g2 .S, Y0 / + q1 .S, Y1 / π1Å .X/ g3 .X, Y1 / + q0 .S, Y0 / π0Å .X/ g4 .X, Y0 /: Setting the expectation to zero, we have {1 − q1 .S, Y1 /} π1Å .X/ g1 .S, Y1 / + q1 .S, Y1 / π1Å .X/ g3 .X, Y1 / = −{1 − q0 .S, Y0 /} π0Å .X/ g2 .S, Y0 / − q0 .S, Y0 / × π0Å .X/ g4 .X, Y0 /: The left-hand side is a function of .X, Y1 / and the right-hand side is a function of .X, Y0 /. For them to be equal, both must be a function of X alone. Thus, {1 − q .S, Y /} π Å .X/ g .S, Y / + q .S, Y / π Å .X/ g .X, Y / = b.X/, 1 1 1 1 1 1 1 1 3 1 −{1 − q0 .S, Y0 /} π0Å .X/ g2 .S, Y0 / − q0 .S, Y0 / π0Å .X/ g4 .X, Y0 / = b.X/ where b.X/ is a function of X. Then, Causal Inference in Sampling Designs g3 .X, Y1 / = − 1 − q1 .S, Y1 / 1 b.X/ g1 .S, Y1 / + q1 .S, Y1 / q1 .S, Y1 / π1Å .X/ g4 .X, Y0 / = − 1 − q0 .S, Y0 / 1 b.X/ g2 .S, Y0 / − q0 .S, Y0 / q0 .S, Y0 / π0Å .X/ 967 and Simple algebra leads to the proposition. Scharfstein et al. (1999) showed the following result. Proposition 5. ⊥ = {h.O/ : Π[h.O/|ΛT |X ] = 0} ΛO, 2 = {h.O/ : h.O/ ∈ ΛT |X, ⊥ }: For simplicity of notation, we consider the estimation of μt from now on. Proposition 6. The orthogonal nuisance tangent space of the observed data is equal to ΛO, ⊥ = {B φ1 {O; μÅt , αÅ , μtÅ .X/} + 1 τ =0 φ2, τ .O; gτ / + φ3 .O; αÅ , b/, : for all B} where φ1 {O; μÅt , αÅ, μtÅ .X/} = VI{T =t} V T −π1 .X; αÅ / π1−t .X; αÅ /{μtÅ .X/−μÅt }, .Yt − μÅt /−.−1/t Å qt .S, Yt / πt .X; α / qT .S,YT / R.X; αÅ / V φ2, τ .O; gτ / = 1{T =τ } 1 − gτ .S, Yτ /, qτ .S, Yτ / V T − π1 .X; αÅ / b.X/, φ3 .O; αÅ , b/ = qT .S, YT / R.X; αÅ / g1 and g0 are arbitrary functions of their arguments, b.X/ ∈ T ⊥ and @l.X; αÅ / T = k : k is an arbitrary constant vector : @α ⊥ and find out what conditions it must satisfy so that it also belongs Proof. We take an element from ΛO, 1 ⊥ O, ⊥ ⊥ to ΛO, . Note that φ .O; g / ∈ Λ and that, for any h.O/ ∈ ΛO, , we have 2, τ τ 2 2 1 VI{T =t} @l.X; αÅ / Å / + φ .O; αÅ , b / E h.O/ {T − π1 .X; αÅ /} = E − μ .Y t 3 t t @α qt .S, Yt / πt .X; αÅ / Å @l.X; α / × {T − π1 .X; αÅ /} @α @l.X; αÅ / =E [.−1/t−1 {μÅt .X/ − μtÅ } π1−t .X; αÅ / + bt .X/] : @α Therefore, bt .X/ = .−1/t π1−t .X; αÅ /{μtÅ .X/ − μtÅ } + b.X/, where E[{@l.X; αÅ /=@α} b.X/] = 0, i.e. b.X/ ∈ T ⊥ . Proposition 7. The set of influence functions of all RAL estimators of μÅt is ⊥ Å Å Å ΛO, Å = {φ1 {O; μt , α , μt .X/} + 1 τ =0 φ2, τ .O; gτ / + φ3 .O; αÅ , b/}: Proof. The set of influence functions of all RAL estimators is equal to 968 W. Wang, D. Scharfstein, Z. Tan and E. J. MacKenzie @h{O; μÅt , αÅ , μtÅ .X/} ⊥ Å Å Å O, ⊥ = −1 : ΛO, Å = h{O; μt , α , μt .X/} ∈ Λ : E @μÅt It is straightforward to show that, for any h{O; μÅt , αÅ , μtÅ .X/} ∈ ΛO, ⊥ , @h{O; μÅt , αÅ , μtÅ .X/} E = −B: @μÅ t Therefore, B should be 1. Proposition 8. For an arbitrary function of observed data h.O/, 1 E[h.O/I{T =τ } {1 − V=qτ .S, Yτ /} |S, Yτ ] V : I{T =τ } 1 − Π[h.O/|Λg ] = q .S, Y / E[I{T =τ } {1 − V=qτ .S, Yτ /}2 |S, Yτ ] τ τ τ =0 Proof. Suppose that Π[h.O/|Λgτ ] = φ2, τ .O; gτÅ /; then, for any gτ , E[{h.O/ − φ .O; g Å /} φ .O; g /] = 0: 2, τ gτÅ can be directly solved from the above equation. τ 2, τ τ Proposition 9. For an arbitrary function of observed data h.O/, Å V T − π1 .X; αÅ / −1 @l.X; α / A.X/ , Π[h.O/|Λb ] = Ch .X/ − kh qT R.X; αÅ / @α where Ch .X/ = E h.O/ .12/ V T − π1 .X; αÅ / , X qT .X, YT / R.X; αÅ / A.X/ = π1 .X; αÅ /−1 E[q1 .S, Y1 /−1 |X] + π0 .X; αÅ /−1 E[q0 .S, Y0 /−1 |X] and @l.X; αÅ / @l.X; αÅ / @l.X; αÅ / −1 : E A.X/−1 kh = E A.X/−1 Ch .X/ @α @α @α Proof. Suppose that Π[h.O/|Λb ] = φ3 .O; αÅ , bÅ /; bÅ can be determined by the following two restrictions: (a) for any b, E[{h.O/ − φ3 .O; αÅ , bÅ /} φ3 .O; αÅ , b/] = 0; (b) bÅ ∈ T ⊥ . References Baker, S. P., O’Neill, B., Haddon, Jr, W. and Long, W. B. (1974) The Injury Severity Scores: a method for describing patients with multiple injuries and evaluating emergency care. J. Trauma, 14, 187–196. Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993) Efficient and Adaptive Estimation for Semi Parametric Models. Baltimore: Johns Hopkins University Press. Breslow, N. E. (2000) Statistics in the life and medical sciences. J. Am. Statist. Ass., 95, 281–282. Breslow, N. E. and Cain, K. C. (1988) Logistic regression for two-stage case-control data. Biometrika, 75, 11– 20. Breslow, N. E. and Holubkov, R. (1997a) Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J. R. Statist. Soc. B, 59, 447–461. Breslow, N. E. and Holubkov, R. (1997b) Weighted likelihood pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Statist. Med., 16, 103–116. Breslow, N., McNeney, B. and Wellner, J. A. (2003) Large sample theory for semiparametric regression models with two-phase outcome dependent sampling. Ann. Statist., 31, 1110–1139. Carroll, R. J. and Wand, M. P. (1991) Semiparametric estimation in logistic measurement error models. J. R. Statist. Soc. B, 53, 573–585. Chatterjee, N., Chen, Y.-H. and Breslow, N. E. (2003) A pseudoscore estimator for regression problems with two-phase sampling. J. Am. Statist. Ass., 98, 158–168. Causal Inference in Sampling Designs 969 Cochran, W. G. (1963) Sampling Techniques, p. 412. New York: Wiley. Cosslett, S. R. (1981) Maximum likelihood estimator for choice-based samples. Econometrica, 49, 1289–1316. Cosslett, S. R. (1983) Distribution-free maximum likelihood estimator of the binary choice model. Econometrica, 51, 765–782. Fears, T. R. and Brown, C. C. (1986) Logistic regression methods for retrospective case-control studies using complex sampling procedures. Biometrics, 42, 955–960. Hampel, F. R. (1974) The influence curve and its role in robust estimation. J. Am. Statist. Ass., 69, 383–393. Hastie, T. J. (1992) Generalized additive models. In Statistical Models in S (eds J. M. Chambers and T. J. Hastie), ch. 7. Pacific Grove: Wadsworth and Brooks/Cole. Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models. London: Chapman and Hall. van der Laan, M. J. and Robins, J. M. (2003) Unified Methods for Censored Longitudinal Data and Causality. New York: Springer. Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999) Semiparametric methods for response-selective and missing data problems in regression. J. R. Statist. Soc. B, 61, 413–438. MacKenzie, E. J., Rivara, F. P., Jurkovich, G. J., Nathens, A. B., Frey, K. P., Egleston, B. L., Salkever, D. S. and Scharfstein, D. O. (2006) A national evaluation of the effect of trauma-center care on mortality. New Engl. J. Med., 354, 366–378. MacKenzie, E. J., Steinwachs, I. M. and Shankar, B. (1989) Classifying trauma severity based on hospital discharge diagnoses: validation of an ICD-9CM to AIS-85 conversion table. Med. Care, 27, 412–422. Newey, W. K. (1994) The asymptotic variance of semiparametric estimators. Econometrica, 62, 1349–1382. Neyman, J. (1938) Contribution to the theory of sampling human populations. J. Am. Statist. Ass., 33, 101–116. Pepe, M. S. and Fleming, T. R. (1991) A nonparametric method for dealing with mismeasured covariate data. J. Am. Statist. Ass., 86, 108–113. Robins, J. M., Hsieh F. and Newey, W. (1995) Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. J. R. Statist. Soc. B, 57, 409–424. Robins, J. M. and Ritov, Y. (1997) Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models. Statist. Med., 16, 285–319. Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994) Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Ass., 89, 846–866. Rotnitzky, A. and Robins, J. M. (1997) Analysis of semiparametric regression models with non-ignorable nonresponse. Statist. Med., 16, 81–102. Rubin, D. B. (1986) Comments on “Statistics and causal inference”. J. Am. Statist. Ass., 81, 961–962. Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999) Reply to comments on “Adjusting for nonignorable drop-out using semiparametric nonresponse models”. J. Am. Statist. Ass., 94, 1135–1146. Schill, W., Jöckel, K.-H., Drescher, K. and Timm, J. (1993) Logistic analysis in case-control studies under validation sampling. Biometrika, 80, 339–352. Scott, A. J. and Wild, C. J. (1991) Fitting logistic regression models in stratified case-control studies. Biometrics, 47, 497–510. Scott, A. J. and Wild, C. J. (1997) Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57–71. Tan, Z. (2006) A distributional approach for causal inference using propensity scores. J. Am. Statist. Ass., 101, 1619–1637. Tsiatis, A. A. (2006) Semiparametric Theory and Missing Data. New York: Springer. van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge: Cambridge University Press. Weaver, M. A. and Zhou, H. (2005) An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J. Am. Statist. Ass., 100, 459–469. White, J. E. (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease. Am. J. Epidem., 115, 119–128.
© Copyright 2026 Paperzz