Biostatistics (2012), 13, 3, pp. 455–467 doi:10.1093/biostatistics/kxr038 Advance Access publication on November 13, 2011 Doubly robust estimation of the generalized impact fraction MASATAKA TAGURI∗ Department of Biostatistics and Epidemiology, Graduate School of Medicine, Yokohama City University, 4 -57 Urafune, Minami-ku, Yokohama 232-0024, Japan [email protected] YUTAKA MATSUYAMA AND YASUO OHASHI Department of Biostatistics, School of Public Health, University of Tokyo, Tokyo, Japan AKIKO HARADA Stress Research Institute, Public Health Research Foundation, Tokyo, Japan HIROTSUGU UESHIMA Lifestyle-Related Disease Prevention Center, Shiga University of Medical Science, Shiga, Japan S UMMARY The attributable fraction (AF) is commonly used in epidemiology to quantify the impact of an exposure to a disease. Recently, Sjölander and Vansteelandt (2011. Doubly robust estimation of attributable fractions. Biostatistics 12, 112–121) introduced the doubly robust (DR) estimator of the AF, which involves positing models for both the exposure and the outcome and is consistent if at least one of these models is correct. In this article, we derived a DR estimator of the generalized impact fraction (IF) with a polytomous exposure. The IF is a measure that generalizes the AF by allowing the possibility of incomplete removal of the exposure. We demonstrated the performance of the proposed estimator via a simulation study and by application to data from a large prospective cohort study conducted in Japan. Keywords: Attributable fraction; Causal inference; Doubly robust estimation; Generalized impact fraction; Inverse probability weighting. 1. I NTRODUCTION From a public health viewpoint, accurately quantifying the risk attributable to an exposure is especially important. Levin (1953) introduced the concept of the population attributable fraction (AF), which is defined as the fractional reduction of disease events that occur if the exposure was eliminated. To estimate the AF with adjustment for confounding, regression model-based maximum likelihood (ML) approaches have been proposed (Greenland and Drescher, 1993). While the ML approaches rely on the use of a regression model for predicting the outcome (outcome regression model), Sjölander (2011) demonstrated ∗ To whom correspondence should be addressed. c The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. 456 M. TAGURI AND OTHERS that estimation of the AF can also be accomplished through inverse probability weighting (IPW), which is based on the regression model for the exposure (exposure regression model). Recently, Sjölander and Vansteelandt (2011) introduced a doubly robust (DR) estimator of the AF. The especially useful property of the DR estimator in this context is that if either the outcome regression model or the exposure regression model is correct, its consistency is guaranteed. See Bang and Robins (2005) and the references cited therein for details on the development of DR estimators for various parameters. However, Sjölander and Vansteelandt (2011) only consider the case in which the exposure was dichotomous and completely removed. Although AF is useful to measure the entire disease burden attributable to the exposure, the complete removal of an exposure is often unrealistic when our aim is to quantify the effect of a population intervention. For polytomous or continuous exposures, such as blood pressure or body mass index (BMI), the complete removal of an exposure is especially unrealistic or even impossible for continuous cases. A measure that generalizes the AF by allowing the possibility of an incomplete intervention is the generalized impact fraction (IF) (Morgenstern and Bursic, 1982; Drescher and Becher, 1997). The IF is defined as the fractional reduction of disease events if the exposure distribution is changed from an observed pattern to a particular modified pattern. The goal of this study was to develop a DR estimator of the IF with a polytomous exposure for cohort/cross-sectional studies. In Section 2, we describe the notation, definition, and assumptions. In Section 3, we describe the method for obtaining a DR estimator of the IF and introduce its asymptotic properties. In Section 4, we investigate the finite sample behavior of the proposed estimator in a simulation study. In Section 5, we illustrate the methods using data from a large prospective cohort study. Finally, we provide some concluding remarks and discussion in the final section. 2. D EFINITION Let A denote an observed discrete or continuous exposure of interest and Y denote an observed dichotomous outcome. Also, let L be a vector of confounding factors, with components being discrete, continuous, or both. We assume that L is a sufficient set of confounders for the association between A and Y (see Assumption (2.2) below). Without loss of generality, we use p.(·) for both probability and densities. We express the unconditional probability of disease as follows: Pr(Y = 1|A = a, L = l) p A,L (a, l)dv A,L (a, l), Pr(Y = 1) = a,l where integrals are used with respect to the appropriate dominating measure v. Typically, v is a counting measure if the variable is discrete, whereas it is a Lebesgue measure if the variable is continuous. Furthermore, suppose a hypothesized intervention in the population under study would result in a modified distribution of the exposure, as given by the distribution p ∗A,L (a, l). Then, the unconditional probability of disease under this modification is as follows: Pr(Y = 1|A = a, L = l) p ∗A,L (a, l)dv A,L (a, l). Pr∗ (Y = 1) = a,l The IF is defined by the fractional reduction of a disease resulting from changing the current distribution of the exposure to some modified distribution as follows: IF = 1 − Pr∗ (Y = 1) . Pr(Y = 1) (2.1) Note that expression (2.1) is equivalent to the definition of the IF given by Drescher and Becher (1997). DR estimation of the generalized IF 457 One typical case of p ∗A,L (a, l) is a reweighting of the original exposure distribution, that is, p ∗A,L (a, l) = b g(a|b, l) p A,L (b, l)dν A (b), where g(a|b,l) is the probability or probability density function of the modified target exposure distribution defined for each level of the participants’ actual exposure and covariates (A = b, L = l), and p A,L (b, l) denotes the probability (density) function of (A, L) evaluated at (b, l). For simplicity, we assume g(a|b,l) is not dependent on l and thus use g(a|b) hereafter. The more general case will be briefly discussed at the end of this section. Also, we assume that the function g(a|b) is completely specified by the analysts depending on the objectives. For illustration, consider the example of Drescher and Becher (1997). In this case, the exposure is discrete with 4 levels (see Table 1). We assume that 100r1 % of subjects change from their current level to the next lower level, whereas an additional 100r2 % of subjects change to the lowest risk level. Moreover, subjects at the lowest level of exposure remain at that level. For estimation of the IF defined above, we adopt the potential outcome framework (Rubin, 1974). Specifically, we postulate the existence of a set of potential outcomes Ya for a ∈ A , where A is the set of all possible values of the exposure. Ya is the potential outcome that would be observed if A were set to a. We make the consistency assumption that if A = a, then Y = Ya . This assumption means that an individual’s potential outcome under his/her observed exposure is precisely his/her observed outcome (Vanderweele, 2009). We also assume the conditional exchangeability assumption, which states that conditional on L, Ya is independent of the exposure: Ya A|L, for all a ∈ A . (2.2) This assumption implies no unmeasured confounding within levels of the measured covariates. Assumption (2.2) is equivalent to the weak ignorability assumption (e.g. Stone, 1993). Using the 2 causal assumptions mentioned above, the IF (2.1) can be expressed as follows (see Section A of the Supplementary Material available at Biostatistics online for the derivation): g(a|b) p A (b)μ(a|b)dv A (b)dv A (a) IF = 1 − a b , (2.3) Pr(Y = 1) where p A (b) is the probability (density) function of the exposure at b and μ(a|b) = Pr (Ya = 1|A = b) is the probability of disease occurrence under the exposure level a in the subpopulation of participants whose actual exposure level is b. The numerator of the second term of (2.3) is the weighted average of μ(a|b), using the joint probability of target and actual exposure levels {g(a|b) p A (b)} as the weight. In a special case that A is dichotomous and g(a|b) puts unit mass on a = 0, (2.3) can be rewritten as follows: AF = 1 − Pr(Y0 = 1) . Pr(Y = 1) This is the definition of the AF in Sjölander and Vansteelandt (2011). Table 1. Example from Drescher and Becher (1997). Each column shows the probability function of the modified target exposure level conditional on the actual exposure level, g(a|b) Target Exposure Level a 0 1 2 3 0 1 0 0 0 Actual exposure level b 1 2 r1 + r2 r2 1 − (r1 + r2 ) r1 0 1 − (r1 + r2 ) 0 0 3 r2 0 r1 1 − (r1 + r2 ) 458 M. TAGURI AND OTHERS Interestingly, using the facts that a g(a|b)dν A (a) = 1 for all b ∈ A , and Pr(Y = 1) = b p A (b) μ(b|b)dν A (b), we can represent (2.3) as follows: g(a|b) p A (b)(a|b)dv A (b)dv A (a) , IF = a b Pr(Y = 1) where (a|b) = μ(b|b) − μ(a|b) is the causal contrast of the target and actual exposure levels in the subpopulation A = b. This indicates that the IF could be expressed as the weighted mixture of the causal effects in various subpopulations. If g(a|b) depends on a discrete univariate random variable Z defined as a function of L,we only need to consider a table like Table 1 for each Z = z and specify g(a|b, z). Then, we can replace z Pr∗ (Y = 1| ∗ Z = z) Pr(Z = z)dv Z (z) with the numerator of the second term of (2.3), where Pr (Y = 1|Z = z) = a b g(a|b, z) p A|Z (b|z)μ(a|b, z)dv A (b)dv A (a), μ(a|b,z) = Pr (Ya = 1|A = b, Z =z), and p A|Z (b|z) is the probability (density) function of the exposure at b conditional on Z = z. 3. E STIMATION AND INFERENCE Although the IF in (2.3) is generally defined with both the discrete and the continuous exposures, here we focus on the discrete case. Extension to the continuous case is under investigation. See Section B of the Supplementary Material available at Biostatistics online for the brief description of the strategy for the estimation in the continuous case. Suppose we observed (Yi , Ai , L i ) for each subject i = 1,. . . , n, with A = {0, 1, . . . , K }. Since Pr(Y = 1) and p A (b), respectively, denote the disease and exposure probability under the current con dition, they can be estimated with the simple sample proportions: Pr(Y = 1) = i Yi /n = Ȳ and p̂ A (b) = i I (Ai = b)/n = n b /n, where I (·) denotes the indicator function. Therefore, we will focus attention on estimating μ(a|b) for a, b ∈ A in (2.3). We consider following 3 types of estimators of μ(a|b), as in Bang and Robins (2005), although they consider estimating the marginal mean of the potential outcomes, E(Ya ). The ML method uses models for Pr(Y = 1 |A = a, L) = m(a|L; β a ) for a ∈ A . We denote M I to be the model that assumes m(a|L; β a ) are true. Under the conditional exchangeability assumption, μ(a|b) = E{E(Y |A = a, L)|A = b}. Based on this formula, we obtain the ML estimator as follows: μ̂ML (a|b) = n −1 i I (Ai = b)m(a|L i ; β̂a ), where β̂a denotes the ML estimator of the parameter for b the outcome regression model m(a|L; β a ). The ML estimator is identical to Robins’ g-formula (Van Der Laan and Robins, 2003). Next, the IPW method uses models for Pr(A = a|L) = π (a|L; γ a ) for a = 1,. . . , K , with the category a = 0 as a reference. We denote M I I to be the model that assumes π (a|L; γ a ) are true. Following the result of−1Sato and Matsuyama (2003), we obtain the IPW estimator as follows: μ̂IPW (a|b) = n −1 i I (Ai = a)π (a|L i ; γ̂a )π(b|L i ; γ̂b )Yi , where γ̂a denotes the ML estimator of the b parameter for the exposure regression model π(a|L; γ a ). To obtain the DR estimator, which is consistent as long as the union model M I ∪ M I I holds, we follow the semiparametric missing data theory of Robins (Robins and others, 1994; Van Der Laan and Robins, 2003; Tsiatis, 2006) under the known π (a|L) for a ∈ A . Using similar arguments to those in Van Der Laan and Robins (2003), the estimating function that yields the locally semiparametric efficient estimator of μ(a|b) is given by φEFF {μ(a|b), γa , γb , βa } = I (A = a)π(b|L; γb ){Y − μ(a|b)} {I (A = a) − π(a|L; γa )}π(b|L; γb ){m(a|L; βa ) − μ(a|b)} − π(a|L , γa ) π(a|L; γa ) DR estimation of the generalized IF = I (A = a)π(b|L; γb ){Y − μ(a|b)} − π(a|L; γa ) 459 I (A = a)π(b|L; γb ) − π(b|L; γb ) {m(a|L; βa ) − μ(a|b)}. π(a|L; γa ) (3.1) The resulting estimator μ̂EFF (a|b) is locally efficient under the submodel M I with known (or correctly estimated) π(a|L). Unfortunately, unlike the standard marginal structural models (Van Der Laan and Robins, 2003), φEFF {μ(a|b), γa , γb , βa } does not yield a DR estimator. The reason for this is that we use π(b|L; γb ){Ya − μ(a|b)} (instead of I (A = b){Ya − μ(a|b)}) as the full data estimating function, which includes π(b|L; γ b ) by construction. Thus, the consistency of the resulting estimator crucially depends on the correct specification of π (b|L; γ b ), regardless of the correctness of m(a|L; β a ). However, we can use (3) with minor modifications to obtain the DR estimator. Specifically, by replacing π(b|L; γ b ) within the second term of the second braces in the last expression of (3) instead with I (A = b), we get the DR estimating function as follows: φDR {μ(a|b), γa , γb , βa } = {I (A = a)π(b|L; γb )} I (A = a)π(b|L; γb ){Y − μ(a|b)} − − I (A = b) {m(a|L; βa ) − μ(a|b)}. π(a|L; γa ) π(a|L; γa ) It is easy to show that E[φDR {μ(a|b), γa , γb , βa }] = 0 if either Pr( A = k|L) = π (k|L; γ k ) for k = (a, b) or Pr(Y = 1 |A = a, L) = m(a|L; β a ). By solving the estimating equation, i φDR,i {μ(a|b), γ̂a , γ̂b , β̂a } = 0, we get μ̂DR (a|b) = n −1 b n I (Ai = a)π(b|L i ; γ̂b ) I (Ai = a)π(b|L i ; γ̂b )Yi − − I (Ai = b) m(a|L i ; β̂a ) . π(a|L i ; γ̂a ) π(a|L i ; γ̂a ) i=1 μ̂DR (a|b) is DR in the sense that it is consistent if either π (k|L; γ k ) for k = (a, b) or m(a|L; β a ) is correct. Because the estimating functions for μ̂EFF (a|b) and μ̂DR (a|b) are similar, one may raise the question whether there are at all circumstances where μ̂EFF (a|b) can be expected to have a much worse performance than μ̂DR (a|b). We therefore evaluate the asymptotic bias of μ̂EFF (a|b) under misspecification occurring in the working model π(b|L; γb ), while m(a|L; β a ) is correctly specified. Note that μ̂DR (a|b) is consistent in this case. Let π ∗ (b|L) be the probability limits of π(b|L; γ̂b ). Also, let m(a|L) and π(b|L) be the true functions of Pr(Y = 1|A = a, L) and Pr(A = b|L), respectively. Using the technique described in Vansteelandt and others (2010), we obtain the asymptotic bias of μ̂EFF (a|b) as follows: E[{π(b|L) − π ∗ (b|L)}m(a|L)]/E[π ∗ (b|L)]. It can be seen that the magnitude of a contribution to the bias at a given L is proportional to {π(b|L) − π ∗ (b|L)}. The bias can be relatively large if m(a|L) is large. ML ), By substituting estimators of Pr(Y = 1), p A (b), and μ(a|b) into (2.3), we obtain the ML (IF IPW ), and DR (IF DR ) estimators of the IF, respectively. IF DR is consistent for the true IF in (2.3) if IPW(IF either π(a|L; γ a ) for all a ∈ A or m(a|L; β a ) for all a ∈ A are correct (i.e. if union model M I ∪ MII holds). Note that if the exposure is completely removed, that is, g(a|b) puts unit mass on a = 0 for DR is identical to the DR estimator of the AF by Sjölander and Vansteelandt (2011). This is beall b, IF cause b p̂ A (b)μ̂DR (0|b) (our estimator for the numerator of the second term of (2.3)) is same as the DR estimator of Pr(Y0 = 1) in Sjölander and Vansteelandt (2011), provided that we use same models for the exposure and outcome in both methods. Note that the equality holds even if the exposure is polytomous 460 M. TAGURI AND OTHERS (not dichotomous). This fact is related to the “distributive property” or “distributivity” of the AF. Distributivity is the property that the sum of the category specific AFs equals the AF calculated from combining those categories into a single exposed category, regardless of the number and divisions of the categories, provided the reference category remains the same (Walter, 1976; Benichou, 2001). ML and IF IPW can be obtained by applying the delta method, as The asymptotic distribution of IF described in Greenland and Drescher (1993) and Sjölander (2011). In Section C of the Supplementary DR . Material available at Biostatistics online, we derive the asymptotic distribution of IF 4. S IMULATION DR , we carried out a simulation study based on 10 000 To examine the finite sample performance of IF Monte Carlo data sets, including n = 2000 i.i.d. observations. The data O = {A, Y, L = (L 1 , L 2 )} were generated in a manner similar to that employed by Sjölander and Vansteelandt (2011) as follows: first, we generated baseline confounders L, where L 1 ∼ N (0, 1) and L 2 ∼ Ber noulli(0.5). Next, we assigned the exposure A with A = {0, 1, 2} according to the following multinomial logistic model: log{π(a|L i )/π(0|L i )} = γ0a + γ1a L 1 + γ2a L 2 + γ3a L 1 L 2 for a = 1, 2, (4.1) where (γ01 , γ11 , γ21 , γ31 , γ02 , γ12 , γ22 , γ32 ) = (0.25, 0.5, −1, 1, −0.25, 1, −1.5, 1.5). This choice corresponds to {p A (0), p A (1), p A (2)} = (0.450, 0.322, 0.228). Subsequently, outcome Y was generated from the logistic model: logit{m(Ai |L i )} = β0 + β1 L 1 + β2 L 2 + β3 L 1 L 2 + β4 I (Ai = 1) + β5 I (Ai = 2), (4.2) where (β0 , β1 , β2 , β3 , β4 , β5 ) = (−2, 1, 1, −1.5, 1, 1.5). The probability that Y = 1 was 0.251. For each Monte Carlo data set, we fitted both the true model and the wrong model that assumed no interactions of covariates in (4.1) and (4.2), as in Sjölander and Vansteelandt (2011). The population probabilities that the inverse probability weights π −1 (a|L) exceed 20 for the true model are (0.00,0.02,0.13) for a = 0,1,2, respectively, while those for the wrong model are also (0.00,0.02,0.13). We investigated 4 IPW , IF DR , and IF EFF . IF EFF is the estimator based on φEFF {μ(a|b), γ̂a , γ̂b , β̂a }. The ML , IF estimators: IF modified distribution of the exposure is specified as in Table 1, and we estimate the IF for the 2 combinations of the probabilities (r1 ,r2 ) = (0.2,0.2) and (0.8,0.2). The corresponding values of the true IF were 0.135 and 0.297, respectively. We evaluated simulations in terms of bias, empirical standard error (SE), empirical mean squared error (MSE), and 95% coverage probability of the corresponding 95% confidence intervals (95% CP). We also conducted 2 types of Wald confidence intervals; the first type was ± 1.96 SE, and the second type was the transformed log(1 − IF) based the untransformed interval IF interval, exp{±1.96 SE(1 − IF)}, 1 − (1 − IF) by Greenland and Drescher (1993). Since the performances of the 2 intervals were similar, we omitted the results from the first type. ML and IF IPW were asymptotically Table 2 summarizes the results of the simulations. As expected, IF unbiased when either the outcome regression model or the exposure regression model was correctly spec DR was asymptotically unbiased when either model was correct. For the cases with ified, respectively. IF correct specification, the coverage probabilities of the confidence intervals were close to the nominal level. DR was less variable ML was consistently the smallest among all the methods. IF The empirical SE of IF EFF was almost the same as IF DR , although it was slightly biased if IPW . The performance of IF than IF the exposure regression model was wrong, even if the outcome regression model was correct. Note that EFF could be accidentally small because biases of estimators of μ(a|b) for different (a,b) the bias of IF 461 DR estimation of the generalized IF Table 2. Simulation results from 10 000 simulation runs. Bias is 100 × Monte Carlo bias, SE is 100× Monte Carlo standard deviation, MSE is 100 × Monte Carlo mean squared error, and 95% CP is Monte Carlo coverage of 95% Wald confidence intervals (r1 , r2 ) = (0.2, 0.2) Bias SE MSE Scenario 1: OR correct, ER correct ML 0.0 1.3 1.8 IPW 0.1 1.7 2.8 EFF 0.0 1.5 2.4 DR 0.0 1.5 2.4 Method (r1 , r2 ) = (0.8, 0.2) SE MSE 95% CP 95% CP Bias 95.2 94.6 95.3 95.2 0.0 0.1 0.0 0.0 3.0 3.4 3.2 3.2 8.8 11.8 10.2 10.2 95.4 95.0 96.3 95.2 Scenario 2: OR correct, ER incorrect ML 0.0 1.3 IPW −4.4 4.8 −1.1 2.3 EFF DR 0.0 2.3 1.8 42.0 6.4 5.4 95.2 74.8 93.3 94.4 0.0 −8.4 –0.5 0.0 3.0 7.0 4.2 4.3 8.8 119.9 17.9 18.1 95.4 65.4 95.7 94.9 Scenario 3: OR incorrect, ER correct ML −3.1 1.5 IPW 0.1 1.7 EFF 0.0 1.7 DR 0.0 1.7 11.8 2.8 2.8 2.8 44.1 94.6 95.3 94.9 −6.4 0.1 0.0 0.0 3.3 3.4 3.4 3.4 51.7 11.8 11.4 11.4 47.8 95.0 96.3 95.3 Scenario 4: OR incorrect, ER incorrect ML −3.1 1.5 11.8 −4.4 4.8 42.0 IPW −3.3 3.9 26.3 EFF DR −3.3 3.9 26.3 44.1 74.8 81.4 80.9 −6.4 −8.4 −6.7 −6.7 3.3 7.0 6.0 6.0 51.7 119.9 80.9 80.9 47.8 65.4 76.3 73.8 OR, outcome regression model; ER, exposure regression model. are not always in the same direction. In our simulations, some cancelation was indeed happening. For instance, the bias of μ̂EFF (1|1) was 1.0, whereas that of μ̂EFF (2|2) was −2.5. From these results, we would DR rather than IF EFF . generally recommend to use IF 5. DATA ANALYSIS The Japan Arteriosclerosis Longitudinal Study-Existing Cohorts Combine (JALS-ECC) is a pooling project based on individual data from existing prospective cohort studies in Japan. The rationale, study design, and methods of the JALS-ECC have been described elsewhere (JALS Group, 2008). Briefly, the baseline data were surveyed from 1985 to 2005. Of the 66 691 participants in 21 cohort studies, in our analysis, we included all men and women (in 10 cohorts) who had participated at baseline with no history of stroke, no missing information on confounders and stroke events, age 40–89 years (n = 19 173). A total of 374 stroke events were recorded during the first 10 years of follow-up. In this data analysis, we were interested in the stroke incidence attributable to metabolic risk factors. The metabolic syndrome is the concept of clustering risk factors comprised insulin resistance, abdominal fat distribution, dyslipidemia, and hypertension. Several institutions have established their own diagnostic criteria for diagnosis of metabolic syndrome. For instance, The National Cholesterol Education Program considers that each metabolic factor has the same importance (Expert Panel on Detection Evaluation and Treatment of High Blood Cholesterol in Adults, 2001), whereas the International Diabetes Federation and 462 M. TAGURI AND OTHERS the Japanese guidelines require central obesity to be present for diagnosis of metabolic syndrome (Alberti and others, 2006; The Examination Committee of Criteria for ‘Obesity Disease’ in Japan, Japan Society for the Study of Obesity, 2002). Based on these different guidelines, there has been controversy as to whether obesity should be considered a prerequisite for the diagnosis of metabolic syndrome. Thus, to investigate the possible impact of the management of metabolic risk factors with or without obesity, we considered the following 2 types of IF: one for intervention among the obese and the other for intervention among the nonobese. We defined metabolic risk factors as follows: obesity (BMI 25 kg/m2 ), high blood pressure (systolic blood pressure 130 mmHg or diastolic blood pressure 85 mmHg or administration of antihypertensive agents), high blood glucose (fasting serum glucose 110 mg/dL or medication for diabetes), high triglycerides (fasting serum triglyceride 150 mg/dL), and low high density lipoprotein (HDL) cholesterol (serum HDL cholesterol 40 mg/dL). As in previous research (Kadota and others, 2007), participants were subdivided into five categories according to their levels of obesity and their other metabolic risk factors as follows. Exposure level 0: BMI < 25 with 0 risk, 1: BMI < 25 with 1 risk, 2: BMI < 25 with 2 or 3 risks, 3: BMI 25 with 0 or 1 risk, and 4: BMI 25 with 2 or 3 risks. The modified distribution of metabolic risk factors is listed in Table 3(A) and (B) for the obese and nonobese groups, respectively. By the hypothetical obese intervention, we assume that 100r1 % of subjects within category 4 (obese with 2 or 3 risks) change from their current level to category 3 (obese with 0 or 1 risk), and an additional 100r2 % of subjects change to category 0 (nonobese with 0 risk), whereas 100(r1 + r2 )% of subjects within category 3 change from their current level to category 0. Similarly, by the hypothetical nonobese intervention, we assume that 100r1 % of subjects within category 2 (nonobese with 2 or 3 risks) change from their current level to category 1 (nonobese with 1 risk), and an additional 100r2 % of subjects change to category 0, whereas 100(r1 + r2 )% of subjects within category 1 change Table 3. Distribution of the modified target exposure conditional on the actual exposure level for JALS data. Each column shows the probability function of the modified target exposure level conditional on the actual exposure level, g(a|b), for (A) hypothetical obese intervention and (B) hypothetical nonobese intervention, respectively (A) IF for obese intervention Target Exposure Level a 0 1 2 3 4 0 1 0 0 0 0 1 0 1 0 0 0 Actual exposure level b 2 3 0 r1 + r2 0 0 1 0 0 1 − (r1 + r2 ) 0 0 4 r2 0 0 r1 1 − (r1 + r2 ) (B) IF for nonobese intervention Target Exposure Level a 0 1 2 3 4 0 1 0 0 0 0 1 r1 + r2 1 − (r1 + r2 ) 0 0 0 Actual exposure level b 2 r2 r1 1 − (r1 + r2 ) 0 0 3 0 0 0 1 0 4 0 0 0 0 1 463 DR estimation of the generalized IF from their current level to category 0. We could consider r1 and r2 as the degree of effectiveness of an intervention to reduce the number of metabolic risk factors. For each fixed r1 (r2 ), largerr2 (r1 ) implies higher degree of effectiveness. The proportion of subjects in each category was 30.8% for category 0, 31.6% for category 1, 13.8% for category 2, 14.8% for category 3, and 9.0% for category 4. We modeled m(a|L) with logistic regression and modeled π(a|L) with multinomial logistic regression. In each model, we included the linear terms of sex, age, total cholesterol, smoking habits (yes/no), and alcohol habits (yes/no) as confounders. We also included the indicator variables of metabolic risk categories in the former model. Although we tried to fit more complex models, which also contained confounder–confounder or confounder–exposure interactions, the resulting estimates of the IF in these models were quite similar to the results presented here. By taking category 0 as a reference, the estimated odds ratios of stroke for BMI < 25 were 2.30 (95% CI: 1.63, 3.24) and 3.41 (2.36, 4.94) for BMI with 1 risk (category 1) and 2 or 3 risks (category 2), respectively. Those estimates for BMI 25 were 2.25 (1.50, 3.37) for 0 or 1 risk (category 3) and 2.91 (1.90, 4.47) for 2 or 3 risks (category 4). The empirical probabilities that the inverse probability weights π −1 (a|L) exceed 20 were (0.00, 0.00, 0.05, 0.00, 0.17) for a = 0, 1, 2, 3, 4, respectively. This condition was nearly the same as our simulations. Tables 4 and 5 show the estimates of the IF for different probabilities r1 and r2 , with estimated SEs ML , IF IPW , IF EFF , IF DR ). There were no large differences among the results of the by the 4 methods (IF different estimation methods, and every method indicated that the IF for nonobesity was more than 2 times higher than that for obesity, for each combination of r1 and r2 . These findings indicate that population interventions of metabolic risk factors only for the obese people are of limiting value for the Japanese general population, regardless of the effectiveness of interventions (at least under the assumption that the effectiveness of an intervention is the same between obese and nonobese people). Presumably, the fact that the estimates were close implies that there was not much model misspecification in the outcome or exposure models under the conditional exchangeability assumption (2.2). ML were consistently the smallest and those of IF EFF and IF DR were slightly smaller than The SEs of IF Table 4. Results of IF estimation for obesity by JALS data. The table displays the estimated IF and SE values for each combination of r1 and r2 . Estimate is the value of 100 × IF estimate and SE is the value of 100 × estimated SE of the IF estimate r1 r2 0.0 0.0 0.2 0.0 0.2 0.4 0.2 0.4 0.0 0.4 0.0 0.8 0.2 0.8 1.0 0.0 0.2 0.0 0.4 0.2 0.0 0.4 0.2 0.8 0.4 1.0 0.0 0.8 0.2 0.0 ML IF Estimate 0.0 2.9 2.0 5.9 5.0 4.0 7.9 7.0 11.8 9.9 14.7 8.0 13.8 11.0 10.0 SE 0.0 0.6 0.4 1.2 1.0 0.8 1.6 1.4 2.4 2.0 2.9 1.6 2.7 2.2 2.1 IPW IF Estimate 0.0 2.8 2.1 5.6 4.9 4.2 7.7 7.0 11.2 9.8 14.0 8.4 13.3 11.2 10.5 SE 0.0 0.7 0.4 1.3 1.1 0.9 1.8 1.5 2.7 2.2 3.4 1.8 3.1 2.4 2.2 EFF IF Estimate –0.4 2.7 1.8 5.8 4.9 4.0 8.0 7.1 12.0 10.2 15.1 8.4 14.2 11.5 10.6 SE 0.0 0.7 0.4 1.3 1.1 0.9 1.7 1.5 2.7 2.2 3.3 1.8 3.1 2.4 2.2 DR IF Estimate 0.0 3.1 2.2 6.1 5.2 4.4 8.3 7.4 12.2 10.5 15.3 8.7 14.4 11.8 10.9 SE 0.0 0.7 0.4 1.3 1.1 0.9 1.7 1.5 2.7 2.2 3.3 1.8 3.1 2.4 2.2 464 M. TAGURI AND OTHERS Table 5. Results of IF estimation for nonobesity by JALS data. The table displays the estimated IF and SE values for each combination of r1 and r2 . Estimate is the value of 100 × IF estimate and SE is the value of 100 × estimated SE of the IF estimate r1 r1 0.0 0.0 0.2 0.0 0.2 0.4 0.2 0.4 0.0 0.4 0.0 0.8 0.2 0.8 1.0 0.0 0.2 0.0 0.4 0.2 0.0 0.4 0.2 0.8 0.4 1.0 0.0 0.8 0.2 0.0 ML IF Estimate 0.0 7.7 5.8 15.4 13.5 11.6 21.2 19.3 30.8 27.0 38.5 23.2 36.6 30.9 29.0 SE 0.0 1.0 0.8 2.1 1.8 1.5 2.8 2.5 4.2 3.6 5.2 3.0 4.9 4.1 3.8 IPW IF Estimate 0.0 6.7 5.0 13.4 11.7 10.0 18.4 16.7 26.8 23.4 33.5 20.1 31.8 26.8 25.1 SE 0.0 1.6 1.1 3.1 2.6 2.1 4.2 3.7 6.2 5.2 7.8 4.2 7.3 5.8 5.3 EFF IF Estimate –0.4 7.2 5.2 14.7 12.8 10.8 20.4 18.4 29.9 26.0 37.5 22.0 35.5 29.6 27.6 SE 0.0 1.5 1.0 3.0 2.5 2.0 4.0 3.5 6.0 5.0 7.5 4.0 7.0 5.5 5.0 DR IF Estimate 0.0 7.5 5.6 15.1 13.1 11.2 20.7 18.8 30.1 26.3 37.7 22.4 35.8 30.0 28.0 SE 0.0 1.5 1.0 3.0 2.5 2.0 4.0 3.5 6.0 5.0 7.5 4.0 7.0 5.5 5.0 IPW . One could conclude that IF ML is most appropriate here because of the lower estimated SEs. those of IF However, we note that the use of DR estimators may provide major improvements in robustness about bias while incurring little efficiency loss. For this data, the odds ratios for nonobesity (categories 1 and 2) were not so different from those for obesity (categories 3 and 4). However, the prevalence of category 1, which comprises subjects who have one risk factor without obesity, was much higher than those of categories 2–4. As a result, the estimates of IF for nonobesity were larger in all intervention scenarios. Thus, we conclude that metabolic risk factors should be carefully managed even among individuals without obesity. 6. D ISCUSSION In this study, we proposed the DR estimator of the IF, which includes the AF as a special case. In almost all previous studies, model-based estimation of the IF or AF has been based only on the outcome regression model, and correct specification of the model is one of the most fundamental assumptions for estimating it consistently. In contrast, the DR procedure applies both the exposure regression model and the outcome regression model and produces a consistent estimator of the IF as long as either of the 2 models has been correctly specified. Because of the robustness of the model misspecification, inference based on the DR estimator should be improved upon compared with previous approaches for obtaining an unbiased (consistent) estimator of the effect or impact of exposure, which is the overarching goal of epidemiological research. In practice, however, it is unlikely for any model to be exactly correct. Recently, several authors have demonstrated how to construct DR estimators that exhibit superior robustness to slight mismodeling, relative to other DR estimators (Tan, 2006; Cao and others, 2009). Future research will focus on whether one can adapt these methods to the DR estimator presented in this article. We throughout assume the conditional exchangeability assumption (2.2), that is, all confounders L have been measured and included in at least one of the models. This is different to the type of model misspecification being considered here, for example, in the simulations where the interaction term was excluded to generate the misspecification. Unfortunately, Assumption (2.2) is a nonidentifiable assumption and is not testable from the observed data. The JALS data used in our analysis include confounders known DR estimation of the generalized IF 465 for cardiovascular research for many years. Even so, there may be utility in conducting sensitivity analyses that examine the effect of violations of the assumption (2.2). Robins (1999), Robins and others (1999), and Brumback and others (2004) discuss strategies for conducting sensitivity analyses under the semiparametric missing data framework. For the analysis of real data, some recommendations are needed regarding how to specify the modified distribution of exposure, g(a|b). We should determine g(a|b) dependent on a few parameters to avoid g(a|b) being too rich to specify. For this purpose, there are 2 distinct approaches. First, one may consider the modified exposure as a random variable and determine g(a|b) in a parsimonious fashion. We took this approach using (r1 , r2 ) as parameters. This approach is used, for example, by Wahrendorf (1987) who examined the impact of various changes in dietary habits on colorectal and stomach cancers. Second, one may consider the modified exposure as a constant (deterministic) conditional on actual exposure levels. Graham (2000) proposed to use a function τ (b) called a target value function. τ (b) indicates deterministic target exposure levels conditional on actual exposure levels. Possible choices of τ (b) include (Hoffmann and others, 2003): (i) min(0, b − 1) (decrease by one level), (ii) b − I (b c)(c > 0) (decrease by one level among high exposure subjects whose exposure levels are equal to or higher than c), (iii) min(b, t) (decrease to a target level t). The target value function (ii) is helpful if high exposure subjects could be specifically targeted by interventions rather than all exposed subjects. Of course, the former approach is more flexible and realistic especially when compliance with the intervention would be poor (e.g. smoking cessation program). However, there would not be sufficient information to determine the parameter values uniquely. In that case, one need to consider various possible scenarios as conducted in our example and discuss the sensitivity of the result. On the other hand, if one can consider compliance with the intervention would be sufficiently good, the latter approach is better due to its ease of specification. We have proposed a plug-in estimator of the IF, which uses a DR estimator of μ(a|b). Recently, Diaz Munoz and Van Der Laan (2011) proposed to estimate Pr∗ (Y = 1) directly and their estimators have been proved to be efficient and DR. We found their augmented inverse probability of treatment weighted (IPTW) estimator is same as our DR estimator for Pr∗ (Y = 1), if we use a simple target value function τ (b) = b−1 under the condition p(0) = 0. More extensive research is required in the future to conclude whether important differences in terms of efficiency and robustness exist or not in more general cases. In this article, we only considered the cohort/cross-sectional study settings. Sjölander and Vansteelandt (2011) also considered the case–control estimator of the AF, which is approximately consistent if either the standard logistic regression of the outcome given the exposure and confounders Pr(Y = 1|A = 0, L) is correct or the model for the exposure probability given the outcome and confounders Pr( A|Y = 0, L) is correct, but not necessarily both, under the rare disease assumption (Tchetgen and others, 2010; Tchetgen and Rotnitzky, 2011). Extension of their approach to the IF estimation is straightforward because one can express the IF in (2.1) with p ∗A,L (a, l) = b g(a|b) p A,L (b, l)dν A (b) as follows (Drescher and Becher, 1997): ∗ R (a, l) IF = 1 − (6.1) p A,L|Y =1 (a, l|Y = 1)dv A,L (a, l), R(a, l) a,l where R(a, l) = Pr(Y = 1|A = a,L = l)/Pr(Y = 1|A = 0, L = l) is the relative risk of exposure level a given confounders and R ∗ (a, l) = b g(b|a)R(b, l)dv A (b) is the expected relative risk of exposure level a after application of the modification rule to the actual exposure level b. Subsequently, the DR estimator of the IF is approximately obtained by substituting the DR estimators of the conditional odds ratio function, which was proposed by Tchetgen and others (2010), for R(a,l) and R(b,l) in (6.1). Note that the double robustness of the case–control estimator is preserved even when the conditional exchangeability assumption in (2.2) does not hold. However, in this case, the estimator does not have the causal interpretation presented in Section 2. See Tchetgen and Rotnitzky (2011) for related discussions. 466 M. TAGURI AND OTHERS S UPPLEMENTARY M ATERIAL Supplementary material is available at http://biostatistics.oxfordjournals.org. ACKNOWLEDGMENTS We express our appreciation to the JALS group for allowing us to use their valuable data. We thank Hiroshi Ohtsu for his helpful comments and encouragements. We are also grateful to the co-editor, associate editor, and 2 referees for giving us many valuable comments Conflict of Interest: None declared. R EFERENCES A LBERTI , K. G. M. M., Z IMMET, P. AND S HAW, J. (2006). Metabolic syndrome—a new world-wide definition: a consensus statement from the International Diabetes Federation. Diabetic Medicine 23, 469–480. BANG , H. AND ROBINS , J. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–972. B ENICHOU , J. (2001). A review of adjusted estimators of attributable risk. Statistical Methods in Medical Research 10, 195–216. B RUMBACK , B. A., H E ÀNAN , M. A., H ANEUSE , S. J. AND ROBINS , J. M. (2004). Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures. Statistics in Medicine 23, 749–767. C AO , W., T SIATIS , A. A. AND DAVIDIAN , M. (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. D IAZ M UNOZ , I. AND VAN D ER L AAN , M. J. (2011). Population intervention causal effects based on stochastic interventions. Biometrics (in press). D RESCHER , K. AND B ECHER , H. (1997). Estimating the generalized impact fraction from case–control data. Biometrics 53, 1170–1176. E XPERT PANEL ON D ETECTION E VALUATION AND T REATMENT OF H IGH B LOOD C HOLESTEROL IN A DULTS . (2001). Executive summary of the third report of the National Cholesterol Education Program (NCEP) expert panel on detection, evaluation, and treatment of high blood cholesterol in adults (Adult Treatment Panel III). Journal of the American Medical Association 285, 2486–2497. G RAHAM , P. (2000). Bayesian inference for a generalized population attributable fraction: the impact of early vitamin A levels on chronic lung disease in very low birth weight infants. Statistics in Medicine 19, 937–956. G REENLAND , S. AND D RESCHER , K. (1993). Maximum likelihood estimation of the attributable fraction from logistic models. Biometrics 49, 865–872. H OFFMANN , K., B OEING , H., VOLATIER , J. AND B ECKER , W. (2003). Evaluating the potential health gain of the World Health Organization’s recommendation concerning vegetable and fruit consumption. Public Health Nutrition 6, 765–772. JAPAN A RTERIOSCLEROSIS L ONGITUDINAL S TUDY G ROUP (JALS G ROUP ). (2008). Japan Arteriosclerosis Longitudinal Study-Existing Cohorts Combine (JALS-ECC)—Rationale, design, and population characteristics—. Circulation Journal 72, 1563–1568. K ADOTA , A., H OZAWA , A., O KAMURA , T., K ADOWAKI , T., NAKAMURA , K., M URAKAMI , Y., H AYAKAWA , T., K ITA , Y., O KAYAMA , A., NAKAMURA , Y.and others (2007). Relationship between metabolic risk factor clustering and cardiovascular mortality stratified by high blood glucose and obesity: NIPPON DATA90, 1990–2000. Diabetes Care 30, 1533–1538. DR estimation of the generalized IF 467 L EVIN , M. L. (1953). The occurrence of lung cancer in man. Acta Unio Internationalis Contra Cancrum 9, 531–541. M ORGENSTERN , H. AND B URSIC , E. S. (1982). A method for using epidemiologic data to estimate the potential impact of an intervention on the health status of a target population. Journal of Community Health 7, 292–309. ROBINS , J. M. (1999). Association, causation, and marginal structural models. Synthese 121, 151–179. ROBINS , J. M., ROTNITZKY, A. AND Z HAO , L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. ROBINS , J. M., ROTNITZKY, A. AND S CHARFSTEIN , D. (1999). Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran, M. E. and Berry, D. (editors), Statistical Models in Epidemiology: The Environment and Clinical Trials. New York: Springer, pp.1–92. RUBIN , D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 688–701. S ATO , T. AND M ATSUYAMA , Y. (2003). Marginal structural models as a tool for standardization. Epidemiology 14, 680–686. S J ÖLANDER , A. (2011). Estimation of attributable fractions using inverse probability weighting. Statistical Methods in Medical Research 20, 415–428. S J ÖLANDER , A. 12, 112–121. AND VANSTEELANDT, S. (2011). Doubly robust estimation of attributable fractions. Biostatistics S TONE , R. (1993). The assumptions on which causal inferences rest. Journal of the Royal Statistical Society, Series B 55, 455–466. TAN , Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101, 1619–1637. T CHETGEN , T. E., ROBINS , J. M. AND ROTNITZKY, A. (2010). On doubly robust estimation in a semi-parametric odds ratio model. Biometrika 97, 171–180. T CHETGEN , T. E. AND ROTNITZKY, A. (2011). Double-robust estimation of an exposure-outcome odds ratio adjusting for confounding in cohort and case-control studies. Statistics in Medicine 30, 335–347. T HE E XAMINATION C OMMITTEE OF C RITERIA FOR ‘O BESITY D ISEASE ’ IN JAPAN , JAPAN S OCIETY FOR S TUDY OF O BESITY. (2002). New criteria for ‘obesity disease’ in Japan. Circulation Journal 66, 987–992. THE T SIATIS , A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer. VAN D ER L AAN , M. J. New York: Springer. AND ROBINS , J. (2003). Unified Methods for Censored Longitudinal Data and Causality. VANDERWEELE , T. J. (2009). Concerning the consistency assumption in causal inference. Epidemiology 20, 880–883. VANSTEELANDT, S., B EKAERT, M. AND C LAESKENS , G. (2010). On model selection and model misspecification in causal inference.Statistical Methods in Medical Research (in press). WAHRENDORF, J. (1987). An estimate of the proportion of colorectal and stomach cancers which might be prevented by certain changes in dietary habits. International Journal of Cancer 40, 625–628. WALTER , S. D. (1976). The estimation and interpretation of attributable risk in health research. Biometrics 32, 829–849. [Received May 8, 2011; revised September 16, 2011; accepted for publication September 20, 2011]
© Copyright 2026 Paperzz