Ulf Stromberg

TWO-STAGE CASE-CONTROL
STUDIES USING EXPOSURE
ESTIMATES FROM A GEOGRAPHICAL
INFORMATION SYSTEM
Jonas Björk1 & Ulf Strömberg2
1Competence Center for Clinical Research
2Occupational and Environmental Medicine
Lund University Hospital
OUTLINE OF TALK
• Previous project: What have we done?
(Jonas Björk)
• Ongoing project: What shall we do?
(Ulf Strömberg)
Two-stage procedure for casecontrol studies
1st stage
Complete data obtained from registries
Disease status
General characteristics
Group affiliation
(e.g. occupation or residential area)
 Group-level exposure XG
2nd stage
Individual exposure data for a subset of
the 1st stage sample
Exposure database 
group-level exposure
• JEM = Job Exposure Matrix
Occupational group  proportion exposed
• GIS
Residential group (area)  average concentration of an
air pollutant
JEM - proportion exposed
0,5
0,4
0,3
0,2
0,1
0
Most data
typically in groups
with low XG
Group 0
Group 1
Group 2
Group 3
Group 4
Linear Relation between Proportion Exposed
and Relative Risk
• No confounding between/within groups
Example: RR (exposed vs. unexposed) = 2.0
Proportion exposed XG
0%
10%
50%
100%
Average RR
1.0
0.10 * 2 + 0.9 +1.0 =1.1
1.5
2.0
Linear OR model:
OR(XG) = 1 + β XG
XG = Exposure proportion
OR for exposed vs. unexposed = OR(1) = 1 + β
OR(1)
Most data
typically in groups
with low XG
1
0
1
XG
Confounding between groups
• General confounders (eg, gender and age) can
normally be adjusted for
• Assuming no confounding within groups and
no effect modification in any stratum sk:
OR(XG;s1, s2, ...sk) = (1 + β XG) exp(Σγksk)
Combining 1st and 2nd stage data
• Assumption: 2nd stage data missing at random
condition on disease status and 1st stage group
affiliation
• For subjects with missing 2nd stage data:
Use 1st stage data to calculate expected number of
exposed/unexposed
• Expectation-maximization (EM) algorithm
EM-algorithm
(Wacholder & Weinberg 1994)
1. Select a starting value, e.g. OR=1
2. E-step
Among the non-participants, calculate expected number of
exposed/unexposed case and controls in each group
3. M-step
Maximize the likelihood for observed+expected cell frequencies
using the chosen risk model for individual-level data
(not necessarily linear)
 New OR-estimate
4. Repeat 2. and 3. until convergence
E-step in our situation
(Strömberg & Björk, submitted)
ÔR = Current OR-estimate
Complete the data in each group G:
• m0 controls with missing 2nd stage data
 m0 * XG = expected number of exposed
• m1 cases with missing 2nd stage data
 m1 * XG * ÔR / [1+(ÔR-1)* XG]
Simulated case-control studies
• 400 cases, 1200 controls in the 1st stage
• 2nd stage participation
75% of the cases
25% of the controls
• Selective participation of 2nd stage controls
Corr(Participation, XG) =0, > 0, <0
• 1000 replications in each scenario
• True OR = 3
Simulations - Results
1st stage data only
(400 + 1200)
2nd stage data only
(300 + 300)
EM-method
(400 + 1200)
OR
SD
Coverage
OR
SD
Coverage
OR
SD
Coverage
Corr(Part., XG)=0
3.0
0.18
95.0%
3.0
0.23
95.6%
3.0
0.15
95.5%
Corr(Part., XG)<0
3.0
0.18
95.0%
5.3
0.29
45.8%
3.0
0.15
95.0%
Corr(Part., XG)>0
3.0
0.18
95.0%
1.8
0.20
32.9%
3.0
0.15
95.5%
Participation
SD = Empirical standard deviation of the ln(OR) estimates
Coverage = Coverage of 95% confidence intervals
Simulations - Conclusions
Combining 1st and 2nd stage data,
using the EM method can:
1. Improve precision
2. Remove bias from selective participation
Method is sensitive to errors in the
(1st stage) external exposure data!
Simulations – Conclusions II
EM-method is sensitive to
1. Violations of the MAR-assumption
(condition on on disease status and 1st stage group affiliation)
2. Errors in the (1st stage) external exposure data
Ongoing methodological
research project
• Focus on exposure estimates from a GIS
GIS data: NO2 (Scania)
Two-stage exposure assessment
procedure
1st stage: XG represents mean exposure levels
rather than proportion exposed
XG = 4.8 XG = 10.1
xi
xi
XG = 20.1
...
xi
2nd stage: xi is a continuous, rather than a
dichotomous, exposure variable
Assume a linear relation between and xi and disease
odds (cf. radon exposure and lung cancer [Weinberg
et al., 1996]).
Odds
xi
For the ”only 1st stage” subjects: no bias expected by
using their XG:s (Berkson errors) provided MAR in
each group – independent of disease status.
EM method? Exposure variation in each group?
Two-stage exposure assessment
procedure –
related work
• Multilevel studies with applications to a study
of air pollution [Navidi et al., 1994]:
pooling exposure effect estimates based on
individual-level and group-level models,
respectively
Collecting data on confounders or effect
modifiers at 2nd stage
1st stage: XG = mean exposure levels
XG = 4.8 XG = 10.1
ci
ci
XG = 20.1
...
ci
2nd stage: ci is a covariate, e.g. smoking history
Data on confounders or effect
modifiers at 2nd stage –
estimation of exposure effect
• Confounder adjustment based on logistic
regression: pseudo-likelihood approach [Cain
& Breslow, 1988]
• More general approach: EM method
[Wacholder & Weinberg, 1994]
Design stage (“stage 0”)
1st stage: How many geographical areas (groups)?
Group 1
Group 2
Subjects?
?
Group 3
...
?
2nd stage: Fractions of the 1st stage cases and controls?
Design stage – related work
• Two-stage exposure assessment: power
depends more strongly on the number of
groups than on the number of subjects per
group [Navidi et al., 1994]
References I
• Björk & Strömberg. Int J Epidemiol
2002;31:154-60.
• Strömberg & Björk. “Incorporating grouplevel exposure information in case-control
studies with missing data on dichotomous
exposures”. Submitted.
References II
• Cain & Breslow. Am J Epidemiol 1988;128:11981206.
• Navidi et al. Environ Health Perspect
1994;102(Suppl 8):25-32.
• Wacholder & Weinberg. Biometrics 1994;50:350-7.
• Weinberg et al. Epidemiology 1996;7:190-7.