Propensity Scores and Quasi-Experiments

A Randomized Experiment
Comparing Random to
Nonrandom Assignment
William R Shadish
University of California, Merced
and
M.H. Clark
Southern Illinois University, Carbondale
Paper presented at the Society for Research on Educational Effectiveness,
December 2006.
Nonrandomized Experiments
• A central hypothesis about the use of nonrandomized
experiments is that their results can well-approximate
results from randomized experiments
– especially when the results of the nonrandomized experiment
are appropriately adjusted by, for example, selection bias
modeling or propensity score analysis.
– I take the goal of such adjustments to be: to estimate what the
effect would have been if the nonrandomly assigned
participants had instead been randomly assigned to the
same conditions and assessed on the same outcome
measures.
– The latter is a counterfactual that cannot actually be observed
– So how is it possible to study whether these adjustments work?
Kinds of Empirical Comparisons of
Randomized to Nonrandomized
Experiments
• In general, there have been three different ways
to study this question:
– Computer Simulations
– Single Study Comparisons
– Meta-Analytic Comparisons
• None of these methods provide a good estimate
of that ideal counterfactual.
• We can, however, estimate that counterfactual in
the usual recommended way, with a randomized
experiment:
N = 445 Undergrad Psych Students
Random Assignment
Randomized
Experiment
N = 235
Randomly Assigned to
Mathematics
Training
N = 119
Vocabulary
Training
N = 116
Nonrandomized
Experiment
N = 210
Self-Selected into
Mathematics
Training
N = 79
Vocabulary
Training
N = 131
All Participants Post-tested on both Vocabulary and Mathematics Outcomes
More on the Design
• All participants pretested on a host of covariates
• Chose math and vocab training because
–
–
–
–
Good analogue to educational interventions
Relevant to college students
Easy to control effect size with item difficulty
Math phobias cause plausible selection bias
• All participants treated together without
knowledge of the different conditions.
• All participants posttested on both math and
vocab outcomes.
Unadjusted Results:
Effects of Math Training on Math Outcome
Math
Tng
Vocab
Tng
Mean
Diff
Mean
Mean
Unadjusted Randomized Experiment
11.35
7.16
4.19
Unadjusted Quasi-Experiment
12.38
7.37
5.01
Absolute
Bias
.82
Conclusions:
1. The effect of math training on math scores was larger when participants could
self-select into math training.
2. The 4.19 point effect (out of 18 possible points) in the randomized experiment was
overestimated by 19.6% (.82 points) in the nonrandomized experiment
Unadjusted Results:
Effects of Vocab Training on Vocab Outcome
Vocab
Tng
Math
Tng
Mean
Diff
Mean
Mean
Unadjusted Randomized Experiment
16.19
8.08
8.11
Unadjusted Quasi-Experiment
16.75
7.75
9.00
Absolute
Bias
.89
Conclusions:
1. The effect of vocab training on vocab scores was larger (9 of 30 points) when
participants could self-select into vocab training.
2. The 8.11 point effect (out of 30 possible points) in the randomized experiment was
overestimated by 11% (.89 points) in the nonrandomized experiment.
Adjusted Quasi-Experiments
• It is no surprise that randomized and
nonrandomized experiments might yield
different answers.
• The more important question is whether
statistical adjustments can improve the
quasi-experimental estimate—in this case,
propensity score adjustments
Estimation of Propensity Scores
• Used stepwise logistic regression with subsequent
forced entry of variables out of balance
– For example: Math and vocabulary pretest scores, ACT, GPA,
measures of previous exposure to math courses, math anxiety,
Demographics
– But also “Big 5” personality traits (extraversion, emotional
stability, agreeableness, intellect, and conscientiousness)
• Checked balance with 2 x 5 ANOVA, recomputed,
rechecked.
– Initially 10 of 30 covariates out of balance.
– At the end, 1 of 30 interactions were significant and 0 of 30 main
effects were significant.
Methods for Propensity Score
Adjustments
• ANCOVA
– Sensitive to nonlinearities
• Stratification on propensity score quintiles.
• Weighting
– Each observation is weighted by the inverse of its
propensity score (tmt) or of (1 – ps) for control, and
then a standard weighted average is computed
• Here are results of three adjustments
100%
80%
Vocabulary
Outcome
Mathematics
Outcome
60%
40%
Propensity
Score
Weighting
Propensity
Score
ANCOVA
0%
Propensity
Score
Stratification
20%
Unadjusted
QuasiExperiment
Percent Bias Remaining
After Adjustment
Percent Bias Remaining After
Adjustments
Propensity Score Adjustment
Method
There are a number of interesting features of these results:
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Vocabulary Outcome
Propensity
Score
Weighting
Propensity
Score
ANCOVA
Propensity
Score
Stratification
Mathematics Outcome
Unadjusted
QuasiExperiment
Percent Bias Remaining After
Adjustment
Percent Bias Remaining After
Adjustments: Stratification
Propensity Score Adjustment
Method
Rosenbaum and Rubin (1984) note that propensity score stratification should
yield approximately a 90% reduction in bias in outcome due to each of the
variables contributing to the propensity score. We obtained bias reduction of
90% and 91% for mathematics and vocabulary outcomes, respectively.
(When was the last time you saw a point prediction like that be confirmed?)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Vocabulary Outcome
Propensity
Score
Weighting
Propensity
Score
ANCOVA
Propensity
Score
Stratification
Mathematics Outcome
Unadjusted
QuasiExperiment
Percent Bias Remaining After
Adjustment
Percent Bias Remaining After
Adjustments: ANCOVA
Propensity Score Adjustment
Method
Bias reduction (in percentage terms) was approximately the same for each of the
two outcomes for both propensity score stratification (90%, 91%) and propensity
score weighting (70%, 74%). Bias reduction only differed over outcomes when we
used propensity scores as a covariate. The reason is the relationship between
propensity scores and mathematics outcome was not linear:
Relationship Between Propensity
Scores and Outcomes
For the mathematics outcome
variable, the relationship had a
cubic component.
For the vocabulary outcome, even
a lowess smoother could not
identify a function that fit much
better than a linear one.
So, redoing the ANCOVA for math to include a quadratic and cubic component yielded
much better adjustment:
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Vocabulary Outcome
Propensity
Score
Weighting
Propensity
Score
ANCOVA
Propensity
Score
Stratification
Mathematics Outcome
Unadjusted
QuasiExperiment
Percent Bias Remaining After
Adjustment
Percent Bias Remaining After
Adjustments: Nonlinear ANCOVA
Propensity Score Adjustment
Method
Bias in the mathematics outcome was reduced by 97.5%. Adding the same terms
to the adjustment for vocabulary outcome had virtually no effect—though perhaps
we did not add the “right” nonlinear terms.
And now, each method yielded about the same bias reduction for each dependent
variable. But why did weighting do comparatively poorly?
Extreme Propensity Scores
The reason may be that n = 15 propensity scores were less than .01 (they were
arbitrarily truncated to .01) and another n = 15 were greater than .9999 (they were
arbitrarily truncated to .9999). Rubin (2001) notes that the estimation of propensity
scores “can generate unrealistically extreme weights when an estimated propensity
score is near zero or one” (p. 184). I’m open to suggestions and comments on this or
other possible reasons for the poor performance of weighting.
What Is Impressive Is That Four Different
Rubin Predictions were Supported
• A specific point prediction that propensity score
stratification should reduce bias by 90%, which it
did, twice (math and vocab).
– Indeed, if it were always 90%, we would be able to
compute the exact effect (bias reduction/.9).
• When propensity scores have no nonlinearities
with outcome, ANCOVA will reduce bias (it did at
90% with vocab).
• When propensity scores have nonlinearities that
are not modelled, ANCOVA will perform more
poorly (math 1)
• When those nonlinearities are modeled well,
ANCOVA will reduce bias (98% in math 2)
Predictors of Convenience
• Bad practice: We also tested the effectiveness of
propensity score adjustments based only on
predictors of convenience (sex, age, ethnicity,
marital status)
• Depending on how we did the analyses bias
reduction ranged from 12-30% at best.
• The importance of thoughtful selection of
covariates in the design of the study.
Discussion
• Results suggest more optimistic conclusions
than those reached by many recent researchers
• Propensity score adjustments to nonrandomized
experiments might yield a reasonable estimate
of what the effect would have been if these
same participants had instead been
randomly assigned to these same conditions
using these same measures.
• Jason Luellen Dissertation
– Methods for constructing propensity scores crucial
– With large sample size, logistic regression nearly
always got closer to the right answer.
Conclusions
• The laboratory analogue method is a novel
test of the effects of assignment method,
and of adjustments to quasi-experiments,
with some advantages over other
methods.
• Propensity score adjustments have
promise