Analysing Results of a Randomised Experiment

Political Analysis 2, Lab 3
Analysing Results of a Randomised Experiment
1. Pre-lab assignment
• Skim: Gerber, Alan S., Donald P. Green, and Christopher W. Larimer. 2008. “Social Pressure and
Voter Turnout: Evidence from a Large-Scale Experiment.” American Political Science Review 102(1):
33-48. Identify the main research question(s) posed by the authors and describe the methods used to
answer the question(s).
• For more background: Read Chapter 1 of Angrist, Joshua D., and Jorn-Steffen Pischke. 2015. Mastering
’Metrics. Princeton, NJ: Princeton University Press.
2. Loading and inspecting the data
Open RStudio. Open a new script so you can save your work: use the button at the top left of the window,
or File –> NewFile –> RScript.
In your new script, write the following command to load the dataset into R:
load(url("https://goo.gl/7cLAKV")) # link points to the dataset on the OQC website
The load() command is a way to load data into R that has been saved with the save() command. The
save() command lets you save many R objects (dataframes, matrices, regression results) in one file. When
you load() that file, all of those objects are put in your memory. Check that you have the ggl object (short
for Gerber, Green, and Larimer) in your environment (the upper right window of RStudio).
• ggl is a dataframe that has one row for every voter in the study — how many in total?
Apply the following commands to get a feel for the data:
dim(ggl) #dimensions of the dataset
head(ggl) #first six rows of data
names(ggl) #variable names
str(ggl) #gives the structure of the dataset and variable types
summary(ggl) #summarizes dataset
The variables are as follows:
• sex: male or female
• yob: year of birth
• g2000, g2002, g2004: did this voter vote in the general elections in November of 2000, 2002, 2004?
(binary)
• p2000, p2002, p2004: did this voter vote in the primary elections of August 2000, 2002, 2004? (binary)
• treatment: which of the five treatment did this voter’s household receive?
– “Control”: No mailing
– “CivicDuty”: A mailing encouraging voting
– “Hawthorne”: A mailing encouraging voting and saying that the sender will ‘be studying voter
turnout in the August 8 primary election’
1
– “Self”: A mailing encouraging voting and showing the recipients’ past turnout, saying ‘We intend
to mail you an updated chart when we have that information’
– “Neighbors”: Same thing, except including information on turnout by neighbors as well
•
•
•
•
cluster: in what cluster of households was this voter’s house located?
voted: did the voter vote in the primary election of 2006?
hh_id: what is the id number of this voter’s household?
hh_size: how many voters are in this household?
3. Summary statistics
Let’s compare the individuals who received different treatments. First we’ll make a table like Table 1 in the
published paper.
• Use the mean() function to calculate turnout in the 2002 primary elections among individuals who were in households assigned to the “Control” group. (Hint: the syntax will look like
mean(ggl$variable_name[ggl$treatment == "X"]).)
• Do the same thing for individuals assigned to the “Neighbors” group.
You should get turnout rates that are a little lower than the corresponding numbers in Table 1 from the
published paper. The reason is that the unit of analysis in Table 1 is a household while our unit of analysis is
individuals.1
We could carry on like this to average all of the variables in the dataset by treatment status and produce a
table like Table 1. Let’s look at a couple of shortcuts.
tapply()
We want to calculate the average turnout in the 2002 primaries by treatment group. We can do it group by
group, but the tapply() function is a shortcut. The syntax is
tapply(vector_you_want_to_analyse, vector_you_want_to_group_by,
function_you_want_to_apply_to_each_group)
For example,
tapply(ggl$yob, ggl$sex, summary)
applies the summary() command to the ggl$yob variable separately by sex.
• Use the tapply() command to calculate the turnout in the 2002 primary elections by treatment group.
aggregate()
We want to calculate the average of many variables by treatment group. We can do it group by group and
variable by variable, but the aggregate() function is a shortcut. The syntax is
1 To reproduce one row of Table 1 exactly, you could do the following: first, get the average value of the variable for each
household; second, get the average of these averages for the households in each treatment group. Challenge question: Turnout
is higher when you average across households than when you average across individuals. What does this mean about the
relationship between turnout and household size?
2
aggregate(dataframe_you_want_to_analyse, list_of_vectors_you_want_to_group_by,
function_you_want_to_apply_to_each_group_for_each_variable)
For example,
aggregate(ggl[,c("p2002", "yob")], list(sex = ggl$sex), summary)
applies the summary() command to the ggl$p2002 and ggl$yob variables separately by sex.
• Use the aggregate() command to calculate the turnout rate in the 2000, 2002, and 2004 primary
elections by treatment group.
• Are the turnout rates in past elections similar across treatment groups? Is this what you would expect?
How does this relate to the advantages of experiments compared to observational studies?
4. Treatment effects
Now we will assess how the different mailings affected turnout in the 2006 primary.
• Use the tapply() command to calculate the turnout rate in the 2006 primary elections by treatment
group. What group has the highest turnout? What group has the lowest turnout?
• Use the lm() command (possibly in conjunction with the summary() command) to compare the
difference in turnout across treatment groups. (Hint: when an independent variable in a regression is
categorical (as is the case with treatment), R makes a separate dummy variable for each level of the
variable.) (Another hint: the dependent variable of your regression is voted and your independent
variable is treatment.)
• Compare the turnout rates you calculated with tapply() and the regression coefficients you produced
with lm(). Do you see the connection?
• Still using the lm() command, measure the effects of the different treatments while controlling for
turnout in the 2002 and 2004 primaries.
– Look at the coefficients on turnout in 2002 and 2004. How are these variables related to turnout
in 2006?
– Compare the coefficients from the original regression with the coefficients from the regression with
these controls included. Are they different? Do you expect them to be different?
– Do you think that controlling for any of the other variables in the dataset would change our
estimates of the effect of the different mailings?
5. Treatment effect heterogeneity
Now let’s examine whether the effect of receiving a mailer differs across different types of individuals.
First, let’s simplify the analysis by only comparing the “Control” group with the “Neighbors” group. Start by
using the lm() command to run a regression in which the dependent variable is voted and the independent
variable is treatment, but the analysis is restricted to individuals where treatment is either “Control” or
“Neighbors”. Here are three ways to do this:
3
# specifying the rows
summary(lm(voted ~ treatment, data = ggl[ggl$treatment %in% c("Control", "Neighbors"),]))
# specifying a subset
summary(lm(voted ~ treatment, data = ggl, subset = ggl$treatment %in% c("Control", "Neighbors")))
# making a separate dataset first, and then regressing
ggl2 = subset(ggl, ggl$treatment %in% c("Control", "Neighbors"))
summary(lm(voted ~ treatment, data = ggl2))
Now, using interactions, assess the following:
• Is the effect of the “Neighbors” treatment larger for men or women? What is the effect for each group?
• Is the effect of the “Neighbors” treatment larger for people who voted in the 2002 primaries, or for the
people who did not vote in the 2002 primaries?
• Is the effect of the “Neighbors” treatment larger for younger people or older people?
• (If time:) Try the same regression (i.e. with the interaction) with the full sample, i.e. not just focusing
on “Neighbors” vs. “Control”. Try to interpret all of the coefficients.
6. Generalizability (no programming!)
Gerber and colleagues (2008) undertook a very rigorous analysis of the effect of different mailings on voter
turnout and uncovered some rather large treatment effects. However, it is important to consider whether the
findings would be similar if the experiment was repeated in a different time and place. Generalizability refers
to whether the same findings would hold in other settings or with a different sample. Why might the results
not generalize?
7. Saving your work and closing
• At the end of the lab, save your script and either copy it to a USB drive or email it to yourself. To save
the script, in RStudio click on File > Save As..., then type a filename such as Experiments1.R,
and press the Return key.
• Clear your workspace at the very end of the lab. Either click on the Broom in the upper right quadrant
of your screen, or click on Session > Clear Workspace....
4