Political Analysis 2, Lab 3 Analysing Results of a Randomised Experiment 1. Pre-lab assignment • Skim: Gerber, Alan S., Donald P. Green, and Christopher W. Larimer. 2008. “Social Pressure and Voter Turnout: Evidence from a Large-Scale Experiment.” American Political Science Review 102(1): 33-48. Identify the main research question(s) posed by the authors and describe the methods used to answer the question(s). • For more background: Read Chapter 1 of Angrist, Joshua D., and Jorn-Steffen Pischke. 2015. Mastering ’Metrics. Princeton, NJ: Princeton University Press. 2. Loading and inspecting the data Open RStudio. Open a new script so you can save your work: use the button at the top left of the window, or File –> NewFile –> RScript. In your new script, write the following command to load the dataset into R: load(url("https://goo.gl/7cLAKV")) # link points to the dataset on the OQC website The load() command is a way to load data into R that has been saved with the save() command. The save() command lets you save many R objects (dataframes, matrices, regression results) in one file. When you load() that file, all of those objects are put in your memory. Check that you have the ggl object (short for Gerber, Green, and Larimer) in your environment (the upper right window of RStudio). • ggl is a dataframe that has one row for every voter in the study — how many in total? Apply the following commands to get a feel for the data: dim(ggl) #dimensions of the dataset head(ggl) #first six rows of data names(ggl) #variable names str(ggl) #gives the structure of the dataset and variable types summary(ggl) #summarizes dataset The variables are as follows: • sex: male or female • yob: year of birth • g2000, g2002, g2004: did this voter vote in the general elections in November of 2000, 2002, 2004? (binary) • p2000, p2002, p2004: did this voter vote in the primary elections of August 2000, 2002, 2004? (binary) • treatment: which of the five treatment did this voter’s household receive? – “Control”: No mailing – “CivicDuty”: A mailing encouraging voting – “Hawthorne”: A mailing encouraging voting and saying that the sender will ‘be studying voter turnout in the August 8 primary election’ 1 – “Self”: A mailing encouraging voting and showing the recipients’ past turnout, saying ‘We intend to mail you an updated chart when we have that information’ – “Neighbors”: Same thing, except including information on turnout by neighbors as well • • • • cluster: in what cluster of households was this voter’s house located? voted: did the voter vote in the primary election of 2006? hh_id: what is the id number of this voter’s household? hh_size: how many voters are in this household? 3. Summary statistics Let’s compare the individuals who received different treatments. First we’ll make a table like Table 1 in the published paper. • Use the mean() function to calculate turnout in the 2002 primary elections among individuals who were in households assigned to the “Control” group. (Hint: the syntax will look like mean(ggl$variable_name[ggl$treatment == "X"]).) • Do the same thing for individuals assigned to the “Neighbors” group. You should get turnout rates that are a little lower than the corresponding numbers in Table 1 from the published paper. The reason is that the unit of analysis in Table 1 is a household while our unit of analysis is individuals.1 We could carry on like this to average all of the variables in the dataset by treatment status and produce a table like Table 1. Let’s look at a couple of shortcuts. tapply() We want to calculate the average turnout in the 2002 primaries by treatment group. We can do it group by group, but the tapply() function is a shortcut. The syntax is tapply(vector_you_want_to_analyse, vector_you_want_to_group_by, function_you_want_to_apply_to_each_group) For example, tapply(ggl$yob, ggl$sex, summary) applies the summary() command to the ggl$yob variable separately by sex. • Use the tapply() command to calculate the turnout in the 2002 primary elections by treatment group. aggregate() We want to calculate the average of many variables by treatment group. We can do it group by group and variable by variable, but the aggregate() function is a shortcut. The syntax is 1 To reproduce one row of Table 1 exactly, you could do the following: first, get the average value of the variable for each household; second, get the average of these averages for the households in each treatment group. Challenge question: Turnout is higher when you average across households than when you average across individuals. What does this mean about the relationship between turnout and household size? 2 aggregate(dataframe_you_want_to_analyse, list_of_vectors_you_want_to_group_by, function_you_want_to_apply_to_each_group_for_each_variable) For example, aggregate(ggl[,c("p2002", "yob")], list(sex = ggl$sex), summary) applies the summary() command to the ggl$p2002 and ggl$yob variables separately by sex. • Use the aggregate() command to calculate the turnout rate in the 2000, 2002, and 2004 primary elections by treatment group. • Are the turnout rates in past elections similar across treatment groups? Is this what you would expect? How does this relate to the advantages of experiments compared to observational studies? 4. Treatment effects Now we will assess how the different mailings affected turnout in the 2006 primary. • Use the tapply() command to calculate the turnout rate in the 2006 primary elections by treatment group. What group has the highest turnout? What group has the lowest turnout? • Use the lm() command (possibly in conjunction with the summary() command) to compare the difference in turnout across treatment groups. (Hint: when an independent variable in a regression is categorical (as is the case with treatment), R makes a separate dummy variable for each level of the variable.) (Another hint: the dependent variable of your regression is voted and your independent variable is treatment.) • Compare the turnout rates you calculated with tapply() and the regression coefficients you produced with lm(). Do you see the connection? • Still using the lm() command, measure the effects of the different treatments while controlling for turnout in the 2002 and 2004 primaries. – Look at the coefficients on turnout in 2002 and 2004. How are these variables related to turnout in 2006? – Compare the coefficients from the original regression with the coefficients from the regression with these controls included. Are they different? Do you expect them to be different? – Do you think that controlling for any of the other variables in the dataset would change our estimates of the effect of the different mailings? 5. Treatment effect heterogeneity Now let’s examine whether the effect of receiving a mailer differs across different types of individuals. First, let’s simplify the analysis by only comparing the “Control” group with the “Neighbors” group. Start by using the lm() command to run a regression in which the dependent variable is voted and the independent variable is treatment, but the analysis is restricted to individuals where treatment is either “Control” or “Neighbors”. Here are three ways to do this: 3 # specifying the rows summary(lm(voted ~ treatment, data = ggl[ggl$treatment %in% c("Control", "Neighbors"),])) # specifying a subset summary(lm(voted ~ treatment, data = ggl, subset = ggl$treatment %in% c("Control", "Neighbors"))) # making a separate dataset first, and then regressing ggl2 = subset(ggl, ggl$treatment %in% c("Control", "Neighbors")) summary(lm(voted ~ treatment, data = ggl2)) Now, using interactions, assess the following: • Is the effect of the “Neighbors” treatment larger for men or women? What is the effect for each group? • Is the effect of the “Neighbors” treatment larger for people who voted in the 2002 primaries, or for the people who did not vote in the 2002 primaries? • Is the effect of the “Neighbors” treatment larger for younger people or older people? • (If time:) Try the same regression (i.e. with the interaction) with the full sample, i.e. not just focusing on “Neighbors” vs. “Control”. Try to interpret all of the coefficients. 6. Generalizability (no programming!) Gerber and colleagues (2008) undertook a very rigorous analysis of the effect of different mailings on voter turnout and uncovered some rather large treatment effects. However, it is important to consider whether the findings would be similar if the experiment was repeated in a different time and place. Generalizability refers to whether the same findings would hold in other settings or with a different sample. Why might the results not generalize? 7. Saving your work and closing • At the end of the lab, save your script and either copy it to a USB drive or email it to yourself. To save the script, in RStudio click on File > Save As..., then type a filename such as Experiments1.R, and press the Return key. • Clear your workspace at the very end of the lab. Either click on the Broom in the upper right quadrant of your screen, or click on Session > Clear Workspace.... 4
© Copyright 2024 Paperzz