Worksheet

Political Analysis II
Lab 6: Intepretation of regression results
week 7, Michaelmas 2016
For this lab we will be using data from a paper by Fails and Krieckhaus (F&K) called “Colonialism, Property
Rights and the Modern World Income Distribution” to give you more practice making predictions using OLS
regression results.
F&K take issue with the conclusions of Acemoglu, Johnson and Robinson (AJR), who posit that European
settlement and extraction patterns shaped the property rights regimes that were set up in European colonies.
Because such institutions tend to be sticky over time—and because they encourage or discourage investment—
settlement and extraction patterns from centuries ago can, in AJP’s opinion, be said to influence development,
as measured by modern-day GDP-per-capita.
Specifically, AJR argue that where Europeans colonists settled in large numbers, they established strong
property rights that encouraged investment over time, leading to high modern-day GDP-per-capita. Conversely,
colonists designed institutions to maximize extraction in colonies with few inhabitants of European origin,
which discouraged investment in those places and led to low modern-day GDP-per-capita.
1. Loading and examining the data
F&K and AJR use historical and recent data from 92 countries that were initially colonized around the 15th
and 16th centuries.
The data are in Stata format, so we must first load the foreign package in R:
library(foreign)
And read the data using the function read.dta(), assigning it to a new dataframe, colonial:
colonial <- read.dta("http://andy.egge.rs/data/FailsKrieckhaus.dta")
Remember that you can view the data by double clicking on the colonial object in your Data window (upper
right) or type View(colonial) in the console.
Here’s a guide to the key variables:
• The dependent variable:
– combrisk: the combined modern-day expropriation risk (combined because F&K use AJR data
on expropriation risk for 63 countries, and fill in the gaps using another data source). Higher
combrisk scores indicate stronger property rights.
• Three variables capture the favourability of a colony for settlement in 1500:
– urb1500: the % of the population living in urban areas in 1500 (the percentage of people living in
urban centres).
– lden1500: the log base 10 population density in 1500 ((people per square kilometer).
– mort: settler mortality in 1500 (deaths per thousand individuals per year).
• Some dummy variables may be helpful for identifying categories of colonies:
– neobrit: dummy distinguishing British colonies with lots of immigration from Britain (reference
category is non-British colonies and British colonies with little immigration from Britain).
1
– nbcs: dummy distinguishing British colonies with lots of immigration from Britain and city states.
• These additional independent variables may be useful to explore the dataset:
– countryn: country name.
– WBgdp: modern-day GDP per capita (US$, ppp) according to the World Bank.
Get a feel for the dataset using the dim() and str() commands that you met in previous weeks.
• What is the range of the dependent variable?
• How many British colonies are there with lots of immigration from Britain? Which are they? How
many city states are there? And what are their names? It maybe helpful to create one or more tables
for this, storing each table as a variable using the below syntax. Note that you may wish to use the
subset() command (subset(dataset, variable.name == variable.value)) to create a table of a
subset of the dataset.
table.variable.name <- table(dataset$rows.variable.name, dataset$columns.variable.name)
After storing the table in a variable, you can add variable labels:
names(dimnames(table.variable.name)) <- c("Rows Title", "Columns Title")
2. Transforming variables
In this lab we are going to focus on one part of F&K’s argument: the question of whether the favourability of
a colony in 1500 to en-masse settlement by Europeans predicts the strength of modern-day property rights.
If you were to run bivariate regressions predicting modern-day property rights using each of the three variables
that capture the favourability of a colony to en-masse settlement by Europeans in 1500, you would find
that all three independent variables are statistically significant at the p=0.05 level. (Feel free to check this,
yourself!)
lm(name.of.dependent.var ~ name.of.independent.var, data = name.of.dataset)
R Exercise 1:
• Using the lm() and summary() commands that you have met in previous weeks, run a multivariate
regression with combrisk as the dependent variable and all three settlement-favourability variables
(urb1500, lden1500, mort) as independent variables. Store the result in a variable called lm.mort.
• You might hypothesize that the opportunity to live in a city as opposed to the middle of nowhere might
attract European settlers, whilst a high mortality rate would put them off moving to a colony. Given
what you know about when to include control variables in a model, why might it be a good idea to
include all three of these variables here?
• Interpret the coefficient(s) of the variable(s) that are significant. Remember that the measure of
modern-day property rights, combrisk, is an inverse risk-of-expropriation score with 10 being the
strongest possible modern-day property rights (or a non-existent risk of expropriation), and 0 indicating
weakest possible modern-day property rights (certainty of expropriation).
You may notice that the population density variable in the dataset is logged. In the next exercise, you will
experiment with different transformations of the mortality rate variable.
2
As you know, OLS assumes a (conditionally) linear relationship between each independent variable and the
dependent variable. Recall that the coefficient for each independent variable is an estimate of the change in
the dependent variable that occurs in response to a one-unit increase in that independent variable, holding
all of the other independent variables constant—and importantly, this is true whatever the value of the
independent variable of interest.
Sometimes, however, an independent variable X doesn’t have a linear relationship with the dependent variable.
We can address this by replacing X in the regression with transformations of X (e.g. the square of X or the
natural logarithm of X) that do have a linear relationship with the dependent variable.
Note that an OLS regression equation containing a quadratic term (X 2 ) should also include the untransformed
variable, X, in the same way that an OLS regression containing an interaction term (X*Z) should also
include the constitutive terms, X and Z.
R Exercise 2:
• Create two varaibles: 1) the square of mortality, which you should store using the label sqdmort; 2)
mortality logged to the base e, which you should store using the label logmort. Do this by specifying that
each of these are new columns within the colonial dataframe, and then apply each type of transformation,
2
or log base e, to (colonial$mort). This is what you should type to create sqdmort:
colonial$sqdmort <- (colonial$mort)^2
• Now rerun the multivariate regression from Exercise 1, trying first the sqdmort transformation, and
then the logmort transformation. Store these models as lm.sqd and lm.log respectively. Which
transformations of the mortality variable generate a statistically significant coefficient for the mortality
rate?
• Leaving aside considerations about which is the preferred model, you have probably realised that each
of these models will generate different predictions. Calculate the predicted modern-day property rights
scores for Bolivia using the results of each of your three models, lm.mort, lm.sqd and lm.log. In 1500,
Bolivia had a settler mortality rate of 70 (deaths per thousand individuals per year) an urbanization
score of 10.6 (the percentage of people living in urban centres), and an untransformed population
density of 0.651134 (people per square kilometer).
3. Assessing models
We are now going to assess the appropriateness of the different transformations of the mortality variable.
There are various considerations here, such as linearity, model fit, the skewedness of the data, and patterns
that may be evident in the residuals.
When it comes to assessing linearity, it helps to consider the direction of the ‘bulge’ in the best-fit line when
X is plotted against Y . You can use Tukey and Mosteller’s ‘bulging rule’ to figure out how to straighten this.
A ‘bulge’ down and to the right, can be linearized by either transforming Y down the ladder of powers (by
logging Y ), or transforming the X up the ladder of powers (by squaring, cubing et.c. X). This spreads out
large values of X relative to the small ones, effectively stretching the curve into a straight line.
On the other hand, a ‘bulge’ up and to the left can be linearized by either transforming Y up the ladder of
powers (by squaring, cubing et.c. Y ), or transforming X down the ladder of powers (by logging X). This
squashes large values of X relative to small ones.
3
R Exercise 3:
• Interpret the Adjusted R2 of the three models.
• Plot a graph of the untransformed mortality variable against the dependent variable, using the syntax
below. Store the graph using the variable name mort.graph.
scatter.smooth(x = dataset$x.variable, y = dataset$y.variable, xlab = "X Title",
ylab = "Y Title")
• Do you notice a ‘bulge’ in the data? If so, what does it suggest about the appropriate transformation
of the mortality variable?
• Plot graphs of the squared mortality variable against the dependent variable, and the logged mortality
variable against the dependent variable. Store these graphs using the variable names log.graph and
sqd.graph.
• What do you notice about the approximate linearity of the plot lines of your three graphs?
• Highly skewed distributions are more difficult to interpret than symmetric distributions because most
of the observations fall within a small range of the data. Looking at your graphs, which of logging or
squaring a variable pulls in the left tail of the mortality variable? Which of your three graphs spreads
the mortality observations most symmetrically?
• Finally, plot residuals for each of your three models, using syntax below. Do you notice any nonrandom
patterns?
plot(model.var.name$model$dep.var.name, model.var.name$fitted.values, ylab = "Model.var.name Residuals")
abline(a = 0, b = 1)
4. Generating predictions in R
By now you should have found that one of the versions of the mortality variable is generally preferable to the
other two. We will use the model containing this variable for the rest of the worksheet.
While it is helpful to be able to calculate predicted values by hand, as you did in Exercise 2, there is a
function in R that offers a quick way to do this. The following syntax tells R to give you predicted values
and 95% confidence intervals for each observation:
predict(model.var.name, interval = "confidence")
But to generate predictions for any set of values of the independent variables that you wish to specify (these
may be values from real observations or hypothetical ones), you must first create a dataframe with these
values:
newdata <- data.frame(indep.var.name.1 = value.indep.var.1, indep.var.name.2 = value.indep.var.2)
Then you run the predict() function, using the following syntax:
predict(model.var.name, newdata)
R Exercise 4:
• Use the predict() function to check your earlier calculation for the predicted property rights score for
Bolivia, using the model with your preferred version of the mortality variable.
4
• Find the predicted property rights score for Jamaica. Jamaica has a settler mortality rate of 130 (deaths
per thousand individuals per year) an urbanization score of 3 (the percentage of people living in urban
centres), and a log10 population density of 1.5304 (people per square kilometer).
• Run your model with the preferred version of the mortality variable again, but now include the dummy
variable denoting settled British colonies and city states. Store this model using the variable name
lm.nbcs. What do you notice happens to the significance of the coefficients and the Adjusted R2 ?
• What do these results mean substantively for the theory being tested?
• Let’s imagine for a moment that we could rerun history and that Jamaica had not become one big
sugar cane farm for the British Empire (to which few people moved from Britain). Rather, Jamaica
developed as city state. How different does the lm.nbcs model predict Jamaica’s modern-day property
rights score would be had Jamaica developed as a city state?
• Either by casting an eye to the property rights scores for different countries in the dataset and to their
present day level of development–or, more ideally, by running a bivariate model–comment on whether
you think that this alternative version of Jamaican history would imply a substantive difference for
Jamaica’s modern-day level of development. (Note that the World Bank GDP-per-capita data variable
is based on purchasing power parity, and has units ‘thousands of US dollars’.)
5. Homework
Read Kellstedt and Whitten on standardized coefficients, pages 206 to 209. Standardized coefficients offer a
means of comparing the impact of different independent variables on the dependent variable. They can be
especially helpful when the metrics of the independent variables span wildly different scales. The models run
in this Worksheet are not ideal for demonstrating standardized coefficients, because none have more than one
statistically significant independent variable to compare. However, to practice the calculation, you could
work out the standardized coefficient of the mortality variable in the lm.mort model. Recall that you can
find the standard deviation of a variable by typing sd(dataset$var.name, na.rm=TRUE).
If you have more time, load the bcg dataset from last week, run a model that predicts legparties using the
variables fragmentation, concentration, promity and prescandidate, and then calculate standardized
coefficients for the significant (p<0.05) variables. Compare the influence of these variables on the effective
number of parties.
5