Political Analysis II
Lab 5: Model Diagnostics and OLS Assumptions
Michaelmas 2016
1. Pre-lab assignment
• For more background on model diagnostics: Read Chapter 10 “Multiple Regression Model Specification”
of Paul Kellstedt and Guy Whitten. 2013. The Fundamentals of Political Science Research. New York,
NY: Cambridge University Press.
• For more on case selection using regression: Read Chapter 5 “Techniques for Choosing Cases” of John
Gerring and Jason Seawright. 2007. In: John Gerring. Case Study Research: Principles and Practices.
New York, NY: Cambridge University Press.
• For more background on OLS assumptions: Read §8.5 “Assumptions, More Assumptions, and Minimal
Mathematical Requirements” of Kellstedt and Whitten. 2013.
2. Introduction
In previous weeks, we tested and interpreted the effects of ethnic fragmentation and ethnic concentration on
the number of effective legislative parties using data from Brambor et al.’s (BCG) paper “Are African party
systems different?”. Today, we will check for unusual observations in our data. This is important because
such observations may drive our findings or they may suggest that our model has omitted important variables.
Model diagnostics can thus help us understand our results better. We will examine our data for three types
of unusual observations: regression outliers, high leverage observations, and influential observations. We will
also learn how to test for another data issue, namely multicollinearity.
Read the data using the function read.dta() and assign it to a new dataframe, bcg:
library(foreign) # Because the data is in Stata format
bcg = read.dta("http://andy.egge.rs/data/STATA_mozaffar.dta")
BCG use data based on 62 elections in 34 African countries between 1980 and 2000. Here is a guide to the
key variables:
• Dependent variable
– legparties: effective number of legislative parties
• Independent variables
a) Institutional variables
– logmag10: the log of the average district magnitude (base 10)
– proximity: measure of the temporal proximity of presidential and legislate elections
– prescandidate: effective number of presidential candidates
b) Sociological variables
– concentration: a measure of how geographically concentrated a country’s ethnopolitical
groups are.
– fragmentation: a measure of how fragmented a country’s ethnopolitical groups are, where
maximum fragmentation means that everyone is in her own group and minimum fragmentation
means that there is only one group.
1
3. Data management and replicating the model
Before we continue, we have to create a variable to label our observations in the plots that we will be making.
We do that because our observations are elections and we have multiple observations for some countries. We
create our id variable label by pasting country and year_nyu together:
bcg$label = paste(bcg$country, bcg$year_nyu)
• To check whether label uniquely identifies each observation, use View(bcg) or duplicated(bcg$label),
which returns a logical vector indicating which rows are duplicates.
The next step is to run a linear model to perform model diagnostics on. We will replicate the “Fully-specified”
model in the “Additive socio-institutional” column of Table 1 which includes:
• legparties as the dependent variable;
• and the following independent variables:
– fragmentation, concentration, and their interaction
– logmag10
– proximity, prescandidate, and their interaction
Recall from Lab 2 that a simple way to run a regression with an interaction in R is to multiply the two
variables together in the lm() formula:
lm(name.of.dependent.var ~ name.of.independent.var.1 * name.of.independent.var.2,
data = name.of.dataset)
• Run the multivariate regression and store the results of the model in a variable called lm.full
• Does our model fit the data well? (hint: use summary())
4. Residuals
We use residuals to perform model diagnostics and test OLS assumptions (more in section 9). Remember
that residuals are the difference between the observed values and the fitted values.12 Let us make a plot of
the observed values of our dependent variable legparties and the fitted values from our linear regression.
The closer the dots are to the diagonal line, the smaller the residuals and the better our model predicts the
outcome. The syntax is
plot(name.of.dataset$name.of.dependent.var, name.of.model$fitted.values,
xlab = "myxlab", ylab = "myylab")
abline(a = 0, b = 1)
# With abline, we add a 45-degree line passing through the origin:
# a is the intercept, b is the slope.
• Does our model predict the effective number of legislative parties well?
1 Residuals
are not to be confused with errors (or disturbance). The latter are the difference between the observed values and
the (unobservable) true values of a parameter.
2 Fitted values are also referred to as estimated or predicted values. They are a weighted sum of the predictors.
2
5. Regression outliers
Regression outliers are observations with extreme values on the dependent variable given their values on the
independent variables.3 They do not necessarily affect the regression coefficients, but they may indicate that
our model is missing important variables that could explain why these observations are outliers. Outliers
have large residuals. Unstandardized residuals are difficult to interpret because they are measured in the
units of the dependent variable. To assess outlierness, we therefore divide the residuals by an estimate of
their standard deviation. This gives us the studentized residuals (which follow a t-distribution with n-k-2
degrees of freedom). Studentized residuals with an absolute value of two or more are often considered outliers.
(Remember that 95% of all residuals should fall within two standard deviations of the mean.)
• Plot the fitted values against the studentized residuals and use label as our id variable to add labels
to the plot.
plot(name.of.model$fitted.values, rstudent(name.of.model),
xlab = "myxlab", ylab = "myylab")
text(fitted.values(name.of.model), rstudent(name.of.model),
labels = name.of.dataset$id.variable, cex = 0.7, pos = 2)
name.of.dataset[abs(rstudent(name.of.model)) >= 2, "id.variable"]
# 'cex = 0.7' specifies the font size of the labels.
# 'pos = 2' specifies the position of the labels. 1=below, 2=left, 3=above, 4=right
# The last line asks R to print the labels for all observations
# with residuals with an absolute value of two or more.
• Which elections are regression outliers?
6. High leverage observations
Observations have high leverage (i.e. the potential to influence the regression line) if they have extreme values
for an independent variable or a combination of independent variables. To identify high leverage observations,
we use hat-values which are stored automatically by R after running the linear regression. Hat-values measure
an observation’s distance from the mean of each independent variable. As a rule of thumb, observations have
high leverage if their hat-values are two (or three in small samples) times higher than the average hat-values
in the sample. The syntax to plot hat-values is
plot(hatvalues(name.of.model))
abline(h = c(2, 3) * mean(hatvalues(name.of.model)))
text(hatvalues(name.of.model),
labels = name.of.dataset$id.variable, cex = 0.7, pos = 2)
name.of.dataset[hatvalues(name.of.model) >= 3 * mean(hatvalues(name.of.model)), "id.variable"]
# Abline draws horizontal lines at two and three times the average hat-values
• How many high leverage observations do we have?
• Which election has the highest leverage?
3 When
observations have extreme values on one variable (either on an independent or dependent variable), we call them
univariate outliers.
3
7. Influential observations
Observations are influential if they have a strong influence on the regression line. In other words, if the results
from the regression would change drastically when we exclude them from the model. It is important to know
if our results are driven by a handful of cases. Influential cases have extreme values for the independent
variable(s) and for the dependent variable. We can identify them as observations with large residuals and
high leverage. A nice way to examine influential observations is with a “bubble plot”. To make one, we plot
studentized residuals against hat-values and scale the dot of each observation proportional to its Cook’s
Distance value. (Cook’s Distance is a summary statistic of the influence of each observation: a larger Cook’s
D means more influence.)
# Plot hat-values and studentized residuals
plot(hatvalues(name.of.model), rstudent(name.of.model),
cex = 1 + 10 * cooks.distance(name.of.model), # Scale dots to their Cook's D value
xlab = "myxlab", ylab = "myylab")
# Add vertical reference lines for high hat-values
abline(v = c(2, 3) * mean(hatvalues(name.of.model)))
# Add horizontal reference lines for large residuals
abline(h = c(-2, 0, 2))
• Do we have any influential observations? (hint: label the observations)
• What is the relationship between regression outliers, high leverage observations, and influential observations?
We can see how influential an observation is by dropping them from our sample and re-running our model on
the subset. We then compare these results to our original results from the full sample.
lm.subset = lm(legparties ~ fragmentation * concentration + logmag10 + proximity * prescandidate,
data = bcg, subset = bcg$label != "name.of.influential.case")
# '!=' tells R to include all observations except for the influential case we identified.
# Hint: with mtable() from the memisc package you can display tables side-by-side
• What is the effect of excluding this observation? (hint: inspect the coefficients, R-squared, and p-values)
8. Multicollinearity
Multiple regression is good at separating the effects of different independent variables on the dependent
variable. However, it cannot do that when two or more independent variables are perfectly or very highly
correlated. We call this multicollinearity. Severe multicollinearity means that we cannot tell which of the
independent variables has a true effect on the dependent variable. It increases our standard errors, which
means that we might mistakenly conclude that there is no effect (Type II error). To test for multicollinearity,
we use the variance-inflation factor V IFj = 1 / (1 - Rj2 ) from the car package. (The Rj2 is the R2 for the
regression of Xj on all other independent variables in the model.) The VIF for a variable tells us how much
larger the standard errors are than they would be if this variable were uncorrelated with other variables in
the model. As a rule of thumb, values above 10 suggest severe multicollinearity.
library(car)
vif(name.of.model)
• Does our model suffer from multicollinearity?
• Should we be worried and which measures can we take?
4
9. (If time) Check OLS assumptions
Ordinary Least Squares (OLS) regression makes a number of assumptions about the data generation process,
such as mean independence, linearity, homoscedasticity, no autocorrelation, and normally distributed errors
(see Kellstedt and Whitten, 2013: §8.5). These assumptions refer to the ideal conditions for OLS. Most
real data do not meet (all of) these conditions. By testing our assumptions, we know to what degree our
data deviate from the ideal conditions (our benchmark). If the assumptions are violated, we can use this
information to improve our model and our interpretation of the results.
9.1 The assumption of linearity
The linearity assumption states that the dependent variable is a linear function of the independent variables
and a random error U. If it does not hold, our estimates will be biased. Luckily, we can easily include nonlinear
transformations of our independent variables (spoiler: we will try that next week). To assess linearity, we
plot our partial residuals (which show the relationship between the independent variable and the dependent
variable given the other independent variables in the model) against the independent variable. To this partial
residuals plot, we add a linear line in red. If the independent variable has a linear effect on the dependent
variable, the dots should follow the red linear line.
(Since partial residual plots do not work with interaction models, we will first run an additive model and
inspect this model for this exercise only.)
lm.add = lm(legparties ~ fragmentation + concentration + logmag10 + proximity
+ prescandidate, data = bcg)
# Store the partial residuals for independent.var.1 in a new variable (e.g. pr.var1)
pr.var1 = residuals(name.of.model, type = "partial")[,"independent.var.1"]
# Plot the independent variable against its partial residuals
plot(name.of.dataset$independent.var.1, pr.var1,
xlab = "myxlab", ylab = "myylab")
# Add a linear line
abline(lm(pr.var1 ~ name.of.dataset$independent.var.1), col = "red", lty = 2)
• Make a partial residuals plot for concentration. Does concentration have a linear effect on the
effective number of legislative parties?
• And fragmentation?
9.2 The assumption of mean independence
The assumption of mean independence states that the mean of our error U does not depend on our independent
variables. This is the most important assumption of OLS and one that we cannot test. It can be violated by
omitted variables, reverse causation, and measurement error in the independent variables. The more it is
violated, the more biased our estimates will be.
9.3 The assumption of homoscedasticity
Homoscedasticity (or constant error variance) means that the variance of the error term is unrelated to
our independent variables. If they are related, we have heteroscedasticity (or nonconstant error variance).
Heteroscedasticity leads to biased estimates of our standard errors. Remember that we use standard errors to
calculate our t-statistic and p-values. If our standard errors are too small, we may incorrectly reject the null
hypothesis (Type I error). If they are too large, we may incorrectly retain a false null hypothesis (Type II
error). To detect heteroscedasticity, we plot the studentized residuals against individual predictors or the
5
fitted values. Ideally, the residuals would be randomly distributed around zero.4 Heteroscedasticity is very
common, especially in smaller samples. Note: use our original model lm.full for this and the following
exercises.
plot(name.of.model$fitted.values, rstudent(name.of.model),
xlab = "myxlab", ylab = "myylab")
abline(h = 0)
• Plot the fitted values against the residuals. Would you say that the residuals are homoscedastic or
heteroscedastic?
• Now plot one of the independent variables against the residuals. What do you see?
9.4 The assumption of no autocorrelation
This assumption states that the error of one observation is not correlated with the error of other observations
(no autocorrelation). If this assumption is violated, we have autocorrelated errors. Autocorrelation leads to
smaller standard errors than they should be. If we underestimate our standard errors, we get larger t-values,
and are more likely to reject the null hypothesis even if it is true. Autocorrelation occurs when we have
multiple observations for the same individual or country (this is called longitudinal data) or when some
observations share the same context (e.g. members of the same household). As long as observations are
randomly sampled, it should not be a big concern.
9.5 The assumption of normally distributed errors
The assumption of normally distributed errors is perhaps the least important assumption (and some would
not consider it one of the assumptions at all). We only need it to calculate our t-statistic and p-values. It
does not affect our coefficients or standard errors. As you know, the Central Limit Theorem ensures that the
distribution of the errors will be normal when samples are sufficiently large.
For small samples (a rough guideline: N < 100), we should check the normality assumption. We do that
by plotting the quantiles of our studentized residuals against the quantiles of a normal distribution. This is
called a Normal Quantile-Quantile (QQ) plot. If the two distributions are similar, our residuals will be on or
very close to the diagonal line and can be considered normally distributed. If they are not, the normality
assumption is violated. We should then use a more stringent threshold for significance (e.g. an alpha of 0.01
instead of 0.05) to avoid making Type I errors.
qqnorm(rstudent(name.of.model))
abline(a = 0, b = 1)
• Does the assumption of normality hold for our data?
• Why (not)?
4 Plots are not always easy to interpret. We can therefore also run a nonconstant variance test: ncvTest(name.of.model) from
car. The p-value tells us whether we can reject the null hypothesis of homoscedasticity (constant variance). If the p-value is
significant, we conclude that there is heteroscedasticity.
6
© Copyright 2026 Paperzz