OLS regression assumptions

Topic 11. Multiple Regression Assumptions
OLS regression assumptions
From: Hamilton, Lawrence C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont
CA: Duxbury Press.
1. The values of the independent variables are fixed in repeated sampling
2. The mean of the errors (in the population) is zero
If assumptions 1 and 2 are valid, then the errors and independent variables are independent, which means that
the estimates of the population intercept and slopes are unbiased.
3. The errors have constant variance (homoskedasticity)
4. The errors are uncorrelated with each other (no autocorrelation)
If assumptions 1-4 are valid, then
 The standard error estimates are unbiased
 The OLS estimator is more efficient than any other linear unbiased estimator
5. The errors are normally distributed
If the 5th assumption is also valid:
 It is safe to use the t and F distributions in confidence intervals and hypothesis testing, even with small
samples
 The OLS estimator is more efficient than any other unbiased estimator (linear or not)
Assumptions 1-5 can be summarized as follows:
We assume the linear model is correct, with normal, independent, and identically distributed errors.
 We have already discussed linearity – you should always evaluate your model for nonlinearity
 ‘Correct’ means that no relevant independent variables have been excluded and no irrelevant
independent variables have been included; we should always strive for parsimonious models
Why are these assumptions important?
To draw inferences about a population, we gather a probability sample from that population and calculate
statistics, which are estimates of the true population parameters:
Population Regression Function (PRF):
Yi     1 X 1i   2 X 2i   i
Sample Regression Function (SRF):
y i  a  b1 x1i  b2 x 2i  ei
Ordinary least squares (OLS) regression is one way/method to calculate the slopes (and intercept). If one or
more of the assumptions is not valid, then it is possible that our OLS estimator is biased and/or not the most
efficient (i.e., other methods of calculating the slopes and intercept might be better).
If our estimator is biased and/or not efficient, we are more likely to draw incorrect conclusions about the
relationships between variables in the population. We may conclude that there is a relationship between
variables that does not really exist or we may conclude that there is no relationship between variables that really
does exist – this is always a possibility, but it is more likely when assumptions are violated. In other words, we
are going to be wrong more often than we expect.
Page 1 of 13
Frequency
The figure below contains sampling distributions for four different estimators (i.e., ways of estimating), which
differ in unbiasedness and efficiency. Which estimator would you rather use?
Estimated Slope
Unbiased and efficient (1)
t
Biased and efficient (2)
Unbiased and inefficient (3)
Biased and inefficient (4)
b
b
; when testing the null hypothesis  = 0 this simplifies to t 
sb
sb
There are two ways that this two-tailed null hypothesis can be wrong:
 < 
Reject null:
sample data
are
improbable
if null is true




H0: =0
Fail to reject null:
sample data are not
sufficiently improbable
 > 
Reject null:
sample data
are
improbable
if null is true
Biased: If the estimate of  (i.e., b) is too big, t is too big and it is easier to reject the null hypothesis
Biased: If the estimate of  (i.e., b) is too small, t is too small and it is easier to fail to reject the null
hypothesis
Inefficient: if the estimate of b (i.e., sb) is too big, t is too small and it is easier to fail to reject the null
hypothesis
Inefficient: if the estimate of b (i.e., sb) is too small, t is too big and it is easier to reject the null hypothesis
Page 2 of 13
OLS regression assumptions in more detail
1. Fixed X: We could generate many random samples; each of these would contain the same X values but
different Yi due to different i values
If we assume the values of the independent variables are fixed in repeated sampling, then we can assume they
are distributed independently of the errors (i.e., that the independent variables are not correlated with the errors),
which is one of the conditions that leads to the OLS estimator being unbiased.
If the error term is correlated with an independent variable, then our independent variable (X1) is correlated with
another cause of the dependent variable (X2) that is not in our model. Since X2 is not in our model, its
association with X1 has not been removed (the association is represented by areas c and d below) and we may be
over or under estimating the slope for X1 (the estimate is, therefore, biased).
Y
e
a
b
c
d
X1
X2
It is not possible to test this assumption. OLS estimation ensures that the independent variables and the
prediction errors are uncorrelated in the sample. This is true even when there is a correlation between the
independent variables and the errors in the population.
In some situations, however, we know that the values of the independent variables are not fixed – for example,
when a lagged dependent variable is included as an independent variable (using the 2009 unemployment rate as
an independent variable to predict the 2010 unemployment rate). In this case, the unemployment rate is
assumed to be randomly distributed because it is a dependent variable; it cannot simultaneously be fixed. Other
estimators are available for situations like these; Peter Kennedy discusses this in more detail in Chapter 9
(Kennedy 1998, p. 137).
2. The mean of the errors (in the population) is zero.
There is no way to test this assumption. OLS estimation ensures that the mean of the prediction errors in the
sample is zero. This is true even when the mean of the errors in the population is not zero.
Page 3 of 13
3. The errors have constant variance (homoskedasticity)
Homoskedasticity means that the predicted values of the dependent variable are as good (or as bad) at all levels
of the independent variable(s). Heteroskedasticity means that we are better able to predict the dependent
variable for some scores on the independent variable(s). You can evaluate homoskedasticity using scatterplots
containing predicted values and residuals:


Heteroskedasticity: y Income  a  bx Education
Homoskedasticity: y Stereotyping  a  bx Age
1400000
20
1200000
1000000
10
Unstandardized Residual
800000
0
-10
600000
400000
200000
0
-200000
-20
-20000
16.5
17.0
17.5
18.0
18.5
19.0
-10000
0
10000
20000
30000
40000
50000
19.5
Unstandardized Predicted V alue
Unstandardized Predicted V alue
The variance of the prediction errors (residuals) appears to be
similar for all predicted values. In other words, anti-Black
stereotyping is predicted as well for all ages.
The variance of the prediction errors (residuals) is greater at
higher predicted incomes. In other words, income is less
predictable for people with more education.
To get these scatterplots in SPSS, you have to save the predicted values and residuals by selecting these boxes:
My scatterplots include the unstandardized predicted values on the x axis and the unstandardized residuals on
the y axis.
Page 4 of 13
Several statistical tests also exist to evaluate homoskedasticity (e.g., the Park test, the Glejser test, Spearman’s
Rank Correlation test, the White test, the Breusch-Pagan-Godfrey test, and the Goldfeld-Quandt test), although
these are not available in SPSS.
Consequences:
 When you have heteroskedasticity, the slope estimates remain unbiased
 The standard errors, however, are mis-estimated and the OLS estimates are no longer the most efficient
What can you do?
 You can often correct heteroskedasticity by transforming your variables
The main source of the heteroskedasticity for the income model is probably income’s skewed distribution. We
may be able to transform it such that the heteroskedasticity will disappear or at least be reduced.
Three Transformations – SPSS Syntax to reduce positive skew:
compute inc_2rt=(income+1)**.5.
compute inc_3rt=(income+1)**(1/3).
compute inc_ln=ln(income+1).
A few comments on the transformations:
 Some people have incomes of 0, these transformations are undefined at 0; I have added 1 so that 0 does
not appear in the data
 In SPSS, ‘**’ means to raise something to a power; so 3 * *2  3^ 2  3 2  9
 Raising a number to the .5 power is the same as taking the square root; so 3 * *.5  3  1.732
 Raising a number to the .3 (1/3) power is the same as taking the cube root; so 3 * *(1 / 3)  3 3  1.442
 ‘Ln’ is the natural logarithm
Here are the before and after pictures:
Original income variable:
Cube root of income:
3000
1400
1200
1000
2000
800
600
0
0
0.
Frequency
400
0
0.
00
00 0
12 00.
0
00 .0
0
11
00
00
10 0.0
0
00
90 0.0
0
00
80 0.0
0
00
70 0.0
0
00
60 0.0
0
00
50 0.0
0
00
40 0.0
0
00
30 0.0
0
00
20 0.0
0
00
10
Frequency
1000
200
0
.0
90
0
0.
11
0
0.
10
.0
80
.0
70
.0
60
.0
50
.0
40
.0
30
.0
20
0
0.
.0
10
Notice that the distribution is more normal and also
that the outliers have been pulled in.
Page 5 of 13
Using the cube root of income is clearly an improvement, but heteroskedasticity is still evident:
Before the transformation:
After the transformation:
1400000
80
1200000
60
1000000
40
600000
400000
200000
0
-200000
-20000
-10000
0
10000
Unstandardized Residual
800000
20000
20
0
-20
30000
40000
50000
-40
10
20
30
40
Unstandardized Predicted V alue
Unstandardized Predicted V alue
If you cannot transform the variable or variables to deal with the problem (I could have also tried transforming
education, which is slightly negatively skewed), then you can use generalized least squares (GLS), rather than
ordinary least squares (OLS), to get more efficient estimates of and to correctly calculate the standard errors.
GLS weights cases depending on the size of their error – cases with smaller prediction errors are given greater
weight in the calculation of the estimates. Note – GLS is ‘general,’ weighted least squares (WLS) is a specific
form of GLS. It is also possible to compute ‘robust’ standard errors that are corrected for heteroskedasticity.
Heteroskedasticity is common in certain research situations – for example:
 When you are analyzing an aggregated dependent variable for aggregate units
o The poverty rate for a city is an aggregated variable
o It is aggregated up to the city level from individual-level data on income
 When you are analyzing distinct social subgroups (i.e., nested data)
o When you have data for students from different classrooms or people from different countries
o We are often better able to predict test scores for students in some classrooms than for students
in other classrooms
4. The errors are uncorrelated with each other (no autocorrelated errors)
Error terms can be correlated when cases share something that is not an independent variable in a regression
model (e.g., temporal, physical, or social proximity). These similarities disappear into the error term of a
regression equation.
Think of students in different classrooms. Class 1 has Mr. Jones as a teacher and class 2 has Mr. Smith as a
teacher. All of the students in class 1 share something in common – the same teacher. This teacher is likely to
influence them in a way that makes them different than the students in class 2. If these similarities are not
controlled for in the model, they disappear into the error term (which you can think of as all the variables not
controlled for in the model). The errors will be correlated for students in class 1 and they will be correlated for
the students in class 2, but not across classes 1 and 2.
Thus, a key source of correlated errors is the inability to control for all possible causes of an outcome.
Page 6 of 13
The true model:
Yi    1 X 1i   2 X 2 i   3 X 3i   i
The estimated model:
*
Yi  a  b1 X 1i  b2 X 2 i  ei
ei  ei  b3 X 3i
*
If X3i is excluded from the regression model, then it is, by definition, included in the residual. As a result, cases
with high values on X3 will tend to have high values on the residual and cases with low values on X3 will tend
to have low values on the residual. Thus, there is a systematic relationship between the residuals – that is, they
are correlated.
When is it common?
Autocorrelated errors are common with:
 Time-series data – for example, when you have repeated measures for the same case (very common in
economics: studying trends in the unemployment rate, etc.)
 Nested data – for example, when people are nested within countries, students are nested within
classrooms, etc.
 It is possible to obtain nested data by chance, but more often it is built into the sampling design:
o We often use multistage cluster sampling to reduce costs and/or because there does not exist a
list from which to sample cases
 There is not a list of students in the US, so how would you generate a simple random
sample from the population of students?
 You could randomly sample: states, school districts within states, schools within school
districts, classrooms within schools, and students within classrooms
 You would probably administer the questionnaire to multiple students from the same
classroom and/or school to reduce costs
Autocorrelated errors are not common when:
 Cases are randomly sampled from a large population (because you do not typically end up with cases
sharing proximity)
o You could, for example, generate a list of random telephone numbers and administer the
questionnaire over the telephone
How can you detect it?
 It is possible to use statistics: The Durbin-Watson test Statistic
 and graphs: correlograms
Consequences:
 The standard error of the slope (sb) is misestimated
o sb is used to calculate the t value (t = b/sb), which is used in hypothesis testing
 OLS is no longer the most efficient
What can you do?
 You can remove the correlation between the errors by adding the ‘responsible’ variable to the model
(although this is not always possible nor is it often easy to identify the responsible variable)
 Similar to heteroskedasticity, you can control for correlated errors by using GLS estimation:
o In time series analysis, you can explore the data and specify how the errors are related in your
model
o With nested data, you can use an estimation method that assumes the errors are correlated
Page 7 of 13
5. The errors are normally distributed
This is a more stringent assumption than other error assumptions (i.e., zero mean and constant variance) – it is
possible to have errors with a zero mean and constant variance, but constant variance that is not normal.
When this assumption is valid, we are able to use the t and F distributions in hypothesis testing and in the
generation of confidence intervals in small samples. We can still use t and F in large samples when this
assumption is violated because the OLS estimator is normally distributed in large samples (i.e., asymptotically).
The proof:
1. It can be shown using the central limit theorem that the distribution of a sum of independent and identically
distributed variables is normally distributed as the number of variables increases. Even if there are few
independent variables and they are not independently distributed, the sum still tends to be normally distributed.
2. Any linear function of normally distributed variables is itself normally distributed.
3. Thus, if the errors are normally distributed, then we can assume that estimators are normally distributed and
use the normal distribution in hypothesis testing.
“Some researchers believe that linear regression requires that the outcome (dependent) and predictor variables
be normally distributed. In actuality, it is the residuals that need to be normally distributed. In fact, the
residuals need to be normal only for the t-tests to be valid. The estimation of the regression coefficients do not
require normally distributed residuals”
(Source: http://www.ats.ucla.edu/stat/spss/webbooks/reg/chapter1/spssreg1.htm).
You can obtain plots of the residuals from SPSS by checking the boxes in “Standardized Residual Plots”:
Here are the standardized residuals for anti-Black stereotyping (predicted by age and education)
Normal P-P Plot of Regression Standardized Residua
Histogram
Dependent Variable: STEREOB
Dependent Variable: STEREOB
1.00
160
140
.75
120
80
Frequency
60
40
Expected Cum Prob
100
.50
.25
Std. Dev = 1.00
20
Mean = 0.00
0
N = 1024.00
25
3.
75
2.
25
2.
75
1.
25
1.
5
.7
5
.2
5
-.2
5
-.7 5
.2
-1 5
.7
-1 5
.2
-2 5
.7
-2 5
.2
-3 5
.7
-3 5
.2
-4 5
.7
-4
0.00
0.00
.25
.50
Observed Cum Prob
Regression Standardized Residual
Page 8 of 13
.75
1.00
Statistics
ZRE_1 Standardized Residual
N
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtos is
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
Percentiles
25
50
75
If you save the residuals, you can also generate
basic statistics like those to the left
1024
1214
.0000000
.03121944
-.0663556
-.42780
.99902200
.99804497
.203
.076
1.103
.153
7.98951
-4.67248
3.31703
.00000
-.5720011
-.0663556
.5182858
These suggest:
 The non-zero, negative median indicates mild
positive skew (it is positive skew because the
mean, 0, is greater than the median)
 In a normal distribution, the standard deviation
(sy) ≈ IQR/1.35
 The standard deviation of the errors (se) =
0.999
 The IQR = 1.090 (.5182858- -.572001)
 1.090/1.35 = .807
 se > IQR/1.35, which suggests heavier-thannormal tails
 You can also see this in the graphs above
 Despite this, the errors don’t look too bad; by
comparison, check out the errors for income
below
These are plots of the standardized residuals for income (predicted by education); assuming normal errors is a
bit more of a stretch here:
Histogram
Dependent Variable: INC_3RT
Normal P-P Plot of Regression Standardized Residua
1600
Dependent Variable: INC_3RT
1400
1.00
1200
1000
.75
Frequency
600
400
Std. Dev = 1.00
200
Mean = 0.00
0
N = 5140.00
00
7.
00
6.
00
5.
00
4.
00
3.
00
2.
00
1.
00
0.
0
.0
-1
0
.0
-2
0
.0
-3
Expected Cum Prob
800
.50
.25
0.00
0.00
Regression Standardized Residual
.25
.50
.75
1.00
Observed Cum Prob
What can you do if the errors deviate substantially from a normally distribution?
 You can transform your variables – this often helps a number of problems
 If that does not help, you can use other estimators (it is possible to compute ‘robust’ standard errors in
some software packages)
Page 9 of 13
Other things to look for:
1. Multicollinearity
When there is perfect multicollinearity (e.g., a correlation of 1) among independent variables, an infinite
number of regression estimates fit the observed data equally well. In other words, it is impossible to calculate
the least squares estimate because there is not one best least squares estimate.
Think of it this way: b2 is interpreted as the change in Y for a one-unit increase in X2 when X1 is held constant.
If X1 and X2 are perfectly correlated, it is impossible to hold X1 constant and change the value of X2.
Consider also the pictures below. If you partial out the portion of variance that X1 and X2 share, there is no
variation left. If X1 and X2 are perfectly correlated, we cannot calculate their unique effects on Y because they
do not have unique effects.
X1 and X2 are correlated:
X1 and X2 are perfectly correlated:
Y
Y
e
a
b
c
a=b=c
d
X1
X2
d
X1
X2
In practice, perfect multicollinearity is very rare – usually it is the result of a mistake:
 Including a built-in linear dependency in the model:
o including as independent variables stratification position (composed of a linear combination of
education, income, and prestige), education, income, and prestige
o including dummy variables indicating membership in organizations A, B, and C (with D left out
as the reference category) as well as a variable indicating the number of memberships (ranging
from 0-4)
 Including n dummy variables for n groups (i.e., forgetting to exclude one category as the reference)
 Having more independent variables than cases
However, even high levels of multicollinearity can affect estimation (i.e., it doesn’t have to be perfect
multicollinearity). In general, multicollinearity leads to inflated standard errors. Inflated standard errors make
confidence intervals wider and observed t values smaller. Both of these things make it more difficult to reject
null hypotheses. HOWEVER, the OLS estimator is still the best unbiased linear estimator (BLUE).
Page 10 of 13
How do you know if you have multicollinearity?
 You have it if:
o You have bivariate correlations between independent variables of .7 or higher
o You have a high R2, but few, if any, significant t values
o Collinearity diagnostics from SPSS suggest that you do:
 A tolerance below .2 / a variance inflation factor above 5
 Condition indices between 10 and 30 indicate moderate to strong multicollinearity, those
above 30 indicate severe multicollinearity
You can get these SPSS diagnostics by clicking on the Statistics box and checking “Collinearity Diagnostics:”
Tolerance – calculated by regressing each X on all other Xs; tolerance is equal to 1-R2 from these regressions.
A high tolerance is good – it indicates a high degree of unexplained variance in X. A tolerance of .2 indicates
that the other Xs account for 80% of the variance in X.
1
VIF – this is the inverse of the tolerance; 
(1  R 2 )
Infant Mortality Example: (DV=cube root of infant mortality; cube root because it is not normal)
Correlations
Pearson Correlation
Sig. (1-tailed)
N
IM_3RT
FERTILTY Fertility:
average number of kids
GDP_LN
URBAN People living in
cities (%)
IM_3RT
FERTILTY Fertility:
average number of kids
GDP_LN
URBAN People living in
cities (%)
IM_3RT
FERTILTY Fertility:
average number of kids
GDP_LN
URBAN People living in
cities (%)
IM_3RT
1.000
FERTILTY
Fertility:
average
number of
kids
.842
GDP_LN
-.873
URBAN
People living
in cities (%)
-.732
.842
1.000
-.691
-.619
-.873
-.691
1.000
.757
-.732
-.619
.757
1.000
.
.000
.000
.000
.000
.
.000
.000
.000
.000
.
.000
.000
.000
.000
.
106
106
106
106
106
106
106
106
106
106
106
106
106
106
106
106
The bivariate associations suggest that multicollinearity is present. We want high correlations between the
dependent variable and the independent variables. It is the high correlations between the independent variables
that are problematic – 1 out of 3 exceeds ±.7.
Coeffi cientsa
Model
1
(Const ant)
FERTILTY Fertility :
average number of kids
GDP_LN
URBAN People living in
cit ies (%)
Unstandardized
Coeffic ient s
B
St d. Error
5.433
.358
St andardiz ed
Coeffic ient s
Beta
t
15.196
Sig.
.000
95% Confidenc e Interval for B
Lower Bound Upper Bound
4.724
6.142
Collinearity Statistics
Tolerance
VIF
.244
.027
.443
8.884
.000
.190
.299
.501
1.998
-.375
.044
-.516
-8. 598
.000
-.462
-.289
.347
2.886
-.003
.002
-.068
-1. 232
.221
-.008
.002
.410
2.440
a. Dependent Variable: IM_3RT
Page 11 of 13
The tolerance and VIF statistics indicate that multicollinearity is not present. The tolerance for GDP per capita
(.347), which is the lowest tolerance, suggests that fertility and percent urban explain 65.3% of the variance in
GDP per capita. This is quite a bit, but still within acceptable limits.
a
Collineari ty Diagnosti cs
Model
1
Dimension
1
2
3
4
Eigenvalue
3.635
.325
.035
.006
Condit ion
Index
1.000
3.345
10.203
25.339
(Const ant)
.00
.00
.07
.93
Variance Proportions
FERTILTY
Fertility :
average
number of
kids
GDP_LN
.01
.00
.22
.00
.38
.07
.40
.93
URBAN
People living
in cities (% )
.00
.06
.80
.14
a. Dependent Variable: IM_3RT
The condition indices indicate moderate multicollinearity.
What can you do? Here are a few things (we’d cover more if this was an econometrics course)
 Do nothing – OLS is still BLUE
 Find another indicator of the concept that is not as highly correlated with the other independent variable
 Create an index composed of the problem variables (e.g., combine education, income, and prestige into
“stratification position”)
 Present results for multiple models alternately excluding the problem variables
 Increase the sample size
Influential Cases
Page 12 of 13
Best Practices
1. Screen your data using univariate descriptive statistics:
 Missing data
o How much is missing?
o Why is it missing?
o Is there a pattern to the missing data?
 Examine the distribution of your variables
o Search for outliers
o Search for non-normal distributions and use transformations where appropriate
2. Examine relationships using bivariate statistics:
 Start with scatterplots to search for bivariate outliers, nonlinear relationships, possible
heteroskedasticity, etc.
 Compute bivariate correlations to search for possible multicollinearity
 Compute bivariate regressions to see ‘baseline’ slopes
Only after knowing your data well should you begin multiple regression!
3. Use plots to evaluate
 the normality of your errors
 possible heteroskedasticity
4. Search for influential cases…
Suggested Readings
 Berk, Richard A. 2003. Regression Analysis: A Constructive Critique. Thousand Oaks: Sage.
 Gujarati, Damodar N. 1988. Basic Econometrics. New York: McGraw Hill, Inc.
 Kennedy, Peter. 1998. A Guide to Econometrics. (Fourth edition). Cambridge: The MIT Press.
 Chen, Xiao, Phil Ender, Michael Mitchell, and Christine Wells (in alphabetical order). SPSS Web Books:
Regression with SPSS. http://www.ats.ucla.edu/stat/spss/webbooks/reg/
Page 13 of 13