1 Simple and Multiple Linear Regression Assumptions 2 Omitted

EEP/IAS 118 - Introductory Applied Econometrics
Spring 2015
1
Sylvan Herskowitz
Section Handout 5
Simple and Multiple Linear Regression Assumptions
The assumptions for simple are in fact special cases of the assumptions for multiple:
Check:
1. What is external validity? Which assumption is critical for external validity?
2. What is internal validity? Which assumption is critical for internal validity?
3. What null hypothesis are we typically testing? Which assumption is critical for hypothesis
testing?
4. What happens when a dummy variable for male is included in a regression alongside a
variable for non-male? Which assumption is violated?
2
Omitted Variable Bias
i. Practice: Signing the bias
A classmate of mine is working on a project in Mexico looking at how homicide rates (per capita)
are affected by changes in police financing (per capita). Presumably, giving police more resources
with which to fight crime would lower the rate of homicides in a given area. Let us imagine
that the population model of homicides looks like this, with an index of gang presence in a given
district as an additional explanatory variable.
homicide = β 0 + β 1 police f inance + β 2 gangs + u
(1)
However, let’s pretend that she didn’t think to collect data on prevalence of gangs in each
district, so that the model she estimates is:
homicide = β̃ 0 + β̃ 1 police f inance + ũ
(2)
1
EEP/IAS 118 - Introductory Applied Econometrics
Spring 2015
Sylvan Herskowitz
Section Handout 5
If we think there’s an important variable missing, like gangs is above, we can sign the bias we
expect if we leave gangs out of the regression simply by determining the signs of two correlations:
1. Cov(homicide, gangs) or Cov(y, omitted variable)
2. Cov( police f inance, gangs) or Cov( x, omitted variable)
Often we leave out the unimportant variables. An unimportant variable is one that we’re not
interested in, and one that will not induce bias (i.e. bias is zero) in our coefficients of interest
if we leave it out.
The bias due to the omitted variable will be zero when:
1.
2.
On a problem set or exam, when you’re trying to decide how omitted variable bias might be
affecting your estimation of a parameter, it’s easy to think of the following table:
Cov(y, xov ) > 0
Cov( x, xov ) > 0
Upward bias
Cov( x, xov ) < 0
Downward bias
Cov(y, xov ) < 0
Downward bias
Upward bias
ii. OVB: Derivation/Calculation
SLR4 fails because of an omitted variable: E[u| X ] 6= 0
Reviewing lecture from last Thursday:
Population Model: y = β 0 + β 1 x1 + β 2 x2 + u
Model with omitted x2 variable: y = β̃ 0 + β̃ 1 x1 + ũ
Suppose x1 is correlated with x2 in the following way: x2 = α + ρx1 + v
Substituting this equation into the true population model we get:
y = β 0 + β 1 x1 + β 2 (α + ρx1 + v) + u
=
=
There are extra terms! If we take the expectation of β̃ˆ 1 :
h
E β̃ˆ 1
i
=
If E β̂ 1 6= β 1 then we say β̂ 1 is biased. What this means is that on average, our regression
.
estimate is going to miss the true population parameter by
2
EEP/IAS 118 - Introductory Applied Econometrics
Spring 2015
3
Sylvan Herskowitz
Section Handout 5
Clean and Dirty Variation
In class we heard the concept of “clean” and “dirty” variation mentioned. What does it mean
and why do we care? This is a very intuitive lense through which to think about the variation
of certain variables. The idea is that an estimation is suffering from omitted variable bias. This
is a problem because as a result we are not getting the true relationship between our explanatory
variable of interest and our outcome, β.
In the example above we wanted to explain homicides with police funding. However, we
had biased results because areas with higher police funding also had higher levels of gang presence. This correlation between police funding and gang presence is considered “dirty” variation
in police funding. It is variation in police funding that cannot be disentangled from gang presence
and results in biasing our estimates. However, police funding and gang presence are not perfectly
collinear. There is other variation in police funding that is not directly explained by gang presence
and vice versa.
If we could isolate only the “clean” variation in police funding that was independent of
gang presence, then we could end up getting an unbiased estimate of the relationship between
homicides and police funding! But this is impossible, right?
Sort of. In the example from class, we actually had data for our omitted variable. Analagously,
if we had data on gang presence we could be clever and use Stata to isolate only our clean variation
in police funding that would capture an unbiased estimate of β.
To do this, we generate a new model that uses gang presence to predict police funding by
regressing police funding on gang presence. This predicted police funding is the “dirty” portion
of variation that we can’t use. Instead we extract the “clean” variation in police funding that has
nothing to do with gang presence. This clean variation is captured in the residuals of our model.
What does this do to our results? We saw that this new estimate from regressing homicides
on these residuals is essentially the same as when we estimate the true population model and
include the omitted variable directly in the equation! We now have E[ β̂] = β!
So why don’t we just include this variable in the first place!?! Well... Good question. I’m
glad you asked. Of course we would love to always be blessed with the complete set of data
and variables so that we could directly estimate the true population model, but that is very rarely
the case. This example was constructed to illustrate two things: 1) how we can focus on clean
variation to get accurate estimates and 2) how isolating only clean variation leads to an increase in
our standard errors relative to our biased estimate. The latter point is because we are now using
less of the total variation in our police funding (our X variable).
The notion of clean and dirty variation will come back later in the course. Many impact
evaluation methods are tools designed to help us isolate the clean variation of our variables of
interest so that we can get unbiased estimates of β.
3
EEP/IAS 118 - Introductory Applied Econometrics
Spring 2015
Sylvan Herskowitz
Section Handout 5
Variance of β̂
Bringing this home (hopefully) lets remember our variance for β̂.
Var ( β̂) =
σˆ2
SSTx (1 − R2x )
Check:
1. What happened to n in this formula, don’t we still care about our sample size?
2. What would happen to R2x if we add an additional variable into our regression that is highly
correlated with X?
3. What happens to σ2 if the newly added variable explains a lot of variation in Y?
4
EEP/IAS 118 - Introductory Applied Econometrics
Spring 2015
4
Sylvan Herskowitz
Section Handout 5
Example: OVB in Action
In this section, I use the wage data (WAGE1.dta) from your textbook to demonstrate the evils of
omitted variable bias and show you that the OVB formula works. Let’s pretend that this sample
of 500 people is our whole population of interest, so that when we run our regressions, we are
actually revealing the true parameters instead of just estimates. We’re interested in the relationship
between wages and gender, and our “omitted” variable will be tenure (how long the person has
been at his/her job). Suppose our population model is:
(1)
log(wage)i = β 0 + β 1 f emalei + β 2 tenurei + ui
First let’s look at the correlations between our variables and see if we can predict how omitting
tenure will bias β̂ 1 :
. corr lwage female tenure
|
lwage
female
tenure
-------------+--------------------------lwage |
1.0000
female | -0.3737
1.0000
tenure |
0.3255 -0.1979
1.0000
If we ran the regression:
(2)
log(wage)i = β̃ 0 + β̃ 1 f emalei + ei
...then the information above tells us that β̃ 1
β 1 . Let’s see if we were right. Imagine we ran
the regressions in Stata (we did) and we get the below results for our two models:
(1)
(2)
log(wage)i = 1.6888 − 0.3421 f emalei + 0.0192tenurei + ui
log(wage)i = 1.8136 − 0.3972 f emalei + ei
From these results we now “know” that β 1 =
and β̃ 1 =
.
This means that our BIAS is equal to:
There’s one more parameter missing from our OVB formula. What regression do we have
to run to find its value?
tenure = ρ0 + ρ1 f emale + v
The Stata results give us:
tenure = 6.4745 − 2.8594 f emale + v
Now we can plug all of our parameters into the bias formula to check that it in fact gives us
the bias from leaving out tenure from our wage regression:
β̃ 1 = E[ β̃ˆ 1 ]
= β 1 + β 2 ρ1
=
=
5
EEP/IAS 118 - Introductory Applied Econometrics
Spring 2015
5
Sylvan Herskowitz
Section Handout 5
OVB Intuition (For your own reference)
For further intuition on omitted variable bias, I like to think of an archer. When our MLR1-4 hold,
the archer is aiming the arrow directly at the center of the target—if he/she misses, it’s due to
random fluctuations in the air that push the arrow around, or maybe imperfections in the arrow
that send it a little off course. When MLR1-4 do not all hold, like when we have an omitted
variable, the archer is no longer aiming at the center of the target. There are still puffs of air and
feather imperfections that send the arrow off course, but the course wasn’t even the right one to
begin with! The arrow (which you should think of as our β̂) misses the center of the target (which
you should think of as our true β) systematically.
To demonstrate this, I did the following:
• Take a random sample of 150 people out of the 500 that are in WAGE1.dta
• Estimate β̂ 1 using OLS, controlling for tenure with these 150 people.
• Estimate α̂1 using OLS (NOT controlling for tenure) with these 150 people.
• Repeat 6000 times.
At the end of all of the above, I end up with 6000 biased and 6000 unbiased estimates of β̂ 1 . I
plotted the kernel density of the biased estimates alongside that of the unbiased estimates. You
can see how the biased distribution is shifted to the left indicating a downward bias!
0
2
Density
4
6
Figure 1. Kernel densities for biased and unbiased estimates.
-.6
-.5
-.4
-.3
effect of female on ln(wage)
alphahat_1
-.2
-.1
betahat_1
6