EEP/IAS 118 - Introductory Applied Econometrics Spring 2015 1 Sylvan Herskowitz Section Handout 5 Simple and Multiple Linear Regression Assumptions The assumptions for simple are in fact special cases of the assumptions for multiple: Check: 1. What is external validity? Which assumption is critical for external validity? 2. What is internal validity? Which assumption is critical for internal validity? 3. What null hypothesis are we typically testing? Which assumption is critical for hypothesis testing? 4. What happens when a dummy variable for male is included in a regression alongside a variable for non-male? Which assumption is violated? 2 Omitted Variable Bias i. Practice: Signing the bias A classmate of mine is working on a project in Mexico looking at how homicide rates (per capita) are affected by changes in police financing (per capita). Presumably, giving police more resources with which to fight crime would lower the rate of homicides in a given area. Let us imagine that the population model of homicides looks like this, with an index of gang presence in a given district as an additional explanatory variable. homicide = β 0 + β 1 police f inance + β 2 gangs + u (1) However, let’s pretend that she didn’t think to collect data on prevalence of gangs in each district, so that the model she estimates is: homicide = β̃ 0 + β̃ 1 police f inance + ũ (2) 1 EEP/IAS 118 - Introductory Applied Econometrics Spring 2015 Sylvan Herskowitz Section Handout 5 If we think there’s an important variable missing, like gangs is above, we can sign the bias we expect if we leave gangs out of the regression simply by determining the signs of two correlations: 1. Cov(homicide, gangs) or Cov(y, omitted variable) 2. Cov( police f inance, gangs) or Cov( x, omitted variable) Often we leave out the unimportant variables. An unimportant variable is one that we’re not interested in, and one that will not induce bias (i.e. bias is zero) in our coefficients of interest if we leave it out. The bias due to the omitted variable will be zero when: 1. 2. On a problem set or exam, when you’re trying to decide how omitted variable bias might be affecting your estimation of a parameter, it’s easy to think of the following table: Cov(y, xov ) > 0 Cov( x, xov ) > 0 Upward bias Cov( x, xov ) < 0 Downward bias Cov(y, xov ) < 0 Downward bias Upward bias ii. OVB: Derivation/Calculation SLR4 fails because of an omitted variable: E[u| X ] 6= 0 Reviewing lecture from last Thursday: Population Model: y = β 0 + β 1 x1 + β 2 x2 + u Model with omitted x2 variable: y = β̃ 0 + β̃ 1 x1 + ũ Suppose x1 is correlated with x2 in the following way: x2 = α + ρx1 + v Substituting this equation into the true population model we get: y = β 0 + β 1 x1 + β 2 (α + ρx1 + v) + u = = There are extra terms! If we take the expectation of β̃ˆ 1 : h E β̃ˆ 1 i = If E β̂ 1 6= β 1 then we say β̂ 1 is biased. What this means is that on average, our regression . estimate is going to miss the true population parameter by 2 EEP/IAS 118 - Introductory Applied Econometrics Spring 2015 3 Sylvan Herskowitz Section Handout 5 Clean and Dirty Variation In class we heard the concept of “clean” and “dirty” variation mentioned. What does it mean and why do we care? This is a very intuitive lense through which to think about the variation of certain variables. The idea is that an estimation is suffering from omitted variable bias. This is a problem because as a result we are not getting the true relationship between our explanatory variable of interest and our outcome, β. In the example above we wanted to explain homicides with police funding. However, we had biased results because areas with higher police funding also had higher levels of gang presence. This correlation between police funding and gang presence is considered “dirty” variation in police funding. It is variation in police funding that cannot be disentangled from gang presence and results in biasing our estimates. However, police funding and gang presence are not perfectly collinear. There is other variation in police funding that is not directly explained by gang presence and vice versa. If we could isolate only the “clean” variation in police funding that was independent of gang presence, then we could end up getting an unbiased estimate of the relationship between homicides and police funding! But this is impossible, right? Sort of. In the example from class, we actually had data for our omitted variable. Analagously, if we had data on gang presence we could be clever and use Stata to isolate only our clean variation in police funding that would capture an unbiased estimate of β. To do this, we generate a new model that uses gang presence to predict police funding by regressing police funding on gang presence. This predicted police funding is the “dirty” portion of variation that we can’t use. Instead we extract the “clean” variation in police funding that has nothing to do with gang presence. This clean variation is captured in the residuals of our model. What does this do to our results? We saw that this new estimate from regressing homicides on these residuals is essentially the same as when we estimate the true population model and include the omitted variable directly in the equation! We now have E[ β̂] = β! So why don’t we just include this variable in the first place!?! Well... Good question. I’m glad you asked. Of course we would love to always be blessed with the complete set of data and variables so that we could directly estimate the true population model, but that is very rarely the case. This example was constructed to illustrate two things: 1) how we can focus on clean variation to get accurate estimates and 2) how isolating only clean variation leads to an increase in our standard errors relative to our biased estimate. The latter point is because we are now using less of the total variation in our police funding (our X variable). The notion of clean and dirty variation will come back later in the course. Many impact evaluation methods are tools designed to help us isolate the clean variation of our variables of interest so that we can get unbiased estimates of β. 3 EEP/IAS 118 - Introductory Applied Econometrics Spring 2015 Sylvan Herskowitz Section Handout 5 Variance of β̂ Bringing this home (hopefully) lets remember our variance for β̂. Var ( β̂) = σˆ2 SSTx (1 − R2x ) Check: 1. What happened to n in this formula, don’t we still care about our sample size? 2. What would happen to R2x if we add an additional variable into our regression that is highly correlated with X? 3. What happens to σ2 if the newly added variable explains a lot of variation in Y? 4 EEP/IAS 118 - Introductory Applied Econometrics Spring 2015 4 Sylvan Herskowitz Section Handout 5 Example: OVB in Action In this section, I use the wage data (WAGE1.dta) from your textbook to demonstrate the evils of omitted variable bias and show you that the OVB formula works. Let’s pretend that this sample of 500 people is our whole population of interest, so that when we run our regressions, we are actually revealing the true parameters instead of just estimates. We’re interested in the relationship between wages and gender, and our “omitted” variable will be tenure (how long the person has been at his/her job). Suppose our population model is: (1) log(wage)i = β 0 + β 1 f emalei + β 2 tenurei + ui First let’s look at the correlations between our variables and see if we can predict how omitting tenure will bias β̂ 1 : . corr lwage female tenure | lwage female tenure -------------+--------------------------lwage | 1.0000 female | -0.3737 1.0000 tenure | 0.3255 -0.1979 1.0000 If we ran the regression: (2) log(wage)i = β̃ 0 + β̃ 1 f emalei + ei ...then the information above tells us that β̃ 1 β 1 . Let’s see if we were right. Imagine we ran the regressions in Stata (we did) and we get the below results for our two models: (1) (2) log(wage)i = 1.6888 − 0.3421 f emalei + 0.0192tenurei + ui log(wage)i = 1.8136 − 0.3972 f emalei + ei From these results we now “know” that β 1 = and β̃ 1 = . This means that our BIAS is equal to: There’s one more parameter missing from our OVB formula. What regression do we have to run to find its value? tenure = ρ0 + ρ1 f emale + v The Stata results give us: tenure = 6.4745 − 2.8594 f emale + v Now we can plug all of our parameters into the bias formula to check that it in fact gives us the bias from leaving out tenure from our wage regression: β̃ 1 = E[ β̃ˆ 1 ] = β 1 + β 2 ρ1 = = 5 EEP/IAS 118 - Introductory Applied Econometrics Spring 2015 5 Sylvan Herskowitz Section Handout 5 OVB Intuition (For your own reference) For further intuition on omitted variable bias, I like to think of an archer. When our MLR1-4 hold, the archer is aiming the arrow directly at the center of the target—if he/she misses, it’s due to random fluctuations in the air that push the arrow around, or maybe imperfections in the arrow that send it a little off course. When MLR1-4 do not all hold, like when we have an omitted variable, the archer is no longer aiming at the center of the target. There are still puffs of air and feather imperfections that send the arrow off course, but the course wasn’t even the right one to begin with! The arrow (which you should think of as our β̂) misses the center of the target (which you should think of as our true β) systematically. To demonstrate this, I did the following: • Take a random sample of 150 people out of the 500 that are in WAGE1.dta • Estimate β̂ 1 using OLS, controlling for tenure with these 150 people. • Estimate α̂1 using OLS (NOT controlling for tenure) with these 150 people. • Repeat 6000 times. At the end of all of the above, I end up with 6000 biased and 6000 unbiased estimates of β̂ 1 . I plotted the kernel density of the biased estimates alongside that of the unbiased estimates. You can see how the biased distribution is shifted to the left indicating a downward bias! 0 2 Density 4 6 Figure 1. Kernel densities for biased and unbiased estimates. -.6 -.5 -.4 -.3 effect of female on ln(wage) alphahat_1 -.2 -.1 betahat_1 6
© Copyright 2026 Paperzz