Econ 1629: Applied Research Methods Assignment 4 Suggested Solutions Section I – Traffic Fatalities Variable state allmort allnite pop yngdrv breath jaild comserd beertax mlda Description State ID (FIPS) Code Number of Vehicle Fatalities per year Number of Night-time Vehicle Fatalities per year Population Proportion of Drivers Age 15-24 Breath Test Law for Suspected Drunk Drivers [Dummy Variable] Mandatory Jail Sentence for Drunk Driving [Dummy variable] Mandatory Community Service for Drunk Driving [Dummy Variable] Tax on Case of Beer (in dollars) Minimum Legal Drinking Age (1) Create summary statistics for the above variables, and write one paragraph summarizing the main findings. The goal of this paragraph is to provide the background you would like the reader to have before you present any information about the effect of certain policies in reducing traffic fatalities. .*Question 1 .sum state allmort allnite pop yngdrv breath jaild comserd beertax mlda Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------state | 48 30.1875 15.44883 1 56 allmort | 48 974.75 981.0172 104 5390 allnite | 48 179.5625 186.9059 18 936 pop | 48 5074334 5347035 478999.7 2.83e+07 yngdrv | 48 .161999 .0221605 .073137 .220724 -------------+-------------------------------------------------------breath | 48 .5 .5052912 0 1 jaild | 47 .2978723 .4622673 0 1 comserd | 47 .212766 .4136881 0 1 beertax | 48 .4798154 .434836 .0433109 2.194418 mlda | 48 20.96875 .2165064 19.5 21 1 Here is an example of a summary paragraph describing the data: The data for this study provide information about traffic fatality rates and selected policies to address driving under the influence of alcohol in 48 U.S. states for 1988. The study also provides population figures and the percentage of young drivers for each of the participating states. In 1988, the average number of vehicle related fatality deaths per state was 974.8, ranging from a minimum of 104 to a maximum of 5390. Part of this wide range is due to the variation in the populations of the states, which range from less than 500,000 residents to over 20 million, as well as variation in the proportion of young drivers (min of 7% and max of 22%), who may be particularly likely to be involved in traffic fatalities. Policies related to driving under the influence of alcohol varied across states: 50% required a breath test for suspected drunk drivers, 30% had a mandatory jail sentence for convicted drunk drivers, and 21% required community service for convicted drunk driver. All states have a beer tax, which averaged about 48 cents per case of beer, but varied widely across the states, from less than a nickel to over $2. (2) Create a new variable (called vfr) that indicates the vehicle fatality rate (defined as the number of vehicle fatalities per 100,000 people living in the state). What is the average vehicle fatality rate in this sample? . *Question 2 . gen vfr=(allmort/pop)*100000 . sum vfr Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------vfr | 48 20.69594 5.211829 12.3111 32.3591 The average fatality rate, by state, in this sample is 20.7 fatalities per 100,000 people. (3) Create a new variable (called nightvfr) that indicates the night-time vehicle fatality rate (defined as the number of night-time vehicle fatalities per 100,000 people living in the state). What is the average night-time vehicle fatality rate in this sample? . *Question 3 . gen nightvfr=(allnite/pop)*100000 . sum nightvfr Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------nightvfr | 48 3.680209 .9816015 1.71598 6.104852 The average night-time fatality rate, by state, in the sample is 3.7 fatalities per 100,000 people. 2 (4) Many people believe that younger drivers are more likely to get into accidents, especially when road conditions are more challenging, such as after dark (auto insurers underwrite policies reflecting this assumption). Which states have the lowest and highest proportions of young drivers, those drivers aged 15-24? Where do these states rank in terms of vfr and nightvfr? Indiana has the lowest percentage of young drivers, at 7.3%, and New Mexico has the highest percentage of young drivers, at 22.1%. New Mexico has the second-highest rate of overall vehicle fatalities and the highest rate of night-time vehicle fatalities. Indiana is in the middle for both of these statistics: 25th highest (or 24th lowest) for overall vehicle fatality rate and 21st highest (or 28th lowest) for night-time vehicle fatality rate. (5) Many night-time car accidents involve alcohol. We are interested in estimating the relationship between the night-time fatality rate (nightvfr) and whether the state has a breath test law for suspected drunk drivers (breath) using the following PRF: nightvfr = β 0 + β1breath + u (a) What sign would you expect β1 to have? Explain. Most people expect β1 to be negative since they expect that having a mandatory breath test should discourage drunk driving and should therefore lead to lower traffic fatalities. Other people would argue that states that have high traffic fatality rates may be more likely to establish a breath test law, which would lead to β1 being positive. Both answers are reasonable. (b) Estimate the above PRF using the data provided. Remember to use the option for robust standard errors. Interpret the coefficient on the variable breath. Is the coefficient on the variable breath statistically significant at the 5% level? 3 . *Question 5 . regress nightvfr breath, robust Linear regression Number of obs = F( 1, 46) = Prob > F = R-squared = Root MSE = 48 2.46 0.1236 0.0508 .9667 -----------------------------------------------------------------------------| Robust nightvfr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------breath | -.4377213 .2790617 -1.57 0.124 -.9994434 .1240009 _cons | 3.89907 .2289213 17.03 0.000 3.438275 4.359865 ------------------------------------------------------------------------------ States with breath test laws have a 0.44 lower night-time vehicle fatality rate (per 100,000 people) than states that do not have this policy. The t-statistic on the breath coefficient is -1.57, which in absolute value does not exceed the critical value of 1.96. Therefore, the coefficient is not statistically significant at the 5% level. (c) Is the sign of β̂1 the one you expected? Explain. The answer to this question depends on which of the two explanations was offered in part (a). If a negative coefficient was predicted in part (a), then the sign is as expected. If a positive coefficient was predicted, then we can reconcile our results with reasoning along the lines of the first explanation – that breath test laws discourage drunk driving and thus are associated with fewer night-time traffic fatalities (per 100,000 people). (d) Is β1 likely to represent the causal effect of a breath-test law on the night-time vehicle fatality rate? Explain. No. We can say that having a breath test law is associated with a lower fatality rate,but we cannot say that we observe a lower vehicle fatality rate in states with breath test laws because of those laws. States that establish breath test laws may be extremely vigilant about drunk driving or want to discourage any tendencies for people to drive while under the influence. They may also be more generally safety-conscious with a population of more cautious drivers. States without breath test laws may have more lenient attitudes about drinking and driving and more popular opposition to these types of laws – though these same “cultural” issues are likely to contribute to a higher vehicle fatality rate. In other words, states that have these laws may be different from states that do not have these laws in ways that also contribute to vehicle fatality rates. 4 (e) From a policymaker’s perspective, why would we care if β1 represented the causal effect of a breath-test law on the night-time vehicle fatality rate? [1 paragraph] If this were indeed a causal effect, then establishing a breath test law could be an effective way of reducing the night-time traffic fatality rate (as long as the associated benefits are larger than the costs of implementing such policies). If β1 were simply reflecting an association but not a causal one, then implementing a breath test law is not likely to lead to a reduction in the night-time traffic fatality rate, on its own. So from a policymaker’s perspective, it would be crucial to know whether this association is causal or not. (6) Other policies that could be used to reduce night-time traffic fatalities, particularly those linked to alcohol use, include: mandatory jail sentences for drunk driving (jaild), mandatory community service for drunk-driving offenders (comserd), and a beer tax (beertax). This question asks you to examine how these policies and the proportion of the state’s drivers who are “young” drivers relate to the night-time vehicle fatality rate. (a) Estimate the following PRF (remember to specify robust standard errors): nightvfr = β 0 + β1breath + β 2 jaild + β 3 comserd + β 4 beertax + β 5 yngdrv + u . *Question 6 . regress nightvfr breath jaild comserd beertax yngdrv, robust Linear regression Number of obs = F( 5, 41) = Prob > F = R-squared = Root MSE = 47 0.90 0.4912 0.1223 .98306 -----------------------------------------------------------------------------| Robust nightvfr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------breath | -.2212992 .3078117 -0.72 0.476 -.8429375 .4003391 jaild | .6884527 .4825511 1.43 0.161 -.286079 1.662984 comserd | -.3982458 .5069437 -0.79 0.437 -1.422039 .6255477 beertax | .2700878 .3547532 0.76 0.451 -.4463508 .9865265 yngdrv | 2.170482 9.186893 0.24 0.814 -16.38282 20.72379 _cons | 3.196759 1.247599 2.56 0.014 .6771815 5.716337 ------------------------------------------------------------------------------ (b) Interpret the coefficient on jaild States with a mandatory jail sentence for drunk drivers have a vehicle fatality rate of 0.69 deaths higher (per 10,000 people) than states without this policy, holding breath, comserd, beertax, and yngdrv constant. 5 (c) Interpret the coefficient on beertax An additional dollar in the beer tax per case is associated with an increase in the night-time vehicle fatality rate of 0.27 additional fatalities per 100,000 people, holding breath, jaild, comserd, and yngdrv constant. (d) Interpret the coefficient on yngdrv An increase of 1 percentage point in the proportion of a state’s drivers who are age 15-24 is associated with 0.022 additional night-time vehicle fatality deaths per 100,000 residents, controlling for breath, jaild, comserd, and beertax. (7) An alternative way to examine the association between yngdrv and nightvfr is to divide states into 3 groups according to the proportion of young drivers. (a) Create the following 3 variables – lowyng, medyng, and highyng: gen lowyng=(yngdrv<0.15) if yngdrv!=. gen medyng=(yngdrv>=0.15 & yngdrv<0.17) if yngdrv!=. gen highyng=(yngdrv>=0.17) if yngdrv!=. Be sure to check your results as suggested in the assignment to make sure that each state is coded for only one of the young driver classifications! (b) Estimate the following PRF (remember to specify robust standard errors): nightvfr = β 0 + β1breath + β 2 jaild + β 3 comserd + β 4 beertax + β 5lowyng + β 6 medyng + u . regress nightvfr breath jaild comserd beertax lowyng medyng, robust Linear regression Number of obs F( 6, 40) Prob > F R-squared Root MSE = 47 = 1.83 = 0.1169 = 0.1613 = .97292 -----------------------------------------------------------------------------| Robust nightvfr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------breath | -.1972487 .3184678 -0.62 0.539 -.8408961 .4463987 jaild | .7008364 .457141 1.53 0.133 -.2230801 1.624753 comserd | -.3977197 .4788624 -0.83 0.411 -1.365537 .5700973 beertax | .336772 .2974011 1.13 0.264 -.2642982 .9378421 lowyng | .0236589 .4775119 0.05 0.961 -.9414287 .9887465 medyng | .4055903 .376392 1.08 0.288 -.3551264 1.166307 _cons | 3.289217 .5665373 5.81 0.000 2.144203 4.434232 ------------------------------------------------------------------------------ 6 (c) Interpret the coefficient on medyng Relative to states with high proportions of young drivers (more than 17% of drivers age 15-24), states with medium proportions of young drivers experience 0.41 more night-time vehicle fatalities per 100,000 residents. (d) Why did we not include highyng in the PRF above? Highyng is our "base" or "omitted" category. So the coefficients on medyng and lowyng are relative to that highyng group. [Intuitively, We are not leaving out anything because the states with highyng are just represented by the combination (medyng=0 and lowyng=0). The states with medyng are represented by (medyng=1 and lowyng=0) and the states with lowyng are represented by (medyng=0 and lowyng=1). So the combination of these two dummy variables accounts for all three of our categories. In fact, if we included highyng in the regression, this would be a violation of assumption 4 (no perfect multicollinearity). To see that it violates assumption 4, notice that lowyng+medyng+highyng=1. If you tried it in Stata, you will note that Stata does not even let you run this regression!] (8) One variable that might also contribute to vehicle fatalities is the minimum legal drinking age. The dataset does include such a variable (mlda). In practice, however, this variable is of limited use in estimating a regression equation explaining nightvfr. Why is this the case? . *Question 8 . sum mlda Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------mlda | 48 20.96875 .2165064 19.5 21 . tab mlda minimum | legal | drinking | age | Freq. Percent Cum. ------------+----------------------------------19.5 | 1 2.08 2.08 21 | 47 97.92 100.00 ------------+----------------------------------Total | 48 100.00 7 There is so little variation in the minimum legal drinking age, it is of little value in explaining the variation in the outcome variable, night-time vehicle fatality rates. Only one state does not have a minimum legal drinking age of 21. 8 Section II – Errors in Variables (Your numeric results may differ because of different random draws, but results should qualitatively be similar) No Errors The first worksheet shows the true data from a population (X and Y), and the true results of regressing a model of the form: Y = bX + a (1) Graph the true relationship between X and Y. Make sure your axes are labeled. (2) Write the true values of a and b (shown on the graph) a = 2.0297 b = 0.4589 Random Error Now let’s see what happens to results when there is error in the variables. We’ll start by exploring the case where the error is totally random (each observation is equally likely to have an the same error). For example, if we were measuring people’s heights, if we were equally likely to over- or under-estimate the height of people who were tall or short by the same amount, the error would be random. 9 (3) How do you think error will affect estimates? (Be specific.) Here your initial intuitions regarding the effect of error on the estimates may diverge. It is important to keep in mind that errors in the dependent variable (Y) and in the independent variable (X) will have different kinds of effects on both the level and the variance of the estimates. We explore this in greater detail below. Error in X variable Trial 1 2 3 4 5 6 7 8 Mean Mean -0.01 -0.03 -0.07 -0.01 -0.03 0.00 -0.09 -0.03 -0.03 Error Correlation with True X -0.24 0.04 0.03 -0.03 -0.27 0.19 -0.13 -0.15 -0.07 Estimate a 2.1121 2.2224 2.1306 2.0865 2.1072 2.0944 2.1629 2.1340 2.1313 b 0.2722 0.0133 0.2690 0.3328 0.2980 0.3096 0.1900 0.2357 0.2401 (4) Compare the mean estimates to the true estimates. Does random error in the X variable seem to affect the estimates of a and b? How? Keep in mind that your values will differ from those in my table above. I have that a appears to be slightly overestimated, by about 5%; and b appears to be substantively underestimated, by nearly 50%. Let us focus on the difference between the true parameter b and its estimated value. When N is large, it can be shown that the estimated value for b can be written as follows ^ " σX % b = b ⋅$ ' #σ X +σε & where σ X is the variance of X and σ ε is the variance of the error added to X. Notice that if σ ε =0, or there is no noise, then the fraction to the right of b is equal to one and the estimated value is correct. Whenever σ ε > 0 , or noise is added to X, then the fraction to the right of b is less than one and the estimated value is biased towards zero. This is commonly referred to as “Attenuation Bias”, since it attenuates the effect of X on Y. Error in Y variable Error Estimate Trial Mean Correlation with True Y a b 1 0.12 -0.20 2.0473 0.7056 2 0.06 0.02 2.1719 0.2737 3 -0.04 -0.27 2.1609 0.0727 4 0.10 0.14 2.0862 0.5626 5 -0.07 0.02 1.9487 0.4810 10 6 7 8 Mean -0.03 0.00 -0.05 0.01 -0.09 -0.35 -0.02 -0.09 2.2329 2.1465 1.8430 2.0797 -0.0901 0.1852 0.7829 0.3717 (5) Does random error in the Y variable seem to affect the estimates of a and b? How? Compared to the solutions in (4), the random error in Y does not have a substantive effect on the levels of our estimates. It does, however, increase the variance of these estimates, as evidenced by the larger spread in recorded values for both a and b in the table above. To understand this better, recall that the standard errors for OLS estimates are directly proportional to the standard deviation of the error term in the PRF. Since the error added to Y in this question is uncorrelated with X, OLS simply treats it as part of the PRF error term, leading to more variance in our estimates and larger standard errors. Error in X and Y variables Estimate Trial a b 1 2.2340 -0.0973 2 2.1282 0.1891 3 2.1792 0.2051 4 2.1957 0.0579 5 2.1279 0.0195 6 2.2621 0.1007 7 2.1192 0.1803 8 2.2627 0.0868 Mean 2.1886 0.0928 (6) Does random error in the X and Y variable seem to affect the estimates of a and b? How? In these results, we can see the combined effects seen in (4) and (5) above. Not only is b underestimated as in (4), the variance of both estimates is high as in (5). (7) Over the past 3 tests, the error has been random. What do you notice about the mean error and the correlation of the error with the variable? The mean error tends to be close to zero, relative to the size of our estimates. This is also true of the mean correlation between the errors and the variables. This is what we would expect, because the errors in the preceding cases are independent of both X and Y. Systematic Error Error Trial Mean Correlation with True Y 1 0.48 0.18 2 0.47 0.16 3 0.49 0.46 4 0.46 0.60 Estimate a 2.0557 2.1728 2.1999 2.1502 11 b 1.5101 1.2223 1.1903 1.2541 5 6 7 8 Mean 0.63 0.54 0.42 0.50 0.50 0.34 0.37 0.28 0.17 0.32 1.9942 2.1844 2.1734 2.2259 2.1446 2.0071 1.3404 1.1048 1.1731 1.3503 (8) Does systematic error in the Y variable seem to affect the estimates of a and b? The estimates of a seem to be slightly over-estimated, as in previous cases. The most noticeable effect of the error is on b in this case, however, which is over-estimated by about 300%! Intuitively, this is because OLS is treating the error correlated with Y as an effect of X. As we can see in the table above, the error is highly correlated with the true value of Y, and this causes the substantial over-estimation of b that we see in the final column. Conclusions (9) In what cases would you expect errors in a variable to affect regression estimates? Measurement error is inescapable, but when we understand its effects on our estimates, it does not prevent us from making valid inference. Random errors in Y only have the effect of increasing our standard errors, which reduces the precision of our estimates but does not bias them in any way. Random errors in X bias our estimates toward zero, but we can easily incorporate this into our interpretation of our results because we know the direction in which the bias occurs. Systematic measurement errors are the kind that we have to be most concerned about, because they can substantially bias our estimates. Moreover, if we do not know the characteristics of such errors, we will not even be able to say anything about the sign, let alone the size, of the bias that they cause. 12 Section III – Omitted Variable Bias As indicated in class, the sign of the bias (positive or negative) of β̂1 when omitting X2 in the estimation of Y = β0 + β1X1 + β2X2 + u can be summarized in the table below. β2>0 Corr (X1,X2) >0 + Corr (X1,X2) <0 - β2<0 - + Illustrate each of the 4 cases in the table with an example that does not draw on the material in section I of this problem set or specific examples we have discussed in class (in other words, please think of examples other than vehicle fatality rates; reperfusion treatment after a heart attack; training programs, education, and wages; and test scores and class size!). Assume the regression equation you would like to estimate is Y = β0 + β1X1 + β2X2 + u, but for some reason (lack of data or knowledge, for example), you end up omitting X2 from the regression. For each example: 1) Describe the variables Y, X1, and X2. 2) Indicate the sign of the correlation between X1 and X2, and explain why you would expect this sign 3) Indicate the sign of β2, and explain why you would expect this sign 4) Indicate the sign of the bias if you were to omit X2 from the regression 5) Indicate how your estimated β̂1 is likely to change when omitting X2 from the regression, i.e. will it get closer to or further from zero relative to when you estimate the full regression (with both X1 and X2 as explanatory variables)? Explain your reasoning both in technical terms (using your answers to the previous questions) and in terms a policymaker can understand. 13 Case 1: β 2>0, Corr (X1,X2)>0 Suppose we would like to estimate the following equation: wage = β0 + β1 educ + β2faminc + u (1) So: Y=wage= wage earned (measured in $/hour) X1=educ=completed education (measured in years of schooling) X2=faminc= family income (measured in thousands of dollars) (2) Corr (educ, faminc) is probably positive since individuals with high family income are more likely to be able to afford going to college than individuals with low family income. (3) β2 is probably positive because individuals with high family income may have certain characteristics (such as good family connections) that may allow them to get better-paying jobs more easily than individuals with low family income. (4) Since β2>0 and Corr (educ, faminc)>0, then bias>0. This means that if we were to estimate: wage = α0 + α1 educ + v Then the expected value of α̂ 1 will be higher than β1, i.e. E[αˆ1 ] > β1 (5) Given that β1 is likely to be positive (since individuals with high levels of schooling are more likely to get high wages than individuals with low levels of schooling), by omitting faminc from the regression we are likely to overestimate the effect of education on wages. We would be attributing to education what is really due to family income. 14 Case 2: β 2<0, Corr (X1, X2)>0 Suppose we would like to estimate the following equation: birthweight = β0 + β1 cigs + β2alcohol + u (1) So: Y= birthweight =Weight at birth of a newborn baby (usually thought of as a measure of a baby’s health/development) X1=cigs=number of cigarettes mother smoked during pregnancy X2=alcohol=number of alcoholic drinks mother drank during pregnancy (2) Corr (cigs, alcohol) is probably positive since mothers who smoked during pregnancy were probably more likely to engage in other potentially dangerous behavior (to the child) such as drinking. (3) β2 is probably negative given that doctors advise women not to drink frequently during pregnancy because of the possible harm this may cause to the baby’s development. (4) Since β2<0 and Corr (educ, faminc)>0, then bias<0. This means that if we were to estimate: birthweight = α0 + α1 cigs + v Then the expected value of α̂ 1 will be lower than β1, i.e. E[αˆ1 ] < β1 (5) Given that β1 is likely to be negative (since mothers who smoke during pregnancy are more likely to have low birthweight babies than mothers who don’t smoke), by omitting alcohol from the regression we are likely to overestimate the effect of smoking on birthweight. We would be attributing to smoking what is really due to alcohol drinking. 15 Case 3: β 2>0, Corr (X1, X2)<0 Suppose we would like to estimate the following equation: life expectancy = β0 + β1 cigs + β2 exercise+ u (1) So: Y= life expectancy X1= cigs = annual average cigarette consumption X2= exercise = annual average hours of exercise (2) Corr (cigs, exercise) is probably negative since people who exercise are less likely to smoke. (3) β2 is probably positive given that frequent exercise may increase life expectancy. (4) Since β2>0 and Corr (cigs, exercise)<0, then bias<0. This means that if we were to estimate: life expectancy = α0 + α1 cigs + v Then the expected value of α̂ 1 will be lower than β1, i.e. E[αˆ1 ] < β1 (5) Given that β1 is likely to be negative (since smokers on average live less years than nonsmokers), by omitting exercise from the regression we are likely to overestimate the effect of smoking on life expectancy. We would be attributing to smoking what is really due to lack of exercise. 16 Case 4: β 2<0, Corr (X1, X2) <0 Suppose we would like to estimate the following equation: testscorei = β0 + β1 schoolspendingi + β2povratei + u (1) So: Y=testscorei=average test score of students in school district i X1=schoolspendingi=spending of school district i X2=povratei = poverty rate in school district i (2) Corr (schoolspending, povrate) is probably negative since communities that have high poverty rates are likely to have school districts that have low levels of resources. (3) β2 is probably negative since poor communities are usually associated with low academic performance. (4) Since β2<0 and Corr (educ, faminc)<0, then bias>0. This means that if we were to estimate: testscorei = α0 + α1 schoolspendingi + v Then the expected value of α̂ 1 will be higher than β1, i.e. E[αˆ 1 ] > β1 (5) Given that β1 is likely to be positive (since we would think that higher spending is associated with higher- or at least not lower- test scores), by omitting povrate from the regression we are likely to overestimate the effect of school spending on test scores. We would be attributing to school spending what is really due to poor socio-economic conditions (as measured by povrate). 17
© Copyright 2025 Paperzz