Problem Set 2: Suggested Solution Objectives • To be able to run regressions using Stata or R. • To be able to interpret regression results. Earnings and Height It has long been recognized that taller adults hold jobs of higher status and, on average, earn more than other workers. This is a simple fact. Economists Anne Case (Princeton University) and Christina Paxson (Brown University), however, provide a causal story to this simple fact. If you’re interested in their story, you can find their story in their paper Case and Paxson (2008). In this problem set, you will investigate the relationship between earnings and height. In particular, we will use their data and apply regression techniques to (1) get the simple fact right, and (2) consider a simple matching strategy. To get started, download the dataset earnings_height.dta from the course website http://econ300.com and load the dataset to R or Stata or other software you like. Variables and sample description are also available on the course website. 1. What is the mean value of height in the sample? library(rio) height = import("http://econ300.com/earnings_height.dta") mean(height$height, na.rm = T) [1] 66.96335 2. Naive mean comparison: 1. Estimate average earnings for workers whose height is at most 67 inches. 2. Estimate average earnings for workers whose height is greater than 67 inches. 3. On average, do taller workers earn more than shorter workers? How much more? Perform a t-test to show whether the difference in sample average is statistically significant different from 0. library(dplyr) height %>% group_by(height > 67) %>% summarise("Average Earning" = mean(earnings, na.rm = TRUE)) Source: local data frame [2 x 2] 1 2 height > 67 Average Earning FALSE 44488.44 TRUE 49987.88 1 ## On average, taller workers earn more than shorter workers. ## Based on the sample, the difference is 49987.88 - 44488.44. 49987.88 - 44488.44 [1] 5499.44 ## To perform a t-test, we first generate a new variable `tall`. height = mutate(height, tall = ifelse(height > 67, 1, 0)) ## We can perform a t-test with group variance being unequal. with(height, t.test(earnings ~ tall, var.equal = FALSE)) Welch Two Sample t-test data: earnings by tall t = -13.5898, df = 16624.47, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6292.643 -4706.237 sample estimates: mean in group 0 mean in group 1 44488.44 49987.88 ## Since t = -13.59 < -2, the difference is statistically significant. 3. Construct a scatter plot of annual earnings (earnings) on height (height). Notice that the points on the plot fall along horizontal lines. (There are only 23 distinct value of earnings). library(ggplot2) ggplot(height, aes(x = height, y = earnings)) + geom_point() 2 80000 earnings 60000 40000 20000 50 60 70 80 height 4. Run a regression of earnings on height 1. 2. 3. 4. Display regression results in a function or a table. What is the estimated slope? Interpret the mean of the estimated slope in words. Use the estimated regression to predict earnings for a worker who is 67 inches tall, for a worker who is 70 inches tall, and for a worker who is 65 inches tall. # 0. Run a regression lm(earnings ~ height, data = height) -> reg # display in a table library(broom) tidy(reg) term estimate std.error statistic p.value 1 (Intercept) -512.7336 3386.85615 -0.1513892 8.796704e-01 2 height 707.6716 50.48922 14.0162889 2.129867e-44 # The estimated slope is 707.6716 # Interpretation: On average, a one inch increase in height is associated with about $700 increase in an # We can use the intercept and slope to predict the three typical workers' earnings: b = 707.6716 a = -512.7336 a + b * 67 3 [1] 46901.26 a + b * 70 [1] 49024.28 a + b * 65 [1] 45485.92 5. Suppose yo want to do a simple matching based on people’s race, describe the steps you would take in order to get the final matching estimate of the effect of being tall. (We cover this in Lecture 5.) • To do a matching based on people’s race, one can compare earnings among tall workers and short workers within a race, and then average all the differences. 6. Use multivariate regression as an automated matchmaker to find the estimate you want to get in question 5. That is, run a regression of earnings on height and race. 1. Report the regression results. 2. How is it different from the regression of earnings on height? lm(earnings ~ height + factor(race), data = height) -> reg2 tidy(reg2) term estimate std.error statistic p.value 1 (Intercept) 9691.0749 3394.34198 2.855067 4.307726e-03 2 height 590.7414 50.35766 11.730915 1.156363e-31 3 factor(race)2 -12919.1002 627.58523 -20.585412 4.453424e-93 4 factor(race)3 -10816.3441 757.70029 -14.275228 5.590604e-46 5 factor(race)4 -2227.9660 1010.77244 -2.204221 2.752150e-02 • The new coefficient on height is 590.7, which is lower than the first regression, but it’s still a sizable effect of height on earnings. Still, there might be selection bias because even within race, tall people and short people can be very different; our simple matching strategy does not reveal the underlying causal story. Reference Case, Anne, and Christina Paxson. 2008. “Stature and Status: Height, Ability, and Labor Market Outcomes.” Journal of Political Economy 116 (3). 4
© Copyright 2025 Paperzz