Problem Set 2: Suggested Solution

Problem Set 2: Suggested Solution
Objectives
• To be able to run regressions using Stata or R.
• To be able to interpret regression results.
Earnings and Height
It has long been recognized that taller adults hold jobs of higher status and, on average, earn more than
other workers. This is a simple fact. Economists Anne Case (Princeton University) and Christina Paxson
(Brown University), however, provide a causal story to this simple fact. If you’re interested in their story, you
can find their story in their paper Case and Paxson (2008).
In this problem set, you will investigate the relationship between earnings and height. In particular, we will
use their data and apply regression techniques to (1) get the simple fact right, and (2) consider a simple
matching strategy.
To get started, download the dataset earnings_height.dta from the course website http://econ300.com
and load the dataset to R or Stata or other software you like. Variables and sample description are also
available on the course website.
1. What is the mean value of height in the sample?
library(rio)
height = import("http://econ300.com/earnings_height.dta")
mean(height$height, na.rm = T)
[1] 66.96335
2. Naive mean comparison:
1. Estimate average earnings for workers whose height is at most 67 inches.
2. Estimate average earnings for workers whose height is greater than 67 inches.
3. On average, do taller workers earn more than shorter workers? How much more? Perform a t-test
to show whether the difference in sample average is statistically significant different from 0.
library(dplyr)
height %>%
group_by(height > 67) %>%
summarise("Average Earning" = mean(earnings, na.rm = TRUE))
Source: local data frame [2 x 2]
1
2
height > 67 Average Earning
FALSE
44488.44
TRUE
49987.88
1
## On average, taller workers earn more than shorter workers.
## Based on the sample, the difference is 49987.88 - 44488.44.
49987.88 - 44488.44
[1] 5499.44
## To perform a t-test, we first generate a new variable `tall`.
height = mutate(height, tall = ifelse(height > 67, 1, 0))
## We can perform a t-test with group variance being unequal.
with(height, t.test(earnings ~ tall, var.equal = FALSE))
Welch Two Sample t-test
data: earnings by tall
t = -13.5898, df = 16624.47, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6292.643 -4706.237
sample estimates:
mean in group 0 mean in group 1
44488.44
49987.88
## Since t = -13.59 < -2, the difference is statistically significant.
3. Construct a scatter plot of annual earnings (earnings) on height (height). Notice that the points on
the plot fall along horizontal lines. (There are only 23 distinct value of earnings).
library(ggplot2)
ggplot(height, aes(x = height, y = earnings)) +
geom_point()
2
80000
earnings
60000
40000
20000
50
60
70
80
height
4. Run a regression of earnings on height
1.
2.
3.
4.
Display regression results in a function or a table.
What is the estimated slope?
Interpret the mean of the estimated slope in words.
Use the estimated regression to predict earnings for a worker who is 67 inches tall, for a worker
who is 70 inches tall, and for a worker who is 65 inches tall.
# 0. Run a regression
lm(earnings ~ height, data = height) -> reg
# display in a table
library(broom)
tidy(reg)
term estimate std.error statistic
p.value
1 (Intercept) -512.7336 3386.85615 -0.1513892 8.796704e-01
2
height 707.6716
50.48922 14.0162889 2.129867e-44
# The estimated slope is 707.6716
# Interpretation: On average, a one inch increase in height is associated with about $700 increase in an
# We can use the intercept and slope to predict the three typical workers' earnings:
b = 707.6716
a = -512.7336
a + b * 67
3
[1] 46901.26
a + b * 70
[1] 49024.28
a + b * 65
[1] 45485.92
5. Suppose yo want to do a simple matching based on people’s race, describe the steps you would take in
order to get the final matching estimate of the effect of being tall. (We cover this in Lecture 5.)
• To do a matching based on people’s race, one can compare earnings among tall workers
and short workers within a race, and then average all the differences.
6. Use multivariate regression as an automated matchmaker to find the estimate you want to get in
question 5. That is, run a regression of earnings on height and race.
1. Report the regression results.
2. How is it different from the regression of earnings on height?
lm(earnings ~ height + factor(race), data = height) -> reg2
tidy(reg2)
term
estimate std.error statistic
p.value
1
(Intercept)
9691.0749 3394.34198
2.855067 4.307726e-03
2
height
590.7414
50.35766 11.730915 1.156363e-31
3 factor(race)2 -12919.1002 627.58523 -20.585412 4.453424e-93
4 factor(race)3 -10816.3441 757.70029 -14.275228 5.590604e-46
5 factor(race)4 -2227.9660 1010.77244 -2.204221 2.752150e-02
• The new coefficient on height is 590.7, which is lower than the first regression, but it’s
still a sizable effect of height on earnings. Still, there might be selection bias because
even within race, tall people and short people can be very different; our simple matching
strategy does not reveal the underlying causal story.
Reference
Case, Anne, and Christina Paxson. 2008. “Stature and Status: Height, Ability, and Labor Market Outcomes.”
Journal of Political Economy 116 (3).
4