BSC Case Competition

QM222 Class 7 Section D1
1. Finding the right statistics – examples (chapter 5)
2. New Material: Regression coefficient statistics:
standard errors, t-statistics, p-values (chapter 7)
3. Review of inputting data and group breakouts
Reminders:
Are you doing the reading?
Did you participate in your first URO
experiment ?
QM222 Fall 2016 Section D1
1
Scheduling reminders:
• Assignment 2 due Wednesday: It’s best to have your
data downloaded into Stata to do this.
• I have office hours today from 2:15-3:30.
• I need to change my Wednesday office hours –
(normally 11-12:15). So let me add: Tuesday
(tomorrow) 2-3:15. Also Thursday 10-12.
• Reminder: No class next Monday Oct. 3
• Did you participate in your first URO experiment ?
QM222 Fall 2016 Section D1
2
Examples of Finding the Right Statistic
Always ask yourself “Is this the right measure to
answer my question?”
QM222 Fall 2016 Section D1
3
Example: Unemployment Rate. What Rate Should
you measure?
Anybody know how the unemployment rate is calculated?
QM222 Fall 2016 Section D1
4
Example: Unemployment Rate. What Rate Should
you measure?
Unemployment Rate =
[# in Current Population Survey who looked for work over the past 4 weeks/
total # number of surveyed who are in the labor force, defined as either
employed or unemployed.
This graph shows the unemployment rate is negatively correlated with
recessions. But is the unemployment rate the best measure of weak
employment markets?
QM222 Fall 2016 Section D1
5
http://www.shadowstats.com/alternate_data/unemployment-charts
Red: The official unemployment rate.
Grey: The broader government unemployment rate that
includes short-term discouraged workers (who are no longer
looking for jobs) and those forced to work part-time because
they cannot find full-time employment.
Blue: Further adds long-term discouraged workers
The ShadowStats Alternate Unemployment Rate for August 2015 is 22,9%.
QM222 Fall 2016 Section D1
6
Often (like with unemployment) the right statistic is
a rate or percentage: But percentage of what?
• California has more teachers than Arizona. Does that mean
that California citizens care more about education? What is
the right statistic to make this comparison of CA and AZ?
• A friend tells you that the number of people who die in fatal
car accidents while wearing their seatbelts at the time of the
accident is much higher than the number who die with their
seatbelts unbuckled. Therefore (your friend says), you
shouldn’t bother with your seatbelt. How would you
respond?
QM222 Fall 2016 Section D1
7
Choosing the right baseline
• Verizon ran a large advertising campaign displaying a map of
the United States that was shaded to show the huge fraction
of the country that was covered by Verizon cellular service.
The ads contrasted this coverage to the notably smaller
shaded geographic area covered by the AT&T network. AT&T
then ran a counterattack with advertisements stating “AT&T
covers 97% of Americans.”
• Why is there a discrepancy between these two advertising
campaigns? Which statistic is most important for potential
cell phone customers?
QM222 Fall 2016 Section D1
8
If we look at the undergraduate majors of top 1% earners in the US, Biology
majors account for the highest share at 6.6% of all top 1% earners!
Does that mean you should choose a biology major for the best shot of
reaching the top 1% of earners?
Undergraduate Degree
Total
Share of All 1 Percenters
Biology
1,864,666
6.60%
Economics
1,237,863
5.40%
Political Science and
Government
1,427,224
4.70%
History
1,351,368
3.30%
Finance
1,071,812
2.70%
Chemistry
780,783
2.40%
Health and Medical
Preparatory Programs
142,345
0.90%
Biochemical Sciences
193,769
0.70%
Zoology
159,935
0.60%
International Relations
146,781
0.50%
Area, Ethnic and
Civilization Studies
184,906
0.50%
Art History and Criticism
137,357
0.40%
Physiology
98,181
0.30%
Molecular Biology
64,951
QM222 Fall 2016 Section D1
0.20%
9
NO! We need to measure the # of 1 percenters relative to the total size of
the undergrad major.
Total
Share of All 1
Percenters
% Who Are 1
Percenters
Biology
1,864,666
6.60%
6.70%
Economics
1,237,863
5.40%
8.20%
Political Science and
Government
1,427,224
4.70%
6.20%
History
1,351,368
3.30%
4.70%
Finance
1,071,812
2.70%
4.80%
Chemistry
780,783
2.40%
5.70%
Health and Medical
Preparatory Programs
142,345
0.90%
11.80%
Biochemical Sciences
193,769
0.70%
7.20%
Zoology
159,935
0.60%
6.90%
146,781
0.50%
6.70%
184,906
0.50%
5.20%
137,357
0.40%
5.90%
Physiology
98,181
0.30%
6.00%
Molecular Biology
64,951
0.20%
5.60%
Undergraduate Degree
International
Relations
Area, Ethnic and
Civilization Studies
Art History and
Criticism
QM222 Fall 2016 Section D1
10
Regression Coefficients’
Statistics
QM222 Fall 2016 Section D1
11
Review slide: Notation for a regression equation
predicted Y
intercept coefficient, slope
(Y-hat, Y’)
Ŷ = b0 + b1 X
Y variable
LHS variable
dependent variable
12
the regression line
X variable
RHS variable
independent variable
explanatory variable
QM222 Fall 2016 Section D1
Review slide: Regression in Stata
regress yvariablename xvariablename
For instance, to run a regression of price on size, type:
regress price size
Table 6.1
Name of dependent variable
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
Number of obs
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
=
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
= 1.3e+05
-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------size |
407.4513
7.166659
56.85
0.000
393.3892
421.5134
_cons |
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
b1
b0
Name of explanatory variable(s)
We have a total of 1085 observations.
Note: _cons is constant, intercept
price = 12934 + 407.45 size
If size increases by 1 sqft, sales price increases by $407
QM222 Fall 2016 Section D1
13
Result of regress with an indicator (dummy)
variable as X.
. regress price Beacon_Street
Source |
SS
df
MS
Number of obs =
1085
-------------+-----------------------------F( 1, 1083) =
3.31
Model | 2.2855e+11
1 2.2855e+11
Prob > F
= 0.0689
Residual | 7.4673e+13 1083 6.8951e+10
R-squared
= 0.0031
-------------+-----------------------------Adj R-squared = 0.0021
Total | 7.4902e+13 1084 6.9098e+10
Root MSE
= 2.6e+05
------------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------+---------------------------------------------------------------Beacon_Street | -46969.18
25798.41
-1.82
0.069
-97589.71
3651.345
_cons |
520728.9
8435.427
61.73
0.000
504177.2
537280.5
-------------------------------------------------------------------------------
• Write the regression equation: price = 520729 – 46969 Beacon_Street
• What is the predicted price of a condo on Beacon Street? 520729–46969=$473,760
• What is the predicted price of a condo that’s not on Beacon Street? $520,729
• What is the difference in prices between those on Beacon St. and NOT? -$46,969
QM222 Fall 2016 Section D1
14
Result of regression with an indicator
(dummy) variable as X.
. regress price Beacon_Street
Source |
SS
df
MS
Number of obs =
1085
-------------+-----------------------------F( 1, 1083) =
3.31
Model | 2.2855e+11
1 2.2855e+11
Prob > F
= 0.0689
Residual | 7.4673e+13 1083 6.8951e+10
R-squared
= 0.0031
-------------+-----------------------------Adj R-squared = 0.0021
Total | 7.4902e+13 1084 6.9098e+10
Root MSE
= 2.6e+05
------------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------+---------------------------------------------------------------Beacon_Street | -46969.18
25798.41
-1.82
0.069
-97589.71
3651.345
_cons |
520728.9
8435.427
61.73
0.000
504177.2
537280.5
-------------------------------------------------------------------------------
•
•
•
•
What is the regression equation? price = 520729 – 46969 Beacon_Street
What is the predicted price of a condo on Beacon Street? 520729–46969= $473,760
What is the predicted price of a condo that’s not on Beacon Street? $520,729
What is the difference in prices between those on Beacon St. and NOT? -$46,969
HARD
but important question: Answer these same questions if
the indicator variable was: “NOT_Beacon_Street”
QM222 Fall 2016 Section D1
15
You get the same predictions whichever category
you choose to exclude (i.e. to be the reference
category)
• What is the predicted price of a condo on Beacon Street? $473,760
• What is the predicted price of a condo that’s not on Beacon Street? $520,729
• What is the difference in prices between those on Beacon St. and NOT? $46,969
. generate Not_Beacon_Street= streetname!="BEACON ST"
. regress price Not_Beacon_Street
Source |
SS
df
MS
Number of obs =
1085
-------------+-----------------------------F( 1, 1083) =
3.31
Model | 2.2855e+11
1 2.2855e+11
Prob > F
= 0.0689
Residual | 7.4673e+13 1083 6.8951e+10
R-squared
= 0.0031
-------------+-----------------------------Adj R-squared = 0.0021
Total | 7.4902e+13 1084 6.9098e+10
Root MSE
= 2.6e+05
-------------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------------+---------------------------------------------------------------Not_Beacon_S~t |
46969.18
25798.41
1.82
0.069
-3651.345
97589.71
_cons |
473759.7
24380.35
19.43
0.000
425921.6
521597.8
-------------------------------------------------------------------------------𝑃𝑟𝑖𝑐𝑒 = 473,760 + 46,969 ∗ 𝑁𝑜𝑡_𝐵𝑒𝑎𝑐𝑜𝑛_𝑆𝑡𝑟𝑒𝑒𝑡
• What is the coefficient on Not_Beacon_Street? +$46,969
• What is the intercept? Since intercept = price on Beacon Street (i.e. what you get
when Not_Beacon_Street=0) = 473,760
• Regression equation 𝑃𝑟𝑖𝑐𝑒 = 473,760 + 46,969 ∗ 𝑁𝑜𝑡_𝐵𝑒𝑎𝑐𝑜𝑛_𝑆𝑡𝑟𝑒𝑒𝑡
QM222 Fall 2016 Section D1
16
How certain are we that the coefficients
we measured are accurate
in light of the fact that we have limited
numbers of observations?
QM222 Fall 2016 Section D1
17
Let’s remember means and standard deviations
with normally distributed variables
• Approximately 68% (or around 2/3rds) of a variable’s
values are within one standard deviation of the mean.
• We call this this the 68% confidence interval (CI),
because 68% of the time, the value falls in this range.
• Approximately 95% of the values are within two standard
deviations of the mean.
• We call this this the 95% confidence interval,
• 2.0 is just 1.96 rounded. Use either!
QM222 Fall 2016 Section D1
18
Central Limit Theorem (QM221)
• The Central Limit Theorem tells us that if you took many
samples from a population, the sample means are always
distributed according to a normal distribution curve
• The average of the sample means (across many samples)
is the same as the population mean (μ)
• The standard deviation of the sample means (across many
samples) is the standard error (se)
-3SE
-2SE
-1SE
μ
+1SE
QM222 Fall 2016 Section D1
+2SE
+3SE
19
Standard errors more generally
• Sample means have a standard error that tells you how much the
means vary if you had lots of different samples.
• Any statistic estimated on a sample has a standard effort that tells
you how much that statistic would vary if you had lots of different
samples.
• Regression coefficients also have standard errors.
• We are (approximately) 68% certain that the true regression
coefficient (if estimated on the entire population) will be within
one standard error of the estimated coefficient.
• We are (approximately) 95% certain that the true regression
coefficient (if estimated on the entire population) will be within
two standard errors of the estimated coefficient.
QM222 Fall 2016 Section D1
20
Standard errors of coefficients
price = 12934 + 407.45 size
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
Number of obs
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
=
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
= 1.3e+05
-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------size
|
407.4513 7.166659
56.85
0.000
393.3892
421.5134
_cons
|
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
• Next to each coefficient is a standard error.
• We are approximately 68% certain that the true coefficient (with an infinitely
very large sample) is within one standard error of this coefficient.
407.45 +/- 7.167
• We are approximately 95% certain that the true coefficient (with an infinitely
very large sample) is within two standard errors of this coefficient.
407.45 +/- 2 * 7.167
QM222 Fall 2016 Section D1
21
The regression results even give you the 95%
confidence interval for each coefficient
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
Number of obs
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
=
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
= 1.3e+05
-----------------------------------------------------------------------------price |
Coef. Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------size
|
407.4513 7.166659
56.85
0.000
393.3892 421.5134
_cons
|
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
The 95% confidence interval for each coefficient is given on the right of that
coefficient’s line.
QM222 Fall 2016 Section D1
22
One way you can use this 95% confidence interval
of the coefficient
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
Number of obs
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
=
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
= 1.3e+05
-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------size
|
407.4513
7.166659
56.85
0.000
393.3892 421.5134
_cons
|
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
If the 95% confidence interval of a coefficient does not include zero,
we are at least 95% confident that the coefficient is NOT zero,
so we are at least 95% of the sign of that coefficient
and we are at least 95% certain that size affects price.
For instance, we are more than 95% confident that the variable size has a
positive “impact” on the condo price
(if we don’t worry about confounding factors.)
QM222 Fall 2016 Section D1
23
t-statistics
In QM221, you tested hypotheses regarding the mean with
the t-test calculation.
t-test
=
X - hypothesized value of mean
standard error of the mean (or s/√n)
(where X is the mean)
When the absolute value of the t-statistic is greater than 2.0
(or 1.96 ), we reject the hypothesis (2-tailed test)
QM222 Fall 2016 Section D1
24
t-statistics of coefficients
Similarly, we might have hypotheses about a regression
coefficient. If you had a hypothesis that a regression
coefficients was truly β, you’d make this t-test:
t-test =
b - hypothesized true β coef.
standard error
where b is the estimated coefficient.
QM222 Fall 2016 Section D1
25
What hypothesis is the t-statistic of the
coefficient in the regression output about?
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
Number of obs =
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
= 1.3e+05
-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------size
|
407.4513
7.166659
56.85 0.000
393.3892
421.5134
_cons
|
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
QM222 Fall 2016 Section D1
26
What hypothesis is the t-statistic of the
coefficient in the regression output about?
The t-stat next to the coefficient in the regression output
tests the hypothesis that:
H: β =0
i.e. that the true coefficient is actually zero:
t-statistic = b – 0
s.e.
Note that this simplifies to:
t-statistic = b
s.e.
If | t | > 2 , we are more than 95% certain that the true
coefficient is NOT zero.
If | t | < 2 , we are NOT more than 95% certain that the true
coefficient is NOT zero.
QM222 Fall 2016 Section D1
27
What hypothesis is the t-statistic of the
coefficient in the regression output about?
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
Number of obs
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
=
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
= 1.3e+05
-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------size
|
407.4513
7.166659
56.85 0.000
393.3892
421.5134
_cons
|
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
-------------------------------------------------------------------------------
The t-statistic on size is > 2 (in absolute value).
We are more than 95% certain that this coefficient is not zero.
In other words, we are more than 95% certain that size has
a positive “impact” on the condo price.
QM222 Fall 2016 Section D1
28
More
• If I know with at least 95% certainty that a coefficient in NOT
zero, I say it is statistically significant.
QM222 Fall 2016 Section D1
29
p-values: The p–value tells us exactly how probable it is
that the coefficient is 0 or of the opposite sign.
Source |
SS
df
MS
-------------+-----------------------------Model | 5.6104e+13
1 5.6104e+13
Residual | 1.8798e+13 1083 1.7357e+10
-------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10
1.3e+05
Number of obs =
F( 1, 1083)
Prob > F
R-squared
Adj R-squared
Root MSE
1085
= 3232.35
= 0.0000
= 0.7490
= 0.7488
=
-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------size
|
407.4513
7.166659
56.85
0.000
393.3892
421.5134
_cons
|
12934.12
9705.712
1.33
0.183
-6110.006
31978.25
------------------------------------------------------------------------------
This p-value says that it is less than .0005 (or .05%) likely that the
coefficient on size is 0 or negative. (Higher than this & it would be .001 rounded)
I am more than 100% - .05% = 99.95% certain that the coefficient is not zero.
QM222 Fall 2016 Section D1
30
Inputting data in Stata review:
Make groups as follows
• Who is using Compustat/CRSP/WRDS
• Who is using NHANES?
• Who will (or has) convert Excel data to Stata?
• Who will (or has) convert .csv data to Stata?
• Who has a different kind of data file?
• Who downloaded Stata data? (except NHANES)
QM222 Fall 2016 Section D1
31
Inputting data in Stata: review
Who will (or has) convert Excel data to Stata?
• File→Import→Excel spreadsheet Opens an Excel datafile
• If importing an Excel file doesn’t work, save the Excel file first as
csv
Who will (or has) convert .csv data to Stata?
• File→Import→Text file(delimited,*.csv) Opens a text file with
columns divided by commas.
Who downloaded Stata data?
• Did it have a “Stata” or “do-file” program also? Download that as
well and talk to me/TA.
EVERYONE: Download the codebook or data dictionary or definition
of variables as well.
Other commands at this point : count (with “if” statements)
cd drive:\foldername (e.g. C:\QM222
File→Save or File→save as
exit, clear
QM222 Fall 2016 Section D1
32