QM222 Class 7 Section D1 1. Finding the right statistics – examples (chapter 5) 2. New Material: Regression coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 3. Review of inputting data and group breakouts Reminders: Are you doing the reading? Did you participate in your first URO experiment ? QM222 Fall 2016 Section D1 1 Scheduling reminders: • Assignment 2 due Wednesday: It’s best to have your data downloaded into Stata to do this. • I have office hours today from 2:15-3:30. • I need to change my Wednesday office hours – (normally 11-12:15). So let me add: Tuesday (tomorrow) 2-3:15. Also Thursday 10-12. • Reminder: No class next Monday Oct. 3 • Did you participate in your first URO experiment ? QM222 Fall 2016 Section D1 2 Examples of Finding the Right Statistic Always ask yourself “Is this the right measure to answer my question?” QM222 Fall 2016 Section D1 3 Example: Unemployment Rate. What Rate Should you measure? Anybody know how the unemployment rate is calculated? QM222 Fall 2016 Section D1 4 Example: Unemployment Rate. What Rate Should you measure? Unemployment Rate = [# in Current Population Survey who looked for work over the past 4 weeks/ total # number of surveyed who are in the labor force, defined as either employed or unemployed. This graph shows the unemployment rate is negatively correlated with recessions. But is the unemployment rate the best measure of weak employment markets? QM222 Fall 2016 Section D1 5 http://www.shadowstats.com/alternate_data/unemployment-charts Red: The official unemployment rate. Grey: The broader government unemployment rate that includes short-term discouraged workers (who are no longer looking for jobs) and those forced to work part-time because they cannot find full-time employment. Blue: Further adds long-term discouraged workers The ShadowStats Alternate Unemployment Rate for August 2015 is 22,9%. QM222 Fall 2016 Section D1 6 Often (like with unemployment) the right statistic is a rate or percentage: But percentage of what? • California has more teachers than Arizona. Does that mean that California citizens care more about education? What is the right statistic to make this comparison of CA and AZ? • A friend tells you that the number of people who die in fatal car accidents while wearing their seatbelts at the time of the accident is much higher than the number who die with their seatbelts unbuckled. Therefore (your friend says), you shouldn’t bother with your seatbelt. How would you respond? QM222 Fall 2016 Section D1 7 Choosing the right baseline • Verizon ran a large advertising campaign displaying a map of the United States that was shaded to show the huge fraction of the country that was covered by Verizon cellular service. The ads contrasted this coverage to the notably smaller shaded geographic area covered by the AT&T network. AT&T then ran a counterattack with advertisements stating “AT&T covers 97% of Americans.” • Why is there a discrepancy between these two advertising campaigns? Which statistic is most important for potential cell phone customers? QM222 Fall 2016 Section D1 8 If we look at the undergraduate majors of top 1% earners in the US, Biology majors account for the highest share at 6.6% of all top 1% earners! Does that mean you should choose a biology major for the best shot of reaching the top 1% of earners? Undergraduate Degree Total Share of All 1 Percenters Biology 1,864,666 6.60% Economics 1,237,863 5.40% Political Science and Government 1,427,224 4.70% History 1,351,368 3.30% Finance 1,071,812 2.70% Chemistry 780,783 2.40% Health and Medical Preparatory Programs 142,345 0.90% Biochemical Sciences 193,769 0.70% Zoology 159,935 0.60% International Relations 146,781 0.50% Area, Ethnic and Civilization Studies 184,906 0.50% Art History and Criticism 137,357 0.40% Physiology 98,181 0.30% Molecular Biology 64,951 QM222 Fall 2016 Section D1 0.20% 9 NO! We need to measure the # of 1 percenters relative to the total size of the undergrad major. Total Share of All 1 Percenters % Who Are 1 Percenters Biology 1,864,666 6.60% 6.70% Economics 1,237,863 5.40% 8.20% Political Science and Government 1,427,224 4.70% 6.20% History 1,351,368 3.30% 4.70% Finance 1,071,812 2.70% 4.80% Chemistry 780,783 2.40% 5.70% Health and Medical Preparatory Programs 142,345 0.90% 11.80% Biochemical Sciences 193,769 0.70% 7.20% Zoology 159,935 0.60% 6.90% 146,781 0.50% 6.70% 184,906 0.50% 5.20% 137,357 0.40% 5.90% Physiology 98,181 0.30% 6.00% Molecular Biology 64,951 0.20% 5.60% Undergraduate Degree International Relations Area, Ethnic and Civilization Studies Art History and Criticism QM222 Fall 2016 Section D1 10 Regression Coefficients’ Statistics QM222 Fall 2016 Section D1 11 Review slide: Notation for a regression equation predicted Y intercept coefficient, slope (Y-hat, Y’) Ŷ = b0 + b1 X Y variable LHS variable dependent variable 12 the regression line X variable RHS variable independent variable explanatory variable QM222 Fall 2016 Section D1 Review slide: Regression in Stata regress yvariablename xvariablename For instance, to run a regression of price on size, type: regress price size Table 6.1 Name of dependent variable Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 Number of obs F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE = 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = 1.3e+05 -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 b1 b0 Name of explanatory variable(s) We have a total of 1085 observations. Note: _cons is constant, intercept price = 12934 + 407.45 size If size increases by 1 sqft, sales price increases by $407 QM222 Fall 2016 Section D1 13 Result of regress with an indicator (dummy) variable as X. . regress price Beacon_Street Source | SS df MS Number of obs = 1085 -------------+-----------------------------F( 1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1083 6.8951e+10 R-squared = 0.0031 -------------+-----------------------------Adj R-squared = 0.0021 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------Beacon_Street | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5 ------------------------------------------------------------------------------- • Write the regression equation: price = 520729 – 46969 Beacon_Street • What is the predicted price of a condo on Beacon Street? 520729–46969=$473,760 • What is the predicted price of a condo that’s not on Beacon Street? $520,729 • What is the difference in prices between those on Beacon St. and NOT? -$46,969 QM222 Fall 2016 Section D1 14 Result of regression with an indicator (dummy) variable as X. . regress price Beacon_Street Source | SS df MS Number of obs = 1085 -------------+-----------------------------F( 1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1083 6.8951e+10 R-squared = 0.0031 -------------+-----------------------------Adj R-squared = 0.0021 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------Beacon_Street | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5 ------------------------------------------------------------------------------- • • • • What is the regression equation? price = 520729 – 46969 Beacon_Street What is the predicted price of a condo on Beacon Street? 520729–46969= $473,760 What is the predicted price of a condo that’s not on Beacon Street? $520,729 What is the difference in prices between those on Beacon St. and NOT? -$46,969 HARD but important question: Answer these same questions if the indicator variable was: “NOT_Beacon_Street” QM222 Fall 2016 Section D1 15 You get the same predictions whichever category you choose to exclude (i.e. to be the reference category) • What is the predicted price of a condo on Beacon Street? $473,760 • What is the predicted price of a condo that’s not on Beacon Street? $520,729 • What is the difference in prices between those on Beacon St. and NOT? $46,969 . generate Not_Beacon_Street= streetname!="BEACON ST" . regress price Not_Beacon_Street Source | SS df MS Number of obs = 1085 -------------+-----------------------------F( 1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1083 6.8951e+10 R-squared = 0.0031 -------------+-----------------------------Adj R-squared = 0.0021 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 2.6e+05 -------------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------Not_Beacon_S~t | 46969.18 25798.41 1.82 0.069 -3651.345 97589.71 _cons | 473759.7 24380.35 19.43 0.000 425921.6 521597.8 -------------------------------------------------------------------------------𝑃𝑟𝑖𝑐𝑒 = 473,760 + 46,969 ∗ 𝑁𝑜𝑡_𝐵𝑒𝑎𝑐𝑜𝑛_𝑆𝑡𝑟𝑒𝑒𝑡 • What is the coefficient on Not_Beacon_Street? +$46,969 • What is the intercept? Since intercept = price on Beacon Street (i.e. what you get when Not_Beacon_Street=0) = 473,760 • Regression equation 𝑃𝑟𝑖𝑐𝑒 = 473,760 + 46,969 ∗ 𝑁𝑜𝑡_𝐵𝑒𝑎𝑐𝑜𝑛_𝑆𝑡𝑟𝑒𝑒𝑡 QM222 Fall 2016 Section D1 16 How certain are we that the coefficients we measured are accurate in light of the fact that we have limited numbers of observations? QM222 Fall 2016 Section D1 17 Let’s remember means and standard deviations with normally distributed variables • Approximately 68% (or around 2/3rds) of a variable’s values are within one standard deviation of the mean. • We call this this the 68% confidence interval (CI), because 68% of the time, the value falls in this range. • Approximately 95% of the values are within two standard deviations of the mean. • We call this this the 95% confidence interval, • 2.0 is just 1.96 rounded. Use either! QM222 Fall 2016 Section D1 18 Central Limit Theorem (QM221) • The Central Limit Theorem tells us that if you took many samples from a population, the sample means are always distributed according to a normal distribution curve • The average of the sample means (across many samples) is the same as the population mean (μ) • The standard deviation of the sample means (across many samples) is the standard error (se) -3SE -2SE -1SE μ +1SE QM222 Fall 2016 Section D1 +2SE +3SE 19 Standard errors more generally • Sample means have a standard error that tells you how much the means vary if you had lots of different samples. • Any statistic estimated on a sample has a standard effort that tells you how much that statistic would vary if you had lots of different samples. • Regression coefficients also have standard errors. • We are (approximately) 68% certain that the true regression coefficient (if estimated on the entire population) will be within one standard error of the estimated coefficient. • We are (approximately) 95% certain that the true regression coefficient (if estimated on the entire population) will be within two standard errors of the estimated coefficient. QM222 Fall 2016 Section D1 20 Standard errors of coefficients price = 12934 + 407.45 size Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 Number of obs F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE = 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = 1.3e+05 -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 • Next to each coefficient is a standard error. • We are approximately 68% certain that the true coefficient (with an infinitely very large sample) is within one standard error of this coefficient. 407.45 +/- 7.167 • We are approximately 95% certain that the true coefficient (with an infinitely very large sample) is within two standard errors of this coefficient. 407.45 +/- 2 * 7.167 QM222 Fall 2016 Section D1 21 The regression results even give you the 95% confidence interval for each coefficient Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 Number of obs F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE = 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = 1.3e+05 -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 The 95% confidence interval for each coefficient is given on the right of that coefficient’s line. QM222 Fall 2016 Section D1 22 One way you can use this 95% confidence interval of the coefficient Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 Number of obs F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE = 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = 1.3e+05 -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 If the 95% confidence interval of a coefficient does not include zero, we are at least 95% confident that the coefficient is NOT zero, so we are at least 95% of the sign of that coefficient and we are at least 95% certain that size affects price. For instance, we are more than 95% confident that the variable size has a positive “impact” on the condo price (if we don’t worry about confounding factors.) QM222 Fall 2016 Section D1 23 t-statistics In QM221, you tested hypotheses regarding the mean with the t-test calculation. t-test = X - hypothesized value of mean standard error of the mean (or s/√n) (where X is the mean) When the absolute value of the t-statistic is greater than 2.0 (or 1.96 ), we reject the hypothesis (2-tailed test) QM222 Fall 2016 Section D1 24 t-statistics of coefficients Similarly, we might have hypotheses about a regression coefficient. If you had a hypothesis that a regression coefficients was truly β, you’d make this t-test: t-test = b - hypothesized true β coef. standard error where b is the estimated coefficient. QM222 Fall 2016 Section D1 25 What hypothesis is the t-statistic of the coefficient in the regression output about? Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 Number of obs = F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = 1.3e+05 -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 QM222 Fall 2016 Section D1 26 What hypothesis is the t-statistic of the coefficient in the regression output about? The t-stat next to the coefficient in the regression output tests the hypothesis that: H: β =0 i.e. that the true coefficient is actually zero: t-statistic = b – 0 s.e. Note that this simplifies to: t-statistic = b s.e. If | t | > 2 , we are more than 95% certain that the true coefficient is NOT zero. If | t | < 2 , we are NOT more than 95% certain that the true coefficient is NOT zero. QM222 Fall 2016 Section D1 27 What hypothesis is the t-statistic of the coefficient in the regression output about? Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 Number of obs F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE = 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = 1.3e+05 -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 ------------------------------------------------------------------------------- The t-statistic on size is > 2 (in absolute value). We are more than 95% certain that this coefficient is not zero. In other words, we are more than 95% certain that size has a positive “impact” on the condo price. QM222 Fall 2016 Section D1 28 More • If I know with at least 95% certainty that a coefficient in NOT zero, I say it is statistically significant. QM222 Fall 2016 Section D1 29 p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. Source | SS df MS -------------+-----------------------------Model | 5.6104e+13 1 5.6104e+13 Residual | 1.8798e+13 1083 1.7357e+10 -------------+-----------------------------Total | 7.4902e+13 1084 6.9098e+10 1.3e+05 Number of obs = F( 1, 1083) Prob > F R-squared Adj R-squared Root MSE 1085 = 3232.35 = 0.0000 = 0.7490 = 0.7488 = -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 ------------------------------------------------------------------------------ This p-value says that it is less than .0005 (or .05%) likely that the coefficient on size is 0 or negative. (Higher than this & it would be .001 rounded) I am more than 100% - .05% = 99.95% certain that the coefficient is not zero. QM222 Fall 2016 Section D1 30 Inputting data in Stata review: Make groups as follows • Who is using Compustat/CRSP/WRDS • Who is using NHANES? • Who will (or has) convert Excel data to Stata? • Who will (or has) convert .csv data to Stata? • Who has a different kind of data file? • Who downloaded Stata data? (except NHANES) QM222 Fall 2016 Section D1 31 Inputting data in Stata: review Who will (or has) convert Excel data to Stata? • File→Import→Excel spreadsheet Opens an Excel datafile • If importing an Excel file doesn’t work, save the Excel file first as csv Who will (or has) convert .csv data to Stata? • File→Import→Text file(delimited,*.csv) Opens a text file with columns divided by commas. Who downloaded Stata data? • Did it have a “Stata” or “do-file” program also? Download that as well and talk to me/TA. EVERYONE: Download the codebook or data dictionary or definition of variables as well. Other commands at this point : count (with “if” statements) cd drive:\foldername (e.g. C:\QM222 File→Save or File→save as exit, clear QM222 Fall 2016 Section D1 32
© Copyright 2026 Paperzz