Lecture 28 - Wharton Statistics Department

Lecture 28
• Categorical variables:
– Review of slides from lecture 27 (reprint of
lecture 27 categorical variables slides with
typos corrected)
– Practice Problem
• Review of main themes from course
• Directions for future study
Sex discrimination revisited
• At the beginning of the class, in case study
1.2, we examined data from a sex
discrimination case.
Oneway Analysis of Salaries By Sex
Salaries
8000
7000
6000
5000
4000
Female
Male
Sex
t-Test
Difference
t-Test
DF
Prob > |t|
Estimate
-818.0 -6.293 91 <.0001
Std Error
130.0
Lower 95% -1076.2
Upper 95% -559.8
Strong evidence that male clerks are
paid more than female hires. But
bank’s defense lawyers say that this is
because males have higher education
and experience, i.e., there
are omitted confounding variables.
Multiple regression model for sex
discrimination
• Let’s look at controlling for education level first.
• To examine bank’s claim, we want to look at
{Salary | Education, Sex}
and compare
{Salary | Education  x1, Sex  Male}
{Salary | Education  x1, Sex  Female}
to
• How do we incorporate a categorical explanatory
variable into multiple regression? Dummy
variables.
Dummy variables
• Define
Sexdummyi  1 if person i is male
Sexdummyi  0 if person i is female
• Multiple regression model:
{Salary | Educ  x1, Sexdummy  x2} 
0  1x1   2 x2
{Salary | Education  x1, Sex  Male} 
{Salary | Education  x1, Sex  Female}   2
• 2 , the coefficient on the dummy variable for sex,
is the difference in mean earnings between the
populations of men and women with the same
education levels.
Categorical variables in JMP
• To color and mark the points by a
categorical variable such as Sex, click red
triangle to left on first column and select
Color or Mark by Column. Select Set
Marker by Value to use different marker by
column.
Response Salary
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.363354
0.349206
572.4368
5420.323
93
Parameter Estimates
Term
Intercept
EDUC
Sexdummy
Estimate
4173.1251
80.697765
691.80826
Std Error t Ratio Prob>|t| Lower 95%
339.1811 12.30 <.0001 3499.2827
27.67291 2.92 0.0045 25.720708
132.2319 5.23 <.0001 429.10655
Upper 95%
4846.9676
135.67482
954.50997
There is strong evidence that males and females of the same education level have
different salaries (p-value < .0001). The 95% confidence interval for the difference in
mean salaries between males and females of the same education level is
($429.11,$954.51).
Parallel Regression Lines
• The model
{Salary | Educ  x1, Sexdummy  x2}  0  1x1  2 x2
implies that
{Salary | Educ  x1, Sex  Male}  ( 0   2 )  1x1
{Salary | Educ  x1, Sex  Female}  0  1x1
• Regression lines for males and females as
education varies are parallel.
• No interaction between sex and education.
{Salary | Education  x1, Sex  Male} 
{Salary | Education  x1, Sex  Female}   2
Response Salary
Whole Model
Regression Plot
8000
Salary
7000
6000
5000
4000
7
8
9 10 11 12 13 14 15 16 17
EDUC
FEMALE
MALE
Plot produced by JMP version 5 in Fit Model output that shows
the parallel regression lines and the actual observations.
Interactions with Dummy
Variables
{Salary | Educ  x1, Sexdummy  x2} 
0  1x1   2 x2
• The model
assumes that difference between men and
women’s mean salaries for fixed levels of
education is the same for all levels of education.
• There might be an interaction between sex and
education. Difference between men and women
might differ depending on level of education.
Interaction Model
• Multiple regression model that allows for
interaction between sex and education:
{Salary | Educ  x1, Sexdummy  x2} 
0  1x1   2 x2  3 ( x1 * x2 )
{Salary | Education  x1 , Sex  Female} 
{Salary | Education  x1 , Sex  Male}   2  3 x2
Difference in mean salary between men and women of same
education level depends on the education level.
• To add interaction in JMP, create a new colun
sexdummy*educ. Right click on column, select
formula and use the formula sexdummy*educ..
Response Salary
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.37279
0.351648
571.3617
5420.323
93
Parameter Estimates
Term
Intercept
EDUC
Sexdummy
Sexdummy*EDUC
Estimate
4395.3228
62.13056
-274.8597
73.585794
Std Error
389.21
31.94336
845.7489
63.59228
t Ratio
11.29
1.95
-0.32
1.16
Prob>|t|
<.0001
0.0549
0.7460
0.2503
There is no evidence of an interaction (p-value = .2503). Note, in the interaction model,
the coefficient on Sexdummy is not easily interpretable. The best way to understand the
estimates of an interaction model is to plot the two separate regression lines as shown on
the next slide.
Response Salary
Whole Model
Regression Plot
8000
Salary
7000
6000
5000
4000
7
8
9 10 11 12 13 14 15 16 17
EDUC
FEMALE
MALE
The model with one continuous explanatory variable, one categorical
variable and an interaction is called the separate regression lines model
because regression lines of y on continuous explanatory variables for
two levels of dummy variable are “separate,”
neither coincident nor parallel.
Multiple regression with
education, experience and sex
• We can easily control for both education and
experience in the sex discrimination case by
adding them both to the multiple regression. A
model without interactions is:
{Salary | Educ  x1, Exper  x2 , Sexdummy  x3} 
0  1x1   2 x2  3 x3
• Note that
{Salary | Educ  x1 , Exper  x2 , Sex  Male} 
{Salary | Educ  x1 , Exper  x2 , Sex  Female}  3
•  3 is difference between mean salaries of males
and females of same education and experience
level.
Response Salary
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.398068
0.377778
559.7297
5420.323
93
Parameter Estimates
Term
Intercept
EDUC
EXPER
Sexdummy
Estimate
3943.6599
87.667515
1.4632657
676.17906
Std Error t Ratio Prob>|t| Lower 95% Upper 95%
346.7729 11.37 <.0001 3254.6295 4632.6903
27.23294 3.22 0.0018 33.556244 141.77879
0.645874 2.27 0.0259 0.1799284 2.746603
129.4805 5.22 <.0001 418.9041 933.45402
Strong evidence that males and females have different mean salaries for same level of
education and experience. 95% confidence interval for difference between mean male
and mean female salaries for same level of education and experience is ($418.90,
$933.45).
Response Salary
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.42282
0.389648
554.3651
5420.323
93
Parameter Estimates
Term
Intercept
EDUC
EXPER
Sexdummy
Sexdummy*EDUC
Sexdummy*EXPER
Estimate
4169.488
62.685629
2.1959717
-344.7752
88.42193
-1.346959
Std Error t Ratio Prob>|t| Lower 95% Upper 95%
387.3417 10.76 <.0001 3399.6045 4939.3715
30.99385 2.02 0.0462 1.0819919 124.28927
0.838037 2.62 0.0104 0.530283 3.8616604
900.2427 -0.38 0.7027 -2134.105 1444.5546
64.48341 1.37 0.1738 -39.74584 216.5897
1.330667 -1.01 0.3142 -3.991804 1.2978853
There is not strong evidence of an interaction between sex and education (p-value=.1738)
nor between sex and experience (p-value=.3142). We do not need to use interaction
terms in the model.
Course Summary
• Techniques:
– Methods for comparing two groups
– Methods for comparing more than two groups (one-way
ANOVA F test, multiple comparisons)
– Method for testing hypothesis about distribution of one
population of nominal variable (chi-squared test)
– Simple and multiple linear regression for predicting a
response variable based on explanatory variables and
(with a random experiment or no omitted confounding
variables) finding the causal effect of explanatory
variables on a response variable.
Course Summary Cont.
• Key messages:
– Always do a randomized experiment if possible.
Inferences about causal effects from observational
studies require the always questionable assumption that
there are no omitted confounding variables. Similarly,
always take a random sample if possible.
– p-values only assesses whether there is strong evidence
against the null hypothesis. They do not provide
information about practical significance.
– Always form confidence intervals for the parameters
(e.g., difference in means, regression coefficients) in
addition to making point estimates and doing
hypothesis tests. Confidence intervals provide
information about the accuracy of the estimate and the
practical significance of the finding.
Course Summary Cont.
• Key messages:
– Beware of multiple comparisons and data snooping.
Use Tukey-Kramer method or Bonferroni to adjust for
multiple comparisons.
– Simple/multiple linear regression is a powerful method
for making predictions of a variable y based on
explanatory variables. However, beware of
extrapolation.
– Multiple regression can be used to control for known
confounding variables in order to obtain good estimates
of the causal effect of a variable on an outcome.
However, if there are omitted confounding variables,
the estimate of the causal effect will be biased. The
sign and magnitude of the bias is indicated by the
omitted variable bias formula.
Directions for Future Study
• Stat 500: Applied Regression and Analysis of
Variance. Offered next fall. Natural follow-up to
Stat 112, giving a more advanced treatment of the
topics in 112.
• Stat 501: Introduction to Nonparametric Methods
and Log-linear models. Offered this spring.
Follow-up to Stat 500.
• Stat 430: Probability. Will be offered next fall and
next spring.
• Stat 431: Statistical Inference. Will be offered
next fall and next spring.
• Stat 210: Sample Survey Design.
• Stat 202: Intermediate Statistics.