Math 141 - Lecture 30: GLM Examples

Math 141
Lecture 30: GLM Examples
Albyn Jones1
1 Library
304
[email protected]
www.people.reed.edu/∼jones/courses/141
Albyn Jones
Math 141
Generalized Linear Models
Binomial GLM: Logistic Regression
Given X ,
Y ∼ Binomial(n, p)
and
log(odds) = log
p
1−p
= β0 + β1 X
or equivalently:
p=
odds
eβ0 +β1 X
=
1 + odds
1 + eβ0 +β1 X
Albyn Jones
Math 141
Example: Death in the North Atlantic
> titanic
Surv
N
1
20 23
2
192 862
3
1
1
4
5
5
5
140 144
6
57 175
7
13 13
8
11 11
9
80 93
10
14 168
11
14 31
12
13 48
13
76 165
14
75 462
Class
Crew
Crew
First
First
First
First
Second
Second
Second
Second
Third
Third
Third
Third
Age
Adult
Adult
Child
Child
Adult
Adult
Child
Child
Adult
Adult
Child
Child
Adult
Adult
Albyn Jones
Sex
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Math 141
Additive Model
Ignoring the Age category, we fit the additive model.
Call:
glm(cbind(Surv, N - Surv) ˜ Class + Sex,
family = binomial, data = titanic)
Coefficients:
Estimate Std. Error z value
(Intercept) 1.18740
0.15746
7.541
ClassFirst
0.88081
0.15697
5.611
ClassSecond -0.07178
0.17093 -0.420
ClassThird -0.77742
0.14231 -5.463
SexMale
-2.42133
0.13909 -17.408
Albyn Jones
Math 141
Interaction Model
Call:
glm(cbind(Surv, N - Surv) ˜ Class * Sex,
family = binomial, data = titanic)
Coefficients:
(Intercept)
ClassFirst
ClassSecond
ClassThird
SexMale
ClassFirst:SexMale
ClassSecond:SexMale
ClassThird:SexMale
Estimate Std. Error z value
1.89712
0.61914
3.064
1.66535
0.80027
2.081
0.07053
0.68630
0.103
-2.06075
0.63551 -3.243
-3.14690
0.62453 -5.039
-1.05911
0.81959 -1.292
-0.63882
0.72402 -0.882
1.74286
0.65139
2.676
Albyn Jones
Math 141
Interpretation of Coefficients: Third Class Males
The interaction coefficient for Third Class Males was about
1.74. What does that mean?
p
= 1.897 + (−2.06) + (−3.14) + 1.74
log
1−p
Add up the coefficients for the intercept (baseline group), the
dummy variable for third class, the dummy variable for males,
and the interaction term (third class and male).
Albyn Jones
Math 141
Compute predicted probabilities
The interaction coefficient tells us that third class males did
better than predicted by the additive model. Females must have
done worse, else the additive model would fit well, and the
effect would be entirely captured by the coefficient for the Third
Class dummy variable.
> P1 <- round(predict(T1,type="response"),3)
> P2 <- round(predict(T2,type="response"),3)
Albyn Jones
Math 141
Compare predicted probabilities
> data.frame(titanic, P1, P2)
Surv
N Class
Age
Sex
1
20 23
Crew Adult Female
2
192 862
Crew Adult
Male
3
1
1 First Child Female
4
5
5 First Child
Male
5
140 144 First Adult Female
6
57 175 First Adult
Male
7
13 13 Second Child Female
8
11 11 Second Child
Male
9
80 93 Second Adult Female
10
14 168 Second Adult
Male
11
14 31 Third Child Female
12
13 48 Third Child
Male
13
76 165 Third Adult Female
14
75 462 Third Adult
Male
Albyn Jones
Math 141
P1
0.766
0.225
0.888
0.413
0.888
0.413
0.753
0.213
0.753
0.213
0.601
0.118
0.601
0.118
P2
0.870
0.223
0.972
0.344
0.972
0.344
0.877
0.140
0.877
0.140
0.459
0.173
0.459
0.173
Example: Death in the Snow
From the alr3 package, the donner dataset, with 91
observations.
The Donner Party was the most famous tragedy in the
history of the westward migration in the United States.
In the winter of 1846-47, about ninety wagon train
emigrants were unable to cross the Sierra Nevada
Mountains of California before winter, and almost
one-half starved to death... These data include some
information about each of the members of the party
from Johnson (1996).
Albyn Jones
Math 141
The Donner Dataset
Variables
Age: Approximate age in 1846.
Outcome: 1 if survived, 0 if died.
Sex: Male or Female.
Family.name: family name, hired or single.
Status: Family, single or hired.
Albyn Jones
Math 141
First Try: 3-way interactions
> donner.glm0 <- glm(Outcome ˜ Age*Sex*Status,
data=donner,family=binomial)
Coefficients: (3 not defined because of singularitie
z
Pr
Est SE value (>|z|)
<...omitted table entries...>
SexMale:StatusHired
-19.06 3956 -.005 0.99
SexMale:StatusSingle
NA
NA
NA
NA
Age:SexMale:StatusHired
NA
NA
NA
NA
Age:SexMale:StatusSingle
NA
NA
NA
NA
Albyn Jones
Math 141
What happend?
’Singularities’ means exact collinearity!
> with(donner,table(Sex,Status))
Status
Sex
Family Hired Single
Female
34
1
0
Male
34
17
5
There is no way to estimate the interaction between Sex and
Status, let alone the higher order interactions. In fact, we have
almost no data on Females outside of families.
Albyn Jones
Math 141
Second Try: Omit the troublesome terms
glm(formula = Outcome ˜ Age * Sex + Age * Status,
family = binomial, data = donner)
Est
(Intercept)
1.885
Age
-0.046
SexMale
-1.476
StatusHired
1.036
StatusSingle
-17.975
Age:SexMale
0.036
Age:StatusHired
-0.070
Age:StatusSingle
0.010
Albyn Jones
Std.
Error
0.67
0.02
0.84
1.96
14764.86
0.03
0.07
472.85
Math 141
z
value Pr(>|z|)
2.810
0.005
-1.833
0.067
-1.755
0.079
0.527
0.598
-0.001
0.999
1.130
0.259
-0.889
0.374
0.000
1.000
Sequential Anova Table
> anova(donner.glm0,test="Chisq")
Analysis of Deviance Table
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL
87
120.86
Age
1
6.8368
86
114.02 0.00893
Sex
1
5.1509
85
108.87 0.02323
Status
2
5.8995
83
102.97 0.05235
Age:Sex
1
0.9841
82
101.98 0.32120
Age:Status 2
1.0131
80
100.97 0.60258
Albyn Jones
Math 141
What Next?
When dropping terms from complex models, start at the bottom
of the anova table, that is drop interactions before single
variables.
Neither interaction appears statistically significant, so let’s drop
the two-way interactions!
Albyn Jones
Math 141
Reduced Model: only main effects
donner.glm1 <- glm(Outcome ˜ Age + Sex + Status,
data=donner,family=binomial)
> anova(donner.glm1,donner.glm0)
Analysis of Deviance Table
Model 1:
Model 2:
Resid.
1
2
Outcome ˜ Age + Sex + Status
Outcome ˜ Age * Sex + Age * Status
Df Resid. Dev Df Deviance
83
102.97
80
100.97 3
1.9971
Albyn Jones
Math 141
Coefficients
glm(Outcome ˜ Age + Sex + Status,
data=donner,family=binomial)
(Intercept)
Age
SexMale
StatusHired
StatusSingle
Estimate Std. Error z value Pr(>|z|)
1.487
0.493
3.019
0.003
-0.028
0.015 -1.868
0.062
-0.728
0.517 -1.407
0.159
-0.599
0.628 -0.953
0.341
-17.456
1765.537 -0.010
0.992
Albyn Jones
Math 141
Coefficients
glm(Outcome ˜ Age + Sex + Status,
data=donner,family=binomial)
(Intercept)
Age
SexMale
StatusHired
StatusSingle
Estimate Std. Error z value Pr(>|z|)
1.487
0.493
3.019
0.003
-0.028
0.015 -1.868
0.062
-0.728
0.517 -1.407
0.159
-0.599
0.628 -0.953
0.341
-17.456
1765.537 -0.010
0.992
Do we see anything peculiar here?
Albyn Jones
Math 141
Coefficients
glm(Outcome ˜ Age + Sex + Status,
data=donner,family=binomial)
(Intercept)
Age
SexMale
StatusHired
StatusSingle
Estimate Std. Error z value Pr(>|z|)
1.487
0.493
3.019
0.003
-0.028
0.015 -1.868
0.062
-0.728
0.517 -1.407
0.159
-0.599
0.628 -0.953
0.341
-17.456
1765.537 -0.010
0.992
Do we see anything peculiar here?
Gigantic SE’s are another symptom of approximate
confounding/collinearity!
Albyn Jones
Math 141
Reduce again!
The non-Family dummy variables are not statistically
significant. Combine?
> Family <- donner$Status == "Family"
> donner.glm2 <- glm(Outcome ˜ Age+Sex+Family,
data=donner,family=binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.59543
0.81523
0.730
0.4652
Age
-0.02984
0.01519 -1.964
0.0495
SexMale
-0.75663
0.51985 -1.455
0.1455
FamilyTRUE
0.94053
0.60548
1.553
0.1203
Albyn Jones
Math 141
Check the Fine Print!
Null deviance: 120.86
Residual deviance: 106.37
on 87
on 84
df
df
(3 observations deleted due to missingness)
AIC: 114.37
Number of Fisher Scoring iterations: 4
> qchisq(.95,84)
[1] 106.39484
Albyn Jones
Math 141
Check the Fine Print!
Null deviance: 120.86
Residual deviance: 106.37
on 87
on 84
df
df
(3 observations deleted due to missingness)
AIC: 114.37
Number of Fisher Scoring iterations: 4
> qchisq(.95,84)
[1] 106.39484
What does a large residual deviance mean?
Albyn Jones
Math 141
Final Points
Analysis of Deviance Table
Model 1:
Model 2:
Resid.
1
2
Outcome ˜ Age
Outcome ˜ Age + Sex + Family
Df Resid. Dev Df Deviance Pr(>Chi)
86
114.02
84
106.37 2
7.6468 0.02185
Sex and Family were strongly associated, so it isn’t clear we
can sort out their individual contributions!
There may be within-family correlation, so a mixed model might
be in order!
Albyn Jones
Math 141
Poisson Generalized Linear Models
the Poisson GLM: a loglinear model
Given X ,
Y ∼ Poisson(µ)
and
log(µ) = β0 + β1 X
or equivalently:
µ = eβ0 +β1 X
Albyn Jones
Math 141
Poisson Regression Models: Interpretation
Since the exponential function
µ(X ) = eβ0 +β1 X = eβ0 eβ1 X
is monotone increasing if β1 > 0 and decreasing if β1 < 0,
qualitative interpretation is again completely analogous to
interpreting coefficients for the linear model:
Albyn Jones
Math 141
Poisson Regression Models: Interpretation
Since the exponential function
µ(X ) = eβ0 +β1 X = eβ0 eβ1 X
is monotone increasing if β1 > 0 and decreasing if β1 < 0,
qualitative interpretation is again completely analogous to
interpreting coefficients for the linear model:
β1 > 0 implies positive association with X
Albyn Jones
Math 141
Poisson Regression Models: Interpretation
Since the exponential function
µ(X ) = eβ0 +β1 X = eβ0 eβ1 X
is monotone increasing if β1 > 0 and decreasing if β1 < 0,
qualitative interpretation is again completely analogous to
interpreting coefficients for the linear model:
β1 > 0 implies positive association with X
β1 < 0 implies negative association with X
Albyn Jones
Math 141
Poisson Regression Models: Interpretation
Since the exponential function
µ(X ) = eβ0 +β1 X = eβ0 eβ1 X
is monotone increasing if β1 > 0 and decreasing if β1 < 0,
qualitative interpretation is again completely analogous to
interpreting coefficients for the linear model:
β1 > 0 implies positive association with X
β1 < 0 implies negative association with X
β1 = 0 implies no association with X
Albyn Jones
Math 141
Poisson Regression Models: Interpretation
µ(X ) = eβ0 +β1 X = eβ0 eβ1 X
The intercept term β0 determines the mean value when X = 0:
µ(0) = eβ0
Albyn Jones
Math 141
Summary
GLM’s are a lot like linear models!!
Albyn Jones
Math 141