Introduction to Statistical Modeling Lab 9 – Interaction and Logistic

http://lab10.pdf
1
Math 155, Spring 2005
Introduction to Statistical Modeling
Daniel Kaplan, Macalester College
Lab 9 – Interaction and Logistic Regression
This lab is about two topics in modeling. Interaction refers
to a particular way of structuring models; it is a concept that
applies to a wide variety of models. Logistic Regression is a
form of nonlinear modeling that is particularly useful when
the response variable is binomial, e.g., yes or no, alive or dead,
success or failure. The two concepts are brought together in
this lab only because we want to cover both of them in the
course and in part to show how interaction applies both in
linear modelling and nonlinear modelling.
Interaction
(score), a quantitative variable, will be the response variable. Group and noise level (group and noise) will be the
explanatory variables. Although group could obviously not
be manipulated experimentally, the noise level could be. (See
Exercise 1.)
To start, we read in the data.
> a = read.csv(’noise.csv’)
It’s often helpful to look at the data graphically. In this
case, a boxplot of the test score as a function of the explanatory variables is appropriate:
> boxplot(score ~ group, data=a)
The data set noise.csv is from an experiment to study how
noise affects the performance of children. In the experiment,
second graders were given a test of math problems and their
score was recorded. The children were either “normal” or
“hyperactive,” and the test conditions were either high noise
or low noise. Each child took just one test.
> boxplot(score ~ noise, data=a)
We want to see how the noise level affects the test score
and whether hyperactivity plays a role as well. Test score
It looks like the hyperactive children score lower, and that
high noise levels increase the spread of scores compared to low
noise levels, but don’t alter the median score by very much.
We can confirm these visual impressions by constructing
linear models:
Math 155: Introduction to Statistical Modeling: April 26, 2005
http://lab10.pdf
2
There are 400 cases altogether, evenly split among the
four different possible combinations of noise level and group
membership.
Look back at the ANOVA reports, paying particular attention to the degrees of freedom. Note that each of the
The negative coefficient on group indicates that being hyper- variables, group and noise, contributes one degree of freeactive is associated with lower scores. The effect is statisti- dom. This makes sense because each of those variables has
cally significant.
two levels, and the number of degrees of freedom is one less
The level of noise also appears to have an effect on the than the number of levels (since there is redundancy with the
score, although it is only about 1/5 as strong as that of being 1s vector). But looking at the table indicates that their are
hyperactive.
four different explanatory conditions: normal with low noise,
normal with high noise, hyperactive with low noise, and hy> lm(score ~ noise, data=a)
peractive with high noise. Seen this way, it seems that there
Coefficients:
should be three degrees of freedom in the model (since there
(Intercept)
noiseLow
are 4 different explanatory conditions).
164.68
-11.72
What’s happened to the 3rd degree of freedom in the
We can also look at the effects of both explanatory vari- model?
ables at the same time:
The answer is that there is nothing in the model
score ~ group + noise to reflect the possibility that noise
might effect the different groups differently. Here is a boxplot
> lm(score ~ group+noise, data=a)
showing the score for each of the four different explanatory
Coefficients:
conditions:
(Intercept) grouphyperactive
noiseLow
193.65
-57.95
-11.71
> lm(score ~ group, data=a)
Coefficients:
(Intercept) grouphyperactive
187.80
-57.95
Perhaps surprisingly, the coefficients on each variable are identical in the two-variable model as in the one-variable models.
It’s easy to understand why this is: the assignment of group
was made to be orthogonal to the group. This was done by
blocking on the group.
This orthogonality shows up in the ANOVA report; the
order of the variables makes no difference:
> summary(aov(score ~ group + noise, data=a))
Df Sum Sq Mean Sq F value
Pr(>F)
group
1 335762 335762 481.521 < 2.2e-16
noise
1 13724
13724 19.682 1.186e-05
Residuals
397 276826
697
> summary(aov(score ~ noise + group, data=a))
Df Sum Sq Mean Sq F value
Pr(>F)
noise
1 13724
13724 19.682 1.186e-05
group
1 335762 335762 481.521 < 2.2e-16
Residuals
397 276826
697
Notice that for the hyperactive subjects, high noise is associated with lower scores, while for the control subjects the
opposite is true.
This is an interaction between group and noise. An interaction term, in general, describes a situation where the association of a response variable with one explanatory variable
depends on the level of another explanatory variable.
An interaction term, in the R-language syntax, is specified
using a colon. Here is the model that includes the interaction
between group and noise:
> lm(a$score ~ a$group + a$noise + a$group:a$noise)
The orthogonality was achieved by making the study bal- Coefficients:
anced — evenly splitting the assignment to noise levels for
(Intercept)
a$grouphyperactive
each of the two groups. This can be seen by a simple table of
210.65
-91.94
the number of cases with each combination of the noise level
a$noiseLow a$grouphyperactive:a$noiseLow
and group.
-45.71
67.99
> table(a$group, a$noise)
control
hyperactive
High Low
100 100
100 100
The fourth coefficient gives the amount to be added for hyperactive kids who are in a low-noise environment. Adding up
the coefficients appropriately for the four levels of the crossed
factors gives the same model values as the four group means
we encountered in the t-tests.
Math 155: Introduction to Statistical Modeling: April 26, 2005
http://lab10.pdf
3
A shorthand for including the main effects of both factors along with their interaction is to use the multiplication
symbol: a$score ~ a$group*a$noise
In constructing the model matrix, each of the main effects
in the model corresponds to one or more vectors. The interaction term is constructed from taking the products of the
vectors that arise from each level of the main effects.
Interaction effects apply not just to nominal variables, but
to qualitative variables as well. For example, in the hotdog
data, we might be interested in constructing a model of calories as a function of hotdog type and of the sodium content:
> lm(hotdogs$cals ~ hotdogs$type + hotdogs$sodium)
Coefficients:
(Intercept) hotdogs$typeM hotdogs$typeP hotdogs$sodium
80.0990
-7.3516
-45.4590
0.1913
The interpretation of these coefficients is that a B-type hotdog has 80.0990 + 0.1913 × sodium, an M-type hotdog has
80.0990 − 7.3516 + 0.1913 × sodium, and so on. Notice that
the effect of one unit of sodium is the same for all types of
hotdogs. This is because we have not included an interaction
term in the model. Here is the model including additional
interaction terms:
> lm(hotdogs$cals ~ hotdogs$type * hotdogs$sodium)
Coefficients:
(Intercept)
hotdogs$typeM
78.19241
4.93594
hotdogs$typeP
hotdogs$sodium
-53.72468
0.19608
hotdogs$typeM:hotdogs$sodium
-0.02956
hotdogs$typeP:hotdogs$sodium
0.01741
The additional coefficients describe how the effect of sodium
differs for different types of hotdogs. For example, for Btype hotdogs the model value is 78.19241 + 0.19608 × sodium
while for M-type hotdogs it is 78.19241 + 4.93594 + (0.19608 −
0.02956) × sodium. (Not all of these digits have significance;
see Exercise 5. We’ve included all the digits here, in violation
of good scientific practice, to make sure that you can see at
a glance where each number is coming from in the R-stats
report.)
Logistic Regression
It often happens that the response variable we are interested
in is nominal rather than quantitative. We want to explain
the level of the response variable for each case as a function
of the values or levels of explanatory variables. For example,
in the Whickham mortality/smoking/age data we would like
to explain the nominal mortality outcome as a function of the
quantitative age data and the nominal smoking data.
A simple way to organize such models is to ask what is the
probability of the levels of the response variables as a function of the explanatory variables. When the response variable
is binomial, taking on only two values, say “yes” and “no,”,
then the probability representation is very simple: just give
the probability of the “yes” outcome. (The “no” outcome
will, of course, have one minus this probability.)
For example, in the Whickham data, we might ask how
does the probability of a subject having died by the time of
the follow-up study depend on the subject’s age and smoking status. We intuitively expect this probability to be an
increasing function of age — older people are more likely to
have died than younger people.
The accepted approach to fitting such probability-ofoutcome models for a two-level nominal variable is logistic
regression. The technology behind logistic regression is quite
advanced, but there is a relatively simple interface that allows
you to construct logistic regression models in much the same
way as linear models. Here is the command for finding the
probability of death as a function of age in the Whickham
data.
> w = read.csv(’Whickham.csv’)
> mod1 = glm(w$outcome ~ w$age, family=’binomial’)
The argument family=’binomial’ is always the same for logistic modelling; it instructs the glm program to construct
a logistic model. It’s important to note that the w$outcome
variable has just two levels; the logistic regression always gives
the probability of the second level.
> levels(w$outcome)
[1] "Alive" "Dead"
The second level is “Dead.”
The fitted model looks superficially like the linear models
we have already constructed:
> summary(mod1)
Deviance Residuals:
Min
1Q
Median
-1.8951 -0.5539 -0.2294
3Q
0.4278
Max
3.2291
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.401106
0.396563 -18.66
<2e-16
w$age
0.121829
0.006832
17.83
<2e-16
There are p-values on each of the coefficients. Some hint
of the differences between logistic regression and linear regression is given by the term “Deviance Residuals” in the
report. In linear regression we use the term “Residual” to
refer to the difference between the measured value and the
model value. But in logistic regression this doesn’t make
much sense. Although the model is giving us the probability of dying, the measured value is not a probability, it is just
“Alive” or “Dead.” The fitting done in logistic regression sets
the parameters to maximize the “likelihood” of the data, that
is p(data|model parameters).
It’s a bit hard to interpret directly the numerical values
of the parameters in a logistic regression model. From the
above example, the parameter 0.122 on age indicates that the
Math 155: Introduction to Statistical Modeling: April 26, 2005
http://lab10.pdf
4
probability of dying increases with age. It’s more directly use- > mod3 = glm(w$outcome ~ w$age + w$smoker,
ful to plot out the fitted probability of dying as a function of + family=’binomial’)
age. To do this, we need to extract the fitted values from the > plot(w$age, mod3$fitted,pch=’.’)
model:
For plotting these data, we’ll define some plotting symbols
> plot(w$age, mod1$fitted.values, ylim=c(0,1),
to use for the living and dead subjects: “+” for living and “·”
+ ylab=’p(dying)’, xlab=’Age’)
for dead.
> symbols = c(’+’,’.’)[w$smoker]
> text(w$age, mod3$fitted,symbols)
We can similarly fit a model of mortality determined by
smoking status:
> mod2 = glm(w$outcome ~ w$smoker, ...
+ family=’binomial’)
In specifying the form of the logistic model, you can inSince smoking is a nominal variable, a boxplot of the fitted clude interaction terms, change quantitative variables to facprobability is appropriate:
tors, and all the other “tricks” we have already seen in linear
models.
> boxplot(mod2$fitted ~ w$smoker)
Exercise 1
Construct a linear model for the noise.csv data to study the
affect of noise level and hyperactivity on the response variable
score. Use lm to find the coefficients on each of the model
terms. Explain in layman’s language what the relationship is
between test score and the noise and group variables.
Exercise 2
Actually, the same information could be had just by making an appropriate table:
> prop.table(table(w$outcome, w$smoker),2)
No
Yes
Alive 0.6857923 0.7611684
Dead 0.3142077 0.2388316
A more meaningful model comes from combining smoking
and age as explanatory variables:
Is there an interaction between smoking and age in the Whickham data? Make a plot of the probability of dying as a function of age for the smoker and non-smoker groups. Think creatively about why the interaction might exist and give your
hypothesis.
Exercise 3
For the I-95 data, construct a logistic model of the probability of ticket vs warning as a function of excess speed. Do
sex and/or race have an influence? Is there any interaction
among the various terms?
Math 155: Introduction to Statistical Modeling: April 26, 2005
http://lab10.pdf
5
Exercise 4
Now, create a simple yes-or-no variable to reflect support for
the death penalty:
For the test-score vs hyperactivity and noise data described
in the text, explain the difference between the following two > b$dp = b$CAPPUN==’FAVOR’
models.
Note that we have added a dp component to the b table.
> lm(a$score ~ a$group:a$noise - 1)
With this, we can construct models using whatever variCoefficients:
ables seem appropriate. For example:
a$groupcontrol:a$noiseHigh a$grouphyper:a$noiseHigh
210.6
a$groupcontrol:a$noiseLow
164.9
118.7
a$grouphyper:a$noiseLow
141.0
> lm(a$score ~ a$group*a$noise - 1)
Coefficients:
a$groupcontrol
a$grouphyperactive
210.65
118.71
a$noiseLow a$grouphyperactive:a$noiseLow
-45.71
67.99
Exercise 5
1. For the hotdog data, calculate an approximate 95% confidence interval on each of the coefficients in the model
> lm(hotdogs$cals ~ hotdogs$type * hotdogs$sodium)
(Hint: you can use summary to get a report of the standard
error of each coefficient.)
2. Is there any substantial evidence that sodium is associated with calorie level in the hotdog data? Is there any
evidence that the effect of sodium depends on the type of
hotdog?
> mod1 = glm(dp ~ POSTLIFE+SEX, data=b, family=’binomial’)
A positive coefficient indicates increased probability of support for the death penalty.
A simple way to get an idea of how much of the variation
in support for the death penalty is captured by the model is
to make a histogram of the fitted values:
> hist(mod1$fitted)
It also can be helpful to plot out the fitted values as a
function of another variable. This helps us see both the trends
and the variability. Since the GSS data includes only nominal
data, the plot should be in the form of a box plot. In making
the box plot, we need to deal with the missing data in the
GSS data. This can be done as follows:
> boxplot(mod1$fit ~ a$RELIG[-mod1$na.ac])
The contents of the square brackets instruct R to leave out the
cases for which the model could not be fitted due to missing
data.
Before looking at the data, frame some hypotheses. What
variables do you think, based on conventional wisdom and
Exercise 6
your knowledge of the body politic, will explain whether a
In the data in the file simple.csv, write out by hand the given person supports the death penalty or not. Detail your
hypotheses, for example saying whether you expect men or
model vectors that are relevant to each of these models:
women are more likely to support the death penalty. Your
value ~ one
hypotheses might include interactions between variables.
Once you have written down your hypotheses, test them.
value ~ two
Explain quantitatively whether there is support in the data
for your hypotheses. Indicate whether the data are inconcluvalue ~ one + two
sive or whether they point in the same or opposite direction
value ~ one:two
as your hypotheses. Support your argument quantitatively,
taking into account p-values the strength of the effect, and
value ~ one*two
the context that one variable creates for another.
Take one hypothesis for which the model coefficients indiExercise 7: Case Study
cated an interesting relationship, but for which the p-values
Look at the GSS data set. Construct a logistic model to deter- were insignificant. Using the resampling technique from Lab
mine what factors are associated with support for the death 7, estimate how much data you would need to collect (aspenalty. Since the GSS data includes people who answered suming the new data were just like the old data) in order to
establish a significant effect. You can do this using resam“Don’t know,” we’ll exclude these people:
pling. For instance, to see what might have happened if you
> a = read.csv(’GSSdata.csv’)
had collected 100 samples, look at the p-value from:
> dk = a$CAPPUN==’DK’
> b = a[!dk, ]
> modnew = glm(dp ~ POSTLIFE+SEX, data=resample(b,100), fa
Math 155: Introduction to Statistical Modeling: April 26, 2005