http://lab10.pdf 1 Math 155, Spring 2005 Introduction to Statistical Modeling Daniel Kaplan, Macalester College Lab 9 – Interaction and Logistic Regression This lab is about two topics in modeling. Interaction refers to a particular way of structuring models; it is a concept that applies to a wide variety of models. Logistic Regression is a form of nonlinear modeling that is particularly useful when the response variable is binomial, e.g., yes or no, alive or dead, success or failure. The two concepts are brought together in this lab only because we want to cover both of them in the course and in part to show how interaction applies both in linear modelling and nonlinear modelling. Interaction (score), a quantitative variable, will be the response variable. Group and noise level (group and noise) will be the explanatory variables. Although group could obviously not be manipulated experimentally, the noise level could be. (See Exercise 1.) To start, we read in the data. > a = read.csv(’noise.csv’) It’s often helpful to look at the data graphically. In this case, a boxplot of the test score as a function of the explanatory variables is appropriate: > boxplot(score ~ group, data=a) The data set noise.csv is from an experiment to study how noise affects the performance of children. In the experiment, second graders were given a test of math problems and their score was recorded. The children were either “normal” or “hyperactive,” and the test conditions were either high noise or low noise. Each child took just one test. > boxplot(score ~ noise, data=a) We want to see how the noise level affects the test score and whether hyperactivity plays a role as well. Test score It looks like the hyperactive children score lower, and that high noise levels increase the spread of scores compared to low noise levels, but don’t alter the median score by very much. We can confirm these visual impressions by constructing linear models: Math 155: Introduction to Statistical Modeling: April 26, 2005 http://lab10.pdf 2 There are 400 cases altogether, evenly split among the four different possible combinations of noise level and group membership. Look back at the ANOVA reports, paying particular attention to the degrees of freedom. Note that each of the The negative coefficient on group indicates that being hyper- variables, group and noise, contributes one degree of freeactive is associated with lower scores. The effect is statisti- dom. This makes sense because each of those variables has cally significant. two levels, and the number of degrees of freedom is one less The level of noise also appears to have an effect on the than the number of levels (since there is redundancy with the score, although it is only about 1/5 as strong as that of being 1s vector). But looking at the table indicates that their are hyperactive. four different explanatory conditions: normal with low noise, normal with high noise, hyperactive with low noise, and hy> lm(score ~ noise, data=a) peractive with high noise. Seen this way, it seems that there Coefficients: should be three degrees of freedom in the model (since there (Intercept) noiseLow are 4 different explanatory conditions). 164.68 -11.72 What’s happened to the 3rd degree of freedom in the We can also look at the effects of both explanatory vari- model? ables at the same time: The answer is that there is nothing in the model score ~ group + noise to reflect the possibility that noise might effect the different groups differently. Here is a boxplot > lm(score ~ group+noise, data=a) showing the score for each of the four different explanatory Coefficients: conditions: (Intercept) grouphyperactive noiseLow 193.65 -57.95 -11.71 > lm(score ~ group, data=a) Coefficients: (Intercept) grouphyperactive 187.80 -57.95 Perhaps surprisingly, the coefficients on each variable are identical in the two-variable model as in the one-variable models. It’s easy to understand why this is: the assignment of group was made to be orthogonal to the group. This was done by blocking on the group. This orthogonality shows up in the ANOVA report; the order of the variables makes no difference: > summary(aov(score ~ group + noise, data=a)) Df Sum Sq Mean Sq F value Pr(>F) group 1 335762 335762 481.521 < 2.2e-16 noise 1 13724 13724 19.682 1.186e-05 Residuals 397 276826 697 > summary(aov(score ~ noise + group, data=a)) Df Sum Sq Mean Sq F value Pr(>F) noise 1 13724 13724 19.682 1.186e-05 group 1 335762 335762 481.521 < 2.2e-16 Residuals 397 276826 697 Notice that for the hyperactive subjects, high noise is associated with lower scores, while for the control subjects the opposite is true. This is an interaction between group and noise. An interaction term, in general, describes a situation where the association of a response variable with one explanatory variable depends on the level of another explanatory variable. An interaction term, in the R-language syntax, is specified using a colon. Here is the model that includes the interaction between group and noise: > lm(a$score ~ a$group + a$noise + a$group:a$noise) The orthogonality was achieved by making the study bal- Coefficients: anced — evenly splitting the assignment to noise levels for (Intercept) a$grouphyperactive each of the two groups. This can be seen by a simple table of 210.65 -91.94 the number of cases with each combination of the noise level a$noiseLow a$grouphyperactive:a$noiseLow and group. -45.71 67.99 > table(a$group, a$noise) control hyperactive High Low 100 100 100 100 The fourth coefficient gives the amount to be added for hyperactive kids who are in a low-noise environment. Adding up the coefficients appropriately for the four levels of the crossed factors gives the same model values as the four group means we encountered in the t-tests. Math 155: Introduction to Statistical Modeling: April 26, 2005 http://lab10.pdf 3 A shorthand for including the main effects of both factors along with their interaction is to use the multiplication symbol: a$score ~ a$group*a$noise In constructing the model matrix, each of the main effects in the model corresponds to one or more vectors. The interaction term is constructed from taking the products of the vectors that arise from each level of the main effects. Interaction effects apply not just to nominal variables, but to qualitative variables as well. For example, in the hotdog data, we might be interested in constructing a model of calories as a function of hotdog type and of the sodium content: > lm(hotdogs$cals ~ hotdogs$type + hotdogs$sodium) Coefficients: (Intercept) hotdogs$typeM hotdogs$typeP hotdogs$sodium 80.0990 -7.3516 -45.4590 0.1913 The interpretation of these coefficients is that a B-type hotdog has 80.0990 + 0.1913 × sodium, an M-type hotdog has 80.0990 − 7.3516 + 0.1913 × sodium, and so on. Notice that the effect of one unit of sodium is the same for all types of hotdogs. This is because we have not included an interaction term in the model. Here is the model including additional interaction terms: > lm(hotdogs$cals ~ hotdogs$type * hotdogs$sodium) Coefficients: (Intercept) hotdogs$typeM 78.19241 4.93594 hotdogs$typeP hotdogs$sodium -53.72468 0.19608 hotdogs$typeM:hotdogs$sodium -0.02956 hotdogs$typeP:hotdogs$sodium 0.01741 The additional coefficients describe how the effect of sodium differs for different types of hotdogs. For example, for Btype hotdogs the model value is 78.19241 + 0.19608 × sodium while for M-type hotdogs it is 78.19241 + 4.93594 + (0.19608 − 0.02956) × sodium. (Not all of these digits have significance; see Exercise 5. We’ve included all the digits here, in violation of good scientific practice, to make sure that you can see at a glance where each number is coming from in the R-stats report.) Logistic Regression It often happens that the response variable we are interested in is nominal rather than quantitative. We want to explain the level of the response variable for each case as a function of the values or levels of explanatory variables. For example, in the Whickham mortality/smoking/age data we would like to explain the nominal mortality outcome as a function of the quantitative age data and the nominal smoking data. A simple way to organize such models is to ask what is the probability of the levels of the response variables as a function of the explanatory variables. When the response variable is binomial, taking on only two values, say “yes” and “no,”, then the probability representation is very simple: just give the probability of the “yes” outcome. (The “no” outcome will, of course, have one minus this probability.) For example, in the Whickham data, we might ask how does the probability of a subject having died by the time of the follow-up study depend on the subject’s age and smoking status. We intuitively expect this probability to be an increasing function of age — older people are more likely to have died than younger people. The accepted approach to fitting such probability-ofoutcome models for a two-level nominal variable is logistic regression. The technology behind logistic regression is quite advanced, but there is a relatively simple interface that allows you to construct logistic regression models in much the same way as linear models. Here is the command for finding the probability of death as a function of age in the Whickham data. > w = read.csv(’Whickham.csv’) > mod1 = glm(w$outcome ~ w$age, family=’binomial’) The argument family=’binomial’ is always the same for logistic modelling; it instructs the glm program to construct a logistic model. It’s important to note that the w$outcome variable has just two levels; the logistic regression always gives the probability of the second level. > levels(w$outcome) [1] "Alive" "Dead" The second level is “Dead.” The fitted model looks superficially like the linear models we have already constructed: > summary(mod1) Deviance Residuals: Min 1Q Median -1.8951 -0.5539 -0.2294 3Q 0.4278 Max 3.2291 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.401106 0.396563 -18.66 <2e-16 w$age 0.121829 0.006832 17.83 <2e-16 There are p-values on each of the coefficients. Some hint of the differences between logistic regression and linear regression is given by the term “Deviance Residuals” in the report. In linear regression we use the term “Residual” to refer to the difference between the measured value and the model value. But in logistic regression this doesn’t make much sense. Although the model is giving us the probability of dying, the measured value is not a probability, it is just “Alive” or “Dead.” The fitting done in logistic regression sets the parameters to maximize the “likelihood” of the data, that is p(data|model parameters). It’s a bit hard to interpret directly the numerical values of the parameters in a logistic regression model. From the above example, the parameter 0.122 on age indicates that the Math 155: Introduction to Statistical Modeling: April 26, 2005 http://lab10.pdf 4 probability of dying increases with age. It’s more directly use- > mod3 = glm(w$outcome ~ w$age + w$smoker, ful to plot out the fitted probability of dying as a function of + family=’binomial’) age. To do this, we need to extract the fitted values from the > plot(w$age, mod3$fitted,pch=’.’) model: For plotting these data, we’ll define some plotting symbols > plot(w$age, mod1$fitted.values, ylim=c(0,1), to use for the living and dead subjects: “+” for living and “·” + ylab=’p(dying)’, xlab=’Age’) for dead. > symbols = c(’+’,’.’)[w$smoker] > text(w$age, mod3$fitted,symbols) We can similarly fit a model of mortality determined by smoking status: > mod2 = glm(w$outcome ~ w$smoker, ... + family=’binomial’) In specifying the form of the logistic model, you can inSince smoking is a nominal variable, a boxplot of the fitted clude interaction terms, change quantitative variables to facprobability is appropriate: tors, and all the other “tricks” we have already seen in linear models. > boxplot(mod2$fitted ~ w$smoker) Exercise 1 Construct a linear model for the noise.csv data to study the affect of noise level and hyperactivity on the response variable score. Use lm to find the coefficients on each of the model terms. Explain in layman’s language what the relationship is between test score and the noise and group variables. Exercise 2 Actually, the same information could be had just by making an appropriate table: > prop.table(table(w$outcome, w$smoker),2) No Yes Alive 0.6857923 0.7611684 Dead 0.3142077 0.2388316 A more meaningful model comes from combining smoking and age as explanatory variables: Is there an interaction between smoking and age in the Whickham data? Make a plot of the probability of dying as a function of age for the smoker and non-smoker groups. Think creatively about why the interaction might exist and give your hypothesis. Exercise 3 For the I-95 data, construct a logistic model of the probability of ticket vs warning as a function of excess speed. Do sex and/or race have an influence? Is there any interaction among the various terms? Math 155: Introduction to Statistical Modeling: April 26, 2005 http://lab10.pdf 5 Exercise 4 Now, create a simple yes-or-no variable to reflect support for the death penalty: For the test-score vs hyperactivity and noise data described in the text, explain the difference between the following two > b$dp = b$CAPPUN==’FAVOR’ models. Note that we have added a dp component to the b table. > lm(a$score ~ a$group:a$noise - 1) With this, we can construct models using whatever variCoefficients: ables seem appropriate. For example: a$groupcontrol:a$noiseHigh a$grouphyper:a$noiseHigh 210.6 a$groupcontrol:a$noiseLow 164.9 118.7 a$grouphyper:a$noiseLow 141.0 > lm(a$score ~ a$group*a$noise - 1) Coefficients: a$groupcontrol a$grouphyperactive 210.65 118.71 a$noiseLow a$grouphyperactive:a$noiseLow -45.71 67.99 Exercise 5 1. For the hotdog data, calculate an approximate 95% confidence interval on each of the coefficients in the model > lm(hotdogs$cals ~ hotdogs$type * hotdogs$sodium) (Hint: you can use summary to get a report of the standard error of each coefficient.) 2. Is there any substantial evidence that sodium is associated with calorie level in the hotdog data? Is there any evidence that the effect of sodium depends on the type of hotdog? > mod1 = glm(dp ~ POSTLIFE+SEX, data=b, family=’binomial’) A positive coefficient indicates increased probability of support for the death penalty. A simple way to get an idea of how much of the variation in support for the death penalty is captured by the model is to make a histogram of the fitted values: > hist(mod1$fitted) It also can be helpful to plot out the fitted values as a function of another variable. This helps us see both the trends and the variability. Since the GSS data includes only nominal data, the plot should be in the form of a box plot. In making the box plot, we need to deal with the missing data in the GSS data. This can be done as follows: > boxplot(mod1$fit ~ a$RELIG[-mod1$na.ac]) The contents of the square brackets instruct R to leave out the cases for which the model could not be fitted due to missing data. Before looking at the data, frame some hypotheses. What variables do you think, based on conventional wisdom and Exercise 6 your knowledge of the body politic, will explain whether a In the data in the file simple.csv, write out by hand the given person supports the death penalty or not. Detail your hypotheses, for example saying whether you expect men or model vectors that are relevant to each of these models: women are more likely to support the death penalty. Your value ~ one hypotheses might include interactions between variables. Once you have written down your hypotheses, test them. value ~ two Explain quantitatively whether there is support in the data for your hypotheses. Indicate whether the data are inconcluvalue ~ one + two sive or whether they point in the same or opposite direction value ~ one:two as your hypotheses. Support your argument quantitatively, taking into account p-values the strength of the effect, and value ~ one*two the context that one variable creates for another. Take one hypothesis for which the model coefficients indiExercise 7: Case Study cated an interesting relationship, but for which the p-values Look at the GSS data set. Construct a logistic model to deter- were insignificant. Using the resampling technique from Lab mine what factors are associated with support for the death 7, estimate how much data you would need to collect (aspenalty. Since the GSS data includes people who answered suming the new data were just like the old data) in order to establish a significant effect. You can do this using resam“Don’t know,” we’ll exclude these people: pling. For instance, to see what might have happened if you > a = read.csv(’GSSdata.csv’) had collected 100 samples, look at the p-value from: > dk = a$CAPPUN==’DK’ > b = a[!dk, ] > modnew = glm(dp ~ POSTLIFE+SEX, data=resample(b,100), fa Math 155: Introduction to Statistical Modeling: April 26, 2005
© Copyright 2026 Paperzz