University of North Carolina
Chapel Hill
Soci252-002 Data Analysis in Sociological Research
Spring 2013
Professor François Nielsen
Class Activity 21 – Regression With a Categorical Variable (Back to Melanoma
and Latitude)
Introduction: The USmelanoma Data Set Again
We come back to the USmelanoma data set discussed by Everitt and Hothorn (2010) to illustrate
the use of an indicator variable. Units are the mainland states of the U.S. and the district
of Columbia. Variables are mortality, the mortality rate due to malignant melanoma of the
skin for white males during the period 1950–1969; latitude (degrees North) and longitude
(degrees West) of the geographical center of each state; and ocean (no, yes), a dichotomous
variable indicating contiguity to an ocean.
The following R commands load the HSAUR2 package and the data set and use the head
function to list the first six cases. Alternatively, if you don’t have the HSAUR2 package installed,
you can uncomment the read.csv() command below to read the data directly from the course
site.
>
>
>
>
library(HSAUR2)
data(USmelanoma)
# alternatively, uncomment the following line to read data from the course site
# USmelanoma <- read.csv("http://www.unc.edu/%7Enielsen/soci252/activities/USmelanoma.csv")
> head(USmelanoma)
mortality latitude longitude ocean
Alabama
219
33.0
87.0
yes
Arizona
160
34.5
112.0
no
Arkansas
170
35.0
92.5
no
California
182
37.5
119.5
yes
Colorado
149
39.0
105.5
no
Connecticut
159
41.8
72.8
yes
> attach(USmelanoma)
Regression I
In Activity 3 we found a strong negative relationship between mortality and latitude: the
farther North the state, the lower the mortality due to melanoma among white males.
Now we estimate a multiple regression model by adding longitude and ocean to the
regression model. ocean is a categorical variables with two categories (no, yes). With many
statistical program to represent this variable in a regression model we would have to explicitly
create an indicator that is 1 for states contiguous to an ocean and 0 for those that are not. We
do not have to do this in R, as the program automatically creates the indicator we need. The
regression is run as follows.
> melan1 <- lm(mortality ~ latitude + longitude + ocean)
> summary(melan1)
1
S O C I 252-002 – DATA A N A LY S I S
IN
SOCIOLOGICAL RESEARCH
2
Call:
lm(formula = mortality ~ latitude + longitude + ocean)
Residuals:
Min
1Q
-30.171 -10.070
Median
-2.607
3Q
8.781
Max
41.805
Coefficients:
Estimate Std. Error t value
(Intercept) 349.2369
27.0596 12.906
latitude
-5.4950
0.5289 -10.390
longitude
0.1219
0.1732
0.704
oceanyes
21.7976
5.2263
4.171
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
< 2e-16 ***
1.55e-13 ***
0.485245
0.000137 ***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.48 on 45 degrees of freedom
Multiple R-squared: 0.7721,
Adjusted R-squared: 0.7569
F-statistic: 50.83 on 3 and 45 DF, p-value: 1.7e-14
Problems
(1) Using the symbols X 1 , X 2 and X 3 for latitude, longitude, and oceanyes, respectively,
write down the prediction equation for ŷ.
(2) For b2 and b3 test the hypothesis H0 : β j = 0 against the alternative HA : β j 6= 0.
(3) What is the substantive meaning of the estimated coefficient b3 (the coefficient of oceanyes)?
Regression II
We saw that in Regression I longitude is a washout, but the coefficient of ocean is highly
significant, representing a positive shift in the intercept of the regression line for states contiguous
to an ocean compared to those that are not. However, we may wonder whether contiguity to an
ocean affects the slope of the relationship of mortality with latitude. In other words, is there
an interaction of contiguity with latitude? To check this possibility we include the ocean ×
latitude interaction as the term ocean:latitude, after dropping longitude from the model,
as follows.
> # drop longitude, add ocean x latitude interaction
> melan2 <- lm(mortality ~ latitude + ocean + ocean:latitude)
S O C I 252-002 – DATA A N A LY S I S
IN
SOCIOLOGICAL RESEARCH
3
> summary(melan2)
Call:
lm(formula = mortality ~ latitude + ocean + ocean:latitude)
Residuals:
Min
1Q
-30.457 -11.307
Median
-1.881
3Q
9.318
Max
44.377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
360.549460 35.498266 10.157 3.19e-13 ***
latitude
-5.485286
0.874315 -6.274 1.22e-07 ***
oceanyes
20.650094 43.987853
0.469
0.641
latitude:oceanyes -0.005534
1.101391 -0.005
0.996
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.57 on 45 degrees of freedom
Multiple R-squared: 0.7696,
Adjusted R-squared: 0.7543
F-statistic: 50.11 on 3 and 45 DF, p-value: 2.171e-14
Problems
(4) Using the symbols X 1 , X 3 and X 4 for latitude, oceanyes and latitude:oceanyes, respectively, write down the estimated prediction equation for ŷ. Decompose X 4 further as a function
of X 1 and X 3 .
(5) Test the null hypothesis that β4 = 0.
(6) Compare the Adjusted R-squared values for Regression I and Regression II. What does this
comparison suggest about the usefulness of including the latitude:oceanyes interaction term
in the model?
Graphical Representation
We can represent the full complexity of a model with both an indicator and a quantitative
explanatory variable using the scatterplot function provided by the car package. This is done
as follows.
> library(car)
> # plot separate regression lines by ocean
S O C I 252-002 – DATA A N A LY S I S
IN
SOCIOLOGICAL RESEARCH
4
200
220
> scatterplot(mortality ~ latitude + ocean + ocean:latitude, smooth=FALSE,
+
col=c("blue", "red"), legend.coords="bottomleft")
●
180
●
●
●
160
●
●
●
●
140
mortality
●
●
●
●
120
●
● ●
●
●
●
●
●
●
●
●
100
●
●
●
ocean
●
no
yes
●
30
35
40
45
latitude
Figure 1: Scatterplot of state mortality rate from skin melanoma among white males by latitude
and contiguity to an ocean.
The plot is shown in Figure 1.
Problems
(7) On the basis of Figure 1 explain the relationship between mortality from melanoma among
white males and latitude and contiguity to an ocean. Make sure to comment on the presence (or
not) of an interaction between latitude and contiguity to an ocean. Also comment on whether
we can view the relationship between mortality and latitude as causal.
© Copyright 2026 Paperzz