Handout

Log-linear modelling of contingency tables: an example using R
Peter Craig (December 2012, revised December 2016)
Contents
1
Data
1
2
Fitting models
2
3
Finding out more about a fitted model
3
4
Selecting a model
5
5
Interpretation of data/model
8
An R script containing largely the same commands as used in this handout is available as
LOGC.R from the handouts section in DUO.
1
Data
The data used here refer to the passengers on the Titanic which sank on its maiden voyage in
1912. The data are available in R as Titanic:
> data(Titanic)
> Titanic
, , Age = Child, Survived = No
, , Age = Child, Survived = Yes
Sex
Class Male Female
1st
0
0
2nd
0
0
3rd
35
17
Crew
0
0
Sex
Class Male Female
1st
5
1
2nd
11
13
3rd
13
14
Crew
0
0
, , Age = Adult, Survived = No
, , Age = Adult, Survived = Yes
Sex
Class Male Female
1st
118
4
2nd
154
13
3rd
387
89
Crew 670
3
Sex
Class Male Female
1st
57
140
2nd
14
80
3rd
75
76
Crew 192
20
1
This is a four-dimensional contingency table (p = 4) and the four variables, together with their
levels, are:
> dimnames(Titanic)
$Class
[1] "1st" "2nd" "3rd"
$Sex
[1] "Male"
$Age
"Crew" [1] "Child" "Adult"
$Survived
[1] "No" "Yes"
"Female"
Now this is clearly not a random sample of people and it is also fairly clear that the interest
probably lies in how the probability of survival relates to the other variables, i.e. we would
want to treat Survived as the response1 . Nevertheless, it is interesting to understand what
statistical model adequately describes the data.
2
Fitting models
It’s easy to fit any particular hierarchical log-linear model to the data. For the independence
model, we do
> library(MASS)
> indep = loglm(˜Class+Sex+Age+Survived, data=Titanic)
Call:
loglm(formula = ˜Class + Sex + Age + Survived, data = Titanic)
Statistics:
Xˆ2 df P(> Xˆ2)
Likelihood Ratio 1243.663 25
0
Pearson
1637.445 25
0
or
> loglm(˜1+2+3+4, data=Titanic)
Call:
loglm(formula = ˜1 + 2 + 3 + 4, data = Titanic)
Statistics:
Xˆ2 df P(> Xˆ2)
Likelihood Ratio 1243.663 25
0
Pearson
1637.445 25
0
which differ only in the way the variables are specified: by name or by (dimension) number.
Two arguments (pieces of information) are being provided to the function: a “formula” and the
name of the data-set. The bit after ˜ in the “formula” is a short-hand way of describing the set
B defining the log-linear model; terms in the model are separated by + in the formula. In the
notation of the lectures this particular example specifies
B = {{i}, {j}, {k}, {l}}
or equivalently
B = {{i1 }, {i2 }, {i3 }, {i4 }}
1
In Epiphany term you will learn about logistic regression which is a tool specifically designed for situations
with a binary response variable but here we will apply log-linear modelling
2
However, it’s clearly easier to use names or numbers of the variables rather than the indices.
So we might write this as
B = {Class, Sex, Age, Survived}
or
B = {1, 2, 3, 4}
We also need to be able to specify terms involving more than one variable. In R, the explicit way to do is to use a colon (:) to join together variables. So to include the term
{i, j} or i : j in B in the model in R, we would replace ˜1+2+3+4 by ˜1+2+3+4+1:2
or ˜Class+Sex+Age+Survived by ˜Class+Sex+Age+Survived+Class:Sex.
R has some further short-hand to reduce the typing:
• Class*Sex is equivalent to writing Class+Sex+Class:Sex
• Class*Sex*Age is equivalent to writing
Class+Sex+Age+Class:Sex+Class:Age+Sex:Age+Class:Sex:Age
• (Class+Sex+Age)ˆ2 is equivalent to writing
Class+Sex+Age+Class:Sex+Class:Age+Sex:Age, i.e. include all terms involving up to 2 of the variables but omit any term involving more than 2.
So, there are often many ways to specify any particular model, some involving less typing than
others. For example, the full/saturated model for our data could be fitted by
> loglm(˜Class*Sex*Age*Survived, data=Titanic)
Call:
loglm(formula = ˜Class * Sex * Age * Survived, data = Titanic)
Statistics:
Xˆ2 df P(> Xˆ2)
Likelihood Ratio
0 0
1
Pearson
NaN 0
1
or by
> loglm(˜(Class+Sex+Age+Survived)ˆ4, data=Titanic)
Call:
loglm(formula = ˜(Class + Sex + Age + Survived)ˆ4, data = Titanic)
Statistics:
Xˆ2 df P(> Xˆ2)
Likelihood Ratio
0 0
1
Pearson
NaN 0
1
3
Finding out more about a fitted model
R initially provides very little information when we fit a log-linear model. It tells us what the
model was and what data was used; it also provides the values of two hypothesis tests: both
compare the fitted model to the saturated model and test the null hypothesis that P
the fitted model
is correct. One is the likelihood-ratio test and the other is based on Pearson’s (O − E)2 /E
statistic where the O values are the observed data and the E values their expected values under
the fitted model. Both test statistics are approximately chi-squared and R also reports the
degrees of freedom and the P-value of the null hypothesis. However, all this does is give
is some measure of the quality of fit for the model in the sense that it tells us whether we
might prefer the saturated model. In our example, the P-values for the independence model are
reported as zero, i.e. the data strongly conflict with the independence model.
What else can we get from the fitted model? One idea is to estimate the various “β” tables:
3
> coef(indep)
$‘(Intercept)‘
[1] 3.015185
$Class
1st
2nd
-0.4115541 -0.5428901
3rd
0.3642359
Crew
0.5902083
$Sex
Male
Female
0.6518609 -0.6518609
$Age
Child
-1.477264
Adult
1.477264
$Survived
No
Yes
0.3699295 -0.3699295
Observe the constraint in action: each one-dimensional table of β values sums to zero.
Recall from lectures that β values lead to odds and odds-ratios if we handle them correctly. For
example, in the independence model, the odds-ratio for surviving versus not surviving is
exp [−0.3699] − [0.3699] = 0.477
which means that roughly half as many people survived as died. In the independence model,
survival probability is not related to other variables.
We can also obtain the full table of “expected values”, np̂i1 ...ip , by using the fitted function
in R:
> round(fitted(indep), 1)
Re-fitting to get fitted values
, , Age = Child, Survived = No , , Age = Child, Survived = Yes
Sex
Class Male Female
1st
8.6
2.3
2nd
7.5
2.0
3rd 18.6
5.1
Crew 23.3
6.3
Sex
Class Male Female
1st
4.1
1.1
2nd
3.6
1.0
3rd
8.9
2.4
Crew 11.1
3.0
, , Age = Adult, Survived = No
, , Age = Adult, Survived = Yes
Sex
Class
Male Female
1st 164.5
44.7
2nd 144.2
39.2
3rd 357.3
97.0
Crew 447.8 121.6
Sex
Class
Male Female
1st
78.5
21.3
2nd
68.8
18.7
3rd 170.5
46.3
Crew 213.7
58.0
4
If we preferred to look at probabilities, we could divide everything here by n which can be
computed in R as sum(Titanic).
Before moving to trying to find a better model, let’s just add a couple of arbitrary terms to the
model to see what happens:
> bigger = loglm(˜Class+Sex+Age+Survived+Class:Sex+Sex:Survived,
+
data=Titanic)
> coef(bigger)
$‘(Intercept)‘
[1] 2.795941
$Class
1st
2nd
-0.0936285 -0.2530613
3rd
Crew
0.5777889 -0.2310991
$Sex
Male
Female
0.6248212 -0.6248212
$Age
Child
-1.477264
Adult
1.477264
$Survived
No
Yes
0.07711381 -0.07711381
$Class.Sex
Sex
Class
Male
Female
1st -0.5569168 0.5569168
2nd -0.4030550 0.4030550
3rd -0.1868803 0.1868803
Crew 1.1468522 -1.1468522
$Sex.Survived
Survived
Sex
No
Yes
Male
0.5792937 -0.5792937
Female -0.5792937 0.5792937
You see that we get extra “β” tables corresponding to the extra terms in model. Notice that each
row and column of the extra tables sums to zero and also that the one-dimensional tables have
changed by adding terms to the model. If we trusted this model, from this we could read off
a strong dependence of survival on the sex of the passenger from the Sex.Survived table
with females having much better survival prospects than men: the log-odds-ratio of survival
versus death for females versus males is [0.579 − (−0.579)] − [(−0.579) − 0.579] = 2.316 so
that the odds-ratio is exp(2.316) ≈ 10: the odds of survival were approximately 10 times better
for females than males.
4
Selecting a model
We can ask R to:
5
• do forwards selection or backwards selection or both;
• start from a model of our choice;
• use AIC (the default) or BIC.
The key function is step. We provide step with the model we want to start with and (optionally) an additional variable called scope which controls whether R does forwards or backwards or both and also what range of models is to be considered. This is quite a complex area
so I am only going to show you how to get R to do both, starting from the independence model.
Here is the result using AIC:
> model.aic = step(indep, scope=list(lower=˜., upper=˜.ˆ4))
Start: AIC=1257.66
˜Class + Sex + Age + Survived
+ Sex:Survived
+ Class:Sex
+ Class:Survived
+ Class:Age
+ Sex:Age
+ Age:Survived
<none>
Df
1
3
3
3
1
1
AIC
825.19
851.06
1082.76
1115.34
1236.38
1240.10
1257.66
Step: AIC=825.19
˜Class + Sex + Age + Survived + Sex:Survived
Df
AIC
+ Class:Sex
3 418.59
+ Class:Survived 3 650.29
+ Class:Age
3 682.87
+ Sex:Age
1 803.91
+ Age:Survived
1 807.63
<none>
825.19
- Sex:Survived
1 1257.66
Step: AIC=418.59
˜Class + Sex + Age + Survived + Sex:Survived + Class:Sex
Df
+ Class:Age
3
+ Class:Survived 3
+ Sex:Age
1
+ Age:Survived
1
<none>
- Class:Sex
3
- Sex:Survived
1
AIC
276.27
318.52
397.31
401.03
418.59
825.19
851.06
Step: AIC=276.27
˜Class + Sex + Age + Survived + Sex:Survived + Class:Sex +
Class:Age
Df
AIC
6
+ Class:Survived
+ Age:Survived
+ Sex:Age
<none>
- Class:Age
- Class:Sex
- Sex:Survived
3 176.19
1 267.26
1 272.18
276.27
3 418.59
3 682.87
1 708.73
Step: AIC=176.19
˜Class + Sex + Age + Survived + Sex:Survived + Class:Sex +
Class:Age + Class:Survived
Df
AIC
+ Class:Sex:Survived 3 117.01
+ Age:Survived
1 152.61
+ Sex:Age
1 172.10
<none>
176.19
- Class:Survived
3 276.27
- Class:Age
3 318.52
- Class:Sex
3 507.97
- Sex:Survived
1 533.83
Step: AIC=117.01
˜Class + Sex + Age + Survived + Sex:Survived + Class:Sex +
Class:Age + Class:Survived + Class:Sex:Survived
+ Age:Survived
+ Sex:Age
<none>
- Class:Sex:Survived
- Class:Age
Df
AIC
1 93.428
1 112.925
117.011
3 176.191
3 259.338
Step: AIC=93.43
˜Class + Sex + Age + Survived + Sex:Survived + Class:Sex +
Class:Age + Class:Survived + Age:Survived +
Class:Sex:Survived
Df
AIC
+ Class:Age:Survived 3 70.222
<none>
93.428
+ Sex:Age
1 95.292
- Age:Survived
1 117.011
- Class:Sex:Survived 3 152.608
- Class:Age
3 241.778
Step: AIC=70.22
˜Class + Sex + Age + Survived + Sex:Survived + Class:Sex +
Class:Age + Class:Survived + Age:Survived +
Class:Sex:Survived + Class:Age:Survived
Df
AIC
7
<none>
+ Sex:Age
- Class:Age:Survived
- Class:Sex:Survived
70.222
1 71.953
3 93.428
3 129.402
At each step, R reports the current model and its AIC value; it then produces a table listing
possible changes to the model and their AIC values. The possible changes are limited by the
requirement that any model fitted be hierarchical (and also by the scope variable). R uses
AIC defined as
AIC = Deviance + 2p
where, as described in lectures, Deviance = 2(L̂full − L̂) being twice the difference between
the maximum of the log-likelihood for the full model and the maximum of the log-likelihood
for the model being considered. As usual, p is the number of (free) parameters in the model
and we are interested in models with low AIC. R also reports the number of parameters added
or removed for each model change in the Df column. The process stops when all the possible
changes to the current model increase AIC. Otherwise the change giving the lowest AIC is
accepted and the process recommences with the new model.
To get R to use BIC we need to change 2p in the definition of AIC to p log n where n is the
total number of data. In R, we do:
> model.bic = step(indep, scope=list(lower=˜., upper=˜.ˆ4),
k=log(sum(Titanic)))
Start: AIC=1297.54
˜Class + Sex + Age + Survived
Df
AIC
+ Sex:Survived
1 870.77
+ Class:Sex
3 908.03
+ Class:Survived 3 1139.73
+ Class:Age
3 1172.30
+ Sex:Age
1 1281.95
+ Age:Survived
1 1285.68
<none>
1297.54
.
.
.
Step: AIC=217.19
˜Survived + Class + Sex + Age + Class:Sex + Class:Age + Sex:Age +
Survived:Sex + Survived:Class + Survived:Age + Class:Sex:Age +
Survived:Class:Sex + Survived:Class:Age
Df
<none>
+ Survived:Sex:Age
- Survived:Class:Age
- Survived:Class:Sex
AIC
217.19
1 223.20
3 238.32
3 269.32
I have omitted most of the output as it is very similar to the AIC version. Note that R continues
to use the term AIC even though we are actually computing BIC.
5
Interpretation of data/model
Fundamentally, from the perspective of survival as the response variable, the structure was the
same using AIC or BIC and whether or not we required the full interaction between Class, Age
8
and Sex to be in the model.
Survival is linked to Class:Sex and to Class:Age but the full model is not preferred. This means
that there is a somewhat complex dependence of Survival on all three variables. Technically,
the odds-ratio for how dependence of survival on Class is affected by Sex does not involve Age
and vice-versa. To get a deeper insight, we would need to spend more time studying about
conditional independence structures.
In principal, we would like to look at the estimated “βs” in order to gain more insight. That
actually fails for this example as some of the estimated probabilities p̂i1 i2 i3 i4 are actually zero,
making some of the β values infinite; in such cases R does not provide any numbers when we
use the coef function.
However we can at least look at the conditional probability of survival given the other variables
and think about what it means:
> phat.bic = fitted(rmodel.bic)/sum(Titanic)
Re-fitting to get fitted values
> round(prop.table(phat.bic, c(1,2,3))[,,,"Yes"], 2)
, , Age = Child
Sex
Class Male Female
1st 1.00
1.00
2nd 1.00
1.00
3rd 0.22
0.53
Crew NaN
NaN
, , Age = Adult
Sex
Class Male Female
1st 0.33
0.97
2nd 0.08
0.86
3rd 0.17
0.45
Crew 0.22
0.87
Several things are clear from this table: 3rd class passengers did not fare well, females fared
better than males and children better than adults.
I have to say that it’s not clear that the log-linear analysis has delivered much more information
than just calculating % survival from the original data. However, in general it’s a powerful tool
for simplifying the presentation of multi-dimensional contingency tables.
There’s one final feature of this example which further complicates matters: some of the entries
in the original data are what are known as “structural zeros”. By definition, there were no
children who were crew. The analysis should in principle be adapted to take this into account.
9