330.Lect28 - Department of Statistics

Stats 330: Lecture 28
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 1
Plan of the day
In today’s lecture we apply Poisson regression to
the analysis of contingency tables having 3 or 4
dimensions.
Topics
– Types of independence
– Connection between independence and interactions for
3 and 4 dimensional tables
– Hierarchical models
– Graphical models
– Examples
Reference: Coursebook, sections 5.3, 5.3.1
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 2
Example: the Florida murder
data
• 326 convicted murderers in Florida, 19761977
• Classified by
– Death penalty (n/y)
– Victims race (black/white)
– Defendants race (black/white)
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 3
Contingency table
Victim's race
Defendant's
Race
Black
White
Death Penalty
Death Penalty
Yes
No
Yes
No
Black
13
195
23
105
White
1
19
39
265
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 4
Getting the data into R
murder.df<-data.frame(expand.grid(
defendant=c("b","w"),dp = c("y","n"),
victim=c("b","w")),
counts=c(13,1,195,19,23,39,105,265))
Note use of function expand.grid
expand.grid(defendant=c("b","w"),
dp = c("y","n"),victim=c("b","w"))
defendant dp victim
1
2
3
4
5
6
7
8
b
w
b
w
b
w
b
w
© Department of Statistics 2012
y
y
n
n
y
y
n
n
b
b
b
b
w
w
w
w
STATS 330 Lecture 28: Slide 5
The data frame
1
2
3
4
5
6
7
8
defendant
b
w
b
w
b
w
b
w
© Department of Statistics 2012
dp victim counts
y
b
13
y
b
1
n
b
195
n
b
19
y
w
23
y
w
39
n
w
105
n
w
265
STATS 330 Lecture 28: Slide 6
Questions
• Is death penalty independent of race?
• What is the role of victim’s race?
• What does “independent” mean when we
have 3 factors?
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 7
Types of independence
• Suppose we have 3 criteria (factors) A, B
and C
• In a multinomial sampling context, let
pijk = Pr(A=i, B=j, C=k)
• Various forms of independence may be of
interest. These can be expressed in terms
of the probabilities pijk
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 8
Marginal probabilities
{ A  i, B  j}  { A  i, B  j , C  k}
k
 P ( A  i, B  j )   P ( A  i, B  j , C  k )
  p ijk
k
k
 p ij 
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 9
Marginal probabilities (ii)
{ A  i}  { A  i, B  j , C  k}
j ,k
 P( A  i )   P( A  i, B  j , C  k )
j
k
  p ijk
j
k
 p i
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 10
All three factors independent
• This can be expressed as
p ijk  Pr( A  i, B  j, C  k )
 Pr (A  i)  Pr (B  j) Pr (C  k)
 p i  p  j p   k
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 11
One factor independent of the
other two
• This can be expressed as
p ijk  Pr( A  i, B  j, C  k )
 Pr (A  i)  Pr (B  j, C  k)
 p i  p  jk
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 12
Two factors conditionally
independent, given a third
This can be expressed as
Pr( A  i, B  j | C  k )
 Pr( A  i | C  k ) Pr( B  j | C  k )
Equivalent to
p ijk
p i  k p  jk

p k
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 13
Parameterising tables of Poisson means
with main effects and interactions
• Recall (see Slides 22 and 23 of lecture 19) that
in ordinary 3-way ANOVA we split up the table
of means into main effects and interactions:
mijk  m  ai  bj  gk  (ab)ij  (bg)jk  (ag)ik (abg)ijk
• We can do exactly the same thing with the
logs of the Poission means in a 3-way table:
Log(mijk) 
m  ai  bj  gk  (ab)ij  (bg)jk  (ag)ik (abg)ijk
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 14
Parameterising tables of probabilities with
main effects and interactions
• Corresponding multinomial probabilities are
given by the model
Log(pijk / p111) 
ai  bj  gk  (ab)ij  (bg)jk  (ag)ik (abg)ijk
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 15
Parameterising tables with main
effects and interactions (cont)
• The interactions are related to the different forms of
independence:
– if all the interactions are zero, then the 3 factors are
mutually independent
– If the ABC and AB interactions are zero, then A and B are
independent, given C
– If the ABC, AB and AC interactions are zero, then A is
independent of B and C
• Using the connection between multinomial and
Poisson sampling, we can test for the various types
of independence by fitting a Poisson regression with
A, B and C as explanatory variables, and testing for
interactions.
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 16
Summary of independence
models
• All 3 factors independent in
multinomial model
– Equivalent to all interactions zero in
Poisson model
– Poisson Model is count ~ A + B + C
• A independent of B and C
– Equivalent to all interactions between A
and the others zero
– Poisson Model is count ~ A + B + C + B:C
or count ~ A + B*C
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 17
Summary of independence
models (cont)
• Factors A and B conditionally independent
given C
– Equivalent to all interactions containing both A
and B zero
– Poisson Model is
count ~ A + B + C + A:C + B:C
– Equivalent to count ~ A*C + B*C
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 18
Analysis strategy: Florida data
• We will fit some models to this data and
investigate the pattern of independence/
dependence between the factors.
• We fit a maximal model, and then
investigate suitable submodels.
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 19
The analysis
> maximal.glm<-glm(counts~victim*defendent*dp, family=poisson,
data=murder.df)
> anova(maximal.glm, test="Chisq")
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL
7
774.73
victim
1
64.10
6
710.63 1.183e-15
defendent
1
0.22
5
710.42
0.64
dp
1
443.51
4
266.90 1.861e-98
victim:defendent
1
254.15
3
12.75 3.230e-57
victim:dp
1
10.83
2
1.92 9.999e-04
defendent:dp
1
1.90
1
0.02
0.17
victim:defendent:dp 1
0.02
0 -2.442e-15
0.89
>
No interactions between defendants race and death penalty - Seems that
model victim*dp + victim*defendant is appropriate “given the victim’s race,
defendants race and death penalty are independent” ie the conditional
independence model. Model also selected by stepwise, other anovas.
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 20
Testing if the model is
adequate
submodel.glm<-glm(counts~victim*defendant + victim*dp,
family=poisson, data=murder.df)
> summary(submodel.glm)
Deviance Residuals:
1
2
3
4
5
6
7
8
0.0636 -0.2127 -0.0163 0.0525 1.0389 -0.7138 -0.4453
0.2860
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
2.5472
0.2680
9.503 < 2e-16 ***
victimw
0.3635
0.3057
1.189 0.23449
defendantw
-2.3418
0.2341 -10.003 < 2e-16 ***
dpn
2.7269
0.2759
9.885 < 2e-16 ***
victimw:defendantw
3.2068
0.2567 12.491 < 2e-16 ***
victimw:dpn
-0.9406
0.3081 -3.053 0.00227 **
>
•
---
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 21
Testing if the model is
adequate
Null deviance: 774.7325 on
7 degrees of freedom
Residual deviance: 1.9216
on 2 degrees of freedom
AIC: 56.638
Number of Fisher Scoring
iterations: 4
> 1-pchisq(1.9216,2)
[1] 0.3825867
Model seems OK
© Department of Statistics 2012
Some hint cell 8
not well fitted
STATS 330 Lecture 28: Slide 22
Example: the Copenhagen
housing data
• 317 apartment residents in Copenhagen
were surveyed on their housing. 3
variables were measured:
– sat: satisfaction with housing (Low,
medium,high)
– cont: amount of contact with other residents
(Low,high)
– infl: influence on management decisions
– (Low, medium,high)
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 23
Copenhagen housing data
sat
infl
Low
Low
Medium
Low
High
Low
Low Medium
Medium Medium
High Medium
Low
High
Medium
High
High
High
9 more lines…
© Department of Statistics 2012
cont count
Low
61
Low
23
Low
17
Low
43
Low
35
Low
40
Low
26
Low
18
Low
54
STATS 330 Lecture 28: Slide 24
Copenhagen housing data
infl = Low
infl = Med
cont
Low
sat
cont
High
Low
61
78
Medium
23
46
High
17
43
Low
sat
High
Low
43
48
Medium
35
45
High
40
86
infl = High
cont
Low
sat
© Department of Statistics 2012
High
Low
26
15
Medium
18
25
High
54
62
STATS 330 Lecture 28: Slide 25
The analysis
> housing.glm<-glm(count~sat*infl*cont, family=poisson,
data=housing.df)
> anova(housing.glm, test]”Chisq”)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL
17
166.757
sat
2
26.191
15
140.566 2.054e-06
infl
2
20.040
13
120.526 4.451e-05
cont
1
22.544
12
97.983 2.054e-06
sat:infl
4
75.577
8
22.406 1.504e-15
sat:cont
2
7.745
6
14.661
0.021
infl:cont
2
11.986
4
2.675
0.002
sat:infl:cont 4
2.675
0 9.546e-15
0.614
This time, only the 3-factor interaction is insignificant.
This is the “homogeneous association model”
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 26
Homogeneous association model
• Association between two factors measured
by sets of odds ratios
p ijp 11
ORij 
p i1p 1 j
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 27
Copenhagen housing OR’s
infl = Low
infl = Med
cont
sat
cont
Low
High
Low
*
*
Medium
*
1.56
High
*
1.97
Low
sat
Low
*
*
Medium
*
1.15
High
*
1.92
infl = High
cont
Low
sat
© Department of Statistics 2012
High
High
Low
*
*
Medium
*
2.40
High
*
1.99
26
15
18
25
54
62
26*25/(18*15)
26*62/(54*15)
STATS 330 Lecture 28: Slide 28
Homogeneous association
model (2)
• For the homogeneous association model,
the conditional odds ratios for A and B (ie
using the conditional distributions of A and
B given C=k) do not depend on k. That is,
the pattern of association between A and
B is the same for all levels of C.
• Common value of the conditional AB Log
OR’s are estimated by the AB interactions
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 29
Estimating conditional OR for
housing data
• Estimated Sat-cont conditional log ORs are
estimated with Sat-cont interactions in
homogeneous association model
> homogen.glm<-glm(count~sat*infl*cont-sat:infl:cont,
family=poisson, data=housing.df)
> summary(homogen.glm)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
satMedium:contHigh
0.4157
0.1948
2.134 0.032818 *
satHigh:contHigh
0.6496
0.1823
3.563 0.000367 ***
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 30
Estimating conditional OR for
housing data (2)
> 0.4157+c(-1,1)*1.96*0.1948
[1] 0.033892 0.797508
> exp(0.4157+c(-1,1)*1.96*0.1948)
[1] 1.034473 2.220002
> exp(0.4157)
[1] 1.515431
> exp(0.6496)
[1] 1.914775
• For med/High, est for log OR is 0.4157, std error is 0.1948
• CI for OR is exp(0.4157 +/- 1.96* 0.1 0.1948)
i.e. (1.034, 2.220)
• Estimated OR for high/high is exp(0.6496) = 1.1914
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 31
4 dimensional tables
• Similar results apply for 4-dimensional
tables
• For example, models for 4 factors A, B, C
and D
– A, B, C and D all independent: A + B + C +D
– A, B independent of C and D: A*B + C*D
– D conditionally independent of C, given A and
B: A*B*C + A*B*D
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 32
Hierarchical models
• We will assume that all models are
hierarchical: if the model includes an
interaction with factors A1, … Ak, then it
includes all main effects and interactions
that can be formed from A1, … Ak
• We can represent hierarchical models by
listing these “maximal interactions”
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 33
Examples
• 2 factor model A + B + A:B is hierarchical
– Hierarchical notation: [AB]
• 3 factor model A + B + C + A:B is
hierarchical
– Hierarchical notation: [AB][C]
• 3 factor model A + B + C + A:B + A:C is
hierarchical
– Hierarchical notation: [AB][AC]
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 34
Graphical Models
A way of visualising independence patterns:
• A subset of hierarchical models
• Each factor represented by the vertex of an
“association” graph
• Two vertices connected by edges if they have a
non-zero interaction
• Then
– A vertex not connected to any other vertex is
independent of the other vertices
– two vertices not directly connected are conditionally
independent, given the connecting vertices
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 35
Examples
• 2 factor model
A + B + A:B or [AB]
A
• 3 factor model
A + B + C +AB or [AB][C]
C independent of A and B
• 3 factor model
A + B + C + A:B + A:C +B:C
or [AB][AC][BC]
© Department of Statistics 2012
B
A
B
C
A
B
C
STATS 330 Lecture 28: Slide 36
More Examples
• 3 factor model
A + B + C + AB + AC
[AB][AC]
• 4 factor model
A + B + C + D + A:B +
A:C +B:C
[AB][AC][BC][D]
A
B
A
B
© Department of Statistics 2012
C
D
C
STATS 330 Lecture 28: Slide 37
Another Example
4 factor model
A + B + C + D + A:B +
B:C + A:D
[AB][BC][AD]
A
B
D
C
C and D are
conditionally
independent given A
and B
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 38
Another Example
4 factor model
A + B + C + D + A:B:C +
A:B:D
[ABC][ABD]
A
B
D
C
C and D are
conditionally
independent given A
and B
© Department of Statistics 2012
STATS 330 Lecture 28: Slide 39