Lecture 25 Generalized Linear Models: Binomial Error Distribution

Lecture 25
Generalized Linear Models: Binomial
Error Distribution
1
Introduction
• Count the no of “successes” and “failures”.
• proportion of “successes”.
Examples
2
Trying to model proportions using linear
models creates problems:
• The errors are not normally distributed.
• The variance is not constant.
• The response is bounded (0 ≤ p ≤ 1).
• By calculating p, information about the
sample size, n, from which the proportion is
calculated is lost.
3
Transformations
−1 √
• the arcsine transform, (sin
p), stabilizes
the variance for binomial data
• in bioassays the probit transformation,
(Φ−1 (p)), was used to linearize the
relationship between % mortality and
log(dose) and ensure predictions were
bounded by 0 and 1.
4
GLM
Specify binomial errors and one of the available
link functions. Several link functions are
available in R for binomial errors.
• Probit Link:
Φ−1 (p) = η
This link function is rarely used in modern
statistics due to the availability of the logit
link.
5
• Logit link:
p
loge
=η
(1 − p)
It gives results that are almost always
identical to the probit link and is easier to
interpret. This is the default link function in
R when binomial errors are specified.
6
• Complementary log-log link:
loge [− loge (1 − p)] = η
This link function is not symmetrical about
p = 0.5 and is often used in the analysis of
simple dilution assays.
7
The Model
Y =µ+
where
Y is a binomial response variable ∼ B(n, µ),
and µ is the probability of “success”.
8
The logit link function is:
η = Xβ
= g(µ)

= loge 
µ

1−µ

(1)
and hence the proportion of “successes” can be
derived from the linear predictor (1):
η
e
µ = 1+eη
9
Example
The following table shows the number of dead
insects (r) out of each sample (n) after five
hours exposure to various dosages of gaseous
carbon disulphide. (ldose is log10 (dosage) at
various concentrations.)
10
ldose
n
r
1
1.69 59.00
6.00
2
1.72 60.00 13.00
3
1.76 62.00 18.00
4
1.78 56.00 28.00
5
1.81 63.00 52.00
6
1.84 59.00 53.00
7
1.86 62.00 61.00
8
1.88 60.00 60.00
11
A plot of the proportion killed against log(dose)
suggests a strong association.
12
1.0
0.6
0.2
r/n
o
o
o
o
o
o
o
1.70
1.75
1.80
ldose
13
1.85
o
A binomial response is assumed together with a
logit link.
p
loge
= β0 + β1 ldose
1−p
(this is often called the simple logistic
regression model).
14
(2)
beetle.df<-read.table("beetle.txt",header=T)
attach(beetle.df)
# Fit GLM with binomial error and print results.
beetle.glm<-glm(r/n~ldose,family=binomial,
weight=n,data=beetle.df)
print(anova(beetle.glm,test=’Chisq’))
###########################
Analysis of Deviance Table
Binomial model
Response: r/n
Terms added sequentially (first to last)
Df Deviance Res Df Res Dev
P
NULL
7
284.2024
ldose 1 272.9702
6
11.2322 0
15
Comments:
• With proportion data you need to take
differing sample sizes into account, so unless
the data are taken from samples of equal size
you should include the argument weight=n,
which gives more weight to larger samples.
• However, if the data are stored as two
columns of successes and failures rather
than sample size and successes, then the
weight argument is not required.
16
count <- cbind(r,n-r)
beetle.glm<-glm(count~ldose,family=binomial,
data=beetle.df)
17
Interpreting Results
• The residual deviance, (= 11.232), also has
an approximate chi-square distribution on 6
df and is not significant. (P = 0.08). This
tests the fit of the model. It appears the
model proposed does give an adequate fit to
the data.
18
• The Deviance associated with ldose (=
272.97) has an approximate χ2 distribution
on 1 df and tests the hypothesis that β1 = 0.
It is highly significant (P ≈ 0) indicating
that the coefficient of ldose is significantly
different from zero.
19
The linear predictor has a slope of 34.27
(coefficient of ldose) and an intercept of −60.72.
20
beetle.sum <- summary(beetle.glm)
print(beetle.sum)
###########################
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-60.72
5.18
-11.7
<2e-16
ldose
34.27
2.91
11.8
<2e-16
(Dispersion parameter for binomial family
taken to be 1)
Null deviance: 284.202
Residual deviance: 11.232
21
on 7
on 6
df
df
Median Lethal Dose, LD50
The ‘dose’ required to kill half the animals,
which is called the median lethal dose or simply
the LD(50), is often of interest.
For the special case of simple logistic regression
pi
= β0 + β1 Xi .
1 − pi
!
loge
22
The LD(50) is the value of X for which p = 12 .
When p = 12 ,

loge


1/2 
= loge (1) = 0
1 − 1/2
Thus, the LD(50) is the value of X for which β0
+ β1 X = 0,
β0
i.e. X = − .
β1
23
This median lethal dose is shown in the generic figure :
24
0.8
ppn
0.4
0.0
2
4
6
Dose
25
8
10
So we see that in this case, the LD50 corresponds to a
dose of about 4.
26
Sometimes the data may not cover dose values
corresponding to the estimated LD50.
In this case, we may want to estimate say an LD25 or an
LD75, or some value away from the 50% mark. This is a
simple extension of the LD50.
27
From the graph we can we see that the LD25 is below 4
while the LD75 is above 4, as expected.
How to calculate these values, as a check?
(Note that for the graph shown, the linear model is
−20 + 5 × dose.)
28
LD25 :
p
ln
1−p
!
= β0 + β1 X
Using p = 1/4 gives
1/4
ln
1 − 1/4
!
= −20 + 5x
1
= − ln 3 = −20 + 5x
ln
4−1
and so
5x = 20 − ln 3 = 20 − 1.0986 = 18.9014 → x = 3.78
29
You should verify that the LD75 is 4.22.
30
Example:
Calculate the LD50 associated with the model
in the beetle example.
The log of the dose at which 50% of beetles die
is,
ldose = −(−60.72/34.27) = 1.772.
This is actually the log10 (dose),
the dose at which 50% mortality is achieved is
101.772 = 59.1 mg l−1 .
31
Exercise:
Confirm the predicted values.
Solution:
For this model
p̂
η̂ = loge
= β̂0 + β̂1 ldose
1 − p̂
We can rearrange to find p̂:
eη̂
exp(β̂0 + β̂1 ldose)
p̂ =
=
η̂
1+e
1 + exp(β̂0 + β̂1 ldose)
32
So for ldose = 1.69,
exp(−60.7 + 34.27 ∗ 1.69)
p̂ =
1 + exp(−60.7 + 34.27 ∗ 1.69)
= 0.0618/1.0618
= 0.0582
Hence the predicted number of dead insects out
of a sample of 59, when ldose = 1.69 is
r̂ = p̂ ∗ n = 0.0582 ∗ 59 ≈ 3.43
33
34
residual <- beetle.sum$deviance.resid
beetle.pred <- predict.glm(beetle.glm,
se.fit = T, type = "response")
fit <- beetle.pred$fit * beetle.df$n
print(cbind(beetle.df[, cbind(2, 3)],
fit, residual))
######################
n r
fit
residual
1 59 6 3.457461 1.2836776
2 60 13 9.841672 1.0596899
3 62 18 22.451378 -1.1961123
4 56 28 33.897635 -1.5941243
5 63 52 50.095821 0.6061406
6 59 53 53.290913 -0.1271582
7 62 61 59.222158 1.2510712
35
8 60 60 58.742961 1.5939851
Residuals
The standardized residuals given above are
defined by
yi
ni − yi
di = 2 yi log
+ (ni − yi ) log
np̂i
ni − ni p̂i
"
!
!#
(3)
and are the signed square roots of contributions
to the scaled deviance.
36
For this reason they are often called Deviance Residuals
and are approximately normally distributed for a well
fitting model.
37
We can inspect the residuals plots with
plot(beetle.glm). Note, however, that
although the Normal QQplot looks quite
skewed, that due to the small sample size, the
Shapiro test for normality is not significant.
38
1.5
o
o
o
o
o
0.5
−1.5 −0.5
Sample Quantiles
Normal Q−Q Plot
o
o
o
−1.5
−0.5 0.0
0.5
1.0
Theoretical Quantiles
39
1.5
print(shapiro.test(residual))
#######################
Shapiro-Wilk normality test
data:
residual W = 0.871, p-value = 0.1543
40
Pearson Residuals are also standardized
residuals but are defined by
pi − p̂i
.
ri = q
[p̂i (1 − p̂i )/ni ]
(4)
While they are useful in examining the fit of
the model graphically they are not in general
normally distributed.
41
Example
An experiment is conducted to examine the sex
ratios in different genotypes of insect. Is the
proportion of male to female offspring produced
the same for different genotypes?
42
genotype
A
A
A
A B
B
male
3
8
1
2
0
5
total 17 22 11
9
3 28
genotype
C
C
C
D D
D
D
D
male
7
6 10
1
0
0
3
4
total 14 14 21 17
8
9 15 12
43
We will assume a binomial error distribution for
the proportion of males (=male/total) and use
a logit link to fit a model of the form
p
= µ + genotypei
1−p
!
loge
.
44
We first draw density plots for the different
genotypes.
45
−0.2 0.0 0.2 0.4 0.6
Density
C
D
o
o o
o oo
A
10
8
6
4
2
0
o oo
o
B
o
o
−0.2 0.0 0.2 0.4 0.6
46
p
o
o
10
8
6
4
2
0
ratio<-read.table(’sexratio.txt’,header=T)
p<-male/total
# Fit GLM and get Analysis of Deviance Table
and summary table.
ratio.glm <- glm(p~genotype,family=binomial,
weights=total,data=ratio)
print(anova(ratio.glm,test=’Chisq’))
print(summary(ratio.glm))
47
Analysis of Deviance Table
Model: binomial, link: logit Response: male/total
Df Deviance Res Df Res Dev
P
NULL
13 32.247
genotype 3
17.737
10 14.510 0.0005
Coefficients:
Estimate Std.Error z value
P
(Intercept) -1.1676
0.3060 -3.815 0.0001
genotypeB
-0.4811
0.5763 -0.835 0.4039
genotypeC
1.0450
0.4190
2.494 0.0126
genotypeD
-0.7232
0.4873 -1.484 0.1378
48
Interpreting Results
• The model appears to provide an adequate
fit (Resid.dev=14.51, df=10, P = 0.15).
• The effect of genotype is highly significant,
(Deviance=17.74, df = 3, P = 0.0005).
49
• The z tests are an approximate test of the
differences in sex ratio between genotype A
and each of the others. Thus genotype A is
significantly different from genotype C
(p-value= 0.012) but not significantly
different from genotypes B and D (p-value=
0.4 & 0.14, respectively).
• It is also apparent from the plots that C is
in fact significantly different from both B
and D as well.
50
Binomial errors and log–linear models for
contingency tables
In some cases, the binomial glm can be used to
corroborate the results of glm used to model contingency
table data.
Consider the aphid example of Lecture 24.
51
Since the outcome variable (hole) was binary, we can
formulate the problem as a binomial glm.
Thus we get :
Tree
Aphid
Hole
Total
1
no
35
1785
1
yes
23
1169
2
no
146
1788
2
yes
30
363
Fitting the glm gives :
52
>
>
>
>
>
>
>
+
>
# binomial error structure
# order is important - taking tree differences into account
# fit tree first
defbin <- read.table(’aphidbin.txt’,header=T)
defbin$tree <- factor(defbin$tree)
defbin$aphid <- factor(defbin$aphid)
defencebin.glm <- glm(hole/total~tree+aphid,data=defbin,
family=binomial, weights=total)
print(anova(defencebin.glm, test=’Chisq’))
Analysis of Deviance Table
Model: binomial, link: logit
53
Response: hole/total
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL
3
110.687
tree
1 110.683
2
0.004 6.944e-26
aphid 1
0.003
1
0.001
0.954
54
We see that the last two lines in the AOD table are the
same (2dp) as the log–linear model using poisson errors
(multinomial distribution).
See page 246 (Fig 24.5) and page 260 (Fig 25.8) of the
SG.
55

Download Report

Lecture 25 Generalized Linear Models: Binomial Error Distribution

Paperzz.com

Your Paperzz