Compare observed and fitted proportions

Compare observed and fitted proportions
Categorize/Group the continuous x-variables into intervals.
Calculate the observed rate of success in each interval.
Compare with the estimated proportion of success from the
model.
Observed and predicted probabilities
1
prop
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
x
See towards the end of f8a.R for an example.
1/5
Residuals in logistic regression
Pearson residuals
Simple standardization, since Yi ∼ Bin(1, pi ) with E(Yi ) = pi
and V (Yi ) = pi (1 − pi ):
Yi − p̂i
r̃i = p
p̂i (1 − p̂i )
( N (·, ·)
not even asymptotically!)
In R: infl<-influence(model) then infl$pear.res.
However in general the problem with residual analysis for
logistic regression is that such plots are not very revealing
because of the binary nature of Y .
Also the Leverage values, i.e. the diagonal elements of
P = X(X 0 W X)−1 X 0 W are now depending both on X and Y
and as such these are no more indicators of outliers w.r.t X.
2/5
Residuals in logistic regression
Standardized residuals
Take leverages vii from the diagonal elements of
P = X(X 0 W X)−1 X 0 W :
Yi − pˆi
≈ N (0, 1)
ri = p
pˆi (1 − pˆi )(1 − vii )
(for large n)
If |ri | > |λα/2 | it might be considered suspiciously large.
Plots of ri vs i can be useful, although it’s sometimes more
revealing to plot their squares, e.g. ri2 vs i.
Notice vii can be obtained using infl<-influence(model)
then infl$hat.
3/5
Residual plots
8
6
r_i^2
4
2
0.4
0
0.0
0
20
40
60
80
100
0
20
x
40
60
80
100
i
: ri2
0
−2
r_i
2
4
: Data, p̂i and 95% CI
−4
data$y
0.8
The residuals in logistic regression always have a pattern but with few
extreme values!
0
20
40
60
80
100
i
: ri and 95% CI
4/5
Cook’s distance
There is a version of Cook’s distance for logistic regression:
DiCook =
ri2
vii
·
p + 1 1 − vii
0.0
0.4
D_i
0.8
1.2
Hosmer & Lemeshow consider influential cases those with
0
DiCook > 1.
20
40
60
80
100
i
5/5