Handout 15, Part 2

Logistic Regression with a Single Dichotomous Predictor
EXAMPLE: Consider the data in the file CHD.csv. Instead of examining the relationship
between the continuous variable age and the presence or absence of evidence of coronary heart
disease (CHD), we could instead consider a dichotomous predictor:
0 if Age < 55
Over55 = 
1 if Age ≥ 55
In R:
> Over55 = as.numeric(Age >=55)
> table(Over55,CHD)
CHD
Over55 0 1
0 51 22
1 6 21
Find the odds ratio for having CHD associated with being 55 or over:
Next, consider our logistic regression model: E(y i | x i ) = θ(x i ) =
exp(η 0 + η1 x i )
.
1 + exp(η 0 + η1 x i )
If we let xi = our indicator variable, Over55, for each observation, we can construct the following
table:
Age ≥ 55 (x = 1)
CHD = 1
CHD = 0
θ(x i = 1) =
exp(η 0 + η1 )
1 + exp(η 0 + η1 )
1- θ(x i = 1) =
1
1 + exp(η 0 + η1 )
Age < 55 (x = 0)
θ(x i = 0) =
exp(η 0 )
1 + exp(η 0 )
1- θ(x i = 0) =
1
1 + exp(η 0 )
17
Estimate the model parameters “by hand”:
Verify the estimates of these model parameters using both R and SAS PROC LOGISTIC:
> chd.glm <- glm(CHD~Over55,family="binomial")
> summary(chd.glm)
Call:
glm(formula = CHD ~ Over55, family = "binomial")
Deviance Residuals:
Min
1Q
Median
-1.7344 -0.8469 -0.8469
3Q
0.7090
Max
1.5488
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8408
0.2551 -3.296 0.00098 ***
Over55
2.0935
0.5285
3.961 7.46e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 136.66
Residual deviance: 117.96
AIC: 121.96
on 99
on 98
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
18
proc logistic data=CHD descending;
model CHD = Over55 / link=logit;
output out=probs predicted=predicted_probabilities;
run;
Questions:
1. Use the model parameters to predict the probability of having CHD for a person who is 55
or over and for a person who is younger than 55.
2. Given only the estimates of the model parameters, find the odds ratio for having CHD
associated with being 55 or over.
19
Verify the predicted probabilities from both the R and SAS output:
> prob.CHD <- fitted(chd.glm)
> cbind(Age,prob.CHD)
Age prob.CHD
1
20 0.3013699
2
23 0.3013699
3
24 0.3013699
4
25 0.3013699
5
25 0.3013699
6
26 0.3013699
7
26 0.3013699
8
28 0.3013699
9
28 0.3013699
10
29 0.3013699
11
30 0.3013699
12
30 0.3013699
13
30 0.3013699
14
30 0.3013699
15
30 0.3013699
16
30 0.3013699
17
32 0.3013699
18
32 0.3013699
19
33 0.3013699
20
33 0.3013699
21
34 0.3013699
22
34 0.3013699
23
34 0.3013699
24
34 0.3013699
25
34 0.3013699
26
35 0.3013699
27
35 0.3013699
28
36 0.3013699
29
36 0.3013699
30
36 0.3013699
31
37 0.3013699
32
37 0.3013699
33
37 0.3013699
34
38 0.3013699
35
38 0.3013699
36
39 0.3013699
37
39 0.3013699
.
.
.
74
55 0.7777778
proc print data=probs;
run;
20
Statistics Measuring Predictive Power
Once again, consider the model using the continuous variable age to predict CHD:
proc logistic descending;
model CHD = age / link=logit;
output out=get_values predicted=predicted_probabilities;
run;
Recall that the p-values shown above are used to test the usefulness of the logistic regression
model. We can also consider a few other statistics to investigate the model’s predictive power:

Generalized R2
 Likelihood Ratio Chi - Square 
This is calculated as follows: 1 − exp−
=
n


You can also request this quantity from SAS:
proc logistic descending;
model CHD = age / link=logit rsq;
run;
Note that the upper-bound of the generalized R2 is less than 1. Therefore, PROC
LOGISTIC also reports a quantity labeled the “Max-rescaled R-Square,” which divides
the original generalized R2 by its upper bound.

Ordinal Measures of Association
SAS PROC LOGISTIC also reports the following statistics:
21
The idea behind these statistics is as follows. For the 100 observations in the data set, there exist
100×(99)/2 = 4,950 different ways to pair them up (without pairing an observation with itself).
Of these pairs, 2,499 have either both 1s or both 0s for an observed response. These are ignored,
leaving 2,451 pairs in which one case has a 0 and the other case has a 1. For these pairs, SAS
determines whether the observation with a 1 has a higher predicted value (based on the model)
than does the observation with a 0. If this is the case, the pair is called concordant. If not, the
pair is discordant.
Let C = the number of concordant pairs =
D = the number of discordant pairs =
T = the number of ties =
N = the total number of pairs (before eliminating any) =
The four measures of association are given as
1. Somer’s D =
2. Gamma =
3. Tau-a =
C−D
C+D+T
C−D
C+D
C−D
N
4. C = .5×(1 + Somer’s D)
All four measures vary between 0 and 1, with large values corresponding to stronger
associations between the predicted and observed values. Finally, note that the measure known
as C has another familiar interpretation. Consider the following programming statements.
ods html;
ods graphics on;
proc logistic data=CHD descending;
model CHD = age / link=logit outroc=roc_data;
run;
ods graphics off;
ods html close;
proc print data=roc_data; run;
22
These statements request the following output.
.
.
The ROC curve is obtained by changing the classification rule based on the estimated
probability. Note that the area under the ROC curve is the same as C.
23
Finding the ROC curve using R
You must first install the Deducer package. Then, you can create the ROC curve and compute
the area under the curve using the following commands.
> library(Deducer)
> rocplot(chd.glm)
24
Finding the Concordance/Discordance Model Fit Measures in R
The following function can be used to find Somer’s D, Gamma, Kendall’s Tau, and the Cstatistic in R (http://statour.blogspot.in/2012/12/concordance-and-discordance-in-logistic.html).
###########################################################
# Function OptimisedConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# Although it still uses two-for loops, it optimises the code
# by creating initial zero matrices
###########################################################
OptimisedConc=function(model)
{
Data = cbind(model$y, model$fitted.values)
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
N<-length(model$y)
Somers_D <-(sum(conc)-sum(disc))/Pairs
gamma <-(sum(conc)-sum(disc))/(Pairs-sum(ties))
k_tau_a <-2*(sum(conc)-sum(disc))/(N*(N-1))
C <-.5*(1+Somers_D)
return(list("Percent Concordance"=PercentConcordance,
"Percent Discordance"=PercentDiscordance,
"Percent Tied"=PercentTied,
"Pairs"=Pairs,
"Somer's D"=Somers_D,
"Gamma"=gamma,
"Kendall's Tau A"=k_tau_a,
"C"=C))
}
25
To call this function, enter the following:
> OptimisedConc(chd.glm)
$`Percent Concordance`
[1] 78.98817
$`Percent Discordance`
[1] 19.01265
$`Percent Tied`
[1] 1.999184
$Pairs
[1] 2451
$`Somer's D`
[1] 0.5997552
$Gamma
[1] 0.61199
$`Kendall's Tau A`
[1] 0.2969697
$C
[1] 0.7998776
26