Logistic regression R code

Logistic regression R code
We have two variables:
1. X is a continuous variable, such as gene expression, drug dose, or a mother’s
weight. X can also be a binary categorical variable (exposed/not)
0.6
0.4
0.0
0.2
response
0.8
1.0
2. Y is a binary variable, such as cancer/not, premature/not, response/not
0.0
0.5
1.0
1.5
2.0
2.5
3.0
dose
We encode Y as a Bernoulli random variable with success = 1 and failure = 0. The
probability of success is , the Greek letter pi.
In logistic regression, rather than modeling Y, which can only take values of 0 or 1, we
instead model the probability of success , which can take values between 0 and 1.
It is most convenient to model the probabilities on a transformed scale:
logit Pi = log( Pi / (1-Pi)) = beta0 + beta1 * x.
logit Pi = log( Pi / (1-Pi)) is the log odds of Pi, the probability of success, which is the
probability that Y = 1.
# Plot of a logistic regression curve, where the probability of success P(Y =1) increases
with X.
beta0 = 0
beta1 = 1
x.range = -10:10
Pi.result.vector= c()
for (index in 1:length(x.range))
{
Pi.result.vector[index] = exp(beta0 + beta1 * x.range [index]) / ( 1 + exp(beta0 + beta1 *
x.range[index]))
}
plot(x.range, Pi.result.vector)
The mathematical form of the logistic regression line is the model
Pi = exp(beta0 + beta1 * x) / ( 1 + exp(beta0 + beta1 * x)
This mathematical form is equivalent to modeling the log odds of Pi as a linear function
of beta0 + beta1 * x.
logit Pi = log( Pi / (1-Pi)) = beta0 + beta1 * x.
Plots of logistic regression to demonstrate effects of coefficients
library(faraway)
# Create blank graph for plotting
prob=c()
plot(prob, xlim=c(-10,10), ylim=c(0,1), xlab="X variable", ylab="response")
x=seq(-10,10,.1)
# The intercept (b0) is the log odds that response Y=1 when X=0.
# The intercept (b0) moves the curve to the left or right. Positive intercept moves the
curve left, negative intercept moves the curve right.
lines(x, ilogit(0+1*x), lty=1, col="black")
lines(x, ilogit(1+1*x), lty=2, col="red")
lines(x, ilogit(2+1*x), lty=3, col="blue")
# The slope (b1) changes the slope (steepness) of the curve.
# Larger numbers => steeper slope. Smaller numbers => shallower slope.
prob=c()
plot(prob, xlim=c(-10,10), ylim=c(0,1), xlab="X variable", ylab="response")
x=seq(-10,10,.1)
lines(x, ilogit(0+1*x), lty=1, col=1)
lines(x, ilogit(0+2*x), lty=2, col=2)
lines(x, ilogit(0+3*x), lty=3, col=4)
# Smaller numbers => shallower slope.
prob=c()
plot(prob, xlim=c(-10,10), ylim=c(0,1), xlab="X variable", ylab="response")
x=seq(-10,10,.1)
lines(x, ilogit(0+1*x), lty=1, col=1)
lines(x, ilogit(0+.5*x), lty=2, col=2)
lines(x, ilogit(0+.1*x), lty=3, col=4)
# If the coefficient (b1) of x1 is zero, then the slope of the line is zero
lines(x, ilogit(0+0*x), lty=4, col=5)
# The sign of the slope determine whether the curve rises or falls as x values increase
prob=c()
plot(prob, xlim=c(-10,10), ylim=c(0,1), xlab="X variable", ylab="response")
x=seq(-10,10,.1)
lines(x, ilogit(0+1*x), lty=1, col=1)
lines(x, ilogit(0-1*x), lty=2, col=2)
# In some cases, X is a binary variable which we encode as {0,1}.
prob=c()
plot(prob, xlim=c(0,1), ylim=c(0,1), xlab="X variable", ylab="response")
x=seq(0,1)
lines(x, ilogit(0+1*x), lty=1, col=1)
lines(x, ilogit(0+.5*x), lty=2, col=2)
lines(x, ilogit(0+.1*x), lty=3, col=4)
lines(x, ilogit(0+0*x), lty=4, col=5)
# The slope coefficient (b1) is the change in the log(odds) that Y=1 for a unit change in X
(for example, a change from X=0 to X=1). It is the ratio of the log(odds) at X=0 to the
log(odds) at X=1.
########## Dose response example. Response is binary {0,1} meaning No or Yes.
mystring="patient,response,dose,age,weight
1,0,0.0,4,28
2,0,1.5,5,35
3,1,2.5,8,55
4,0,1.0,9,76
5,0,0.5,5,31
6,1,2.0,5,27
7,0,1.0,6,35
8,1,2.5,6,47
9,0,0.0,9,59
10,1,3.0,8,50
11,0,2.0,7,50
12,1,2.5,7,46
13,1,1.5,4,33
14,1,3.0,8,59
15,0,1.5,6,40
16,0,0.0,8,58
17,0,1.0,7,55
18,0,2.0,10,76
19,0,0.5,9,66
20,0,1.0,6,43
21,1,1.0,6,48
22,1,2.5,7,50
23,0,1.5,5,29
24,1,2.5,11,64
25,1,3.0,9,61
26,1,2.5,10,71
27,0,1.5,4,26
28,1,2.0,3,27
29,0,0.0,9,56
30,1,2.5,8,57
31,0,1.0,3,22
32,0,0.5,5,37
33,0,0.5,6,44
34,1,3.0,5,45
35,0,1.5,8,53
36,1,2.0,4,29"
pediatric.data=read.table(textConnection(mystring),header=TRUE,sep=",",row.names="
patient")
model1=glm(response ~ dose, family=binomial, data= pediatric.data)
summary(model1)
Call:
glm(formula = response ~ dose, family = binomial, data =
pediatric.data)
Deviance Residuals:
Min
1Q
Median
-1.54658 -0.34206 -0.05608
3Q
0.36801
Max
2.39490
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-6.454
2.171 -2.973 0.00294 **
dose
3.645
1.187
3.070 0.00214 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 49.461
Residual deviance: 20.169
AIC: 24.169
on 35
on 34
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
plot(response ~ dose, data= pediatric.data, xlim=c(0,3), ylim=c(0,1))
x=seq(0,3,.1)
lines(x, ilogit(-6.454+3.645*x))
0.0
0.5
1.0
1.5
dose
2.0
2.5
3.0
0.0
0.2
0.4
0.6
response
0.8
1.0