ch09-sec06.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Chapter 9,
Section 6
Luo Xiao
November 18, 2015
1 / 20
ST430 Introduction to Regression Analysis
Special Topics
2 / 20
Special Topics
ST430 Introduction to Regression Analysis
Logistic regression
Linear regression methods are used to evaluate the impact of various factors
on a response.
When the response Y is binary (0 or 1), linear methods have problems.
Because E (Y |X) = P(Y = 1|X), the linear regression model
E (Y |X ) = β0 + β1 X
will often predict probabilities that are either negative or greater than 1.
3 / 20
Special Topics
ST430 Introduction to Regression Analysis
The most common alternative is based on modeling the log odds ratio:
π(X) = P(Y = 1|X) = E (Y |X)
π(X)
= the odds ratio
1 − π(X)
π(X)
= log odds ratio, or just log odds.
log
1 − π(X)
In the logistic regression model, we assume
π(X)
log
= β0 + β1 X1 + · · · + βk Xk .
1 − π(X)
4 / 20
Special Topics
ST430 Introduction to Regression Analysis
Solving for π(X), we find
P(Y = 1|X) = π(X) =
exp (β0 + β1 X1 + · · · + βk Xk )
.
1 + exp (β0 + β1 X1 + · · · + βk Xk )
Consequently
P(Y = 0|X) = 1 − π(X) =
1
.
1 + exp (β0 + β1 X1 + · · · + βk Xk )
As a function of any Xj , π(X) changes smoothly from 0 to 1.
It is increasing if βj > 0, and decreasing if βj < 0.
5 / 20
Special Topics
ST430 Introduction to Regression Analysis
The function
F (x ) =
exp(x )
1 + exp(x )
is the cdf of the logistic distribution.
It is similar to the cdf of the normal distribution with zero mean and the
2
matching variance ( π3 ).
#CDF of logistic distribution
curve(exp(x)/(1 + exp(x)), from = -5, to = 5)
#CDF of normal distribution
curve(pnorm(x, 0, sqrt(pi^2/3)), add = TRUE, col = "red")
legend("topleft",c("logistic","normal"),
col=c("black","red"),lwd=1)
6 / 20
Special Topics
0.2
0.4
0.6
0.8
logistic
normal
0.0
exp(x)/(1 + exp(x))
1.0
ST430 Introduction to Regression Analysis
−4
−2
0
2
4
x
7 / 20
Special Topics
ST430 Introduction to Regression Analysis
Interpreting the parameters
The coefficient βj measures the change in the log odds associated with a
change of +1 in Xj .
So e βj − 1 is the percentage change in the odds associated with the same
change in Xj .
When Xj is an indicator variable, e βj is often interpreted as the relative risk
that Y = 1 for the group where Xj = 1, relative to the group where Xj = 0.
8 / 20
Special Topics
ST430 Introduction to Regression Analysis
Example: fraud detection
Data are credit card transactions.
The response is Y , where
(
Y =
1
0
if the transaction is fraudulent,
otherwise.
The predictors are information about the card holder (credit limit, etc.) and
about the transaction (amount, etc.).
The fitted π̂(X) can be used to predict the probability that a new
transaction will prove to be fraudulent.
9 / 20
Special Topics
ST430 Introduction to Regression Analysis
Model estimation
The usual approach to estimating β0 , β1 , . . . , βk is by maximum likelihood.
It is implemented in proc logistic and proc genmod in SAS, and in the
glm() function in R.
The names “genmod” and “glm” are abbreviations of generalized linear
model, of which logistic regression is a particular case.
10 / 20
Special Topics
ST430 Introduction to Regression Analysis
Example: collusive bidding in Florida road construction.
Y : binary if the bid is fixed (Y = 1) or competitive (Y = 0);
X1 : number of bidders;
X2 : difference between winning bid and estimated competitive bid.
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("ROADBIDS.RData")
pairs(ROADBIDS)
11 / 20
Special Topics
ST430 Introduction to Regression Analysis
4
6
8 10
0.8
2
12
0.0
0.4
STATUS
0
DOTEST
20 40 60
2 4 6 8
NUMBIDS
0.0
0.4
0.8
0
12 / 20
20
40
60
Special Topics
ST430 Introduction to Regression Analysis
Using glm() is very similar to using lm():
g = glm(STATUS~NUMBIDS+DOTEST,ROADBIDS,family = binomial)
summary(g)
The argument family = binomial specifies that the response, STATUS,
has the binomial (strictly, the Bernoulli) distribution.
13 / 20
Special Topics
ST430 Introduction to Regression Analysis
See "output1.txt"
The output is also similar to that of lm().
Note that instead of a column of t-values, there is a column of z-values.
Like a t-value, a z-value is the ratio of a parameter estimate to its standard
error.
The label indicates that you test the significance of the parameter using the
normal distribution, not the t-distribution.
14 / 20
Special Topics
ST430 Introduction to Regression Analysis
Because this is not a least squares fit, there are no sums of squares.
Deviance plays a similar role: measures how the model fit the data.
For example, to test the utility of the model, use the statistic
Null deviance − Residual deviance = 41.381 − 22.843 = 18.538
which, under H0 : β1 = β2 = 0, is χ2 -distributed with 30 − 28 = 2 degrees
of freedom.
P(χ22 ≥ 18.538) < 0.0001, so we reject H0 .
15 / 20
Special Topics
ST430 Introduction to Regression Analysis
You also use deviance to compare nested models, such as the first order
model
π(X)
log
= β0 + β1 X1 + β2 X2
1 − π(X)
against the complete second-order model
π(X)
= β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 .
log
1 − π(X)
See “output2.txt”
g2 = glm(STATUS ~ NUMBIDS * DOTEST + I(NUMBIDS^2)
+ I(DOTEST^2), ROADBIDS, family = binomial)
summary(g2)
16 / 20
Special Topics
ST430 Introduction to Regression Analysis
To test H0 : β3 = β4 = β5 = 0, the test statistic is
(Residual deviance for reduced model)−
(Residual deviance for complete model)
Under H0 , this statistic has the χ2 -distribution with 28 − 25 = 3 degrees of
freedom.
Here we have 22.843 − 13.820 = 9.023, which we compare with the
χ23 -distribution.
We find P(χ23 ≥ 9.023) = .029, so we would reject H0 at α = .05 but not
at α = .01.
That is, there is some evidence that we need second-order terms.
17 / 20
Special Topics
ST430 Introduction to Regression Analysis
Compare non-nested models
There is no R 2 or Ra2 .
Deviance can be used and is similar to R 2 .
AIC can be defined here:
AIC = Deviance + 2(k + 1),
where k is the number of predictors.
18 / 20
Special Topics
ST430 Introduction to Regression Analysis
Prediction
Suppose that a new auction has 4 bidders, and the difference between the
winning bid and the engineer’s estimate is 30%. What is the probability
that the auction was collusive?
predict(g, data.frame(NUMBIDS = 4, DOTEST = 30),
type = "response", se.fit = TRUE)
The probability is .85, but the standard error of .13 shows that it is not very
well quantified.
If you do not specify “type = "response"”, the prediction is on the scale
of the log odds, not the probability itself.
19 / 20
Special Topics
ST430 Introduction to Regression Analysis
Do not use the standard error of the predicted probability to construct a
confidence interval!
You can use a confidence interval for the log odds to construct a
corresponding confidence interval for the probability:
p = predict(g, data.frame(NUMBIDS = 4, DOTEST = 30),
se.fit = TRUE)
logOdds = p$fit + qnorm(c(.025, .5, .975)) * p$se.fit
exp(logOdds)/(1 + exp(logOdds))
20 / 20
Special Topics

Download Report

ch09-sec06.pdf

Paperzz.com

Your Paperzz