Logistic Regression - UF Department of Statistics

Case Study – Logistic Regression
NFL Field Goal Attempts – 2003
Dependent Variable: Outcome of Field Goal Attempt (1=Success,
0=Failure)
Independent Variable: Distance (Yards)
Data:
YARDS * SUCCESS Crosstabulation
Count
SUCCESS
0
YARDS
Total
18.0
19.0
20.0
21.0
22.0
23.0
24.0
25.0
26.0
27.0
28.0
29.0
30.0
31.0
32.0
33.0
34.0
35.0
36.0
37.0
38.0
39.0
40.0
41.0
42.0
43.0
44.0
45.0
46.0
47.0
48.0
49.0
50.0
51.0
52.0
53.0
54.0
55.0
56.0
57.0
58.0
60.0
62.0
1
0
0
1
1
0
0
2
0
1
0
1
4
3
3
2
3
6
7
4
6
7
7
5
10
2
7
11
8
9
7
15
13
11
6
9
7
5
5
0
1
0
2
1
192
Total
1
5
18
17
36
33
28
17
26
32
31
33
19
23
16
30
19
20
21
26
26
26
24
22
26
21
20
13
25
20
21
16
14
12
5
7
3
1
1
1
1
0
0
756
1
5
19
18
36
33
30
17
27
32
32
37
22
26
18
33
25
27
25
32
33
33
29
32
28
28
31
21
34
27
36
29
25
18
14
14
8
6
1
2
1
2
1
948
Random Component: Binomial Distribution
Systematic Component: g() =  + X
Link Function: logit
  

g() = log 
1  
e  x
Regression Equation:  ( x) 
1  e  x
Estimated Regression Equation:
Variables in the Equation
Step
a
1
YARDS
Constant
B
-.110
5.698
S.E.
.011
.451
Wald
107.856
159.545
df
1
1
Sig.
.000
.000
Exp(B)
.896
298.235
95.0% C.I.for EXP(B)
Lower
Upper
.878
.915
a. Variable(s) entered on s tep 1: YARDS.
e 5.6980.110x
 ( x) 
1  e 5.6980.110x
^
Yardage
20
25
30
35
40
45
50
55
Probability
.9706
.9502
.9167
.8639
.7855
.6787
.5493
.4129
95% CI for : -0.110  1.96(0.011)  -0.110  0.022  (-0.132 , -0.088)
Odds Ratio: OR 
  ( x  1) 
1   ( x  1) 


odds( x  1)

 e
odds( x)
  ( x) 
1   ( x) 


^
Estimated odds ratio: e   e 0.110  0.896
Odds of successful fieldgoal change by about (0.896-1)100% = -10.4% for
each extra yard of the attempt.
95% CI for odds ratio: (e-0.132 , e-0.088)  (0.876 , 0.916)
Hosmer-Lemeshow Goodness-of-Fit Test:
1. Lump attempts into 10 bins of approximately equal number of attempts
2. Obtain the observed and predicted (expected) numbers of successes and
failures for each bin (20 total cells)
3. For each bin, compute the contribution to the chi-square statistic:
(observed - expected) 2
X 
expected
2
4. Sum the results from step 3 over all 20 cells
Hosme r and Leme show Test
St ep
1
Chi-square
5.974
df
8
Sig.
.650
Contingency Table for Hosmer and Lemeshow Test
Step
1
1
2
3
4
5
6
7
8
9
10
SUCCESS = 0
Observed
Expected
47
47.043
35
36.501
28
27.657
19
22.232
19
18.588
17
12.632
14
10.995
8
6.779
3
5.638
2
3.936
SUCCESS = 1
Observed
Expected
45
44.957
57
55.499
58
58.343
69
65.768
76
76.412
67
71.368
88
91.005
83
84.221
103
100.362
110
108.064
Total
92
92
86
88
95
84
102
91
106
112
Don’t reject the null hypothesis that the model fit is appropriate.
Computational Approach to Obtaining Logistic Regression Analysis
Data: ni observations at the ith of m distinct levels of the independent
variable(s), with yi successes.
 x1' 
 '
x
X  2
 
 ' 
 x m 

xi'  1 x1i
 x pi
0 
 
1
  
  
 
  p 

Note: p=1 in this case
With Likelihood and log-Likelihood Functions:

 exp( xi'  ) 
ni !


L(  | (n1 , y1 ),..., (nm , y m ))   
'

i 1  y i !( ni  y i )!  1  exp( x i  ) 
m
yi


1


'
 1  exp( x  ) 
i


n
n
n
i 1
i 1
i 1

ni  yi
log( L)   log( ni !)  log( y i !)  log(( ni  y i )!)    y i xi'    ni log 1  exp( xi'  )
The derivative of the log-likelihood wrt :
 log( L)
1
  y i x i   ni
xi exp( xi'  )   ( y i  ni  i ) xi  g (  )
'

1  exp( xi  )
The Hessian matrix:
1  exp( xi'  )  xi' exp( xi'  )  exp( xi'  ) xi' exp( xi'  )

2
  ni xi
2
 '
1  exp( x'  ) 
i
  ni xi xi' exp( xi'  )
Wii  ni i (1   i )
1
1  exp( x  ) 
'
i
2
  X 'WX  G (  )
Wij  0 i  j
Newton-Raphson-Algorithm:
^ NEW

^ OLD

1
  ^ OLD   ^ OLD 
 g  

 G 



 

 

Results from SAS PROC IML:
BETA_NEW
SEBETA
5.6978802 0.4510991
-0.10991 0.0105832
VBETA
LOGLIKE
0.2034904 -0.004683 -408.6017
-0.004683 0.000112
Note: There log-likelihood function evaluation does not match SPSS’ (it
does not evaluate the first term, which does not involve ). Any tests
concerning  will not be effected.
1
.8
y .6
.4
.2
20
30
40
50
60
70
x
For more detail, see Agresti (2002), Categorical Data Analysis. Chapter 4.
Sources: www.jt-sw.com, www.espn.com