Logistic Regression Using SAS

Logistic Regression Using SPSS
We will use the same breast cancer dataset for this handout as we did for the handout on logistic regression
using SAS. Shown below is a table listing the variables in “A study of preventive lifestyles and women’s
health” conducted by a group of students in School of Public Health, at the University of Michigan during
the1997 winter term. There are 370 women in this study aged 40 to 91 years.
Description of variables:
Variable Name
Location
IDNUM
STOPMENS
AGESTOP1
NUMPREG1
AGEBIRTH
MAMFREQ4
DOB
EDUC
TOTINCOM
SMOKER
WEIGHT1
Description
Column
Identification number
1-4
1= Yes, 2= NO, 9= Missing
5
88=NA (haven't stopped) 99= Missing
6-7
88=NA (no births) 99= Missing
8-9
88=NA (no births) 99= Missing
10-11
1= Every 6 months
12
2= Every year
3= Every 2 years
4= Every 5 years
5= Never
6= Other
9= Missing
01/01/00 to 12/31/57
13-20
99/99/99= Missing
1= No formal school
21-22
2= Grade school
3= Some high school
4= High school graduate/ Diploma equivalent
5= Some college education/ Associate’s degree
6= College graduate
7= Some graduate school
8= Graduate school or professional degree
9= Other
99= Missing
1= Less than $10,000
23
2= $10.000 to 24,999
3= $25,000 to 39,999
4= $40.000 to 54,999
5= More than $55,000
8= Don’t know
9= Missing
1= Yes, 2= No, 9= Missing
999= Missing
24
25-27
1
In SPSS, we use the Set Epoch command to define the 100-year window SPSS will use for a two-digit year. We
use SET epoch=1900 so that a date of birth of 12/21/05 will be read as Dec 21, 1905, rather than as Dec 21,
2000.
SET epoch=1900.
The data list command reads in the raw data. We initially read in DOB as a character variable. We then set up
the missing value code for DOB to be 09/09/99. We compute a new variable called BIRTHDATE, which is the
numeric version of DOB, using the SPSS date and time wizard. We also use the date and time wizard in SPSS
to calculate the variables STUDYDATE and AGE.
SET epoch=1900.
data list
file "c:\documents and settings\kwelch\desktop\b510\brca.dat" records=1
/idnum 1-4 stopmens 5 agestop1 6-7 numpreg1 8-9 agebirth 10-11
mamfreq4 12 dob 13-20 (A) educ 21-22
totincom 23 smoker 24 weight1 25-27.
execute.
RECODE
dob ('09/09/99'=' ')
EXECUTE .
.
* Date and Time Wizard: birthdate.
COMPUTE birthdate = number(dob, ADATE8).
VARIABLE LABEL birthdate.
VARIABLE LEVEL birthdate (SCALE).
FORMATS birthdate (ADATE10).
VARIABLE WIDTH birthdate(10).
EXECUTE.
missing values stopmens mamfreq4 smoker(9) agestop1 agebirth (88,99)
numpreg1 educ (99) totincom(8,9) weight1 (999).
RECODE
stopmens
(1=1) (2=0)
EXECUTE .
INTO
menopause .
COMPUTE yearbirth = XDATE.YEAR(birthdate).
FORMATS yearbirth (F8.0).
VARIABLE WIDTH yearbirth(8).
execute.
compute studymonth = 1.
format studymonth (f2.0).
compute studyyear = 1997.
format studyyear (f4.0).
compute studyday = 1.
format studyday (f1.0).
execute.
COMPUTE studydate = DATE.DMY(studyday, studymonth, studyyear).
FORMATS studydate (ADATE10).
2
EXECUTE.
COMPUTE age = DATEDIF(studydate, birthdate, "years").
FORMATS age (F5.0).
EXECUTE.
RECODE
educ
(MISSING=SYSMIS) (1 thru 4=1) (5 thru 6=2) (7 thru 8=3)
edcat .
formats edcat (f2.0).
EXECUTE .
INTO
Recode educ (missing=sysmis) (6 thru 8 = 1) (else=0) into highed.
formats highed (f2.0).
EXECUTE .
RECODE age (missing=sysmis) (lowest thru 49=1) (50 thru 59=2)
(60 thru 69=3) (70 thru highest=4) into agecat.
formats agecat (f2.0).
EXECUTE.
Recode age (missing=sysmis) (50 thru highest = 1) (else=0) into over50.
formats over50 (f2.0).
EXECUTE .
Recode age (missing=sysmis) (50 thru highest = 1) (else=2) into highage.
formats highage(f2.0).
EXECUTE .
SAVE OUTFILE='C:\Documents and Settings\kwelch\Desktop\b510\brca.sav'
/COMPRESSED.
Descriptives and Frequencies
We first get descriptive statistics for all the numerical variables in the dataset. Notice that although there are 370
observations in the dataset, we have only 191 cases that are complete for all variables (Valid n listwise=191).
DESCRIPTIVES
VARIABLES=idnum stopmens agestop1 numpreg1 agebirth mamfreq4 educ totincom
smoker weight1 birthdate yearbirth edcat highed studymonth studyyear
studyday studydate age highage over50 menopause
/STATISTICS=MEAN STDDEV MIN MAX .
3
Descriptive Statistics
N
idnum
stopmens
agestop1
numpreg1
agebirth
mamfreq4
educ
totincom
smoker
weight1
birthdate
yearbirth
edcat
highed
studymonth
studyyear
studyday
studydate
age
highage
over50
menopause
Valid N (listwise)
Minimum
370
369
297
366
324
328
365
325
364
360
361
361
364
364
370
370
370
370
361
361
361
369
191
Maximum
1008
1
27
0
9
1
1
1
1
86
12/21/1905
1905
1
0
1
1997
1
1/01/1997
40
1
0
.00
2448
2
61
12
39
6
9
5
2
295
8/01/1956
1956
3
1
1
1997
1
1/01/1997
91
2
1
1.00
Mean
Std. Deviation
1761.69
1.16
47.18
2.95
23.98
2.94
5.64
3.83
1.49
148.35
5/16/1938
1937.86
2.01
.44
1.00
1997.00
1.00
1/01/1997
58.14
1.27
.73
.8401
412.729
.367
6.310
1.873
4.829
1.381
1.637
1.308
.500
31.109
96170:54:55
10.984
.769
.497
.000
.000
.000
00:00:00
10.990
.447
.447
.36700
Next, we examine oneway frequencies for selected variables.
FREQUENCIES
VARIABLES= birthdate stopmens menopause educ edcat age over50 highage
/ORDER= ANALYSIS .
birthdate
Frequency
Valid
Total
Valid Percent
Cumulative
Percent
12/21/1905
1
.3
.3
.3
9/11/1909
1
.3
.3
.6
12/04/1909
1
.3
.3
.8
7/15/1911
...
1
.3
.3
1.1
2/24/1956
1
.3
.3
99.7
8/01/1956
1
.3
.3
100.0
361
97.6
100.0
9
2.4
370
100.0
Total
Missing
Percent
System
4
stopmens
Frequency
Valid
Cumulative
Percent
Valid Percent
1
310
83.8
84.0
84.0
2
59
15.9
16.0
100.0
369
99.7
100.0
1
.3
370
100.0
Total
Missing
Percent
9
Total
menopause
Frequency
Valid
Missing
Percent
Cumulative
Percent
Valid Percent
.00
59
15.9
16.0
16.0
1.00
310
83.8
84.0
100.0
Total
369
99.7
100.0
1
.3
370
100.0
System
Total
educ
Frequency
Valid
Total
Cumulative
Percent
Valid Percent
1
1
.3
.3
.3
2
4
1.1
1.1
1.4
3
11
3.0
3.0
4.4
4
89
24.1
24.4
28.8
5
99
26.8
27.1
55.9
6
50
13.5
13.7
69.6
7
23
6.2
6.3
75.9
8
87
23.5
23.8
99.7
9
1
.3
.3
100.0
365
98.6
100.0
5
1.4
370
100.0
Total
Missing
Percent
99
5
edcat
Frequency
Valid
Missing
Percent
Cumulative
Percent
Valid Percent
1
105
28.4
28.8
28.8
2
149
40.3
40.9
69.8
3
110
29.7
30.2
100.0
Total
364
98.4
100.0
6
1.6
370
100.0
System
Total
age
Frequency
Valid
Cumulative
Percent
Valid Percent
40
2
.5
.6
.6
41
5
1.4
1.4
1.9
42
7
1.9
1.9
3.9
43
11
3.0
3.0
6.9
44
...
7
1.9
1.9
8.9
85
1
.3
.3
99.2
87
2
.5
.6
99.7
91
1
.3
.3
100.0
361
97.6
100.0
9
2.4
370
100.0
Total
Missing
Percent
System
Total
over50
Frequency
Valid
Missing
Total
Percent
Cumulative
Percent
Valid Percent
0
99
26.8
27.4
27.4
1
262
70.8
72.6
100.0
Total
361
97.6
100.0
9
2.4
370
100.0
System
6
highage
Frequency
Valid
Cumulative
Percent
Valid Percent
1
262
70.8
72.6
72.6
2
99
26.8
27.4
100.0
361
97.6
100.0
9
2.4
370
100.0
Total
Missing
Percent
System
Total
Crosstabulation
Prior to fitting a logistic regression model, we check a crosstabulation to understand the relationship between
menopause and high age. In this 2 by 2 table both the predictor variable, HIGHAGE, and the outcome variable,
STOPMENS, are coded as 1 and 2. For HIGHAGE, the value 1 represents the high risk group (those whose age
is greater than or equal to 50 years), and for STOPMENS, the value 1 represents the outcome of interest (those
who are in menopause). Notice also that HIGHAGE is considered to be the risk factor so it is listed first (the
row variable) in the tables statement and STOPMENS is the outcome of interest so it is listed second (the
column variable). We request the relative risk and the odds ratio, along with the chi-square test of independence
in the Statistics window.
CROSSTABS
/TABLES=highage BY stopmens
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ RISK
/CELLS= COUNT EXPECTED ROW
/COUNT ROUND CELL .
Case Processing Summary
Cases
Valid
N
highage * stopmens
Missing
Percent
360
N
97.3%
Total
Percent
10
N
2.7%
370
highage * stopmens Crosstabulation
stopmens
1
highage
1
2
Count
Total
251
10
261
Expected Count
218.2
42.8
261.0
% within highage
96.2%
3.8%
100.0%
Count
Expected Count
% within highage
Total
2
Count
50
49
99
82.8
16.2
99.0
50.5%
49.5%
100.0%
301
59
360
Expected Count
301.0
59.0
360.0
% within highage
83.6%
16.4%
100.0%
7
Percent
100.0%
Chi-Square Tests
Value
Pearson Chi-Square
Continuity
Correctionb
Likelihood Ratio
Asymp. Sig. (2sided)
df
109.219a
1
.000
105.912
1
.000
99.081
1
.000
Exact Sig. (2sided)
Fisher's Exact Test
Linear-by-Linear Association
N of Valid Cases
Exact Sig. (1sided)
.000
108.916
1
.000
.000
360
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 16.23.
b. Computed only for a 2x2 table
The output below says "For cohort stopmens=1". This is what we want: the risk of menopause for those who
are high age (ROW1) divided by the risk of menopause for those who are not high age (ROW2). Notice that the
odds ratio (24.6) is not a good estimate of the risk ratio (1.90), because the outcome is not rare in this group of
older women.
Risk Estimate
95% Confidence Interval
Value
Odds Ratio for highage (1 / 2)
For cohort stopmens = 1
For cohort stopmens = 2
N of Valid Cases
Lower
24.598
1.904
.077
360
11.680
1.564
.041
Upper
51.802
2.318
.147
Logistic Regression Model with a dummy variable predictor
We now fit a logistic regression model, but using two different variables: OVER50 (coded as 0, 1) is used as the
predictor, and MENOPAUSE (also coded as 0,1) is used as the outcome. Note that SPSS will fit the probability
of the Menopause=1 by default, so we do not need to use any special syntax for this, as we did in SAS.
LOGISTIC REGRESSION VARIABLES menopause
/METHOD = ENTER over50
/PRINT = CI(95)
/CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
8
The output below shows us that we have 360 observations in this analysis.
Case Processing Summary
Unweighted
Casesa
Selected Cases
N
Included in Analysis
Missing Cases
Total
Unselected Cases
Total
Percent
360
97.3
10
2.7
370
0
370
100.0
.0
100.0
a. If weight is in effect, see classification table for the total number of
cases.
The dependent variable encoding is as we expect, so we proceed to look at the rest of the output.
Dependent Variable
Encoding
Original
Value
Internal Value
.00
1.00
0
1
Block 0: Beginning Block
This gives us information about the model before any predictors have been added.
We will not be using the classification table for this analysis. In general, it is helpful if you have a diagnostic
test or other method that you wish to use to check the proportion of cases that are correctly classified, if a given
cutpoint is used for the probability of an event; the default cutpoint is .5.
Classification Tablea,b
Predicted
menopause
Observed
Step 0
menopause
.00
Percentage
Correct
1.00
.00
0
59
.0
1.00
0
301
100.0
Overall Percentage
83.6
a. Constant is included in the model.
b. The cut value is .500
At step 0, the only predictor in the equation is the constant.
Variables in the Equation
B
Step 0
Constant
1.630
S.E.
.142
Wald
df
130.998
Sig.
1
9
.000
Exp(B)
5.102
We now get a score test for the significance of the predictor, OVER50. It will be significant, based on this score
test.
Variables not in the Equation
Score
Step 0
Variables
over50
Overall Statistics
df
Sig.
109.219
1
.000
109.219
1
.000
Block 1: Method = Enter
Now, we enter the first block of variables, because we are entering only OVER50, it is the only variable shown
here.
The Omnibus tests of model coefficients are testing the overall model. Because we have only one predictor in
the model, we have a 1 df test, which is significant.
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
99.081
1
.000
Block
99.081
1
.000
Model
99.081
1
.000
The model summary table shows the -2 log likelihood for the model, and the Cox & Snell R-Square (called the
pseudo R-square in SAS) and the Nagelkerke R-square (called the maximum rescaled R-square in SAS). We
see that this model has explained about 41% of the variation in the outcome.
Model Summary
Step
-2 Log likelihood
Cox & Snell R
Square
222.084a
1
Nagelkerke R
Square
.241
.408
a. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001.
The value of the parameter estimate for OVER50 (3.2) tells us that the log-odds of being in menopause
increase (because the estimate is positive) by 3.2 units for those in menopause compared to those women who
are not. This result is significant, Wald chi-square (1 df) = 71.036, p< 0.001. The odds ratio (24.6) is easier to
interpret. It tells us that the odds of being in menopause are 24.6 times higher for a woman who is over 50 than
for someone who is not. We can see that the 95% CI for the odds ratio does not include 1, so we can be pretty
confident that there is a strong relationship between being over 50 and being in menopause.
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
over50
Constant
S.E.
Wald
df
Sig.
Exp(B)
3.203
.380
71.036
1
.000
24.598
.020
.201
.010
1
.920
1.020
10
Lower
11.680
Upper
51.802
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
over50
Constant
S.E.
Wald
df
Sig.
Exp(B)
3.203
.380
71.036
1
.000
24.598
.020
.201
.010
1
.920
1.020
Lower
11.680
Upper
51.802
a. Variable(s) entered on step 1: over50.
Logistic Regression Model with a class variable as predictor
We now fit the same model, but using a class variable, HIGHAGE (coded as 1=Highage and 2=Not Highage),
as the predictor. We set up HIGHAGE as a categorical predictor, using the Indicator dummy variable coding.
We accept the default setup, so that the last (highest) category of HIGHAGE will be the reference. So, we will
be fitting a model in which we are comparing the odds of being in menopause for those women who are over 50
(HIGHAGE=1) to those who are not over 50 (HIGHAGE=2, the reference category).
Note that the results of this model fit are the same as in the previous model, but with some minor modifications
in the display.
LOGISTIC REGRESSION VARIABLES menopause
/METHOD=ENTER highage
/CONTRAST (highage)=Indicator
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
Case Processing Summary
Unweighted
Casesa
Selected Cases
N
Included in Analysis
Missing Cases
Total
Unselected Cases
Total
Percent
360
97.3
10
2.7
370
0
370
100.0
.0
100.0
a. If weight is in effect, see classification table for the total number of
cases.
Dependent Variable
Encoding
Original
Value
.00
1.00
Internal Value
0
1
SPSS provides information on the coding of the categorical predictor (HIGHAGE). We can see that the 1
parameter that will be used for HIGHAGE has a value of 0 for HIGHAGE=2 (the younger group), which means
that this will be the reference category.
11
Categorical Variables Codings
Parameter
coding
Frequency
highage
(1)
1
261
1.000
2
99
.000
The output for the parameter estimate is slightly different than for the previous model. In this case, we see
highage(1), to emphasize that this is the first (and only) dummy variable for HIGHAGE. Refer to the table
showing the coding of the categorical variables to be sure of the interpretation of this parameter.
Variables in the Equation
B
Step
1a
highage(1)
S.E.
Wald
df
Sig.
Exp(B)
3.203
.380
71.036
1
.000
24.598
.020
.201
.010
1
.920
1.020
Constant
a. Variable(s) entered on step 1: highage.
Logistic Regression Model with a class predictor with more than two categories
We now look at the relationship of education categories to menopause. Again, we begin by checking the crosstabulation between education and menopause, using the variable EDCAT as the "exposure" and STOPMENS as
the "outcome" or event. Because we are interested in the probability of STOPMENS = 1, for each level of
EDCAT, we really need only the row percents, so we request the row percents only. We see in the output that
the proportion of women in menopause decreases with increasing education level.
CROSSTABS
/TABLES=edcat BY stopmens
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT ROW
/COUNT ROUND CELL.
edcat * stopmens Crosstabulation
stopmens
1
edcat
1
Count
% within edcat
2
Count
% within edcat
3
Count
% within edcat
Total
Count
% within edcat
2
Total
96
9
105
91.4%
8.6%
100.0%
125
23
148
84.5%
15.5%
100.0%
84
26
110
76.4%
23.6%
100.0%
305
58
363
84.0%
16.0%
100.0%
12
Chi-Square Tests
Value
Asymp. Sig. (2sided)
df
9.117a
9.337
9.071
363
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear Association
N of Valid Cases
2
2
1
.010
.009
.003
a. 0 cells (.0%) have expected count less than 5. The minimum expected
count is 16.78.
We now fit a logistic regression model, using EDCAT as a predictor, we include EDCAT as a categorical
predictor with the reference being EDCAT=1.
LOGISTIC REGRESSION VARIABLES menopause
/METHOD=ENTER edcat
/CONTRAST (edcat)=Indicator(1)
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
We can look at the categorical variable information below to see that EDCAT=1 is the reference category,
because it has a value of 0 for both of the design (dummy) variables. The first dummy variable, EDCAT(1), will
be for EDCAT 2 vs. 1, and the second dummy variable, EDCAT(2), will be for EDCAT 3 vs. 1.
Case Processing Summary
Unweighted
Casesa
Selected Cases
N
Included in Analysis
Percent
363
98.1
7
1.9
370
0
370
100.0
.0
100.0
Missing Cases
Total
Unselected Cases
Total
a. If weight is in effect, see classification table for the total number of
cases.
Dependent Variable
Encoding
Original
Value
Internal Value
.00
1.00
0
1
Categorical Variables Codings
Parameter coding
Frequency
edcat
(1)
(2)
1
105
.000
.000
2
148
1.000
.000
3
110
.000
1.000
13
The table for Omnibus Tests of Model Coefficients provides an overall test for all parameters in the model.
Thus, we can see that there is a likelihood ratio chi-square test of whether there is any effect of EDCAT, Χ2
(2df) = 9.337, p=.009. In spite of the model being significant, the Nagelkerke R-Square is very small (.043).
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
9.337
2
.009
Block
9.337
2
.009
Model
9.337
2
.009
Model Summary
Step
-2 Log likelihood
Cox & Snell R
Square
309.598a
1
Nagelkerke R
Square
.025
.043
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
The parmeter estimate for EDCAT(1) shows that the log-odds of menopause for someone with EDCAT=2 are
smaller than for someone with EDCAT=1, but this difference is not significant (p=0.105). The parameter
estimate for EDCAT(1) is negative, indicating that someone with EDCAT=3 has a lower log-odds of
menopause than a person with EDCAT=1, and this difference is significant (p=0.004) . The overall test for
EDCAT is a Wald chi-square, which is another test of the overall signficance of EDCAT (Chi-square (2 df) =
8.632, p=.013.
The odds ratio estimate for EDCAT(2) (Edcat 3 vs 1) is .303, indicating that the odds of being in menopause for
a person with EDCAT=3 are only 30% of the odds of being in menopause for a person with EDCAT=1.
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
S.E.
edcat
Wald
df
Sig.
8.632
2
.013
Exp(B)
Lower
Upper
edcat(1)
-.674
.416
2.628
1
.105
.510
.225
1.151
edcat(2)
-1.194
.415
8.299
1
.004
.303
.134
.683
Constant
2.367
.349
46.107
1
.000
10.667
a. Variable(s) entered on step 1: edcat.
Logistic Regression Model with a continuous predictor
We now look at a logistic regression model, but this time with a single continuous predictor (AGE). The
parameter estimate for AGE is positive (0.283) telling us that the log-odds of being in menopause increase by
.28 units for a woman who is one year older compared to her counterpart who is one year younger. The odds
ratio (1.33) tells us that the odds of being in menopause for a woman who is one year older are 1.33 times
greater than for a woman who is one year younger.
14
LOGISTIC REGRESSION VARIABLES menopause
/METHOD=ENTER age
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
Case Processing Summary
Unweighted
Casesa
Selected Cases
N
Included in Analysis
Percent
360
97.3
10
2.7
370
0
370
100.0
.0
100.0
Missing Cases
Total
Unselected Cases
Total
a. If weight is in effect, see classification table for the total number of
cases.
Dependent Variable
Encoding
Original
Value
Internal Value
.00
1.00
0
1
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
124.146
1
.000
Block
124.146
1
.000
Model
124.146
1
.000
Model Summary
Step
-2 Log likelihood
Cox & Snell R
Square
197.019a
1
Nagelkerke R
Square
.292
.494
a. Estimation terminated at iteration number 7 because
parameter estimates changed by less than .001.
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
age
Constant
S.E.
Wald
df
Sig.
Exp(B)
.283
.040
49.765
1
.000
1.327
-12.868
1.936
44.174
1
.000
.000
a. Variable(s) entered on step 1: age.
15
Lower
1.227
Upper
1.436
Quasi-Complete Separation in a Logistic Regression Model
One fairly common occurrence in a logistic regression model is that the model fails to converge. This often
happens when you have a categorical predictor that is too perfect, that is, there may be a category with no
variability in the response (all subjects in one category of the predictor have the same response). This is called
quasi-complete separation. When this happens, SPSS will give a warning message in the output. These
warnings should be taken seriously, and the model should be refitted, perhaps by combining some categories of
the predictor.
Even if there is not quasi-complete separation, separation may be nearly complete, so the standard error for a
parameter estimate can become very large. It is good practice to examine the parameter estimates and their
standard errors carefully for any logistic regression output.
We now examine a situation where quasi-complete separation occurs, using the variable AGECAT as a
predictor in a logistic regression. First we check the crosstabulation between AGECAT and STOPMENS.
Notice that in the highest age category, all 71 women are in menopause (not surprisingly).
CROSSTABS
/TABLES=agecat BY stopmens
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT ROW
/COUNT ROUND CELL.
agecat * stopmens Crosstabulation
stopmens
1
agecat
1
Count
% within agecat
2
3
4
99
50.5%
49.5%
100.0%
106
9
115
92.2%
7.8%
100.0%
74
1
75
98.7%
1.3%
100.0%
71
0
71
100.0%
.0%
100.0%
Count
% within agecat
Total
49
Count
% within agecat
Count
% within agecat
Total
50
Count
% within agecat
2
301
59
360
83.6%
16.4%
100.0%
Chi-Square Tests
Value
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear Association
N of Valid Cases
111.660a
110.175
78.698
360
Asymp. Sig. (2sided)
df
3
3
1
.000
.000
.000
a. 0 cells (.0%) have expected count less than 5. The minimum expected
count is 11.64.
16
We now fit the corresponding logistic regression model, using AGECAT as a categorical predictor, with
AGECAT=1 as the reference category.
LOGISTIC REGRESSION VARIABLES menopause
/METHOD=ENTER agecat
/CONTRAST (agecat)=Indicator(1)
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
We check out the Categorical Variable Codings to see that the design variables are set up correctly to have
AGECAT=1 as the reference category, and notice that the dummy variables for AGECAT will be labeld
AGECAT(1), AGECAT(2), and AGECAT(3). These will correspond to AGECAT=2, 3, and 4, respectively.
Categorical Variables Codings
Parameter coding
Frequency
agecat
(1)
(2)
(3)
1
99
.000
.000
.000
2
115
1.000
.000
.000
3
75
.000
1.000
.000
4
71
.000
.000
1.000
The note in the output below alerts us to the problem of quasi-complete separation of data points. Also notice
that the estimate for AGECAT(3) has a standard error of 4770, compared to about 1 or less for the other two
dummy variables. We can also see that the estimate of the odds ratio for this dummy variable is basically
infinity (1.583 E 9, which is 1,583,000,000).
Model Summary
Step
-2 Log likelihood
Cox & Snell R
Square
210.990a
1
Nagelkerke R
Square
.264
.447
a. Estimation terminated at iteration number 20
because maximum iterations has been reached. Final
solution cannot be found.
Variables in the Equation
95% C.I.for EXP(B)
B
Step 1a
S.E.
agecat
Wald
df
Sig.
50.075
3
.000
Exp(B)
Lower
Upper
agecat(1)
2.446
.401
37.172
1
.000
11.542
5.258
25.339
agecat(2)
4.284
1.027
17.413
1
.000
72.520
9.696
542.384
agecat(3)
21.183
4770.028
.000
1
.996
1.583E9
.000
.
Constant
.020
.201
.010
1
.920
1.020
a. Variable(s) entered on step 1: agecat.
Based on the information that we saw in the crosstabulation, we will create a new variable AGECAT3, with 3
age categories, collapsing category 3 and category 4.
17
RECODE agecat
EXECUTE.
(MISSING=SYSMIS)
(1=1) (2=2)(3 thru 4=3) INTO agecat3.
We now fit a new logistic regression, with AGECAT3 as a categorical predictor. Note in the output for this
model, we do not have a problem with quasi-complete separation, however, we see a very wide confidence in
terval for AGECAT3(2), owing to the fact that there was only one participant in this group with a value of 0 on
the dependent variable.
LOGISTIC REGRESSION VARIABLES menopause
/METHOD=ENTER agecat3
/CONTRAST (agecat3)=Indicator(1)
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
Model Summary
Step
-2 Log likelihood
Cox & Snell R
Square
212.329a
1
Nagelkerke R
Square
.261
.442
a. Estimation terminated at iteration number 8 because
parameter estimates changed by less than .001.
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
S.E.
Wald
agecat3
df
Sig.
55.353
2
.000
Exp(B)
Lower
Upper
agecat3(1)
2.446
.401
37.172
1
.000
11.542
5.258
25.339
agecat3(2)
4.957
1.023
23.458
1
.000
142.100
19.120
1056.078
.020
.201
.010
1
.920
1.020
Constant
a. Variable(s) entered on step 1: agecat3.
Logistic Regression Model with Several Predictors
We now fit a logistic regression model with several predictors, both continuous and categorical. Note especially
the global test for the model, which has 6 degrees of freedom, due to the 6 parameters that are estimated for the
predictors in the model. There are two parameters for EDCAT and one each for AGE, SMOKER, TOTINCOM,
and NUMPREG1. The only predictor that is significant in this model is AGE (p<0.0001)..
LOGISTIC REGRESSION VARIABLES menopause
/METHOD=ENTER age edcat smoker totincom numpreg1
/CONTRAST (edcat)=Indicator(1)
/PRINT=CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
Note that there are only 313 observations included in this model.
Case Processing Summary
Unweighted
Casesa
N
Percent
18
Selected Cases
Included in Analysis
313
84.6
57
15.4
370
0
370
100.0
.0
100.0
Missing Cases
Total
Unselected Cases
Total
a. If weight is in effect, see classification table for the total number of
cases.
Categorical Variables Codings
Parameter coding
Frequency
edcat
(1)
(2)
1
87
.000
.000
2
128
1.000
.000
3
98
.000
1.000
This table gives us an idea of the relative p-values for each of the potential predictors in the model. Note that we
see EDCAT with 2 df, which is an overall test for EDCAT, and we also see the two dummy variables for
EDCAT, EDCAT(1) and EDCAT(2).
Variables not in the Equation
Score
Step 0
Variables
age
df
65.729
1
.000
9.847
2
.007
edcat(1)
.001
1
.980
edcat(2)
6.815
1
.009
smoker
3.704
1
.054
totincom
8.501
1
.004
numpreg1
4.943
1
.026
73.151
6
.000
edcat
Overall Statistics
Omnibus Tests of Model Coefficients
Chi-square
Step 1
Sig.
df
Sig.
Step
110.366
6
.000
Block
110.366
6
.000
Model
110.366
6
.000
19
Model Summary
Step
-2 Log likelihood
Cox & Snell R
Square
177.510a
1
Nagelkerke R
Square
.297
.494
a. Estimation terminated at iteration number 7 because
parameter estimates changed by less than .001.
Again, in the output below, we see an overall test for EDCAT, with 2 df, showing that it is not significant in the
model. We also see the output for the two EDCAT dummy variables.
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
age
S.E.
.280
.044
edcat
Wald
df
Sig.
40.609
1
.000
2.378
2
.305
Exp(B)
Lower
Upper
1.323
1.214
1.442
edcat(1)
-.436
.552
.622
1
.430
.647
.219
1.910
edcat(2)
-.840
.564
2.221
1
.136
.432
.143
1.303
smoker
-.654
.384
2.909
1
.088
.520
.245
1.102
totincom
-.093
.168
.303
1
.582
.911
.655
1.268
.006
.131
.002
1
.961
1.006
.779
1.300
-10.815
2.213
23.879
1
.000
.000
numpreg1
Constant
a. Variable(s) entered on step 1: age, edcat, smoker, totincom, numpreg1.
20