Logistic Regression Using SPSS We will use the same breast cancer dataset for this handout as we did for the handout on logistic regression using SAS. Shown below is a table listing the variables in “A study of preventive lifestyles and women’s health” conducted by a group of students in School of Public Health, at the University of Michigan during the1997 winter term. There are 370 women in this study aged 40 to 91 years. Description of variables: Variable Name Location IDNUM STOPMENS AGESTOP1 NUMPREG1 AGEBIRTH MAMFREQ4 DOB EDUC TOTINCOM SMOKER WEIGHT1 Description Column Identification number 1-4 1= Yes, 2= NO, 9= Missing 5 88=NA (haven't stopped) 99= Missing 6-7 88=NA (no births) 99= Missing 8-9 88=NA (no births) 99= Missing 10-11 1= Every 6 months 12 2= Every year 3= Every 2 years 4= Every 5 years 5= Never 6= Other 9= Missing 01/01/00 to 12/31/57 13-20 99/99/99= Missing 1= No formal school 21-22 2= Grade school 3= Some high school 4= High school graduate/ Diploma equivalent 5= Some college education/ Associate’s degree 6= College graduate 7= Some graduate school 8= Graduate school or professional degree 9= Other 99= Missing 1= Less than $10,000 23 2= $10.000 to 24,999 3= $25,000 to 39,999 4= $40.000 to 54,999 5= More than $55,000 8= Don’t know 9= Missing 1= Yes, 2= No, 9= Missing 999= Missing 24 25-27 1 In SPSS, we use the Set Epoch command to define the 100-year window SPSS will use for a two-digit year. We use SET epoch=1900 so that a date of birth of 12/21/05 will be read as Dec 21, 1905, rather than as Dec 21, 2000. SET epoch=1900. The data list command reads in the raw data. We initially read in DOB as a character variable. We then set up the missing value code for DOB to be 09/09/99. We compute a new variable called BIRTHDATE, which is the numeric version of DOB, using the SPSS date and time wizard. We also use the date and time wizard in SPSS to calculate the variables STUDYDATE and AGE. SET epoch=1900. data list file "c:\documents and settings\kwelch\desktop\b510\brca.dat" records=1 /idnum 1-4 stopmens 5 agestop1 6-7 numpreg1 8-9 agebirth 10-11 mamfreq4 12 dob 13-20 (A) educ 21-22 totincom 23 smoker 24 weight1 25-27. execute. RECODE dob ('09/09/99'=' ') EXECUTE . . * Date and Time Wizard: birthdate. COMPUTE birthdate = number(dob, ADATE8). VARIABLE LABEL birthdate. VARIABLE LEVEL birthdate (SCALE). FORMATS birthdate (ADATE10). VARIABLE WIDTH birthdate(10). EXECUTE. missing values stopmens mamfreq4 smoker(9) agestop1 agebirth (88,99) numpreg1 educ (99) totincom(8,9) weight1 (999). RECODE stopmens (1=1) (2=0) EXECUTE . INTO menopause . COMPUTE yearbirth = XDATE.YEAR(birthdate). FORMATS yearbirth (F8.0). VARIABLE WIDTH yearbirth(8). execute. compute studymonth = 1. format studymonth (f2.0). compute studyyear = 1997. format studyyear (f4.0). compute studyday = 1. format studyday (f1.0). execute. COMPUTE studydate = DATE.DMY(studyday, studymonth, studyyear). FORMATS studydate (ADATE10). 2 EXECUTE. COMPUTE age = DATEDIF(studydate, birthdate, "years"). FORMATS age (F5.0). EXECUTE. RECODE educ (MISSING=SYSMIS) (1 thru 4=1) (5 thru 6=2) (7 thru 8=3) edcat . formats edcat (f2.0). EXECUTE . INTO Recode educ (missing=sysmis) (6 thru 8 = 1) (else=0) into highed. formats highed (f2.0). EXECUTE . RECODE age (missing=sysmis) (lowest thru 49=1) (50 thru 59=2) (60 thru 69=3) (70 thru highest=4) into agecat. formats agecat (f2.0). EXECUTE. Recode age (missing=sysmis) (50 thru highest = 1) (else=0) into over50. formats over50 (f2.0). EXECUTE . Recode age (missing=sysmis) (50 thru highest = 1) (else=2) into highage. formats highage(f2.0). EXECUTE . SAVE OUTFILE='C:\Documents and Settings\kwelch\Desktop\b510\brca.sav' /COMPRESSED. Descriptives and Frequencies We first get descriptive statistics for all the numerical variables in the dataset. Notice that although there are 370 observations in the dataset, we have only 191 cases that are complete for all variables (Valid n listwise=191). DESCRIPTIVES VARIABLES=idnum stopmens agestop1 numpreg1 agebirth mamfreq4 educ totincom smoker weight1 birthdate yearbirth edcat highed studymonth studyyear studyday studydate age highage over50 menopause /STATISTICS=MEAN STDDEV MIN MAX . 3 Descriptive Statistics N idnum stopmens agestop1 numpreg1 agebirth mamfreq4 educ totincom smoker weight1 birthdate yearbirth edcat highed studymonth studyyear studyday studydate age highage over50 menopause Valid N (listwise) Minimum 370 369 297 366 324 328 365 325 364 360 361 361 364 364 370 370 370 370 361 361 361 369 191 Maximum 1008 1 27 0 9 1 1 1 1 86 12/21/1905 1905 1 0 1 1997 1 1/01/1997 40 1 0 .00 2448 2 61 12 39 6 9 5 2 295 8/01/1956 1956 3 1 1 1997 1 1/01/1997 91 2 1 1.00 Mean Std. Deviation 1761.69 1.16 47.18 2.95 23.98 2.94 5.64 3.83 1.49 148.35 5/16/1938 1937.86 2.01 .44 1.00 1997.00 1.00 1/01/1997 58.14 1.27 .73 .8401 412.729 .367 6.310 1.873 4.829 1.381 1.637 1.308 .500 31.109 96170:54:55 10.984 .769 .497 .000 .000 .000 00:00:00 10.990 .447 .447 .36700 Next, we examine oneway frequencies for selected variables. FREQUENCIES VARIABLES= birthdate stopmens menopause educ edcat age over50 highage /ORDER= ANALYSIS . birthdate Frequency Valid Total Valid Percent Cumulative Percent 12/21/1905 1 .3 .3 .3 9/11/1909 1 .3 .3 .6 12/04/1909 1 .3 .3 .8 7/15/1911 ... 1 .3 .3 1.1 2/24/1956 1 .3 .3 99.7 8/01/1956 1 .3 .3 100.0 361 97.6 100.0 9 2.4 370 100.0 Total Missing Percent System 4 stopmens Frequency Valid Cumulative Percent Valid Percent 1 310 83.8 84.0 84.0 2 59 15.9 16.0 100.0 369 99.7 100.0 1 .3 370 100.0 Total Missing Percent 9 Total menopause Frequency Valid Missing Percent Cumulative Percent Valid Percent .00 59 15.9 16.0 16.0 1.00 310 83.8 84.0 100.0 Total 369 99.7 100.0 1 .3 370 100.0 System Total educ Frequency Valid Total Cumulative Percent Valid Percent 1 1 .3 .3 .3 2 4 1.1 1.1 1.4 3 11 3.0 3.0 4.4 4 89 24.1 24.4 28.8 5 99 26.8 27.1 55.9 6 50 13.5 13.7 69.6 7 23 6.2 6.3 75.9 8 87 23.5 23.8 99.7 9 1 .3 .3 100.0 365 98.6 100.0 5 1.4 370 100.0 Total Missing Percent 99 5 edcat Frequency Valid Missing Percent Cumulative Percent Valid Percent 1 105 28.4 28.8 28.8 2 149 40.3 40.9 69.8 3 110 29.7 30.2 100.0 Total 364 98.4 100.0 6 1.6 370 100.0 System Total age Frequency Valid Cumulative Percent Valid Percent 40 2 .5 .6 .6 41 5 1.4 1.4 1.9 42 7 1.9 1.9 3.9 43 11 3.0 3.0 6.9 44 ... 7 1.9 1.9 8.9 85 1 .3 .3 99.2 87 2 .5 .6 99.7 91 1 .3 .3 100.0 361 97.6 100.0 9 2.4 370 100.0 Total Missing Percent System Total over50 Frequency Valid Missing Total Percent Cumulative Percent Valid Percent 0 99 26.8 27.4 27.4 1 262 70.8 72.6 100.0 Total 361 97.6 100.0 9 2.4 370 100.0 System 6 highage Frequency Valid Cumulative Percent Valid Percent 1 262 70.8 72.6 72.6 2 99 26.8 27.4 100.0 361 97.6 100.0 9 2.4 370 100.0 Total Missing Percent System Total Crosstabulation Prior to fitting a logistic regression model, we check a crosstabulation to understand the relationship between menopause and high age. In this 2 by 2 table both the predictor variable, HIGHAGE, and the outcome variable, STOPMENS, are coded as 1 and 2. For HIGHAGE, the value 1 represents the high risk group (those whose age is greater than or equal to 50 years), and for STOPMENS, the value 1 represents the outcome of interest (those who are in menopause). Notice also that HIGHAGE is considered to be the risk factor so it is listed first (the row variable) in the tables statement and STOPMENS is the outcome of interest so it is listed second (the column variable). We request the relative risk and the odds ratio, along with the chi-square test of independence in the Statistics window. CROSSTABS /TABLES=highage BY stopmens /FORMAT= AVALUE TABLES /STATISTIC=CHISQ RISK /CELLS= COUNT EXPECTED ROW /COUNT ROUND CELL . Case Processing Summary Cases Valid N highage * stopmens Missing Percent 360 N 97.3% Total Percent 10 N 2.7% 370 highage * stopmens Crosstabulation stopmens 1 highage 1 2 Count Total 251 10 261 Expected Count 218.2 42.8 261.0 % within highage 96.2% 3.8% 100.0% Count Expected Count % within highage Total 2 Count 50 49 99 82.8 16.2 99.0 50.5% 49.5% 100.0% 301 59 360 Expected Count 301.0 59.0 360.0 % within highage 83.6% 16.4% 100.0% 7 Percent 100.0% Chi-Square Tests Value Pearson Chi-Square Continuity Correctionb Likelihood Ratio Asymp. Sig. (2sided) df 109.219a 1 .000 105.912 1 .000 99.081 1 .000 Exact Sig. (2sided) Fisher's Exact Test Linear-by-Linear Association N of Valid Cases Exact Sig. (1sided) .000 108.916 1 .000 .000 360 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 16.23. b. Computed only for a 2x2 table The output below says "For cohort stopmens=1". This is what we want: the risk of menopause for those who are high age (ROW1) divided by the risk of menopause for those who are not high age (ROW2). Notice that the odds ratio (24.6) is not a good estimate of the risk ratio (1.90), because the outcome is not rare in this group of older women. Risk Estimate 95% Confidence Interval Value Odds Ratio for highage (1 / 2) For cohort stopmens = 1 For cohort stopmens = 2 N of Valid Cases Lower 24.598 1.904 .077 360 11.680 1.564 .041 Upper 51.802 2.318 .147 Logistic Regression Model with a dummy variable predictor We now fit a logistic regression model, but using two different variables: OVER50 (coded as 0, 1) is used as the predictor, and MENOPAUSE (also coded as 0,1) is used as the outcome. Note that SPSS will fit the probability of the Menopause=1 by default, so we do not need to use any special syntax for this, as we did in SAS. LOGISTIC REGRESSION VARIABLES menopause /METHOD = ENTER over50 /PRINT = CI(95) /CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) . 8 The output below shows us that we have 360 observations in this analysis. Case Processing Summary Unweighted Casesa Selected Cases N Included in Analysis Missing Cases Total Unselected Cases Total Percent 360 97.3 10 2.7 370 0 370 100.0 .0 100.0 a. If weight is in effect, see classification table for the total number of cases. The dependent variable encoding is as we expect, so we proceed to look at the rest of the output. Dependent Variable Encoding Original Value Internal Value .00 1.00 0 1 Block 0: Beginning Block This gives us information about the model before any predictors have been added. We will not be using the classification table for this analysis. In general, it is helpful if you have a diagnostic test or other method that you wish to use to check the proportion of cases that are correctly classified, if a given cutpoint is used for the probability of an event; the default cutpoint is .5. Classification Tablea,b Predicted menopause Observed Step 0 menopause .00 Percentage Correct 1.00 .00 0 59 .0 1.00 0 301 100.0 Overall Percentage 83.6 a. Constant is included in the model. b. The cut value is .500 At step 0, the only predictor in the equation is the constant. Variables in the Equation B Step 0 Constant 1.630 S.E. .142 Wald df 130.998 Sig. 1 9 .000 Exp(B) 5.102 We now get a score test for the significance of the predictor, OVER50. It will be significant, based on this score test. Variables not in the Equation Score Step 0 Variables over50 Overall Statistics df Sig. 109.219 1 .000 109.219 1 .000 Block 1: Method = Enter Now, we enter the first block of variables, because we are entering only OVER50, it is the only variable shown here. The Omnibus tests of model coefficients are testing the overall model. Because we have only one predictor in the model, we have a 1 df test, which is significant. Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 99.081 1 .000 Block 99.081 1 .000 Model 99.081 1 .000 The model summary table shows the -2 log likelihood for the model, and the Cox & Snell R-Square (called the pseudo R-square in SAS) and the Nagelkerke R-square (called the maximum rescaled R-square in SAS). We see that this model has explained about 41% of the variation in the outcome. Model Summary Step -2 Log likelihood Cox & Snell R Square 222.084a 1 Nagelkerke R Square .241 .408 a. Estimation terminated at iteration number 6 because parameter estimates changed by less than .001. The value of the parameter estimate for OVER50 (3.2) tells us that the log-odds of being in menopause increase (because the estimate is positive) by 3.2 units for those in menopause compared to those women who are not. This result is significant, Wald chi-square (1 df) = 71.036, p< 0.001. The odds ratio (24.6) is easier to interpret. It tells us that the odds of being in menopause are 24.6 times higher for a woman who is over 50 than for someone who is not. We can see that the 95% CI for the odds ratio does not include 1, so we can be pretty confident that there is a strong relationship between being over 50 and being in menopause. Variables in the Equation 95% C.I.for EXP(B) B Step 1a over50 Constant S.E. Wald df Sig. Exp(B) 3.203 .380 71.036 1 .000 24.598 .020 .201 .010 1 .920 1.020 10 Lower 11.680 Upper 51.802 Variables in the Equation 95% C.I.for EXP(B) B Step 1a over50 Constant S.E. Wald df Sig. Exp(B) 3.203 .380 71.036 1 .000 24.598 .020 .201 .010 1 .920 1.020 Lower 11.680 Upper 51.802 a. Variable(s) entered on step 1: over50. Logistic Regression Model with a class variable as predictor We now fit the same model, but using a class variable, HIGHAGE (coded as 1=Highage and 2=Not Highage), as the predictor. We set up HIGHAGE as a categorical predictor, using the Indicator dummy variable coding. We accept the default setup, so that the last (highest) category of HIGHAGE will be the reference. So, we will be fitting a model in which we are comparing the odds of being in menopause for those women who are over 50 (HIGHAGE=1) to those who are not over 50 (HIGHAGE=2, the reference category). Note that the results of this model fit are the same as in the previous model, but with some minor modifications in the display. LOGISTIC REGRESSION VARIABLES menopause /METHOD=ENTER highage /CONTRAST (highage)=Indicator /CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5). Case Processing Summary Unweighted Casesa Selected Cases N Included in Analysis Missing Cases Total Unselected Cases Total Percent 360 97.3 10 2.7 370 0 370 100.0 .0 100.0 a. If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding Original Value .00 1.00 Internal Value 0 1 SPSS provides information on the coding of the categorical predictor (HIGHAGE). We can see that the 1 parameter that will be used for HIGHAGE has a value of 0 for HIGHAGE=2 (the younger group), which means that this will be the reference category. 11 Categorical Variables Codings Parameter coding Frequency highage (1) 1 261 1.000 2 99 .000 The output for the parameter estimate is slightly different than for the previous model. In this case, we see highage(1), to emphasize that this is the first (and only) dummy variable for HIGHAGE. Refer to the table showing the coding of the categorical variables to be sure of the interpretation of this parameter. Variables in the Equation B Step 1a highage(1) S.E. Wald df Sig. Exp(B) 3.203 .380 71.036 1 .000 24.598 .020 .201 .010 1 .920 1.020 Constant a. Variable(s) entered on step 1: highage. Logistic Regression Model with a class predictor with more than two categories We now look at the relationship of education categories to menopause. Again, we begin by checking the crosstabulation between education and menopause, using the variable EDCAT as the "exposure" and STOPMENS as the "outcome" or event. Because we are interested in the probability of STOPMENS = 1, for each level of EDCAT, we really need only the row percents, so we request the row percents only. We see in the output that the proportion of women in menopause decreases with increasing education level. CROSSTABS /TABLES=edcat BY stopmens /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT ROW /COUNT ROUND CELL. edcat * stopmens Crosstabulation stopmens 1 edcat 1 Count % within edcat 2 Count % within edcat 3 Count % within edcat Total Count % within edcat 2 Total 96 9 105 91.4% 8.6% 100.0% 125 23 148 84.5% 15.5% 100.0% 84 26 110 76.4% 23.6% 100.0% 305 58 363 84.0% 16.0% 100.0% 12 Chi-Square Tests Value Asymp. Sig. (2sided) df 9.117a 9.337 9.071 363 Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases 2 2 1 .010 .009 .003 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 16.78. We now fit a logistic regression model, using EDCAT as a predictor, we include EDCAT as a categorical predictor with the reference being EDCAT=1. LOGISTIC REGRESSION VARIABLES menopause /METHOD=ENTER edcat /CONTRAST (edcat)=Indicator(1) /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). We can look at the categorical variable information below to see that EDCAT=1 is the reference category, because it has a value of 0 for both of the design (dummy) variables. The first dummy variable, EDCAT(1), will be for EDCAT 2 vs. 1, and the second dummy variable, EDCAT(2), will be for EDCAT 3 vs. 1. Case Processing Summary Unweighted Casesa Selected Cases N Included in Analysis Percent 363 98.1 7 1.9 370 0 370 100.0 .0 100.0 Missing Cases Total Unselected Cases Total a. If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding Original Value Internal Value .00 1.00 0 1 Categorical Variables Codings Parameter coding Frequency edcat (1) (2) 1 105 .000 .000 2 148 1.000 .000 3 110 .000 1.000 13 The table for Omnibus Tests of Model Coefficients provides an overall test for all parameters in the model. Thus, we can see that there is a likelihood ratio chi-square test of whether there is any effect of EDCAT, Χ2 (2df) = 9.337, p=.009. In spite of the model being significant, the Nagelkerke R-Square is very small (.043). Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 9.337 2 .009 Block 9.337 2 .009 Model 9.337 2 .009 Model Summary Step -2 Log likelihood Cox & Snell R Square 309.598a 1 Nagelkerke R Square .025 .043 a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001. The parmeter estimate for EDCAT(1) shows that the log-odds of menopause for someone with EDCAT=2 are smaller than for someone with EDCAT=1, but this difference is not significant (p=0.105). The parameter estimate for EDCAT(1) is negative, indicating that someone with EDCAT=3 has a lower log-odds of menopause than a person with EDCAT=1, and this difference is significant (p=0.004) . The overall test for EDCAT is a Wald chi-square, which is another test of the overall signficance of EDCAT (Chi-square (2 df) = 8.632, p=.013. The odds ratio estimate for EDCAT(2) (Edcat 3 vs 1) is .303, indicating that the odds of being in menopause for a person with EDCAT=3 are only 30% of the odds of being in menopause for a person with EDCAT=1. Variables in the Equation 95% C.I.for EXP(B) B Step 1a S.E. edcat Wald df Sig. 8.632 2 .013 Exp(B) Lower Upper edcat(1) -.674 .416 2.628 1 .105 .510 .225 1.151 edcat(2) -1.194 .415 8.299 1 .004 .303 .134 .683 Constant 2.367 .349 46.107 1 .000 10.667 a. Variable(s) entered on step 1: edcat. Logistic Regression Model with a continuous predictor We now look at a logistic regression model, but this time with a single continuous predictor (AGE). The parameter estimate for AGE is positive (0.283) telling us that the log-odds of being in menopause increase by .28 units for a woman who is one year older compared to her counterpart who is one year younger. The odds ratio (1.33) tells us that the odds of being in menopause for a woman who is one year older are 1.33 times greater than for a woman who is one year younger. 14 LOGISTIC REGRESSION VARIABLES menopause /METHOD=ENTER age /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). Case Processing Summary Unweighted Casesa Selected Cases N Included in Analysis Percent 360 97.3 10 2.7 370 0 370 100.0 .0 100.0 Missing Cases Total Unselected Cases Total a. If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding Original Value Internal Value .00 1.00 0 1 Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 124.146 1 .000 Block 124.146 1 .000 Model 124.146 1 .000 Model Summary Step -2 Log likelihood Cox & Snell R Square 197.019a 1 Nagelkerke R Square .292 .494 a. Estimation terminated at iteration number 7 because parameter estimates changed by less than .001. Variables in the Equation 95% C.I.for EXP(B) B Step 1a age Constant S.E. Wald df Sig. Exp(B) .283 .040 49.765 1 .000 1.327 -12.868 1.936 44.174 1 .000 .000 a. Variable(s) entered on step 1: age. 15 Lower 1.227 Upper 1.436 Quasi-Complete Separation in a Logistic Regression Model One fairly common occurrence in a logistic regression model is that the model fails to converge. This often happens when you have a categorical predictor that is too perfect, that is, there may be a category with no variability in the response (all subjects in one category of the predictor have the same response). This is called quasi-complete separation. When this happens, SPSS will give a warning message in the output. These warnings should be taken seriously, and the model should be refitted, perhaps by combining some categories of the predictor. Even if there is not quasi-complete separation, separation may be nearly complete, so the standard error for a parameter estimate can become very large. It is good practice to examine the parameter estimates and their standard errors carefully for any logistic regression output. We now examine a situation where quasi-complete separation occurs, using the variable AGECAT as a predictor in a logistic regression. First we check the crosstabulation between AGECAT and STOPMENS. Notice that in the highest age category, all 71 women are in menopause (not surprisingly). CROSSTABS /TABLES=agecat BY stopmens /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT ROW /COUNT ROUND CELL. agecat * stopmens Crosstabulation stopmens 1 agecat 1 Count % within agecat 2 3 4 99 50.5% 49.5% 100.0% 106 9 115 92.2% 7.8% 100.0% 74 1 75 98.7% 1.3% 100.0% 71 0 71 100.0% .0% 100.0% Count % within agecat Total 49 Count % within agecat Count % within agecat Total 50 Count % within agecat 2 301 59 360 83.6% 16.4% 100.0% Chi-Square Tests Value Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases 111.660a 110.175 78.698 360 Asymp. Sig. (2sided) df 3 3 1 .000 .000 .000 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.64. 16 We now fit the corresponding logistic regression model, using AGECAT as a categorical predictor, with AGECAT=1 as the reference category. LOGISTIC REGRESSION VARIABLES menopause /METHOD=ENTER agecat /CONTRAST (agecat)=Indicator(1) /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). We check out the Categorical Variable Codings to see that the design variables are set up correctly to have AGECAT=1 as the reference category, and notice that the dummy variables for AGECAT will be labeld AGECAT(1), AGECAT(2), and AGECAT(3). These will correspond to AGECAT=2, 3, and 4, respectively. Categorical Variables Codings Parameter coding Frequency agecat (1) (2) (3) 1 99 .000 .000 .000 2 115 1.000 .000 .000 3 75 .000 1.000 .000 4 71 .000 .000 1.000 The note in the output below alerts us to the problem of quasi-complete separation of data points. Also notice that the estimate for AGECAT(3) has a standard error of 4770, compared to about 1 or less for the other two dummy variables. We can also see that the estimate of the odds ratio for this dummy variable is basically infinity (1.583 E 9, which is 1,583,000,000). Model Summary Step -2 Log likelihood Cox & Snell R Square 210.990a 1 Nagelkerke R Square .264 .447 a. Estimation terminated at iteration number 20 because maximum iterations has been reached. Final solution cannot be found. Variables in the Equation 95% C.I.for EXP(B) B Step 1a S.E. agecat Wald df Sig. 50.075 3 .000 Exp(B) Lower Upper agecat(1) 2.446 .401 37.172 1 .000 11.542 5.258 25.339 agecat(2) 4.284 1.027 17.413 1 .000 72.520 9.696 542.384 agecat(3) 21.183 4770.028 .000 1 .996 1.583E9 .000 . Constant .020 .201 .010 1 .920 1.020 a. Variable(s) entered on step 1: agecat. Based on the information that we saw in the crosstabulation, we will create a new variable AGECAT3, with 3 age categories, collapsing category 3 and category 4. 17 RECODE agecat EXECUTE. (MISSING=SYSMIS) (1=1) (2=2)(3 thru 4=3) INTO agecat3. We now fit a new logistic regression, with AGECAT3 as a categorical predictor. Note in the output for this model, we do not have a problem with quasi-complete separation, however, we see a very wide confidence in terval for AGECAT3(2), owing to the fact that there was only one participant in this group with a value of 0 on the dependent variable. LOGISTIC REGRESSION VARIABLES menopause /METHOD=ENTER agecat3 /CONTRAST (agecat3)=Indicator(1) /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). Model Summary Step -2 Log likelihood Cox & Snell R Square 212.329a 1 Nagelkerke R Square .261 .442 a. Estimation terminated at iteration number 8 because parameter estimates changed by less than .001. Variables in the Equation 95% C.I.for EXP(B) B Step 1a S.E. Wald agecat3 df Sig. 55.353 2 .000 Exp(B) Lower Upper agecat3(1) 2.446 .401 37.172 1 .000 11.542 5.258 25.339 agecat3(2) 4.957 1.023 23.458 1 .000 142.100 19.120 1056.078 .020 .201 .010 1 .920 1.020 Constant a. Variable(s) entered on step 1: agecat3. Logistic Regression Model with Several Predictors We now fit a logistic regression model with several predictors, both continuous and categorical. Note especially the global test for the model, which has 6 degrees of freedom, due to the 6 parameters that are estimated for the predictors in the model. There are two parameters for EDCAT and one each for AGE, SMOKER, TOTINCOM, and NUMPREG1. The only predictor that is significant in this model is AGE (p<0.0001).. LOGISTIC REGRESSION VARIABLES menopause /METHOD=ENTER age edcat smoker totincom numpreg1 /CONTRAST (edcat)=Indicator(1) /PRINT=CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). Note that there are only 313 observations included in this model. Case Processing Summary Unweighted Casesa N Percent 18 Selected Cases Included in Analysis 313 84.6 57 15.4 370 0 370 100.0 .0 100.0 Missing Cases Total Unselected Cases Total a. If weight is in effect, see classification table for the total number of cases. Categorical Variables Codings Parameter coding Frequency edcat (1) (2) 1 87 .000 .000 2 128 1.000 .000 3 98 .000 1.000 This table gives us an idea of the relative p-values for each of the potential predictors in the model. Note that we see EDCAT with 2 df, which is an overall test for EDCAT, and we also see the two dummy variables for EDCAT, EDCAT(1) and EDCAT(2). Variables not in the Equation Score Step 0 Variables age df 65.729 1 .000 9.847 2 .007 edcat(1) .001 1 .980 edcat(2) 6.815 1 .009 smoker 3.704 1 .054 totincom 8.501 1 .004 numpreg1 4.943 1 .026 73.151 6 .000 edcat Overall Statistics Omnibus Tests of Model Coefficients Chi-square Step 1 Sig. df Sig. Step 110.366 6 .000 Block 110.366 6 .000 Model 110.366 6 .000 19 Model Summary Step -2 Log likelihood Cox & Snell R Square 177.510a 1 Nagelkerke R Square .297 .494 a. Estimation terminated at iteration number 7 because parameter estimates changed by less than .001. Again, in the output below, we see an overall test for EDCAT, with 2 df, showing that it is not significant in the model. We also see the output for the two EDCAT dummy variables. Variables in the Equation 95% C.I.for EXP(B) B Step 1a age S.E. .280 .044 edcat Wald df Sig. 40.609 1 .000 2.378 2 .305 Exp(B) Lower Upper 1.323 1.214 1.442 edcat(1) -.436 .552 .622 1 .430 .647 .219 1.910 edcat(2) -.840 .564 2.221 1 .136 .432 .143 1.303 smoker -.654 .384 2.909 1 .088 .520 .245 1.102 totincom -.093 .168 .303 1 .582 .911 .655 1.268 .006 .131 .002 1 .961 1.006 .779 1.300 -10.815 2.213 23.879 1 .000 .000 numpreg1 Constant a. Variable(s) entered on step 1: age, edcat, smoker, totincom, numpreg1. 20
© Copyright 2025 Paperzz