Chan Biostatistics 203 - Survival analysis - additional detail Robin Beaumont Monday, 22 April 2013 [email protected] Contents Introduction .................................................................................................................................................................... 2 SPSS analysis ................................................................................................................................................................... 2 Setting the Reference category .................................................................................................................................... 3 Obtaining confidence intervals..................................................................................................................................... 4 Adding a interaction term ............................................................................................................................................ 5 Creating the Log minus log graph to assess the proportionality assumption ................................................................. 6 R commander analysis ..................................................................................................................................................... 7 Installing the R commander survival plugin .................................................................................................................. 7 Loading the R Commander Survival plugin ................................................................................................................... 8 Importing the SPSS file................................................................................................................................................. 8 First check the data! .................................................................................................................................................... 9 The R Survival object.................................................................................................................................................... 9 Converting a Factor to a numeric variable .................................................................................................................. 10 Converting a numeric variable to a factor .................................................................................................................. 14 Merging factor levels ................................................................................................................................................. 15 Adding interaction terms ........................................................................................................................................... 17 D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 1 Introduction Chan's excellent article Biostatistics 203 Survival analysis available from www.sma.org.sg/smj/4506/4506bs1.pdf provides several analyses in SPSS. In this document I include additional details concerning the breast cancer survival dataset in SPSS (page 254) describd by him and also offer an equivalent R commander analysis. The data set Variable Description Unique id for each subject id Range 1 to 1266 Age in years age Range 22 to 88 Pathology Tumor Size (cm) pathsize Range 0.1 to 7.0 Positive Axillary Lymph Nodes lnpos Range 0 to 35 Histologic Grade histgrad Range 1 to 3 Levels 0= negative 1= positive 0= negative pr Progesterone Receptor Status 1= positive 0= censored status Status 1= died (event) 0= 0 cm 1= < 2cm pathscat Pathological Tumor Size (Categories) 2= 2 - 5 cm 3= > 5 cm 0= no ln_yesno Lymph Nodes? 1= yes Time (months) time Range 2.63 to 133.8 er Estrogen Receptor Status Missing value code No missing values No missing values 99=missing value No missing values 4=missing Not that the SPSS file have missing values defined for several variables. This information is used when importing the SPSS (*sav) file into R. 2= missing 2= missing 9= missing 99 = missing No missing values No missing values SPSS analysis We setup the data as shown opposite: • • • Time variable = time Status variable = status where 1= those who suffered event. 6 covariates. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 2 Setting the Reference category From the above analysis some of the parameter estimates are different from those reported in the Chan article this is because SPSS has selected a different level (reference category) for certain variables as the baseline level compared to those Chan used. To rectify the situation select the following dialog box: a,c,d,e,f Categorical Variable Codings Frequency (1) b histgrad 1=1 56 0 2=2 352 1 3=3 252 0 b er 0=Negative 262 0 1=Positive 398 1 b pr 0=Negative 299 0 1=Positive 361 1 b pathscat 1=<= 2 cm 457 0 2=2-5 cm 196 1 3=> 5 cm 7 0 b ln_yesno 0=No 485 0 1=Yes 175 1 a. Category variable: histgrad (Histologic Grade) b. Indicator Parameter Coding c. Category variable: er (Estrogen Receptor Status) d. Category variable: pr (Progesterone Receptor Status) e. Category variable: pathscat (Pathological Tumor Size (Categories)) f. Category variable: ln_yesno (Lymph Nodes?) (2) 0 0 1 0 0 1 In above dialog box select for each variable in turn: 1. select reference category = first (most desirable) 2. and then click the change button In the table opposite each reference category is highlighted in yellow. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 3 Obtaining confidence intervals To obtain confidence intervals for each parameter estimate setup the Cox regression dialog box as shown below. a Omnibus Tests of Model Coefficients -2 Log Overall (score) Change From Previous Step Likelihood Chidf Sig. Chidf Sig. square square 436.371 31.355 8 .000 25.156 8 .001 a. Beginning Block Number 1. Method = Enter B ln_yesno pathscat Variables in the Equation SE Wald df Sig. Change From Previous Block Chidf Sig. square 25.156 8 .001 Exp(B) .724 .337 4.605 6.005 1 2 .032 .050 2.063 pathscat(1) pathscat(2) pr er histgrad .638 1.484 -.455 -.022 .336 .776 .422 .432 3.614 3.658 1.159 .003 .872 1 1 1 1 2 .057 .056 .282 .959 .647 1.893 4.412 .635 .978 histgrad(1) histgrad(2) age .778 .942 -.021 1.036 1.056 .014 .564 .796 2.200 1 1 1 .453 .372 .138 2.177 2.564 .980 In above table setting the Critical Value (CV) as 0.05 I have highlighted in yellow statistically significant p-values. When you have a parameter estimate for each level of a variable you can ignore the overall parameter estimate for the variable (greyed out in the above) • • Taking critical value to equal 0.05 p-values below 0.05 are said to be Statistically significant -> Parameter estimate (i.e. the hazard ratio = Exp(B)) value is assumed to be 1 -> same as saying accept null hypothesis. Taking critical value to equal 0.05 p-values above 0.05 are said to be Statistically insignificant -> Parameter estimate (i.e. the hazard ratio = Exp(B)) value is assumed to be equal to that in the Exp(B)) column cell -> same as saying reject null hypothesis. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 4 Adding a interaction term ln_yesno pathscat pathscat(1) pathscat(2) pr er histgrad histgrad(1) histgrad(2) age ln_yesno*pathscat ln_yesno*pathscat(1) ln_yesno*pathscat(2) B SE .006 .505 -.179 3.100 -.516 -.063 .501 1.102 .413 .424 1.047 1.161 -.023 1.067 1.081 .014 1.670 -1.847 .707 1.547 Variables in the Equation Wald df .000 8.520 .128 7.904 1.556 .022 1.165 .962 1.153 2.845 8.564 5.574 1.425 1 2 1 1 1 1 2 1 1 1 2 1 1 Sig. .990 .014 .721 .005 .212 .881 .559 .327 .283 .092 .014 .018 .233 Exp(B) 1.006 95.0% CI for Exp(B) Lower Upper .374 2.706 .836 22.189 .597 .939 .313 2.557 .266 .409 2.233 192.566 1.342 2.156 2.848 3.192 .977 .352 .384 .951 23.068 26.563 1.004 5.312 .158 1.328 .008 21.248 3.274 Setting the CV as 0.05 the statistically significant p-values are highlighted in yellow in the above table When you have a parameter estimate for each level of a variable you can ignore the overall parameter estimate for the variable (greyed out in the above). D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 5 Creating the Log minus log graph to assess the proportionality assumption To produce a Log minus log graph (to assess the proportionality assumption) setup the dialog boxes as shown below. Lines do NOT cross Proportional hazard assumption maintained therefore results valid. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 6 R commander analysis This first involves installing and loading the R Commander survival plugin called RcmdPlugin.survival. Installing the R commander survival plugin Select the menu option shown below Select the nearest CRAN mirror to you and then select the RcmdPlugin.survival Package. Once you have installed the R commander survival plugin you can now open up the R command interface in the usual way typing library(Rcmdr) in the RGui window. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 7 Loading the R Commander Survival plugin When in R Commander you need to load the previously installed survival plugin (RcmdPlugin.survival) and this is achieved by selecting the R Commander menu option Tools-> Load R cmdr plug-ins and then selecting RcmdPlugin.survival from the list and clicking on the OK button. Importing the SPSS file This is achieved in the usual way in R Commander Data-> Import data -> from SPSS data set Have called the dataset mydata, and asked R Commander to convert value labels to factor levels. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 8 First check the data! This is easily achieved by summarizing each of the variables and if nothing else quickly shows if you have the correct variables along with the range each. It also shows the type of variable each is, as the min (minimum), mean and max (maximum) values are shown for each numeric variable and in contrast the number in each level for what R believes to be a factor variable. Things to note here: Histgrad is classed as a numeric variable status is classed as a factor pathscat has four levels or which the first level, 0 cm, has zero count and also has 86 missing values. Other variables also have missing values - which will effectively reduce the working sample size. The missing values specified in the SPSS file (shown below) are automatically converted to NA values in R: The R Survival object In R, and R commander to carry out a survival analysis is it necessary to have at least two variables, and these two special variables must be of a particular type: • • status variable - indicating if the case is either a censored or experiences the event (in this situation died). Must be a numeric variable, not a factor, with the values 1 = event occured; 0 = censored. time variable - this gives how long each subject was in the study, regardless of them being censored or not. This variable must also be numeric. Looking at the above summary indicates that R thinks status is a factor with two levels, not what we require. therefore we need to convert or create a new status variable which is numeric before we can carry on with the analysis. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 9 Converting a Factor to a numeric variable At present the R commander believes the status variable to be a factor so we will create a new variable called status2 from the status variable specifying the required censored/completed values mentioned above. Notice how the recode values are expressed left to right that is the old value is on the left and the new value you want it to be on the right. This is the opposite way round to how you normally write R expressions. "Censored" = 0 "Died" = 1 The result is shown opposite. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 10 Alternatively we could have imported the spss file and un-selected the 'convert variable names to lower case'. 0.2 0.4 0.6 0.8 1.0 Now to get a feel for the survival aspect of thedata we will create a KM plot of the survival times for those with / without node involvement: ln_yesno 0.0 No Yes 0 20 D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx 40 60 80 100 page 11 120 The next stage is to carry out a Cox regression analysis to see if the various covariates influence the survival rate. This is achieved in R Commander using the menu option Fit Models-> Cox regression model D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 12 Notice two differences: Histgrad variable only has one parameter estimate - R has failed to notice it is a factor with three levels, instead it thinks it is a continuous variable. pathcat the factor seems to have 4 levels (the table always misses the base level out ) and fails to compute the parameter estimates for one category - looking back at the summary statistics at the start will give a hint why this is. Lets correct both these things. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 13 Converting a numeric variable to a factor Select the following: D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 14 Merging factor levels You can't do this by using the R commander menu options names(mydata) levels(mydata$pathscat) levels(mydata$pathscat)[c(1,2)] <- "<2 cm" levels(mydata$pathscat) Re-running the model from the previous stage now gives idential results to the SPSS output reported in the Chan article (next page). D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 15 D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 16 Adding interaction terms This just involves modifying the equation expression breastcancer_pathnode <- coxph(Surv(time,status2) ~ age + pr + er + histgradfactor + pathscat * ln_yesno, method="efron", data=mydata) D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 17 This is identical to the SPSS output as reported by Chan. D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx page 18
© Copyright 2026 Paperzz