additional information for Chan article

Chan Biostatistics 203 - Survival analysis - additional detail
Robin Beaumont Monday, 22 April 2013
[email protected]
Contents
Introduction .................................................................................................................................................................... 2
SPSS analysis ................................................................................................................................................................... 2
Setting the Reference category .................................................................................................................................... 3
Obtaining confidence intervals..................................................................................................................................... 4
Adding a interaction term ............................................................................................................................................ 5
Creating the Log minus log graph to assess the proportionality assumption ................................................................. 6
R commander analysis ..................................................................................................................................................... 7
Installing the R commander survival plugin .................................................................................................................. 7
Loading the R Commander Survival plugin ................................................................................................................... 8
Importing the SPSS file................................................................................................................................................. 8
First check the data! .................................................................................................................................................... 9
The R Survival object.................................................................................................................................................... 9
Converting a Factor to a numeric variable .................................................................................................................. 10
Converting a numeric variable to a factor .................................................................................................................. 14
Merging factor levels ................................................................................................................................................. 15
Adding interaction terms ........................................................................................................................................... 17
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 1
Introduction
Chan's excellent article Biostatistics 203 Survival analysis available from www.sma.org.sg/smj/4506/4506bs1.pdf
provides several analyses in SPSS. In this document I include additional details concerning the breast cancer survival
dataset in SPSS (page 254) describd by him and also offer an equivalent R commander analysis.
The data set
Variable Description
Unique id for each subject
id
Range 1 to 1266
Age in years
age
Range 22 to 88
Pathology Tumor Size (cm)
pathsize
Range 0.1 to 7.0
Positive Axillary Lymph Nodes
lnpos
Range 0 to 35
Histologic Grade
histgrad
Range 1 to 3
Levels
0= negative
1= positive
0= negative
pr
Progesterone Receptor Status
1= positive
0= censored
status
Status
1= died (event)
0= 0 cm
1= < 2cm
pathscat Pathological Tumor Size (Categories)
2= 2 - 5 cm
3= > 5 cm
0= no
ln_yesno Lymph Nodes?
1= yes
Time (months)
time
Range 2.63 to 133.8
er
Estrogen Receptor Status
Missing value code
No missing values
No missing values
99=missing value
No missing values
4=missing
Not that the SPSS file have missing values defined for
several variables. This information is used when
importing the SPSS (*sav) file into R.
2= missing
2= missing
9= missing
99 = missing
No missing values
No missing values
SPSS analysis
We setup the data as shown opposite:
•
•
•
Time variable = time
Status variable = status where 1= those who suffered event.
6 covariates.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 2
Setting the Reference category
From the above analysis some of the parameter estimates are different from those reported in the Chan article this is
because SPSS has selected a different level (reference category) for certain variables as the baseline level compared to
those Chan used.
To rectify the situation select the following dialog box:
a,c,d,e,f
Categorical Variable Codings
Frequency
(1)
b
histgrad
1=1
56
0
2=2
352
1
3=3
252
0
b
er
0=Negative
262
0
1=Positive
398
1
b
pr
0=Negative
299
0
1=Positive
361
1
b
pathscat
1=<= 2 cm
457
0
2=2-5 cm
196
1
3=> 5 cm
7
0
b
ln_yesno
0=No
485
0
1=Yes
175
1
a. Category variable: histgrad (Histologic Grade)
b. Indicator Parameter Coding
c. Category variable: er (Estrogen Receptor Status)
d. Category variable: pr (Progesterone Receptor Status)
e. Category variable: pathscat (Pathological Tumor Size (Categories))
f. Category variable: ln_yesno (Lymph Nodes?)
(2)
0
0
1
0
0
1
In above dialog box select for each variable in turn:
1. select reference category = first (most desirable)
2. and then click the change button
In the table opposite each reference category is
highlighted in yellow.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 3
Obtaining confidence intervals
To obtain confidence intervals for each parameter estimate setup the Cox regression dialog box as shown below.
a
Omnibus Tests of Model Coefficients
-2 Log
Overall (score)
Change From Previous Step
Likelihood
Chidf
Sig.
Chidf
Sig.
square
square
436.371
31.355
8
.000
25.156
8
.001
a. Beginning Block Number 1. Method = Enter
B
ln_yesno
pathscat
Variables in the Equation
SE
Wald
df
Sig.
Change From Previous Block
Chidf
Sig.
square
25.156
8
.001
Exp(B)
.724
.337
4.605
6.005
1
2
.032
.050
2.063
pathscat(1)
pathscat(2)
pr
er
histgrad
.638
1.484
-.455
-.022
.336
.776
.422
.432
3.614
3.658
1.159
.003
.872
1
1
1
1
2
.057
.056
.282
.959
.647
1.893
4.412
.635
.978
histgrad(1)
histgrad(2)
age
.778
.942
-.021
1.036
1.056
.014
.564
.796
2.200
1
1
1
.453
.372
.138
2.177
2.564
.980
In above table setting the Critical Value (CV) as 0.05 I have highlighted in yellow statistically significant p-values.
When you have a parameter estimate for each level of a variable you can ignore the overall parameter estimate for the
variable (greyed out in the above)
•
•
Taking critical value to equal 0.05 p-values below 0.05 are said to be Statistically significant ->
Parameter estimate (i.e. the hazard ratio = Exp(B)) value is assumed to be 1 -> same as saying accept null
hypothesis.
Taking critical value to equal 0.05 p-values above 0.05 are said to be Statistically insignificant ->
Parameter estimate (i.e. the hazard ratio = Exp(B)) value is assumed to be equal to that in the Exp(B)) column
cell -> same as saying reject null hypothesis.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 4
Adding a interaction term
ln_yesno
pathscat
pathscat(1)
pathscat(2)
pr
er
histgrad
histgrad(1)
histgrad(2)
age
ln_yesno*pathscat
ln_yesno*pathscat(1)
ln_yesno*pathscat(2)
B
SE
.006
.505
-.179
3.100
-.516
-.063
.501
1.102
.413
.424
1.047
1.161
-.023
1.067
1.081
.014
1.670
-1.847
.707
1.547
Variables in the Equation
Wald
df
.000
8.520
.128
7.904
1.556
.022
1.165
.962
1.153
2.845
8.564
5.574
1.425
1
2
1
1
1
1
2
1
1
1
2
1
1
Sig.
.990
.014
.721
.005
.212
.881
.559
.327
.283
.092
.014
.018
.233
Exp(B)
1.006
95.0% CI for Exp(B)
Lower
Upper
.374
2.706
.836
22.189
.597
.939
.313
2.557
.266
.409
2.233
192.566
1.342
2.156
2.848
3.192
.977
.352
.384
.951
23.068
26.563
1.004
5.312
.158
1.328
.008
21.248
3.274
Setting the CV as 0.05 the statistically significant p-values are highlighted in yellow in the above table
When you have a parameter estimate for each level of a variable you can ignore the overall parameter estimate for the
variable (greyed out in the above).
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 5
Creating the Log minus log graph to assess the proportionality assumption
To produce a Log minus log graph (to assess the proportionality assumption) setup the dialog boxes as shown below.
Lines do NOT cross 
Proportional hazard assumption
maintained therefore results valid.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 6
R commander analysis
This first involves installing and loading the R Commander survival plugin called RcmdPlugin.survival.
Installing the R commander survival plugin
Select the menu option shown below Select the nearest CRAN mirror to you and then select the RcmdPlugin.survival
Package.
Once you have installed the R commander survival plugin you can now open up the R command interface in the usual
way typing library(Rcmdr) in the RGui window.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 7
Loading the R Commander Survival plugin
When in R Commander you need to load the previously installed survival plugin (RcmdPlugin.survival) and this is
achieved by selecting the R Commander menu option Tools-> Load R cmdr plug-ins and then selecting
RcmdPlugin.survival from the list and clicking on the OK button.
Importing the SPSS file
This is achieved in the usual way in R Commander Data-> Import
data -> from SPSS data set
Have called the dataset mydata, and asked R Commander to
convert value labels to factor levels.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 8
First check the data!
This is easily achieved by summarizing each of the variables and if nothing else quickly shows if you have the correct
variables along with the range each. It also shows the type of variable each is, as the min (minimum), mean and max
(maximum) values are shown for each numeric variable and in contrast the number in each level for what R believes to
be a factor variable.
Things to note here:
Histgrad is classed as a numeric variable
status is classed as a factor
pathscat has four levels or which the first level, 0 cm,
has zero count and also has 86 missing values.
Other variables also have missing values - which will
effectively reduce the working sample size.
The missing values specified in the SPSS file (shown below) are automatically converted to NA values in R:
The R Survival object
In R, and R commander to carry out a survival analysis is it necessary to have at least two variables, and these two
special variables must be of a particular type:
•
•
status variable - indicating if the case is either a censored or experiences the event (in this situation died). Must
be a numeric variable, not a factor, with the values 1 = event occured; 0 = censored.
time variable - this gives how long each subject was in the study, regardless of them being censored or not. This
variable must also be numeric.
Looking at the above summary indicates that R thinks status is a factor with two levels, not what we require. therefore
we need to convert or create a new status variable which is numeric before we can carry on with the analysis.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 9
Converting a Factor to a numeric variable
At present the R commander believes the status variable to be a factor so we will create a new variable called status2
from the status variable specifying the required censored/completed values mentioned above.
Notice how the recode values are expressed left to
right that is the old value is on the left and the new
value you want it to be on the right. This is the
opposite way round to how you normally write R
expressions.
"Censored" = 0
"Died" = 1
The result is shown opposite.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 10
Alternatively we could have imported the spss file and un-selected the 'convert variable names to lower case'.
0.2
0.4
0.6
0.8
1.0
Now to get a feel for the survival aspect of thedata we will create a KM plot of the survival times for those with /
without node involvement:
ln_yesno
0.0
No
Yes
0
20
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
40
60
80
100
page 11
120
The next stage is to carry out a Cox regression analysis to see if the various covariates influence the survival rate.
This is achieved in R Commander using the menu option Fit Models-> Cox regression model
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 12
Notice two differences:
Histgrad variable only has one parameter estimate - R has failed to notice it is a factor with three levels, instead it thinks
it is a continuous variable.
pathcat the factor seems to have 4 levels (the table always misses the base level out ) and fails to compute the
parameter estimates for one category - looking back at the summary statistics at the start will give a hint why this is.
Lets correct both these things.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 13
Converting a numeric variable to a factor
Select the following:
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 14
Merging factor levels
You can't do this by using the R commander menu options
names(mydata)
levels(mydata$pathscat)
levels(mydata$pathscat)[c(1,2)] <- "<2 cm"
levels(mydata$pathscat)
Re-running the model from the previous stage now gives idential results to the SPSS output reported in the Chan
article (next page).
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 15
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 16
Adding interaction terms
This just involves modifying the equation expression
breastcancer_pathnode <- coxph(Surv(time,status2) ~ age + pr + er + histgradfactor + pathscat * ln_yesno,
method="efron", data=mydata)
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 17
This is identical to the SPSS output as reported by Chan.
D:\web_sites_mine\HIcourseweb new\stats\basics\articles\survival_analysis1\Chan Biostatistics 203_additional_info.docx
page 18