Henderson: Dagstat 2016
Exercise Sessions
Session 1: Introduction
We assume you are familiar with R. If not, then this session could be well used by introducing
yourself to it, using an online tutorial. Google R and follow the link to manuals to start off.
The practical sessions will use data that are available at
http://www.mas.ncl.ac.uk/∼nrh3/dagstat
There are also some commands in a file someR. If you paste or source all of this into R it will
read in the data sets and make available some simple routines.
Make sure the library survival is loaded. If you are not familiar with R, use help(Surv) for
information on survival objects.
1. The first exercise is to work through the calculations for Nelson-Aalen without using (to
begin with) any built-in function.
The data below show the waiting times in months for patients scheduled to have hip
replacements at a certain hospital. As usual t is the time to event and δ is the censoring
indicator, with δ = 1 if the surgery took place and δ = 0 if the patient was still waiting.
t
δ
1
1
3
0
4
1
5
1
6
0
8
1
10
1
13
1
(a) Create variables time and cens (or any other names you prefer) for these data.
(b) Create a variable dN of the same length containing the counting process increments.
Use dN to form the counting process N. (Note: you don’t need to type in any more
data to form either dN or N).
(c) Create a variable Y containing the number of people at risk at each time.
(d) Create alphahat, the estimated hazards.
(e) Form the Nelson-Aalen estimator Ahat.
(f) Form the variance estimate VarAhat. (You might create an intermediate variable
first, or you can do it directly using the cumsum command).
(g) Use cbind to look at all your variables side by side.
(h) Plot Ahat against time. You might want to add a point (0,0) at the start (since
Â(0) = 0).
(i) If you wish, add approximate 95% pointwise confidence limits for Ahat.
(j) Calculate and plot the Kaplan-Meier survival estimator. You can get this using
KMS=cumprod(c(1,(1-dN/Y)))
plot(c(0,time),KMS,type="s")
(k) Superimpose a survival estimator based on the Nelson-Aalen cumulative hazard.
2. Next we look at a more realistic sample size and make use of the functions in the survival
library. The Nelson-Aalen estimator can be created using
1
tem1=basehaz(coxph(Surv(time,cens)~1,data=leukaemia))
Ahat1=tem1$hazard
time1=tem1$time
The basehaz function returns the Breslow cumulative hazard estimator for a Cox model
(Session 2), which reduces to Nelson-Aalen when there are no covariates.
An alternative is:
tem2=survfit(Surv(time,cens)~1,data=leukaemia)
Ahat2=cumsum(tem2$n.event/tem2$n.risk)
time2=tem2$time
This has the advantage that you can calculate the confidence intervals, by first finding:
VarAhat2=cumsum(tem2$n.event/(tem2$n.risk)^2)
For the leukaemia data:
(a) Plot the Nelson-Aalen estimate of cumulative hazard Â(t).
(b) Plot the Kaplan-Meier survival estimate Ŝ(t), obtained using the survfit function.
(c) On the same plot, show the Kaplan-Meier Ŝ(t) and an alternative survival estimate
based on Â(t).
(d) If you wish, add confidence intervals obtained for the Kaplan-Meier directly and via
Â(t).
3. And now we look at more general event history data.
For the ships data:
(a) Plot the Nelson-Aalen estimate of cumulative hazard, using both methods above,
adapted to the (time1, time2, cens) form. For the second method you should
delete the first elements of the event, risk and time vectors, as R sets up an empty
risk set at time zero.
tem2=survfit(Surv(time1,time2,cens)~1,data=ships)
Ahat2=cumsum(tem2$n.event[-1]/tem2$n.risk[-1])
time2=tem2$time[-1]
Do the methods give (almost) the same results?
(b) In one plot, show the estimated cumulative hazards for each type of ship. Is there
any evidence that one type of ship is sold more often than the other types?
There are several ways to do this. One way is simply to create subsets of the data
made up of each ship type. For example
bulker=ships[ships$type==1,]
4. Next we look at the effect on the Nelson-Aalen estimator (and CI) of censoring.
The routine simAhat simulates data from a Weibull distribution and plots the NelsonAalen estimator, with limits.
2
(a) Explore the effect of censoring. You can simulate uncensored data by
simAhat(rancens=FALSE)
You can have exponential random censoring with parameter cenrate using
simAhat(cenrate=1)
Changing cenrate changes the amount of censoring. Other parameters are sample
size n and Weibull shape parameter shape.
Now we have some theoretical exercises, for if you have time and inclination.
5. (i) Assume that the hazard function for an individual with covariates x follows a proportional hazards model
T
α(t|x) = eβ x α0 (t)
and a sample of n independent observations (ti , δi , xi ) is available, where δi is the usual
failure (δi = 1) or censoring (δi = 0) indicator. Assume continuous survival time with no
ties.
(ii) Write down expressions for the full and partial likelihoods for the data.
(iii) Show that if β is known the nonparametric maximum likelihood estimator of α0 at
an observation time ti is
α̂i = α̂0 (ti ) = P
δi
j∈Ri
eβxj
where Ri is the risk set at time i, ie all individuals known to have survived to at least ti .
You may assume without formal proof that α̂0 (t) is zero at non-observation times.
(iv) Hence show that the partial likelihood can be considered as a profile likelihood for β
after removing the effect of α0 (t).
6. Suppose that patients fall into either Group 0 or Group 1. Cumulative baseline hazard
is A0 (t) in Group 0 and A1 (t) = rA0 (t) in Group 1, where r is a time-fixed relative risk.
Assume r > 1 so Group 1 is high risk.
Let L and H be survival times of patients randomly selected from Groups 0 and 1 respectively. Find P r(H > L). Evaluate this at each of r = 2, 3, 4, 5. Comment.
7. Suppose T is a survival time and there is a monotonic function H(T ) such that H(T ) =
−β T x + . Show that if has extreme value distribution with cumulative distribution
function
F () = 1 − exp{−e }
then T follows a proportional hazards model.
Show also that if follows a standard logistic distribution
F () =
e
1 + e
then T follows a proportional odds model of the form
P (T > t)
T
= h(t)eβ x ,
1 − P (T > t)
where h(t) is a function of H(t).
3
8. The proportional odds model assumes that the relationship between survival function
S1 (t) in a treatment group and S0 (t) in a control group is
S1 (t)
S0 (t)
=
1 − S1 (t)
r × {1 − S0 (t)}
r>0
Show that the hazard ratio α1 (t)/α0 (t) starts at r at t = 0 but converges to one as t
increases indefinitely.
Session 2: Regression Models, Frailty and Multivariate Survival
1. For the ships data, test whether there is a significant difference between lengths of ownership of the three different ship types a) ignoring the other covariates, and b) allowing
for vessel speed. Don’t forget to treat type as a factor variable.
2. Which covariates, if any, influence leukaemia survival? What is your interpretation of the
important effects?
3. Is there any evidence of frailty in the leukaemia data?
4. Consider the retinopathy data. Test for treatment effect a) ignoring other covariates
and any within-subject correlation, b) allowing for other covariates but ignoring withinsubject correlation, and c) allowing for other covariates and using shared frailty to model
within-person correlation. Does it matter what frailty distribution you assume?
Note: you can include frailty by adding + frailty(id) to the list of covariates when
calling coxph. You can change the frailty distribution within the frailty option.
5. An alternative to frailty is to treat each person as a cluster, using +cluster(id) within
coxph. Does it make any difference?
6. The routine simfrail in someR simulates Weibull survival with a single N(0,1) covariate
x, mixed by gamma frailty with frailty variance frailvar. You can turn frailty off by
setting usefrail=FALSE and can control censoring (as in Session 1) using rancens and
cenrate. Other parameters are sample size n and regression parameter b.
Explore the effect of ignoring frailty in a Cox proportional hazards analysis. Does adding
+frailty(id) help?
7. If you have prior experience of survival analysis in R, you might explore how good the
various Cox model diagnostics (eg Schoenfeld residuals) are at detecting non-proportional
hazards induced by gamma frailty.
8. A theoretical exercise, if time. Let Z be gamma with mean 1 and variance ξ, so that
Z ∼ Γ(1/ξ, 1/ξ)
h(z) =
z 1/ξ−1 e−z/ξ
ξ 1/ξ Γ(1/ξ)
(z > 0)
and assume S(t|z, x) = exp{−zeβx A0 (t)}.
(a) Show that
S(t|x) =
!1/ξ
1
1 + ξeβx A0 (t)
4
α(t|x) =
eβx α0 (t)
1 + ξeβx A0 (t)
(b) Show
[Z|T ≥ t] ∼ Γ(1/ξ, 1/ξ + eβx A0 (t))
[Z|T = t] ∼ Γ(1/ξ + 1, 1/ξ + eβx A0 (t))
Hint: use Bayes theorem and consider the kernel (ie the part that involves z) of the
distribution only.
9. Another theoretical exercise. Suppose paired lifetimes T1 and T2 have shared gamma
frailty and joint survivor function
S(t1 , t2 ) =
1
1 + ξ{A1 (t1 ) + A2 (t2 )}
!1/ξ
Show that the conditional survivor function of T1 given T2 > t2 is
S(t1 |T2 > t2 ) =
1 + ξA2 (t2 )
1 + ξ{A1 (t1 ) + A2 (t2 )}
!1/ξ
and the conditional survivor function of T1 given T2 = t2 is
S(t1 |T2 = t2 ) =
1 + ξA2 (t2 )
1 + ξ{A1 (t1 ) + A2 (t2 )}
!1+1/ξ
Find the corresponding hazards α(t1 |T2 > t2 ) and α(t1 |T2 = t2 ) and show
α(t1 |T2 = t2 )
=1+ξ
α(t1 |T2 > t2 )
10. And another theoretical exercise. Suppose T1 and T2 have joint survival function
S(t1 , t2 ) = exp [α log{S1 (t1 )} + β log{S2 (t2 )} + γ log{S1 (t1 )} log{S2 (t2 )}]]
where S1 (t1 ) and S2 (t2 ) are univariate survival functions. Show that the conditional
survival function P (T1 > t1 |T2 > t2 ) is of proportional hazards form, ie H(t1 )η(t2 ) for
some functions H(.) and η(.).
Session 3: Time-Variation and Dynamic Covariates
1. The subroutine plaallung() in someR fits an Aalen additive model to the lung cancer
data and plots the estimated cumulative coefficients. What is your interpretation of the
plots?
2. With these plots in mind, use the quick and dirty method to explore possible changes in
covariate effects when fitting a Cox model. (Warning: don’t forget to use copies of the
original data when introducing artificial censoring: don’t overwrite the original).
3. The subroutine plaalships() in someR fits an Aalen additive model to the ships data.
What is your interpretation of the plots?
5
4. With these plots in mind, use the quick and dirty method to explore possible changes in
covariate effects when fitting a Cox model. (Note: use time2 to decide whether an event
(a sale) is before or after a potential change-point tau.)
5. We can add number of previous owners as a dynamic covariate using
plaalships(useowner=TRUE). Does it make a difference?
6. Theoretical. Consider the Aalen additive model:
αi (t | Ft − ) = β0 (t) + β1 (t)xi1 (t) + β2 (t)xi2 (t) . . .
λi (t | Ft − ) = Yi (t)αi (t | Ft − ).
Let X(t) be a n × (p + 1) matrix whose ith row is made up of the p time-t covariates
(plus an intercept) for subject i, multiplied by the at-risk indicator, ie
(Yi (t), Yi (t)xi1 (t), Yi (t)xi2 (t), . . .)
Let dN (t) be an n × 1 vector with element 1 in row i if subject i has an event observed at
t, zero otherwise. Then the cumulative regression coefficients B(t) can be estimated by
B̂(t) =
Z t
−1
J(u) X T (u)X(u)
X T (u)dN (u)
0
where J(u) is an indicator of the inverse existing.
Suppose there are no covariates so αi (t | Ft − ) = β0 (t). Show that in this case B̂(t) is the
Nelson-Aalen estimator.
Session 4: Competing Risks
1. For the kidney data, obtain the CIF for each of death or transplant failure for patients
having their first, second or third-or-later transplant.
Note. One way to do this is to create three new data sets, one for each of the categories
mentioned above. The CIF values are stored in a variable prev when coxph is run with
the type="mstate" option. There is one column for each failure type.
2. Which covariates are important for time-to-failure in the kidney data? As these the same
for time-to-death? What are the main differences?
3. We now calculate the transition matrix P (t) for a patient with a specific set of covariate
values. The example assumes we are interested in rage, capd, drmm and bmm and we
assume we have a two new patients, both with rage=60, capd=0, bmm=1 but one with
drmm=0 and the other with drmm=2. Of course you can try other covariates, and different
numbers of new patients.
First you need to fit a Cox model to each cause-specific hazard:
fit1=coxph(Surv(time, censfail)~rage+capd+drmm+bmm,data=kidney)
fit2=coxph(Surv(time, censdeath)~rage+capd+drmm+bmm,data=kidney)
6
Then you should create a dataframe with the covariate values for new patients. The
covariates have to be in the same order as in the fits above.
newcase=data.frame(rage=c(60,60),capd=c(0,0),drmm=c(0,2),bmm=c(1,1))
The routine stateprobs(fit1,fit2,newcase) will plot the state occupation probabilities against time, ie the probabilities that the patient will have a functioning kidney, or
will have had a failed transplant or will have died.
Use this routine to explore the effects of each rage, capd, drmm and bmm individually,
keeping the other covariates fixed. Does it matter what values they are fixed at?
4. The subroutine simcrdat simulates competing risks data. There are two failure types
(cens1 and cens2) and the option of additional independent right censoring. Dependence is generated by assuming by default a positive stable shared frailty, with parameter
nu. There are two covariates, one standard Normal and one binary. For simplicity the
regression coefficients b are assumed to be the same for each failure type. All times are
truncated at a value tmax.
Use the routine to explore how the CIFs for each failure type are affected by the frailty
parameter nu, when there are no covariate effects. For example
dat1=simcrdat(nu=1)
cif1=survfit(Surv(time,cens,type="mstate")~1,data=dat1)
dat2=simcrdat(nu=0.2)
cif2=survfit(Surv(time,cens,type="mstate")~1,data=dat2)
plot(cif1,col=2:3)
lines(cif2,col=4:5)
Consider the effect of nu on the CIF for each level of the binary covariate, at b=(0,1).
For more general b, use a Cox model to estimate the covariate effects based on causespecific hazard models.
5. The routine can also be used to explore multivariate survival models as it also saves the
latent event times for each cause, as time1 and time2 with censoring indicators cens11
and cens21. It uses shared or independent frailty terms, which are either positive stable
with parameter nu
dat1=simcrdat(b=c(0,0),usegamma=FALSE,nu=0.5)
or gamma with variance xi
dat2=simcrdat(b=c(0,0),usegamma=TRUE,xi=0.5)
Find the values of nu and xi which each lead to variance of log(time1) and log(time2))
about 1, when there is shared frailty, no covariate effects and no censoring. (You can turn
censoring off by setting tmax=Inf and userandom=FALSE). At these parameter values,
how is the correlation between the logs of time1 and time2 affected by choice of frailty
distribution?
7
Appendix: Data
Leukaemia
Results on 1043 patients diagnosed with acute myeloid leukaemia.
time: Time in years from diagnosis
cens: Indicator of death (1) or censoring (0)
age: Age in years
male: Female=0, male=1
wbc: White blood cell count (50 × 109 /L), truncated at 500
dep: Deprivation score - a measure of poverty/affluence for the residential location (the
higher the score the more deprived the area)
Lung cancer
Data on 272 patients diagnosed with non-small-cell lung cancer.
time: Time in months from diagnosis
cens: Indicator of death (1) or censoring (0)
age: Age in years
female: Male=0, female=1
activity: Activity score on scale 0–4 (low good)
anorexia: 0=absent, 1=present
hoarseness: 0=absent, 1=present
metastases: 0=absent, 1=present
Kidney
Results on 3232 patients who received a kidney transplant.
time: Time in years from transplant to first of organ failure or death
cens: Indicator of censoring (0), failure (1) or death (2)
censfail: Indicator of failure (1) or other (0, death or censoring)
censdeath: Indicator of death (1) or other (0, failure or censoring)
rfemale: Indicator of female recipient
rage: Recipient age in years
capd: Indicator of continuous ambulatory peritoneal dialysis before transplant
8
wait: Waiting time (months) for transplant
txnum: Transplant number
livdon: Indicator of living (1) or cadaver (0) donor
transf: Indicator of pre-transplant blood transfusion
presens: Indicator of recipient sensitised pre-transplant
dfemale: Indicator of female donor
dage: Donor age in years
amm, bmm, drmm: HLA-A, B and DR antibody missmatches
cold,warm: Ischaemia times: time kidney was on ice (hours), or not (minutes), before
transplant
Retinopathy
Data on 197 patients with diabetic retinopathy, who had one one eye randomized to laser
treatment and the other eye received no treatment. There are two rows per patient - one for
each eye.
id: 1:197
time: Time in months to blindness
cens: Indicator of blindness (1) or censoring (0)
tmt: Indicator of laser treatment
lefteye: Indicator of eye
agediag: Age in years at diagnosis of diabetes
juvenilediab: Indicator of juvenile diabetes
riskgp: A measure of risk (6:12)
These data are from
Huster WJ, Brookmeyer R, Self SG. Modelling paired survival data with covariates. Biometrics
1989; 45:145-156.
Ships
Data, in counting process form, on ownership lengths of 3908 ships. There is one row per
ownership.
id: 1:3908
time1: Age in months of ship at time ownership starts
time2: Age in months at time ownership ends
9
cens: Indicator of whether ship was sold (1) or still in use at time2 (0)
owner: Owner number
type: Bulker (1), container (2) or tanker (3)
dwt: Deadweight (weight of ship plus max load) in tonnes
speed: Speed in knots
10
© Copyright 2026 Paperzz