Bone Marrow Transplantation (2001) 28, 1001–1011
2001 Nature Publishing Group All rights reserved 0268–3369/01 $15.00
www.nature.com/bmt
Mini-review
Statistical methods for the analysis and presentation of the results of
bone marrow transplants. Part 2: Regression modeling
JP Klein1,2, JD Rizzo2, M-J Zhang1,2 and N Keiding3
1
3
Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI, USA; 2IBMTR/ABMTR, Milwaukee, WI, USA; and
Department of Biostatistics, University of Copenhagen, Denmark
Summary:
In this paper, we address methods of multivariate
regression. We discuss the value of regression compared
to matched pairs analysis, methods of coding variables,
basic concepts of the Cox model and interpretation of
results of the Cox model. We present methods of handling variables whose effect changes with time. We
present methods to check the assumptions of the Cox
regression. Finally, and perhaps most importantly, we
provide suggestions for presenting the results in clear
and thorough tables and graphs. Bone Marrow Transplantation (2001) 28, 1001–1011.
Keywords: Cox model; proportional hazards models;
fixed and time-dependent covariate
In the first paper in this series,1 we presented important
considerations in basic statistics and univariate analyses for
studies involving outcomes of bone marrow transplantation.
Regression or multivariate analysis is commonly used in
studying the outcomes of bone marrow transplantation.
Regression models are used in two distinct situations. First,
they are used in prognostic factor studies. These are studies
designed to determine patient, donor, disease and treatment
characteristics, which influence some outcome of the therapy. They typically use stepwise regression techniques to
build a model that can be used to predict the outcome of
a patient based on a series of factors measured at the time
of transplantation.
Second, regression models are frequently used to make
adjustments for imbalances in prognostic factors between
groups of patients and therefore are an excellent method to
make planned comparisons between treatments. Interaction
terms can be added to the models to allow for subgroup
analysis.
An alternative to use of regression models for making
treatment comparisons is the use of matched pairs analysis.
Here, each case given a particular treatment is matched on
some patient- or disease-specific characteristic such as age,
sex, primary disease state, etc to a control patient. CompariCorrespondence: Dr JP Klein, Division of Biostatistics, Medical College
of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI, USA
sons between the treated and control groups are made by
an appropriate matched pairs survival analysis technique
(see section 7.5 of Klein and Moeschberger2). In most clinical applications we prefer a regression-adjusted test of
treatments to a matched pairs analysis for several reasons.
First, the regression analysis allows us to use all the patient
information, including those patients for whom we cannot
find a match. Second, in the matched pairs analysis only
those pairs where the smaller of the pair’s on-study time
corresponds to a death are used in the analysis so the effective sample size is less than that of the corresponding
regression analysis. Third, the regression analysis allows us
to look for differential effects of treatments in some subgroups of patients. This is not possible with a matched pairs
analysis. Fourth, any factor for which the patients are
matched cannot be analyzed, nor can its effect on outcome
be quantified. Fifth, a frequent misconception is that
increasing the size of the matched cohort leads to greater
statistical power. In practice, however, little is gained
beyond a two or three to one matching.3 Finally, the
matched pairs analysis, which must involve prospective
matching, is administratively harder to implement.
Regression models can be applied to most outcomes of
a BMT. Most commonly regression models are used for
time to event data. Time to event data includes data where
we have an on-study time and an event indicator. Examples
are data on survival, relapse, death in remission, and disease-free survival. Here regression models typically focus
on modelling the rate at which these events occur over time.
The most common regression model is the Cox4 proportional hazards model, which models the timing of events
through the hazard rate (see the appendix for definitions of
statistical terms). Another possible model for this type of
data is the class of accelerated failure time models5 (AFT),
which models the relationship between risk factors and outcome through the logarithm of the event time. The AFT
models are parametric models that require a very specific
model for the baseline survival function and are rarely used
in BMT studies for this reason. Studies have shown that
when the incorrect model is assumed for the baseline survival rate the AFT may lead to misleading results.6 In the
following sections we will explain how to interpret the
results of a Cox regression model for BMT data. We will
illustrate how to check whether this model is an appropriate
one and suggest some ways the model can be improved.
Regression methods for BMT
JP Klein et al
We will point out common mistakes made in applying and
interpreting this model.
Coding risk factors
BMT studies involve a number of different types of covariates. These covariates or risk factors may be: (1) continuous
measures of patient status, such as age, pre-transplant white
blood cell count (WBC), time from diagnosis to transplant,
etc; (2) binary, such as sex; or (3) categorical such as race
(white, black, other) or stem cell source (marrow, blood,
both). Each of the regression models we consider has a
different interpretation of these three types of risk factor,
but some common principles do hold.
First, continuous covariates, such as age or WBC, are
characteristics which we can, in theory measure, with a
degree of accuracy. A common mistake is to code ordinal
categorical variables as continuous covariates. For example,
year of transplant or stage of disease are ordered categorical
variables which should not be treated as continuous covariates. When one incorrectly treats these ordered categorical
variables as continuous, the tests from the regression
models are valid tests of trend but the risk estimates are
meaningless.
Categorical variables require special care. For a risk factor with k categories we must select one category as a baseline and code the remaining (k − 1) categories as binary (0
or 1) indicators of being in each category. Selection of the
baseline category is arbitrary, but it is important to realize
that statistical packages will only routinely supply tests of
the effect of being in one of the other categories relative
to the chosen baseline category. A good analysis will report
an overall P value for a categorical risk factor from a socalled multiple degree of freedom test. This P value is a
test of the hypothesis that all the responses are equal against
the alternative that the response in at least one category is
different from the others. This test, which should always
be reported, is not affected by the choice of the baseline
category.
In BMT studies we can classify risk factors as fixed time
or time-dependent covariates. Fixed time risk factors are
those whose value is known and fixed at the start of the
study. Time-dependent covariates are those whose value
may change after the patients have entered the study.
Examples of time varying risk factors, when the starting
point is at transplant, are the most current neutrophil or
WBC count (both continuous), or time varying binary variable indicating whether acute GVHD has occurred by a
given time or whether the patient is still hospitalized at time
t. A very common mistake made when reporting results of
BMT studies is to treat time-dependent covariates as if they
were fixed covariates, for example, treating occurrence of
acute GVHD at some time as if it were known at the time
of transplant. This leads to very misleading results. We will
discuss the interpretation of time-dependent covariates
later.
Missing values complicate analysis of transplant studies.
Although every effort should be made to collect complete
information on each covariate, this is sometimes impossible. Several approaches to models with missing data are
Bone Marrow Transplantation
suggested in the literature. These include imputation of
missing values by using averages of covariate values for
similar patients,7 discarding any case with a missing value,
or coding ‘missing’ as a separate category for the risk factor. We prefer the latter two methods. When there are relatively few missing values we tend to discard the patients
with missing values from the data set (this is the so-called
‘complete case analysis’). When we do this we like to compare the outcome between those with and without missing
values to ensure we are not biasing the study by this
method. When a variable has a high frequency of missing
values (⬎10% of sample), we code an indicator variable
that this risk factor is missing and include this as one of
the categories when testing for this factor. This allows us
to keep as much data as possible for analysis and to test if
those with missing values are different from the average
patient with complete data.
The Cox proportional hazards model
The most commonly used regression model for lifetime
data is the Cox4 proportional hazards regression model.
This is a model for a patient’s hazard rate. The hazard rate
is the chance of the event of interest occurring in the next
instant of time for a patient yet to experience the event. If,
for example, the event is treatment failure (ie death or
relapse) then the hazard rate for treatment failure at time t,
h(t), is the chance of relapse or death tomorrow if the
patient is alive and disease-free today (t).
The Cox model assumes that the hazard rate for a given
patient can be factored into a baseline hazard rate (common
to all patients) and a parametric function of the covariates
which describes the patients characteristics. Thus we can
express the hazard rate for patient i as
hi(t) = h0(t)exp{Zi}.
Here h0(t) is a baseline hazard rate that we estimate nonparametrically; Zi, is the ith patient’s covariate and  is the
risk or regression coefficient. This is called a semiparametric model since we model the covariate effects with a
parametric model, but we model the baseline hazard rate,
h0(t), nonparametrically.
Figure 1 illustrates the relationship between hazard rates
2
With covariates
Hazard rate
1002
1
0
Baseline
0
1
2
Time post transplant
Figure 1 Relationship between hazard rates for the Cox model.
3
Regression methods for BMT
JP Klein et al
for a single binary covariate based on the Cox model. Here
the top curve representing a patient with the characteristic
is twice the bottom curve which is for a baseline patient.
The relative risk is eZ = 2. Both curves are hump-shaped
which is the typical shape of a hazard rate for treatment
failure after transplant. This represents an increasing risk
of death due to infection or GVHD early which tapers after
some period of time.
Since the Cox model is based on the hazard rate when
the event of interest is survival or treatment failure (see our
discussion of competing risks in part 1 of this series1), the
model suggests a relationship between the survival functions for patients with different risk factors. The survival
function for a patient with a covariate Z is equal to
So(t)exp{Z}. Here So(t) is the survival function for the baseline patient. Figure 2 shows the baseline and adjusted survival curves corresponding to the Cox models in Figure 1.
Interpretation of covariates
As stated above, in the Cox proportional hazards model the
covariates act as factors multiplying the hazard rate, which
is the probability of experiencing the event in question (eg
death or relapse) during the next little interval (eg
tomorrow). To illustrate the interpretation of covariates in
the model we are using data from an IBMTR study of alternative BMT donors for leukemia.8 The data considered consists of patients transplanted for chronic or acute leukemia
with either an HLA-identical sibling donor (n = 1224), a
one (n = 238) or two (n = 102) HLA antigen mismatched
related donor, an HLA matched unrelated donor (n = 380)
or a one-HLA antigen mismatched unrelated donor (n =
108).
Consider first the inference for a continuous covariate
AGE (age of patient at transplantation in years) for the
event of treatment failure (Table 1). The interpretation of
the estimate 0.003842 is best obtained through the risk ratio
obtained using the exponential function as exp{0.003842}
= 1.003849, which says that the hazard increases by a factor
of 1.0038 for each year that the age of the patient increases.
The standard error of 0.00237 is a bit high for the estimate
Survival function
1
Baseline
With covariates
0
0
1
2
3
Time post transplant
Figure 2 Relationship between survival functions for the Cox model.
Table 1
Variable
AGE
1003
A typical computer output
DF
Parameter
estimate
Standard
error
Risk ratio
P
1
0.003842
0.00237
1.003849
0.1055
to be considered statistically significant; for one parameter
significance at the 5% level corresponds to the estimate
being further away from zero than 1.96 standard errors.
This is also reflected in the quoted P value of 0.1055 or
10.55%. Comparing two patients with an age difference of
20 years yields an increase of exp{20 × (0.003842)} =
exp{0.07684} = 1.07987 = (1.003849)20 so that a patient
20 years older has a hazard about 1.08 times higher, ie 8%
higher, than the younger patient.
Another way of scoring the age effect would be to select
a threshold age (such as 40 years) such that AGE40 = 0 if
AGE ⭐ 40, AGE40 = 1 if AGE ⬎ 40. This dichotomization
of the age effect postulates that most of the effect of age
on relapse or death can be captured as a marked increase
in hazard at age 40. The estimate of AGE40 is 0.163338
with standard error 0.07070, P = 0.0209 and risk ratio
1.177. So according to this, patients older than 40 have a
17.7% higher hazard of death or relapse than patients under
age 40.
The choice of cutpoint between high and low risk (in the
above case 40 years of age) is sometimes difficult and there
is no hard and fast rule as to the choice. The best choice is
one based on the existing scientific evidence from previous
papers or based on clinical observation. Another possibility
is to use the median in the data as a cut point. One common
data-dependent approach when the cutpoint is not given at
the outset is to try several cutpoints in order to locate that
point with clearest significance. This however invalidates
the interpretation of the hypothesis tests using the standard
P values. Recent statistical research9–11 has shown that the
P values computed from this search procedure are too small
and lead to false significances in most cases. There are various12–14 (much more elaborate) statistical procedures that
correct these P values so that the overall inference level
is correct.
When studying a categorical variable with more than two
categories it is necessary to use several ‘dummy’ variables.
Consider for example, the patient’s disease at the time of
transplant, AML, ALL or CML. This factor can be coded
using
ALL
= 0 if disease = AML or CML
1 if disease = ALL
CML
= 0 if disease = AML or ALL
1 if disease = CML
The relative risk estimates of ALL will measure how much
the hazard of ALL is changed relative to the baseline,
which here is AML. It is best to choose a baseline category
that is useful for such comparisons, and at the same time
has sufficient patients (because if it contains few patients
the resulting estimates will all inherit the resulting
uncertainty). Note that most statistical software will choose
Bone Marrow Transplantation
Regression methods for BMT
JP Klein et al
1004
Table 2
Variable
ALL
CML
Parameter estimates for disease
DF
1
1
Table 4
Parameter estimates when there is an interaction between age
and disease
Parameter
estimate
Standard
error
Risk ratio
0.184353
−0.020281
0.08257
0.07056
1.202
0.980
P
Variable
0.0256
0.7738
the reference category unless specifically prompted, which
will not take the above considerations into account.
Table 2 shows the typical computer output for the disease
factors. The interpretation is that ALL patients have a hazard of relapse or death of 1.202 times that of AML patients,
and this is statistically significant with P = 0.0256. On the
other hand the slightly decreased (by a factor of 0.980)
hazard of CML patients relative to AML patients cannot
be considered statistically significant (P = 0.77). If the relative risk of CML in relation to ALL is considered, direct
calculation yields that this is 0.980/1.202 = 0.815. However, calculation of standard errors and P values requires
additional information and is most easily done in practice
by recoding so that ALL is the reference category and
running the program once more.
When two factors are included in a model a typical output is as in Table 3. Here the interpretation of the risk ratio
of 1.240 for ALL is that an ALL patient is failing treatment
at a rate 1.24 times greater than an AML patient of a similar
age. Note that the relative risk is now 1.240 as compared
to 1.202 in Table 2 reflecting that we have made an adjustment for age. The risk ratio of 1.240 is the same regardless
of the age of the patients being compared. This model is
said to have no interaction between age and disease.
In some cases we may be interested in determining if
there is an interaction between two factors in their effect
on outcome. To detect that, we include in the model terms
involving cross products of the indicators for each factor.
Table 4 illustrates this for the disease and age factors. Here
we see that there is in fact an interaction between age and
CML such that older CML patients have a different
relationship to older AML patients than younger CML
patients to younger AML patients. The relative risk of
1.037 for CML is the relative risk of treatment failure for
a CML patient as compared to an AML patient when both
are under 40 years of age (ie when AGE40 = 0). For
patients over age 40 the relative risk of a CML patient to
an AML patient is exp{0.03662 + (−0.32161)} = 0.752.
Similarly, ALL patients are failing at a rate 1.254 times
that of an AML patient under age 40 and at a rate
exp{0.22639 + 0.21595} = 1.556 faster than that of an
Table 3
Parameter estimates when there is no interaction between age
and disease
Variable
DF
Parameter
estimate
Standard
error
Risk ratio
P
ALL
CML
AGE40
1
1
1
0.21515
−0.05181
0.22499
0.08330
0.07140
0.07364
1.240
0.950
1.252
0.0098
0.4680
0.0022
Bone Marrow Transplantation
ALL
CML
AGE40
CML and
AGE ⬎40a
ALL and
AGE ⬎40
DF
Parameter
estimate
Standard
error
Risk ratio
P
1
1
1
1
0.22639
0.03662
0.39811
−0.32161
0.08999
0.08275
0.12881
0.16028
1.254
1.037
1.490
0.725
0.0119
0.6581
0.0020
0.0448
1
0.21595
0.25724
1.241
0.4012
a
Interaction term constructed as product of CML and AGE40 covariates.
AML patient over age 40. To test if these relative risks are
significant requires additional information not given in
Table 4. This information can be obtained by using an alternative coding of the variables. When this is done we find
that for patients over age 40 the risk of treatment failure
for a CML patient as compared to an AML patient of 0.752
is significantly different from 1 (P = 0.0379). This tells us
that older CML patients do better then older AML patients,
but for younger patients there is no difference in treatment
failure rates.
Time-varying effects and time-dependent covariates
Most of the variables discussed so far are known at the
time of transplantation (age, sex, WBC count, Karnofsky
score, basic disease); these are called fixed or time-fixed
covariates. In contrast to these stand time-dependent covariates, of which the most important in the present context is
whether or not various events have happened: engraftment,
platelet recovery, (first occurrence of) acute graft-versushost disease, similar for chronic GVHD. One of the
important features of the Cox regression model is that it
allows for inclusion of time-dependent covariates with little
extra technical effort. However there are various issues of
interpretation that require attention.
First, as already mentioned, time-dependent covariates
are often misinterpreted as time-fixed covariates. It is incorrect to compare patients with aGVHD to patients without,
as if it were known at transplantation whether or not the
patients will get aGVHD. To do so would bias the estimate
of the effect of aGVHD by assuming that every patient who
receives a transplant is at risk of developing aGVHD, while
in fact some patients die before they could experience
aGVHD.
Second, sometimes the effect of some fixed covariate (eg
donor age, sex) on outcome is mediated through timedependent covariates (eg GVHD). It may then confuse the
assessment of the time-fixed covariates if the time-dependent covariate is also included in the analysis. This is similar to the well-established practice in epidemiology of not
including intermediate variables in the analysis.
In our example we can examine the effect of acute and
chronic GVHD on treatment failure. Here we code two
time-dependent covariates which are the indicators that a
patient has developed either acute or chronic GVHD prior
Regression methods for BMT
JP Klein et al
Table 5
Effect of GVHD on outcome when coded as time-dependent covariates
Variable
DF
Parameter
estimate
Standard
error
Risk ratio
P
Acute
GVHD
Chronic
GVHD
1
0.68051
0.06340
1.975
⬍0.0001
1
−0.05885
0.08829
0.943
0.5050
difference by treatment group in the next 38 months (18–
56 months). Finally, comparing long-term survivors of the
two treatments (ie those surviving at least 56 months) the
chemotherapy patients are dying at a rate 1/0.156 = 6.41
times that of BMT patients. Which treatment is better? This
question can not be answered directly from the Cox model,
but rather from a plot of the differences in survival curves
discussed in Presentation of results.
1005
Checking the model
Table 6
Effect of GVHD on outcome when coded incorrectly as if
known at transplant
Variable
DF
Parameter
estimate
Acute
GVHD
Chronic
GVHD
1
0.729
1
−1.238
Standard
error
Risk ratio
P
0.0628
2.073
⬍0.0001
0.757
0.290
⬍0.0001
to time t. Notice that the GVHD covariate is initially zero
and changes to a value of one at the time GVHD is diagnosed for the patient. Fitting these models, we have Table
5. These results tell us that a patient’s risk of treatment
failure increases by 1.975 when they develop acute GVHD.
There is no effect on outcome due to chronic GVHD. Note
that if we incorrectly coded acute and chronic GVHD as if
the occurrence were known at the time of transplant (ie as
fixed covariates) then we have the results in Table 6 where
it appears that chronic GVHD has a highly significant benefit to the patient. This apparent benefit is entirely due to
the fact that patients need to live to at least 100 days or
more to develop chronic GVHD. When modelled correctly
this ‘benefit’ is not observed.
It is important to distinguish time-dependent covariates
from time-varying fixed effects. Times varying fixed effects
occur when the effects of a fixed covariate on outcome
change over time. To illustrate these types of effects consider the IBMTR study15 comparing the survival of 548
HLA-identical sibling transplant patients to 196 patients
treated for CML with hydroxyurea (n = 121) or interferon
(n = 75). In this study time is measured from the date of
diagnosis. The effect of therapy (BMT vs chemo) changed
over time. Table 7 shows the estimates of the effect of BMT
in three distinct time intervals. Here we see that in the first
18 months after diagnosis that BMT patients are dying at
a rate 5.8 times that of patients on chemotherapy. Among
those patients who survive at least 18 months there is little
Table 7
The results of the Cox model are only valid as long as the
assumptions of the model are correct. The most important
is the proportionality of hazards assumption that the effects
of the risk factors are constant over the follow-up time period. We need to check the proportionality assumption when
fitting a Cox model. Various approaches and methods for
checking the proportional hazard rate have been proposed
and studied (see in Klein and Moeschberger,2 chapter 11).
We consider two commonly used approaches. One method
is to add a time-dependent covariate. Suppose we need to
examine the proportionality assumption for a fixed covariate Z1. We add a time-dependent covariate Z2 = Z1 ⫻ log(t)
to the model. If the proportionality assumption holds for
Z1, then the regression coefficient for Z2 should be zero,
suggesting no change in Z1’s effect on outcome over time.
As an example, in our study of alternative donors consider the factor Karnofsky score at transplant and its effect
on treatment failure. Here, we dichotomize patients into
good (⭓90%) and poor Karnofsky status. Table 8 shows
the results of fitting the covariate Karn = {1 if good Karnofsky, 0 if poor} and the time-dependent covariate Karn ×
log(t). Here the risk ratio of 0.435 is quite meaningless.
Our artificial covariate Karn ⫻ log(t) is highly significant
suggesting the effect of Karnofsky score changes over time.
Another approach to checking the proportionality
assumption is a graphical method. To check the proportionality assumption for a discrete covariate Z1, after
adjusting for other covariates, we fit a Cox model to other
covariates stratified on Z1. When a Cox model is stratified
on a covariate then different baseline rates are used for each
level of the covariate. If the proportionality assumption
holds for Z1, the difference in the log cumulative hazard
function over any two levels of Z1 should be constant over
the follow-up time period. Alternatively, Andersen16 proposed plotting the cumulative baseline hazard functions for
each pair of levels against every other level of Z1. Andersen’s plots should be straight lines if the proportionality
assumption holds. When the proportionality assumption
does not hold, these plots also summarize how the hazard
rate changes over time.
Estimates of treatment effect
Variable
BMT and within 18 months of diagnosis
BMT and between 18 to 56 months of diagnosis
BMT and greater than 56 months of diagnosis
Parameter estimate
Standard error
Risk ratio
P
1.76654
−0.21794
−1.18592
0.36442
0.24598
0.37881
5.851
0.804
0.156
⬍0.0001
0.3756
⬍0.0001
Bone Marrow Transplantation
Regression methods for BMT
JP Klein et al
1006
Table 8
Testing for proportional hazards
Variable
Parameter estimate
Standard error
Risk ratio
P
−0.83281
0.20654
0.08717
0.05535
0.435
1.229
⬍0.0001
0.0002
Good Karnofsky
Good Karnofsky × log(time)
-0.25
Difference
-0.50
-0.75
Karnofsky score: 90100%<90%
-1.00
-1.25
-1.50
0
12
24
36
48
60
72
Months since transplant
Figure 3 Difference in log cumulative baseline hazards functions.
To illustrate these techniques we again consider the indicator of good Karnofsky score. We fit a stratified Cox
model with risk factor donor type stratified on Karnofsky
score. Figure 3 plots the difference in log cumulative baseline hazard (good–poor Karnofsky) which indicates that the
difference is not constant. Figure 4 is the Andersen plot of
cumulative baseline hazard function of good against poor
Karnofsky. If proportional hazards holds this plot should
be the 45° line. These plots confirm that the proportionality
assumption is not valid for the risk factor of Karnofsky
score.
In the situation where the proportionality assumption is
clearly untenable it may sometimes be possible to partition
the effects along the time axis as shown in earlier in the
context of comparison of BMT with chemotherapy for
CML. Within each time interval (in that example 0–18
Karnofsky score: 90100%
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
Karnofsky score: <90%
Figure 4 Andersen’s plot of cumulative hazard functions.
Bone Marrow Transplantation
0.5
months, 18–56 months, more than 56 months) the proportionality assumption still holds. A similar procedure
could be used for the Karnofsky score, one possibility being
to distinguish between an early and a late effect according
to some cutpoint t0. Often this cutpoint needs to be chosen
from the data. Applying this procedure to this data we find
a ‘best’ cutpoint of 2 months. For this model we see there
is a benefit of having good Karnofsky for the first 2 months
after transplant (RR = 0.382, P ⬍ 0.0001). In 2-month survivors, there is still a beneficial effect of good Karnofsky
at transplant (P ⬍ 0.0001), but this effect is now cut in
half with a relative risk of 0.644. The P values from this
procedure are suspect, however, since the procedure is ‘rigged’ to find significance by selecting the cut point that gives
us the biggest values of test statistics. Unless these P values
are very small, as in this example, we need to be extremely
careful in our interpretation of significance.
In some situations the non-proportionality regards a
covariate for which no direct effect measure is needed. It
is then often useful to relax the model assumption allowing
separate underlying hazards for each level of that covariate.
This is called the stratified Cox model, and often it is
acceptable to assume that the rest of the covariates act
multiplicatively on these hazards, with common value
regardless of the level of the critical covariate. The Karnofsky score may once again provide an example. Assume that
we allow separate baseline hazards for Karnofsky ⭓90 and
⬍90, then the effect of older age has a relative risk of 1.159
(P = 0.0378), while for an unstratified model the relative
risk is 1.170 (P = 0.0270).
Presentation of results
In the previous sections we have presented a number of
tables which are helpful in understanding how to interpret
the Cox regression analysis. In this section, we discuss our
views on how one should present these results for publication. In all but the simplest studies this presentation
should involve a table of risk ratios with the appropriate
confidence intervals and P values and a set of summary
figures. Depending on the results of the Cox analysis the
summary figures should reflect the average survival or disease-free survival for a patient with a particular risk factor,
the average being over all other risk factors in the model.
The curves are either the actual estimated survival curves
themselves, or, when the effect being studied has nonproportional hazards, a difference between survival curves.
One needs to be careful that the curves are appropriate for
the outcome measure (see our discussion of competing risks
in part 11). While the Cox model is valid for studying the
rate of occurrence of competing risks such as relapse or
Regression methods for BMT
JP Klein et al
Table 9
1007
Results of Cox modelling for the alternative donor study
Variable
One-antigen mismatched related
Two-antigen mismatched related
Matched unrelated
One-antigen mismatched unrelated
One-antigen mismatched related and advanced disease
Two-antigen mismatched related and advanced disease
Matched unrelated and advanced disease
One-antigen mismatched unrelated and advanced disease
Advanced disease
Age ⬎40
Parameter estimate
Standard error
Risk ratio
P
0.91184
1.28301
0.80989
1.18477
−0.71353
−0.80465
−0.4495
−0.79174
0.96943
0.26131
0.19593
0.25117
0.11793
0.19593
0.18281
0.29021
0.15638
0.24816
0.08772
0.07196
2.489
3.607
2.248
3.270
0.490
0.447
0.638
0.453
2.636
1.299
⬍0.0001
⬍0.0001
⬍0.0001
⬍0.0001
⬍0.0001
0.0056
0.0040
0.0014
⬍0.0001
0.0003
death in remission, the survival curves suggested in the
sequel apply only to events like survival or treatment
failure.
We first illustrate this on the data on unrelated donors.
Table 9 shows the output of the final Cox model on which
our presentation table and figures are based. This model
has an indicator for each type of donor with HLA-identical
siblings as the baseline, an indicator of transplantation with
advanced disease (⬎first chronic phase for CML; ⬎1st
remission for AML or ALL), and age ⬎40. There is also
an interaction term between donor type and disease stage.
All models are stratified on good or poor Karnofsky score.
While this table is very useful in understanding the basic
model, it is not the type of table we recommend for publication. This is found in Table 10. Here we present for each
factor a test of its significance. This tests the null hypotheses that all levels of the factor have the same effect on
survival against the hypothesis that one of the levels of the
factor is different from any other. Here the P value (a) is
from such a test and it is an overall P value of the hypothesis that at least one of the donor types has a different treatment failure rate than the others, here for early disease
patients. This P value, based on a so-called multiple degree
of freedom test, is compared to 0.05 to judge significance
or 0.01 to judge strong significance. Having determined that
the factor (here donor type) is significant we then look at
the pairwise comparisons of levels of the factor to determine which donor types are different from each other. We
are making 5 ⫻ 4/2 = 10 such pairwise comparisons, four
of which are listed in the body of the table and six in the
Table 10
Suggested publication form of a summary table for the Cox regression for the alternative
donor study
Factor
Donor type
Patients with early disease1
HLA identical sibling
One-antigen mismatched related (A)
Two-antigen mismatched related (B)
Matched unrelated (C)
One-antigen mismatched unrelated (D)
Relative risk
(95% confidence interval)
P
⬍0.0001a
2.49
3.61
2.25
3.27
1.00
(1.90–3.26)b
(2.00–5.90)
(1.78–2.83)
(2.23–4.80)
Patients with advanced disease2
HLA-identical sibling
One-antigen mismatched related (A)
Two-antigen mismatched related (B)
Matched unrelated (C)
One-antigen mismatched unrelated (D)
1.22
1.61
1.43
1.48
1.00
(0.96–1.54)
(1.21–2.15)
(1.17–1.75)
(1.10–2.00)
Age
⭐40
⬎40
1.00
1.30 (1.13–1.49)
0.0003
Disease status at transplant
Early disease
Advanced disease
1.00
2.64 (2.22–3.13)
⬍0.0001
⬍0.0001
⬍0.0001
⬍0.0001
⬍0.0001
0.0004
0.993
0.0011
0.0101
0.0101
Pairwise comparisons of donor types for patients with early disease: A–B, P = 0.1735; A–C, P = 0.5215; A–
D, 0.2213; B–C, P = 0.0725; B–D, P = 0.7482; C–D, P = 0.0760.
2
Pairwise comparisons of donor types for patients with advanced disease: A–B, P = 0.0980; A–C, P = 0.2249;
A–D, 0.2642; B–C, P = 0.4524; B–D, P = 0.6594; C–D, P = 0.8392.
a
Multiple degree of freedom test (see text).
b
95% confidence interval (see text).
1
Bone Marrow Transplantation
1.0
HLA-identical sibling
HLA-matched unrelated
One-mismatched unrelated
One-mismatched related
Two-mismatched related
0.8
0.6
0.4
0.2
0.0
0
12
24
36
48
60
Adjusted probability of disease-free survival
1008
Adjusted probability of disease-free survival
Regression methods for BMT
JP Klein et al
1.0
HLA-identical sibling
HLA-matched unrelated
One-mismatched unrelated
One-mismatched related
Two-mismatched related
0.8
0.6
0.4
0.2
0.0
0
12
24
36
48
60
Months since transplant
Months since transplant
Figure 5 Adjusted probability of disease-free survival for early disease.
Figure 6 Adjusted probability of disease-free survival for advanced disease.
footnote. To ensure that we are not drawing any false conclusions we should compare each of these comparisons to
0.05/10 = 0.005 for significance (or 0.01/10 = 0.001 for
strong significance). Here we see that all the alternative
donors differ from HLA-identical siblings for early disease
patients, but none of the alternative donors are different.
Note for advanced disease using the 0.005 adjusted level
we find that only two-antigen mismatched related donors
are different from HLA-identical sibling donors.
The table also provides for each variable in the column
the estimated relative risk as compared to the baseline category which is also given in the table and a 95% confidence
interval for the relative risk. Here, for example, the value
(b) tell us that, for early disease, a patient given a oneantigen mismatched related donor transplant fails the treatment at a rate 2.49 times that of an HLA-identical sibling
transplant. We are 95% confident that this estimated relative risk could be as low as 1.90 or as high as 3.26. These
values in many cases come directly from the regression
table (Table 9) with the confidence intervals being computed as exp{parameter estimate ⫾ 1.96 ⫻ standard error}.
Having presented the results of the study, accompanying
figures are needed to summarize the main results. Here,
since the relationship between donor types does not change
over time an adjusted survival curve for each of the main
effects is of interest. This adjusted survival curve is constructed by estimating a survival curve for each patient
based on the final Cox model and then averaging these
curves over patients with a similar donor type. The curve is
then representative of an average patient with the condition.
Figures 5 and 6 show these curves for different donor types
by diseases stage.
For our second example comparing BMT to chemotherapy patients in CML, the proportional hazards model
did not hold for the main effect of type of therapy. After
modelling outcome we found that different risks of death
for the two donor types were found prior to 1988 and after
1988 and that male patients with a large spleen size did
Table 11
Suggested publication form of a summary table for the Cox regression for BMT vs chemotherapy
study in CML
Factor
Bone Marrow Transplantation
Relative risk
(95% confidence interval)
P
Type of treatment
Diagnosed prior to 1988
Chemotherapy
BMT 0–18 months after diagnosis
BMT 18–56 months after diagnosis
BMT ⬎56 months after diagnosis
1.00
5.35 (2.72–10.52)
0.73 (0.47–1.13)
0.14 (0.07–0.29)
Diagnosed after 1988
Chemotherapy
BMT 0–18 months after diagnosis
BMT 18–56 months after diagnosis
BMT ⬎56 months after diagnosis
1.00
3.19 (1.70–5.98)
0.44 (0.29–0.65)
0.08 (0.04–0.18)
0.0003
⬍0.0001
⬍0.0001
Gender
Male
Female
1.00
0.65 (0.50–0.86)
0.0021
Spleen size
⬍10 cm
⭓10 cm
1.00
1.56 (1.17–2.07)
⬍0.0001
⬍0.0001
0.1583
⬍0.0001
⬍0.0001
0.0023
Regression methods for BMT
JP Klein et al
Difference of survival probabilities
0.75
0.50
0.25
0.0
-0.25
-0.50
-0.75
Differences of survival probabilities
95% simultaneous confidence band
0
12
24
36
48
60
84
72
96
and after 1988 for male patients with small spleen sizes. In
male patients with small spleen sizes diagnosed prior to
1988 we see that the there was a survival advantage up to
about 36 months (the upper confidence band is below the
0 line until this point), between about 36–90 months there
is no advantage to either treatment (the band includes the
zero line) and after 90 months the survival probabilities are
higher for a BMT patient. For more recently treated patients
it appears, except for a short window around 14–16 months,
the survival rates are similar for the two treatments and
after 60 months there is a clear advantage for BMT.
1009
Acknowledgements
Months since diagnosis
Figure 7 Difference of survival probabilities (BMT–chemotherapy) for
male patient with small spleen size and diagnosed prior to 1988.
poorly. Table 11 shows the summary table for presentation
of these data. Here we see that soon after diagnosis the
chemotherapy patients do better, but among long-term survivors the BMT patients do better.
While Table 10 tells the story in terms of the rate of
death, this table fails to answer the key question ‘for a given
patient with a certain set of covariates, at what times are
these two treatments different?’ This is particularly
important when one treatment has a higher early survival
but a lower long-term survival as we have here. We suggest
plotting the predicted survival functions of two treatments
for a given set value of risk factors, and comparing the
difference between the survival functions along with a confidence band for the difference. Visually examining these
plots and comparing the confidence bands with the zero
line shows how the difference between the two survival
functions changes over time. Also the time region where
two survival functions are different can be identified from
the plot, ie the time region where zero lies outside the 95%
confidence bands. Zhang and Klein17 provide detailed procedures to construct such confidence bands. Figures 7 and
8 shows these curves for patients diagnosed prior to 1988
Difference of survival probabilities
0.75
0.50
0.25
0.0
-0.25
-0.50
-0.75
Difference of survival probabilities
95% simultaneous confidence band
0
12
24
36
48
60
72
84
96
Months since diagnosis
Figure 8 Difference of survival probabilities (BMT–chemotherapy) for
male patient with small spleen size and diagnosed after 1988.
This research was supported by grant R01-CA54706–07 from the
National Cancer Institute.
References
1 Klein JP, Rizzo JD, Zhang M-J, Keiding N. Statistical
methods for the analysis and presentation of the results of
bone marrow transplants. Bone Marrow Transplant 2001; 28:
909–915.
2 Kelin JP, Moeschberger ML. Survival Analysis: Techniques
for Censored and Truncated Data. Springer-Verlag: New
York, 1997.
3 Dupont WD. Power calculations for matched case-control
studies. Biometrics 1988; 34: 1157–1168.
4 Cox DR. Regression models and life-tables (with discussion).
J Roy Stat Soc 1972; B34: 187–220.
5 Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med
1992; 11: 1871–1885.
6 Li Y, Klein JP, Moeschberger ML. Effects of model misspecification in estimating covariate effects in survival analysis for
small sample sizes. Comput Stat Data Analysis 1996; 22:
176–192.
7 Wei LJ, Greg CG, Tanner MA. Applications of multiple imputation to the analysis of censored regression data. Biometrics
1991; 47: 1297–1309.
8 Szydlo R, Goldman JM, Klein JP et al. Results of allogeneic
bone marrow transplants for leukemia using donors other than
HLA-identical siblings. J Clin Oncol 1997; 15: 1767–1777.
9 Wu JT. Methods for Dichotomising a Covariate in Censored
Data Regression. Ph.D. Thesis, The Medical College of Wisconsin, 2001.
10 Morgan TM, Elashoff RM. Effect of categorizing a continuous
covariate in comparison of survival times. J Am Stat Assoc
1986; 81: 917–921.
11 Altman DG, Sauerbrei W, Schumacher M. Dangers of using
‘optimal’ cutpoints in the evaluation of prognostic factors. J
Natl Cancer Inst 1994; 86: 829–835.
12 Contal C, O’Quigley J. An application of changepoint
methods in studying the effect of age on survival in breast
cancer. Comput Stat Data Analysis 1999; 30: 253–270.
13 Jespersen NCB. Dichotomizing a continuous covariate in the Cox
regression model. Stat Res Univ Copenh Res Rep 1986; 86/2.
14 Lausen B, Schumacher M. Evaluating the effect of optimized
cutoff values in the assessment of prognostic factors. Comput
Stat Data Analysis 1996; 21: 307–326.
15 Gale RP, Hehlmann R, Zhang MJ et al and the German CML
Study Group. Survival with bone marrow transplantation verBone Marrow Transplantation
Regression methods for BMT
JP Klein et al
1010
sus hydroxyurea or interferon for chronic myelogenous leukemia. Blood 1998; 91: 1810–1819.
16 Andersen PK. Testing goodness-of-fit of Cox’s regression and
life model. Biometrics 1982; 38: 67–77; erratum 40: 1217.
17 Zhang MJ, Klein JP. Confidence bands for the difference of
two survival curves under proportional hazards model. Lifetime Data Analysis 2001; 7: 243–254.
Appendix
Cox model A model which relates covariates or risk factors to outcome. The model, first proposed by Sir David
Cox,4 assumes that the hazard rate for an individual with
a set of covariates is the product of a baseline hazard rate
and a parametric function of the risk factors.
Glossary of statistical terms
Accelerated failure time model A class of regression
models for survival data. These models assume that the logarithm of time is a linear function of the covariates. They
also assume the effect of the covariates on survival is to
change the time scale.
Binary covariate A patient characteristic, which takes
only two values such as gender.
Bonferroni correction This correction is used when multiple hypothesis tests are made on the same set of data. This
correction provides insurance that the likelihood of falsely
rejecting any one of k hypotheses is at most some fixed
value, ␣, when each null hypothesis is true. The Bonferroni
correction is made by using an ␣/k level cut-off for each
of the k tests performed on the data. This correction makes
the experiment-wise error approximately equal to ␣. The
experiment-wise error rate is the probability of making an
incorrect decision for any of the hypotheses when all the
null hypotheses are true.
Categorical factor A factor which takes a finite number
of values. Examples are primary disease (CML, ALL,
AML), disease stage (early, intermediate and advanced).
Categorical factors are handled by designation of one level
as a baseline and representing the other levels by a series
of binary covariates.
Competing risks Two or more events where occurrence
of one of the events precludes occurrence of any other of
the events.
Complete case analysis An analysis based on only those
subjects who have complete data. Any patient with missing
data is removed from the study.
Confidence band (for the survival function) A 95% confidence band is an upper and lower curve with the property
that we are 95% confident that the true survival function is
in between these two curves for any time point. Confidence
bands tend to be wider than confidence intervals (see
below), since we are making a promise that the entire curve
is in the band not simply the value at one point in time.
Confidence interval A 95% confidence interval for the
survival time at some fixed point in time is an interval with
the property that we are 95% sure that the true value of the
survival function is in this interval. The inference is to the
survival function at a single point in time.
Covariates These are independent variable or risk factors
known at the time of transplant which may influence outcome.
Bone Marrow Transplantation
Disease-free survival The chance a patient is alive and
disease-free at a given point in time after transplant.
Factor Something which effects outcome. A factor, as
opposed to a covariate, is a set of variables that represent
a disease or patient characteristic. For example, donor type,
which has five categories in our example, is a factor that
is expressed as four binary covariates.
Fixed time covariates Covariates or risk factors whose
values are known at the start of the study and whose values
do not change during the course of the study. Examples are
pretransplant risk factors, gender, age at diagnosis, etc.
Hazard rate The rate at which an event is occurring. The
hazard rate of death is approximately the chance of dying
in the next moment in time among current survivors. The
hazard rate is the slope of the negative of the natural log
of the survival function.
Imputation A body of techniques for missing values
where the missing value is replaced (imputed) by an estimated value. The estimated values are based on the value
of this covariate in similar patients who have this covariate recorded.
Interaction One variable has a different effect on the outcome depending upon the value of another variable. That
is, there is an interaction between two variables if a change
in one variable produces a change in outcome at one level
of the second variable different from that produced at other
levels of the second variable. Testing of an interaction is
done by including both variables in the model, as well as
their cross-product (A × B) and testing the significance of
the cross-product term.
Matched pairs analysis Techniques where each patient on
one treatment arm (called a case) is matched to one or more
patients on a different treatment arm (controls). The matching is on the basis of commonly known important prognostic factors. Special statistical methods which account for
this matching are needed to analyze this type of data. This
design is most commonly used in epidemiology to determine if some factor is associated with disease by matching
patients with the disease (cases) with healthy patients
(controls) on all factors but the one of interest.
Multiple degree of freedom test A hypothesis test for the
comparison of three or more treatments or for factors with
three or more levels. A multiple degree of freedom test of
the hypothesis of no difference in donor types in our
Regression methods for BMT
JP Klein et al
example is a test of the hypothesis that all donor types have
the same outcome as opposed to the hypotheses that at least
one of the donor types has a different outcome from the
others.
Nonparametric procedure A statistical method which
makes no assumptions about a model for the data. Nonparametric tests are also known as distribution-free tests. Two
sample nonparametric tests tend to compare the complete
distribution of the data in each sample not some measure
of the distribution in each sample. In the complete data
setting these are typically tests based on ranks and
simple counts.
Outliers Values for a variable that deviate appreciably
from most other values of that variable in the data set.
P value The probability of seeing a more extreme result
than observed, if the model is correct. This also known as
the observed significance level.
Parametric procedures These are procedures which
assume that the data come from a family of statistical distributions indexed by some parameter. They require an
assumption of a model for the data with some unknown
parameter. Tests are about some summary population measure (a parameter) like the mean or variance. Typical
examples are the analysis of variance or the t-test for uncensored data which assumes that data is from a normal distribution. These tests can be misleading if the assumed model
is wrong, but they are usually more powerful when the
model is correct.
Proportional hazards model See Cox model.
Regression analysis Techniques which are used to study
prognostic factors or to adjust for treatment imbalances
when comparing treatments. These techniques involve a
model that relates covariates to outcome.
1011
Risk set The collection of all subjects who could experience an event at time t.
Semiparametric model A model or procedure where the
data are modelled by a combination of an arbitrary function
and a specific parametric model. The Cox model is a semiparametric model since we model the hazard rate for an
individual by a general baseline function (the nonparametric part) times a specific parametric function of the
covariates.
Standard deviation The square root of the variance. It is
a measure of dispersion.
Standard error The standard deviation of an estimator. It
is a measure of the precision of an estimator.
Survival function
The chance a patient is alive at time t.
Time-dependent covariates Prognostic or risk factors
whose value changes over time. Examples would be weekly
white blood counts, occurrence of GVHD, etc.
Time-varying fixed effects Artificial time-dependent
covariates which represent the effect in different time periods of a covariate whose value is unchanging over time.
These are typically adjustments for nonproportional
hazards.
Bone Marrow Transplantation
© Copyright 2026 Paperzz