Supplementary Document 1 - Clinical Cancer Research

Supplementary Document 1
Development of the DBCG-RT profile and weighted index
Description of the Lasso method
To identify genes whose transcript interacts with the effect of radiotherapy so to modify the risk of locoregional recurrence (LRR), we studied the interaction between gene expressions and post-mastectomy
radiotherapy (PMRT) or no PMRT. Interaction effects are weaker than main effects, and difficult to detect
and a pre-selection step was necessary (1-3). In the first stage we try to capture as many as possible
candidate main effect and interaction effect genes; in the second stage a careful selection is then
performed. In the pre-selection step, a Cox proportional hazard model with L1 penalty was fitted, the so
called Lasso method (4, 5), with all gene expressions as main effects in addition to all their interactions with
a binary radiotherapy variable. None of the clinical covariates were included at this stage. More precisely,
let t1, . . . , tn be the times to LRR or time to censoring for the n = 195 individuals in the study and let d1, . . . ,
dn be the corresponding censoring indicators. The Cox model is described by the hazard of LRR at time t for
a patient with gene expression values xi = (xi1, . . . , xip) as
p
p
j 1
g 1
h(t | xi1 ,..., xip )  h0 (t ) exp( xij  j  Ti xig  g )
where h0(t) denotes the baseline hazard, Ti is the indicator of whether patient i received radiotherapy
(Ti = 1) or not (Ti = 0). The log partial likelihood is given by
n
l p (  ,  )   (di ( xit   Ti xit  )  log (  exp (x tj   T j x tj  ))
i 1
j:t j ti
where β and γ are the p-dimensional vectors of regression parameters. When applying the Lasso shrinkage,
the penalized estimates are found by maximizing
l p ( ,  )   (
p
p
j 1
j 1
|  j | |  j | )
or equivalently maximizing the log partial likelihood
p
p
j 1
j 1
l p (  ,  ) subject to (|  j | |  j |)  s.
Here, λ and s denote the penalization parameters which are in one-to-one correspondence though there is
no closed conversion formula. This step of the analysis was carried out using the glcoxph library in R (6),
1
which can handle high-dimensional vector of predictors. All covariates were on the same scale. The Lasso
procedure selected a number of main effect genes and (PMRT × expression)-interaction genes for each
fixed value of the penalization parameter on a grid.
Taking the union of all these gene lists, identified at the various level of penalization, two sets were
constructed: J, the set of genes with main effect on the risk of LRR and I, the set of genes associated to LRR
through their interaction with PMRT. At the pre-selection stage, we varied the Lasso penalty weight s on a
grid of 400 values in [0.05, 20] (the grid [0.05, 5] was suggested in (6)).
Next in the second step, the pre-selected genes along with the clinical factors were regressed in a
multivariate Cox model. The clinical parameters included in the multivariate analysis were PMRT (yes, no),
estrogen-receptor status ( positive, negative), the number of involved lymph nodes, categorized into four
classes ( 0, 1-3, ≥4 positive lymph nodes), tumor size, categorized in three classes (< 20 mm, 20-50 mm, >50
mm) and menopausal status (pre-menopausal, post-menopausal). The candidate genes of set J were
included as main effects and the interactions genes in set I were included as interaction effects. Now the
instantaneous risk of LRR at time t for a patient with gene expressions xi 1, . . . , xi p and clinical covariates zi
1,
. . . , zi 5 , is modeled as
4
h(t | xi , zi )  h0 (t ) exp(Ti  k zki  xij  j  Ti xig  g )
k 1
jJ
gI
Here, α is the effect of PMRT, θk the effect of clinical covariate zk, βj and γj are the main and interaction
effects of gene j. Again, Lasso shrinkage is applied to avoid overfitting and the optimal penalty weight λopt
identified via 5-fold cross validation (7, 8). The final selection of genes associated with PMRT and
influencing the LRR-free survival is done at λopt. This stage of the analysis was carried out using glmpath
library in R (9), which allows inclusion of covariates which are not penalized.
The standard errors of the significant interaction coefficients were estimated via nonparametric 0.632bootstrap (10): we sampled m ≈ 0.632n = 124 patients from the cohort of 195 patients without
replacement; then we estimated the coefficients in the Cox model of the second stage at the optimal crossvalidated penalty level λ = λopt. This bootstrap procedure was repeated 200 times and the standard errors
were estimated from the sampling distributions of the bootstrapped coefficients.
Finally, after a set I* of significant interaction genes was obtained from the double selection procedure a
cross-validated score index was computed for each patients, from leave-one-out crossvalidation (9). This
approach mimics external validation as the coefficients used in the computation of the score index for
patient i are estimated excluding patient i. In practice, for i = 1, . . . , n, we (i) removed subject i from data;
(ii) Lasso estimates 
( i )
of the interaction genes I*are obtained by applying the Cox model of the final
2
selection step to the reduced (n−1 patients) data; (iii) repeat steps (i)-(ii) over all n subjects; (iv) compute
the score for each patients using the cross-validated estimates as
CVSI i 

( i )
g
X ig
gI *
By dividing the n predictive scores at the third quartile, the patients were divided into two groups and their
LRR-free survival probabilities compared. The CVSI differentiates between the 75% of patients expected to
benefit most from PMRT (low CVSI) and the 25% who would benefit less or not at all (high CVSI). Figure 1
illustrates the different stages of the analysis.
.
Figure 1. Flowchart of the analyses. Upper left panel: pre-selection of candidate genes; set I interaction genes and set
J main effect genes. Upper right panel: final selection of interaction genes and bottom panel: validation of the selected
interaction via cross validation, classification, and prediction.
The proportionality assumption for the significant genes and the clinical covariates was checked by visual
inspection of the re-scaled Schoenfeld residuals against loco-regional log time (with a natural spline
smoother) and the goodness-of-fit test as in (11).
3
Selection of genes
At the pre-selection stage, we indentified 206 genes, 178 as main effect and 44 as interaction effect genes,
with 16 genes in common. Goeman's (12) test for global predictive significance for these 206 selected
genes was highly significant (p-value= 0.00159), indicating a strong association of these genes with the
outcome. The p-value for the same test on all 17910 genes was 0.0396 and the average p-value for 1000
randomly selected 206 genes was 0.06552 (library globaltest in R). In the subsequent selection stage, 46
main effect genes and 7 interaction genes (denoted as I7) were selected at the optimal level of penalty λopt =
4.000265 found via 5-fold cross validation. The 7 interaction genes are reported in Table 1 along with the
estimated interaction coefficients, the hazard ratios and the corresponding bootstrapped standard errors.
The estimated interaction coefficients are small except for RGS1 (0.2810 se 0.1323) and DNALI1 (0.3763 se
0.1429). We evaluated whether the hazard ratio was proportional for the seven genes and the selected
clinical covariates. Only DNALI1 appeared to have a time dependent effect, increasing up to approximately
three months and then decreasing. A plot of the re-scaled Schoenfeld residuals against survival time was
estimated visually. We did not perform any correction to introduce time dependence in the covariate.
Table 1. The seven interaction genes I7. In the first column the AB-specific ID numbers of the seven genes, in the
second column the gene symbols, if available. The estimated interaction coefficients and the hazard ratios together
with bootstrapped standard errors in parenthesis are reported in the third and fourth column. A short description of
the genes is given in the last column.
AB-ID
(probe ID)
hCG2042724
Gene symbol
HLA-DQA1
hCG1980528.1
IGKC
hCG39901.3
RGS1
hCG41484.2
ADH1B
hCG25678.2
DNALI1
hCG2032658
OR8G2
hCG2023290
(
Description
)
0.0699
(0.0440)
-0.0646
(0.0426)
0.2810
(0.1323)
-0.0314
(0.0305)
1.0724
(0.0412)
0.9375
(0.0455)
1.3244
(0.1115)
0.9691
(0.0325)
0.3763
(0.1429)
1.4568
(0.1130)
-0.1266
(0.0636)
0.0452
(0.0186)
0.8811
(0.0699)
1.0462
(0.0179)
4
major histocompatibility complex,
class II, DQ alpha chr. 6p21.3
immunoglobulin kappa constant
chr. arm 2p12
regulator of G-protein signaling 1
chr. 1q31
alcohol dehydrogenase IB (class I)
beta polypeptide chr. 4q21-q23
dynein, axonemal, light
intermediate polypeptide 1 chr.
arm 1p35.1
olfactory receptor, family 8,
subfamily G member 2 chr. 11q24
unknown, chr. 7
Interaction effects of the seven genes
To assess the effect of each gene in interaction with PMRT, we computed the gene-PMRT hazard ratio
. The computed HR indicates the conditional influence of each gene, fixing the effects of
the others. HR can be interpreted as the change in the risk of LRR when/if a patient with gene expression
is treated with PMRT. A value below one implies a reduction in the risk of LRR, whereas a value above
one indicates an increasing risk. A value of one indicates no benefit, nor worsening.
Figure 2. The individual interaction effect of each gene in I7. Plots show histograms of the expression (left axis) and the
gene-PMRT-interaction hazard ratios
(right axis) for the I7 genes. A red line at HR = 1 is added as
reference.
The left panels of Figure 2 show the hazard ratio (HR) for HLA-DQA, RGS1, DNALI1 and hCG2023290. For
most of the patients in our cohort with negative gene expressions, the HR curves of HLA-DQA, DNALI1 and
hCG2023290 are below one. This indicates a benefit from PMRT. For instance, radiation of a patient with
very low expressed HLA-DQA (e.g. -8) would reduce the risk of LRR by 5%. However, for a smaller
proportion of patients with high gene expression levels, (about 20% for HLA-DQA and hCG2023290 and
5
31% for DNALI1), the HR is above one (red lines in Figure 2) indicating an increased risk of LRR for such
patients. For RGS1, the HR is below one for all patients in the study, thus the marginal effect of this gene is
a reduction of the risk of LRR through its association with PMRT.
The three right-most panels of Figure 2 represent IGKC, ADH1B and OR8G2. They share the same pattern:
mostly above one indicating higher risk when these genes are low expressed. However, patients with high
expression levels (approximately 45%, 25% and 27% for IGKC, ADH1B and OR8G2, respectively) will
experience improved LRR-free survival if given radiotherapy.
To evaluate the gene signature I7 as a whole, the cross-validated score index
CVSI i   ( i )    g( i ) X ig
gI 7
was computed for each patient, following the leave-one-out cross validation scheme described above. All
( i )
CVSI were below zero. The estimated effects  of PMRT were negative for all samples indicating a
reduction of the hazard of LRR with radiation.
Defining CVSI cut-off
To evaluate the CVSI cut-off value for patients expected to benefit most from PMRT (low CVSI) and the
patients who would benefit less (high CVSI), patients were ranked according to the CVSI index. In Figure 3,
odds for loco-regional recurrence were calculated in the PMRT and noPMRT groups starting with the
patient with the highest CVSI index and including patients stepwise by their CVSI index. The odds ratio was
calculated similarly. Patients with the highest CVSI index have an odds ratio about 1, whereas the odds ratio
for all patients is 0.20 (95% CI: 0.10-0.41). Figure 3 illustrates that the upper quartile of patients ranked by
CVSI index coincide with the point where the odds-ratio starts to drop. Thus, the upper quartile was
arbitrarily selected as the cut-off value for grouping patients into "low LRR risk" and high LRR risk" groups.
The non-linearity between the odds-ratio and CVSI index also explains why a binary score was chosen,
instead of a continuous score.
6
Figure 3. Odds for loco-regional
recurrence the PMRT and noPMRT groups,
and odds ratio, for loco-regional
recurrence. Patients are ranked by
decreasing CVSI index. Vertical line
indicate arbitrary cutpoints for patients in
the upper quartile. Fitted lines were
generated using local smoothing (running
median).
Loco-regional failure (%)
1st quartile
100
B
p = 0.0001
PMRT
Loco-regional recurrence (%)
50
25
75
noPMRT
Loco-regional recurrence (%)
50
25
75
A
100
Figure 4 illustrates the risk of loco-regional recurrence in patients treated with or without PMRT, divided
into quartiles on CVSI. Risk of loco-regional recurrence is estimated using 1 minus Kaplan-Meier and
differences are tested with log-rank test. This shows that for patients treated without PMRT, the lower
three quartiles of patients have a high risk of loco-regional recurrence, whereas the upper quartile has a
low risk of loco-regional recurrence. In patients treated with PMRT there is no association between
outcome and index.
3rd quartile
2nd quartile
4th quartile
p = 0.70
1st quartile
3rd quartile
4th quartile
0
0
2nd quartile
0
5
10
15
Time after treatment (years)
20
0
5
10
15
Time after treatment (years)
20
Figure 4. Loco-regional recurrence in patients treated without (A) or with PMRT (B), divided into quartiles based on
CVSI.
7
References
1. Verweij PJ and Van Houwelingen HC. Cross-validation in survival analysis. Stat. Med 12 (24): 23052314, 1993.
2. Debashis P, Bair E, Hastie T, and Tibshirani R. "Preconditioning" for feature selection and regression in
high-dimensional problems. Ann. Stat. 36 (4): 1595-1618, 2008.
3. Fan J and Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. B
70 (5): 849-911, 2013.
4. Tibshirani R. The Lasso method for variable selection in the Cox model. Stat. Med 16 (4): 385-395,
1997.
5. Tibshirani R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58 (1): 267-288, 1996.
6. Sohn I, Kim J, Jung SH, and Park C. Gradient lasso for Cox proportional hazards model. Bioinformatics
25 (14): 1775-1781, 2009.
7. Van Houwelingen HC, Bruinsma T, Hart AA, Van't Veer LJ, and Wessels LF. Cross-validated Cox
regression on microarray gene expression data. Stat. Med 25 (18): 3201-3216, 2006.
8. Gui J and Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings,
with applications to microarray gene expression data. Bioinformatics 21 (13): 3001-3008, 2005.
9. Hastie T and Park MY. L1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. B
69 (4): 659-677, 2007.
10. Efron B and Tibshirani R. An introduction to the bootstrap. Monographs on Statistics & Applied
Probability 57. Boca Raton, Chapman & Hall/CRC, 1993.
11. Lin DY. Goodness-of-fit analysis for the Cox regression model based on a class of parameter
estimators. J. Am. Stat. Assoc. 86 (415): 725-728, 1991.
12. Goeman JJ, van de Geer SA, de KF, and Van Houwelingen HC. A global test for groups of genes: testing
association with a clinical outcome. Bioinformatics 20 (1): 93-99, 2004.
8