Supplementary Document 1 Development of the DBCG-RT profile and weighted index Description of the Lasso method To identify genes whose transcript interacts with the effect of radiotherapy so to modify the risk of locoregional recurrence (LRR), we studied the interaction between gene expressions and post-mastectomy radiotherapy (PMRT) or no PMRT. Interaction effects are weaker than main effects, and difficult to detect and a pre-selection step was necessary (1-3). In the first stage we try to capture as many as possible candidate main effect and interaction effect genes; in the second stage a careful selection is then performed. In the pre-selection step, a Cox proportional hazard model with L1 penalty was fitted, the so called Lasso method (4, 5), with all gene expressions as main effects in addition to all their interactions with a binary radiotherapy variable. None of the clinical covariates were included at this stage. More precisely, let t1, . . . , tn be the times to LRR or time to censoring for the n = 195 individuals in the study and let d1, . . . , dn be the corresponding censoring indicators. The Cox model is described by the hazard of LRR at time t for a patient with gene expression values xi = (xi1, . . . , xip) as p p j 1 g 1 h(t | xi1 ,..., xip ) h0 (t ) exp( xij j Ti xig g ) where h0(t) denotes the baseline hazard, Ti is the indicator of whether patient i received radiotherapy (Ti = 1) or not (Ti = 0). The log partial likelihood is given by n l p ( , ) (di ( xit Ti xit ) log ( exp (x tj T j x tj )) i 1 j:t j ti where β and γ are the p-dimensional vectors of regression parameters. When applying the Lasso shrinkage, the penalized estimates are found by maximizing l p ( , ) ( p p j 1 j 1 | j | | j | ) or equivalently maximizing the log partial likelihood p p j 1 j 1 l p ( , ) subject to (| j | | j |) s. Here, λ and s denote the penalization parameters which are in one-to-one correspondence though there is no closed conversion formula. This step of the analysis was carried out using the glcoxph library in R (6), 1 which can handle high-dimensional vector of predictors. All covariates were on the same scale. The Lasso procedure selected a number of main effect genes and (PMRT × expression)-interaction genes for each fixed value of the penalization parameter on a grid. Taking the union of all these gene lists, identified at the various level of penalization, two sets were constructed: J, the set of genes with main effect on the risk of LRR and I, the set of genes associated to LRR through their interaction with PMRT. At the pre-selection stage, we varied the Lasso penalty weight s on a grid of 400 values in [0.05, 20] (the grid [0.05, 5] was suggested in (6)). Next in the second step, the pre-selected genes along with the clinical factors were regressed in a multivariate Cox model. The clinical parameters included in the multivariate analysis were PMRT (yes, no), estrogen-receptor status ( positive, negative), the number of involved lymph nodes, categorized into four classes ( 0, 1-3, ≥4 positive lymph nodes), tumor size, categorized in three classes (< 20 mm, 20-50 mm, >50 mm) and menopausal status (pre-menopausal, post-menopausal). The candidate genes of set J were included as main effects and the interactions genes in set I were included as interaction effects. Now the instantaneous risk of LRR at time t for a patient with gene expressions xi 1, . . . , xi p and clinical covariates zi 1, . . . , zi 5 , is modeled as 4 h(t | xi , zi ) h0 (t ) exp(Ti k zki xij j Ti xig g ) k 1 jJ gI Here, α is the effect of PMRT, θk the effect of clinical covariate zk, βj and γj are the main and interaction effects of gene j. Again, Lasso shrinkage is applied to avoid overfitting and the optimal penalty weight λopt identified via 5-fold cross validation (7, 8). The final selection of genes associated with PMRT and influencing the LRR-free survival is done at λopt. This stage of the analysis was carried out using glmpath library in R (9), which allows inclusion of covariates which are not penalized. The standard errors of the significant interaction coefficients were estimated via nonparametric 0.632bootstrap (10): we sampled m ≈ 0.632n = 124 patients from the cohort of 195 patients without replacement; then we estimated the coefficients in the Cox model of the second stage at the optimal crossvalidated penalty level λ = λopt. This bootstrap procedure was repeated 200 times and the standard errors were estimated from the sampling distributions of the bootstrapped coefficients. Finally, after a set I* of significant interaction genes was obtained from the double selection procedure a cross-validated score index was computed for each patients, from leave-one-out crossvalidation (9). This approach mimics external validation as the coefficients used in the computation of the score index for patient i are estimated excluding patient i. In practice, for i = 1, . . . , n, we (i) removed subject i from data; (ii) Lasso estimates ( i ) of the interaction genes I*are obtained by applying the Cox model of the final 2 selection step to the reduced (n−1 patients) data; (iii) repeat steps (i)-(ii) over all n subjects; (iv) compute the score for each patients using the cross-validated estimates as CVSI i ( i ) g X ig gI * By dividing the n predictive scores at the third quartile, the patients were divided into two groups and their LRR-free survival probabilities compared. The CVSI differentiates between the 75% of patients expected to benefit most from PMRT (low CVSI) and the 25% who would benefit less or not at all (high CVSI). Figure 1 illustrates the different stages of the analysis. . Figure 1. Flowchart of the analyses. Upper left panel: pre-selection of candidate genes; set I interaction genes and set J main effect genes. Upper right panel: final selection of interaction genes and bottom panel: validation of the selected interaction via cross validation, classification, and prediction. The proportionality assumption for the significant genes and the clinical covariates was checked by visual inspection of the re-scaled Schoenfeld residuals against loco-regional log time (with a natural spline smoother) and the goodness-of-fit test as in (11). 3 Selection of genes At the pre-selection stage, we indentified 206 genes, 178 as main effect and 44 as interaction effect genes, with 16 genes in common. Goeman's (12) test for global predictive significance for these 206 selected genes was highly significant (p-value= 0.00159), indicating a strong association of these genes with the outcome. The p-value for the same test on all 17910 genes was 0.0396 and the average p-value for 1000 randomly selected 206 genes was 0.06552 (library globaltest in R). In the subsequent selection stage, 46 main effect genes and 7 interaction genes (denoted as I7) were selected at the optimal level of penalty λopt = 4.000265 found via 5-fold cross validation. The 7 interaction genes are reported in Table 1 along with the estimated interaction coefficients, the hazard ratios and the corresponding bootstrapped standard errors. The estimated interaction coefficients are small except for RGS1 (0.2810 se 0.1323) and DNALI1 (0.3763 se 0.1429). We evaluated whether the hazard ratio was proportional for the seven genes and the selected clinical covariates. Only DNALI1 appeared to have a time dependent effect, increasing up to approximately three months and then decreasing. A plot of the re-scaled Schoenfeld residuals against survival time was estimated visually. We did not perform any correction to introduce time dependence in the covariate. Table 1. The seven interaction genes I7. In the first column the AB-specific ID numbers of the seven genes, in the second column the gene symbols, if available. The estimated interaction coefficients and the hazard ratios together with bootstrapped standard errors in parenthesis are reported in the third and fourth column. A short description of the genes is given in the last column. AB-ID (probe ID) hCG2042724 Gene symbol HLA-DQA1 hCG1980528.1 IGKC hCG39901.3 RGS1 hCG41484.2 ADH1B hCG25678.2 DNALI1 hCG2032658 OR8G2 hCG2023290 ( Description ) 0.0699 (0.0440) -0.0646 (0.0426) 0.2810 (0.1323) -0.0314 (0.0305) 1.0724 (0.0412) 0.9375 (0.0455) 1.3244 (0.1115) 0.9691 (0.0325) 0.3763 (0.1429) 1.4568 (0.1130) -0.1266 (0.0636) 0.0452 (0.0186) 0.8811 (0.0699) 1.0462 (0.0179) 4 major histocompatibility complex, class II, DQ alpha chr. 6p21.3 immunoglobulin kappa constant chr. arm 2p12 regulator of G-protein signaling 1 chr. 1q31 alcohol dehydrogenase IB (class I) beta polypeptide chr. 4q21-q23 dynein, axonemal, light intermediate polypeptide 1 chr. arm 1p35.1 olfactory receptor, family 8, subfamily G member 2 chr. 11q24 unknown, chr. 7 Interaction effects of the seven genes To assess the effect of each gene in interaction with PMRT, we computed the gene-PMRT hazard ratio . The computed HR indicates the conditional influence of each gene, fixing the effects of the others. HR can be interpreted as the change in the risk of LRR when/if a patient with gene expression is treated with PMRT. A value below one implies a reduction in the risk of LRR, whereas a value above one indicates an increasing risk. A value of one indicates no benefit, nor worsening. Figure 2. The individual interaction effect of each gene in I7. Plots show histograms of the expression (left axis) and the gene-PMRT-interaction hazard ratios (right axis) for the I7 genes. A red line at HR = 1 is added as reference. The left panels of Figure 2 show the hazard ratio (HR) for HLA-DQA, RGS1, DNALI1 and hCG2023290. For most of the patients in our cohort with negative gene expressions, the HR curves of HLA-DQA, DNALI1 and hCG2023290 are below one. This indicates a benefit from PMRT. For instance, radiation of a patient with very low expressed HLA-DQA (e.g. -8) would reduce the risk of LRR by 5%. However, for a smaller proportion of patients with high gene expression levels, (about 20% for HLA-DQA and hCG2023290 and 5 31% for DNALI1), the HR is above one (red lines in Figure 2) indicating an increased risk of LRR for such patients. For RGS1, the HR is below one for all patients in the study, thus the marginal effect of this gene is a reduction of the risk of LRR through its association with PMRT. The three right-most panels of Figure 2 represent IGKC, ADH1B and OR8G2. They share the same pattern: mostly above one indicating higher risk when these genes are low expressed. However, patients with high expression levels (approximately 45%, 25% and 27% for IGKC, ADH1B and OR8G2, respectively) will experience improved LRR-free survival if given radiotherapy. To evaluate the gene signature I7 as a whole, the cross-validated score index CVSI i ( i ) g( i ) X ig gI 7 was computed for each patient, following the leave-one-out cross validation scheme described above. All ( i ) CVSI were below zero. The estimated effects of PMRT were negative for all samples indicating a reduction of the hazard of LRR with radiation. Defining CVSI cut-off To evaluate the CVSI cut-off value for patients expected to benefit most from PMRT (low CVSI) and the patients who would benefit less (high CVSI), patients were ranked according to the CVSI index. In Figure 3, odds for loco-regional recurrence were calculated in the PMRT and noPMRT groups starting with the patient with the highest CVSI index and including patients stepwise by their CVSI index. The odds ratio was calculated similarly. Patients with the highest CVSI index have an odds ratio about 1, whereas the odds ratio for all patients is 0.20 (95% CI: 0.10-0.41). Figure 3 illustrates that the upper quartile of patients ranked by CVSI index coincide with the point where the odds-ratio starts to drop. Thus, the upper quartile was arbitrarily selected as the cut-off value for grouping patients into "low LRR risk" and high LRR risk" groups. The non-linearity between the odds-ratio and CVSI index also explains why a binary score was chosen, instead of a continuous score. 6 Figure 3. Odds for loco-regional recurrence the PMRT and noPMRT groups, and odds ratio, for loco-regional recurrence. Patients are ranked by decreasing CVSI index. Vertical line indicate arbitrary cutpoints for patients in the upper quartile. Fitted lines were generated using local smoothing (running median). Loco-regional failure (%) 1st quartile 100 B p = 0.0001 PMRT Loco-regional recurrence (%) 50 25 75 noPMRT Loco-regional recurrence (%) 50 25 75 A 100 Figure 4 illustrates the risk of loco-regional recurrence in patients treated with or without PMRT, divided into quartiles on CVSI. Risk of loco-regional recurrence is estimated using 1 minus Kaplan-Meier and differences are tested with log-rank test. This shows that for patients treated without PMRT, the lower three quartiles of patients have a high risk of loco-regional recurrence, whereas the upper quartile has a low risk of loco-regional recurrence. In patients treated with PMRT there is no association between outcome and index. 3rd quartile 2nd quartile 4th quartile p = 0.70 1st quartile 3rd quartile 4th quartile 0 0 2nd quartile 0 5 10 15 Time after treatment (years) 20 0 5 10 15 Time after treatment (years) 20 Figure 4. Loco-regional recurrence in patients treated without (A) or with PMRT (B), divided into quartiles based on CVSI. 7 References 1. Verweij PJ and Van Houwelingen HC. Cross-validation in survival analysis. Stat. Med 12 (24): 23052314, 1993. 2. Debashis P, Bair E, Hastie T, and Tibshirani R. "Preconditioning" for feature selection and regression in high-dimensional problems. Ann. Stat. 36 (4): 1595-1618, 2008. 3. Fan J and Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. B 70 (5): 849-911, 2013. 4. Tibshirani R. The Lasso method for variable selection in the Cox model. Stat. Med 16 (4): 385-395, 1997. 5. Tibshirani R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58 (1): 267-288, 1996. 6. Sohn I, Kim J, Jung SH, and Park C. Gradient lasso for Cox proportional hazards model. Bioinformatics 25 (14): 1775-1781, 2009. 7. Van Houwelingen HC, Bruinsma T, Hart AA, Van't Veer LJ, and Wessels LF. Cross-validated Cox regression on microarray gene expression data. Stat. Med 25 (18): 3201-3216, 2006. 8. Gui J and Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21 (13): 3001-3008, 2005. 9. Hastie T and Park MY. L1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. B 69 (4): 659-677, 2007. 10. Efron B and Tibshirani R. An introduction to the bootstrap. Monographs on Statistics & Applied Probability 57. Boca Raton, Chapman & Hall/CRC, 1993. 11. Lin DY. Goodness-of-fit analysis for the Cox regression model based on a class of parameter estimators. J. Am. Stat. Assoc. 86 (415): 725-728, 1991. 12. Goeman JJ, van de Geer SA, de KF, and Van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 (1): 93-99, 2004. 8
© Copyright 2026 Paperzz