Supplementary Methods General information All statistical analysis and data processing steps described below were performed using R (www.r-project.org/). Mutation status for samples in included expression cohorts were identified from original publications. Explicit mutation information was not available for a majority of cases and consequently not included in any analyses. Gene expression analyses A total of 10 public gene expression cohorts were analyzed (Chitale et al. was divided into two cohorts due to different Affymetrix platforms). Affymetrix cohorts with available CEL files (Chitale U133A, Chitale U133 2plus, E-MTAB-923, GSE31210, GSE14814, Shedden et al.) were all normalized using GCRMA. Normalized expression values for series GSE13213, GSE3141, GSE33072, and GSE42127 were obtained from Gene Expression Omnibus (1). Only tumor samples were selected. All cohorts were mean-centered for each probe across all samples (adenocarcinomas, squamous cell carcinomas, or NSCLC depending on context) if analyses were performed in a general setting, otherwise within each mutation group (for consensus clustering). Exceptions included Shedden et al., which was row-centered for each of the four sites individually (similar to Bryant et al. (2)), merged, and lastly filtered for tumors overlapping with Chitale et al. For duplicated probes in non-Affymetrix cohorts, the probe with the highest standard deviation was kept. All matching of genes between different microarray platforms were performed using gene symbols or gene ids. For Affymetrix cohorts original probe set identifiers were used whenever possible. Calculation of expression metagenes and classification into adenocarcinoma molecular subtypes Samples were classified according the adenocarcinoma molecular subtypes based on Pearson correlation to gene expression centroids reported by Wilkerson et al. (3). Samples were assigned to the gene expression centroid with the highest correlation >0, otherwise set as unclassified. A CIN70 metagene was calculated as the mean expression of genes matching the list of 70 genes reported by Carter et al. (4). Identification of subgroups within mutation groups using Consensus clustering, SAM analysis and centroid prediction Consensus clustering was performed using the ConsensusClusterPlus R package (5) for each discovery cohort (Chitale U133A, Chitale U133 2plus, GSE31210, and EMTAB-923) and mutation group individually. Prior to clustering of a cohort, expression data was mean-centered across samples belonging to the specific mutation group, and probe sets with log2ratio standard deviation < 0.5 were removed. Settings for the consensus clustering were 1000 repetitions, pItem and pFeature =0.7, Pearson correlation and complete linkage. Only the k=2 group solution was evaluated further due to relatively small number of samples in several cohorts. Notably, in the larger cohorts, such as GSE31210, division into additional clusters did not show improved prognostic stratification. To identify differentially expressed genes between the two consensus clusters we performed SAM analysis using the siggenes package with 1000 permutations. Probe sets with a false discovery rate < 5% were considered significant and used to create gene expression centroids for the two clusters for this cohort and mutation group. Centroids derived from each cohort and mutation group was next applied to all remaining discovery cohorts (n=3) for that mutation type, using Pearson correlation and assignment to the centroid with the highest correlation >0. Importantly, this procedure means that a classifier is never applied to its own training set, only independent test sets. Classification by each centroid on a test set was compared with the original, unrelated, consensus clustering of that test set to investigate cross-cohort consistency. Here, cross-consistency implies that for a given test set the two independent classifications by a) a centroid classifier derived from a different training cohort, and b) the original consensus clustering for the test set groups tumors similarly. A single multicohort centroid classifier for EGFR-mutated (EGFR-1/2) and a single multicohort classifier for EGFRwt/KRASwt tumors (wt/wt-1/2) were created based on the intersection of probe sets between cohorts with cross-consistency between original consensus clusters and centroid prediction (3 cohorts for EGFRmutated tumors, 4 for EGFRwt/KRASwt tumors). Multicohort centroid values for each intersected probe set were calculated as the mean of individual data set centroid values, resembling a “mean of means”. Independent validation cohorts were classified using the two multicohort centroids by assigning samples to the centroid with highest Pearson correlation >0 for each classifier. Probes in independent cohorts were mean centered across each mutation group or across the entire cohort prior to classification (depending on if analysis was performed in a mutation specific or general setting, respectively). Matching of centroid genes was done by Affymetrix probe set id if possible, otherwise gene symbol / gene id. If probe set ids were not available in the independent cohorts, genes with multiple probe sets in the multicohort centroids were averaged to one gene – one row prior to classification. For genes with multiple probes in independent non-Affymetrix cohorts, the probe with the highest log2ratio standard deviation was chosen to represent the gene in correlation calculations. The prognostic association of the multicohort signatures in squamous cell carcinoma samples from GSE3141 and GSE42127 was performed similarly as for adenocarcinomas. We used the pre-normalised data from GEO, which was next mean-centered across all squamoid samples and finally classified using the two multicohort centroids similarly as for adenocarcinomas. Copy number analysis Genomic profiles for 158 tumors with matching copy number data was obtained from Staaf et al. (6). These tumors were normalized, partitioned and merged to a common probe set as described (6). Partitioned genomic segments with log2ratio > 0.12 were called as copy number gain. Partitioned genomic segments with log2ratio < -0.12 were called as copy number loss. The fraction of the genome altered by copy number alterations was defined as the sum of genomic probes called as loss or gain divided by the total number of genomic probes for each sample. Amplifications were identified as genomic regions with partitioned log2ratio > 0.8. Supplemental Figure and Table Legends Supplemental Figure 1 (S1). Identification, characteristics and validation of transcriptional subgroups in EGFR-mutated and EGFRwt/KRASwt tumors. (A) Detection of robust subgroups in EGFR-mutated tumors (EGFR-1: red, EGFR-2: blue) based on consistency between consensus clusters and predicted clusters by cohort specific centroids in GSE31210, Chitale U133 2plus and E-MTAB-923. Subpanels display cohorts with original consensus clusters together with predicted clusters based on classification by centroids derived from GSE31210 (top), Chitale U133 2plus (center), and E-MTAB-923 (bottom). Samples were classified to the centroid with the highest correlation >0. Classification of the cohort the centroids was derived from is only included for reference. Rows in subpanels correspond to samples and grey color indicates unclassified samples. Cross-consistency is seen as similar grouping of samples by a centroid classifier applied to a test set and the original unrelated consensus clusters for that test set, e.g. GSE31210 centroids applied to EMTAB-923. (B) Detection of robust subgroups (wt/wt-1: red and wt/wt-2: blue) in EGFRwt/KRASwt tumors based on GSE31210, Chitale U133A, Chitale U133 2plus, and E-MTAB-923. Subpanels display similar analyses as in A for the GSE31210 centroids (top), Chitale U133A centroids, Chitale U133 2plus centroids, and EMTAB-923 centroids (bottom). (C) Characteristics of the EGFR-1 and EGFR-2 groups in EGFR-mutated tumors for GSE31210, Chitale U133 2plus and E-MTAB923 including adenocarcinoma molecular subtypes, CIN70 metagene expression and patient age. (D) Characteristics of the wt/wt-1 and wt/wt-2 groups in EGFRwt/KRASwt tumors for GSE31210, Chitale U133 2plus, Chitale U133A and EMTAB-923 including adenocarcinoma molecular subtypes, CIN70 metagene expression and patient smoking status. (E) Characteristics for EGFR-mutated tumors, n=45, in GSE13213 classified by the multicohort EGFR-1/2 centroids. (F) Characteristics for EGFRwt/KRASwt tumors, n=57, in GSE13213 classified by the multicohort wt/wt-1/2 centroids. Supplemental Figure 2 (S2). Differences in patterns of copy number gain and loss between EGFR-1/2 and wt/wt-1/2 transcriptional subgroups. (A) Frequency of copy number gain and loss for EGFR-1 (red, n=29) and EGFR-2 (blue, n=24) tumors. Regions highlighted with arrows display >25% frequency difference. Frequencies above zero indicate gains, below zero losses. (B) Frequency of copy number gain and loss for wt/wt-1 (red, n=49) and wt/wt-2 (blue, n=56) tumors. Regions highlighted with arrows display >25% frequency difference. Frequencies above zero indicate gains, below zero losses. Supplemental Figure 3 (S3). Expression pattern of genes in multicohort signatures across adenocarcinoma cohorts. (A) EGFR-1/2 gene expression across 1177 adenocarcinomas classified by the signature (unclassified samples excluded). For reference purpose, classification was performed also for samples in the four discovery cohorts from whom the signature was derived (the signature was derived from EGFR-mutated tumors in the discovery cohorts). Heatmap colors range from blue (log2ratio ≤ -2) to yellow (log2ratio ≥2), with rows corresponding to unique genes in the signature (n=600) and columns to samples. White color in the heatmap indicates missing expression values. Mutation status and cohort information are provided as annotation bars. (B) wt/wt-1/2 gene expression across 1175 adenocarcinomas classified by the signature (unclassified samples excluded). For reference purpose, classification was performed also for samples in the four discovery cohorts from whom the signature was derived (the signature was derived from EGFRwt/KRASwt tumors in the discovery cohorts). Heatmap colors range from blue (log2ratio ≤ -2) to yellow (log2ratio ≥2), with rows corresponding to unique genes in the signature (n=724) and columns to samples. White indicates missing expression values. Mutation status and cohort information are provided as annotation bars. Supplemental Table 1. Characteristics of adenocarcinoma and NSCLC cohorts used in the current study. Supplemental Table 2. Multicohort EGFR-1/2 and wt/wt-1/2 centroids. References 1. Gene Expression Omnibus. [cited; Available from: http://www.ncbi.nlm.nih.gov/geo/ 2. Bryant CM, Albertus DL, Kim S, Chen G, Brambilla C, Guedj M, et al. Clinically relevant characterization of lung adenocarcinoma subtypes based on cellular pathways: an international validation study. PLoS ONE 2010;5: e11712. 3. Wilkerson MD, Yin X, Walter V, Zhao N, Cabanski CR, Hayward MC, et al. Differential pathogenesis of lung adenocarcinoma subtypes involving sequence mutations, copy number, chromosomal instability, and methylation. PLoS ONE 2012;7: e36530. 4. Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z. A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nature genetics 2006;38: 1043-8. 5. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 2010;26: 1572-3. 6. Staaf J, Isaksson S, Karlsson A, Jonsson M, Johansson L, Jonsson P, et al. Landscape of somatic allelic imbalances and copy number alterations in human lung carcinoma. International journal of cancer 2012;1: 2020-31.
© Copyright 2026 Paperzz