Supplementary References, Methods and Figure Legend

Supplementary Methods
General information
All statistical analysis and data processing steps described below were performed
using R (www.r-project.org/). Mutation status for samples in included expression
cohorts were identified from original publications. Explicit mutation information was
not available for a majority of cases and consequently not included in any analyses.
Gene expression analyses
A total of 10 public gene expression cohorts were analyzed (Chitale et al. was divided
into two cohorts due to different Affymetrix platforms). Affymetrix cohorts with
available CEL files (Chitale U133A, Chitale U133 2plus, E-MTAB-923, GSE31210,
GSE14814, Shedden et al.) were all normalized using GCRMA. Normalized
expression values for series GSE13213, GSE3141, GSE33072, and GSE42127 were
obtained from Gene Expression Omnibus (1). Only tumor samples were selected. All
cohorts were mean-centered for each probe across all samples (adenocarcinomas,
squamous cell carcinomas, or NSCLC depending on context) if analyses were
performed in a general setting, otherwise within each mutation group (for consensus
clustering). Exceptions included Shedden et al., which was row-centered for each of
the four sites individually (similar to Bryant et al. (2)), merged, and lastly filtered for
tumors overlapping with Chitale et al. For duplicated probes in non-Affymetrix
cohorts, the probe with the highest standard deviation was kept. All matching of genes
between different microarray platforms were performed using gene symbols or gene
ids. For Affymetrix cohorts original probe set identifiers were used whenever
possible.
Calculation of expression metagenes and classification into adenocarcinoma
molecular subtypes
Samples were classified according the adenocarcinoma molecular subtypes based on
Pearson correlation to gene expression centroids reported by Wilkerson et al. (3).
Samples were assigned to the gene expression centroid with the highest correlation
>0, otherwise set as unclassified. A CIN70 metagene was calculated as the mean
expression of genes matching the list of 70 genes reported by Carter et al. (4).
Identification of subgroups within mutation groups using Consensus clustering, SAM
analysis and centroid prediction
Consensus clustering was performed using the ConsensusClusterPlus R package (5)
for each discovery cohort (Chitale U133A, Chitale U133 2plus, GSE31210, and EMTAB-923) and mutation group individually. Prior to clustering of a cohort,
expression data was mean-centered across samples belonging to the specific mutation
group, and probe sets with log2ratio standard deviation < 0.5 were removed. Settings
for the consensus clustering were 1000 repetitions, pItem and pFeature =0.7, Pearson
correlation and complete linkage. Only the k=2 group solution was evaluated further
due to relatively small number of samples in several cohorts. Notably, in the larger
cohorts, such as GSE31210, division into additional clusters did not show improved
prognostic stratification. To identify differentially expressed genes between the two
consensus clusters we performed SAM analysis using the siggenes package with 1000
permutations. Probe sets with a false discovery rate < 5% were considered significant
and used to create gene expression centroids for the two clusters for this cohort and
mutation group. Centroids derived from each cohort and mutation group was next
applied to all remaining discovery cohorts (n=3) for that mutation type, using Pearson
correlation and assignment to the centroid with the highest correlation >0.
Importantly, this procedure means that a classifier is never applied to its own training
set, only independent test sets. Classification by each centroid on a test set was
compared with the original, unrelated, consensus clustering of that test set to
investigate cross-cohort consistency. Here, cross-consistency implies that for a given
test set the two independent classifications by a) a centroid classifier derived from a
different training cohort, and b) the original consensus clustering for the test set
groups tumors similarly.
A single multicohort centroid classifier for EGFR-mutated (EGFR-1/2) and a
single multicohort classifier for EGFRwt/KRASwt tumors (wt/wt-1/2) were created
based on the intersection of probe sets between cohorts with cross-consistency
between original consensus clusters and centroid prediction (3 cohorts for EGFRmutated tumors, 4 for EGFRwt/KRASwt tumors). Multicohort centroid values for
each intersected probe set were calculated as the mean of individual data set centroid
values, resembling a “mean of means”. Independent validation cohorts were classified
using the two multicohort centroids by assigning samples to the centroid with highest
Pearson correlation >0 for each classifier. Probes in independent cohorts were mean
centered across each mutation group or across the entire cohort prior to classification
(depending on if analysis was performed in a mutation specific or general setting,
respectively). Matching of centroid genes was done by Affymetrix probe set id if
possible, otherwise gene symbol / gene id. If probe set ids were not available in the
independent cohorts, genes with multiple probe sets in the multicohort centroids were
averaged to one gene – one row prior to classification. For genes with multiple probes
in independent non-Affymetrix cohorts, the probe with the highest log2ratio standard
deviation was chosen to represent the gene in correlation calculations. The prognostic
association of the multicohort signatures in squamous cell carcinoma samples from
GSE3141 and GSE42127 was performed similarly as for adenocarcinomas. We used
the pre-normalised data from GEO, which was next mean-centered across all
squamoid samples and finally classified using the two multicohort centroids similarly
as for adenocarcinomas.
Copy number analysis
Genomic profiles for 158 tumors with matching copy number data was obtained from
Staaf et al. (6). These tumors were normalized, partitioned and merged to a common
probe set as described (6). Partitioned genomic segments with log2ratio > 0.12 were
called as copy number gain. Partitioned genomic segments with log2ratio < -0.12
were called as copy number loss. The fraction of the genome altered by copy number
alterations was defined as the sum of genomic probes called as loss or gain divided by
the total number of genomic probes for each sample. Amplifications were identified
as genomic regions with partitioned log2ratio > 0.8.
Supplemental Figure and Table Legends
Supplemental Figure 1 (S1). Identification, characteristics and validation of
transcriptional subgroups in EGFR-mutated and EGFRwt/KRASwt tumors. (A)
Detection of robust subgroups in EGFR-mutated tumors (EGFR-1: red, EGFR-2:
blue) based on consistency between consensus clusters and predicted clusters by
cohort specific centroids in GSE31210, Chitale U133 2plus and E-MTAB-923.
Subpanels display cohorts with original consensus clusters together with predicted
clusters based on classification by centroids derived from GSE31210 (top), Chitale
U133 2plus (center), and E-MTAB-923 (bottom). Samples were classified to the
centroid with the highest correlation >0. Classification of the cohort the centroids was
derived from is only included for reference. Rows in subpanels correspond to samples
and grey color indicates unclassified samples. Cross-consistency is seen as similar
grouping of samples by a centroid classifier applied to a test set and the original
unrelated consensus clusters for that test set, e.g. GSE31210 centroids applied to EMTAB-923. (B) Detection of robust subgroups (wt/wt-1: red and wt/wt-2: blue) in
EGFRwt/KRASwt tumors based on GSE31210, Chitale U133A, Chitale U133 2plus,
and E-MTAB-923. Subpanels display similar analyses as in A for the GSE31210
centroids (top), Chitale U133A centroids, Chitale U133 2plus centroids, and EMTAB-923 centroids (bottom). (C) Characteristics of the EGFR-1 and EGFR-2
groups in EGFR-mutated tumors for GSE31210, Chitale U133 2plus and E-MTAB923 including adenocarcinoma molecular subtypes, CIN70 metagene expression and
patient age. (D) Characteristics of the wt/wt-1 and wt/wt-2 groups in
EGFRwt/KRASwt tumors for GSE31210, Chitale U133 2plus, Chitale U133A and EMTAB-923 including adenocarcinoma molecular subtypes, CIN70 metagene
expression and patient smoking status. (E) Characteristics for EGFR-mutated tumors,
n=45, in GSE13213 classified by the multicohort EGFR-1/2 centroids. (F)
Characteristics for EGFRwt/KRASwt tumors, n=57, in GSE13213 classified by the
multicohort wt/wt-1/2 centroids.
Supplemental Figure 2 (S2). Differences in patterns of copy number gain and
loss between EGFR-1/2 and wt/wt-1/2 transcriptional subgroups.
(A) Frequency of copy number gain and loss for EGFR-1 (red, n=29) and EGFR-2
(blue, n=24) tumors. Regions highlighted with arrows display >25% frequency
difference. Frequencies above zero indicate gains, below zero losses. (B) Frequency
of copy number gain and loss for wt/wt-1 (red, n=49) and wt/wt-2 (blue, n=56)
tumors. Regions highlighted with arrows display >25% frequency difference.
Frequencies above zero indicate gains, below zero losses.
Supplemental Figure 3 (S3). Expression pattern of genes in multicohort
signatures across adenocarcinoma cohorts.
(A) EGFR-1/2 gene expression across 1177 adenocarcinomas classified by the
signature (unclassified samples excluded). For reference purpose, classification was
performed also for samples in the four discovery cohorts from whom the signature
was derived (the signature was derived from EGFR-mutated tumors in the discovery
cohorts). Heatmap colors range from blue (log2ratio ≤ -2) to yellow (log2ratio ≥2),
with rows corresponding to unique genes in the signature (n=600) and columns to
samples. White color in the heatmap indicates missing expression values. Mutation
status and cohort information are provided as annotation bars. (B) wt/wt-1/2 gene
expression across 1175 adenocarcinomas classified by the signature (unclassified
samples excluded). For reference purpose, classification was performed also for
samples in the four discovery cohorts from whom the signature was derived (the
signature was derived from EGFRwt/KRASwt tumors in the discovery cohorts).
Heatmap colors range from blue (log2ratio ≤ -2) to yellow (log2ratio ≥2), with rows
corresponding to unique genes in the signature (n=724) and columns to samples.
White indicates missing expression values. Mutation status and cohort information are
provided as annotation bars.
Supplemental Table 1. Characteristics of adenocarcinoma and NSCLC cohorts
used in the current study.
Supplemental Table 2. Multicohort EGFR-1/2 and wt/wt-1/2 centroids.
References
1.
Gene
Expression
Omnibus.
[cited;
Available
from:
http://www.ncbi.nlm.nih.gov/geo/
2.
Bryant CM, Albertus DL, Kim S, Chen G, Brambilla C, Guedj M, et al.
Clinically relevant characterization of lung adenocarcinoma subtypes based on
cellular pathways: an international validation study. PLoS ONE 2010;5: e11712.
3.
Wilkerson MD, Yin X, Walter V, Zhao N, Cabanski CR, Hayward MC, et al.
Differential pathogenesis of lung adenocarcinoma subtypes involving sequence
mutations, copy number, chromosomal instability, and methylation. PLoS ONE
2012;7: e36530.
4.
Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z. A signature of
chromosomal instability inferred from gene expression profiles predicts clinical
outcome in multiple human cancers. Nature genetics 2006;38: 1043-8.
5.
Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool
with confidence assessments and item tracking. Bioinformatics 2010;26: 1572-3.
6.
Staaf J, Isaksson S, Karlsson A, Jonsson M, Johansson L, Jonsson P, et al.
Landscape of somatic allelic imbalances and copy number alterations in human
lung carcinoma. International journal of cancer 2012;1: 2020-31.