Supplementary information

Supplementary Information
Genomic landscape established by allelic imbalance in the cancerization field of a
normal appearing airway
Yasminka A. Jakubek, Wenhua Lang, Selina Vattathil, Melinda Garcia, Li Xu, Lili Huang, Suk-Young Yoo, Li
Shen, Wei Lu, Chi-Wan Chow, Zachary Weber, Gareth Davies, Jing Huang, Carmen Behrens, Neda Kalhor,
Cesar Moran, Junya Fujimoto, Reza Mehran, Randa El-Zein, Stephen G. Swisher, Jing Wang, Jerry Fowler,
Avrum E. Spira, Erik A. Ehli, Ignacio I. Wistuba, Paul Scheet, Humam Kadara
Supplementary Methods
Genome-wide high-density array profiling. Genomic DNA samples (200 ng) were processed using the
Illumina Infinium HD assay protocol. DNA samples were denatured, isothermally amplified overnight (to
minimize amplification bias) and then fragmented. Fragmented DNA samples were then hybridized to the
BeadChip arrays. BeadChip images were then captured using an Illumina iScan system and raw genotyping
data was then generated using the Illumina GenomeStudio Genotyping (GT) Module Software. For a quality
control measure we determined the concordance between the germline genotype and SNP genotype calls for
each set of patient samples. When blood cells were unavailable, DNA from white blood cells or normal lung
tissue was used as the designated normal sample for each individual, from which to infer patient germline
genotypes. Relatively high concordance rates (> 98%) were observed between non-tumor and germline
genotypes and > 88% concordance between NSCLC tumor/CNB and germline genotypes. Importantly and in
contrast, the concordance rates for all samples when compared to an incorrectly matched germline sample
were less than 70%.
Quality control of event boundaries identified by hapLOH. HapLOH was applied to the entire genome,
agnostic to chromosomal boundaries. Although individual AI events are naturally restricted to separate
chromosomes, running the algorithm in this way allows for an orthogonal confirmation of detected events, ie.
when the boundary of a called AI event corresponds in genomic coordinates to the end of a chromosome or
1
its centromere. Indeed, most of the called events with sizes on the order of a small chromosome were
observed to end at chromosome boundaries, validating their existence. However, in highly aberrant
genomes, AI events may reside at “adjacent” (by number) chromosomes, eg. all of 1q and near the terminus
of 2p. In this boundary-agnostic mode, a false observation of a small event next to a large one of a different
chromosome can result from the algorithm slowly lowering the probability below 50% over many markers
(essentially “bleeding” into adjacent markers). This is especially likely in “low cellularity” settings since
regions of AI will not look that different from normal regions. To deal with this and separate out this “bleeding”
phenomenon from true multiple (adjacent) events, hapLOH was performed four times for each sample, each
time permuting the order of the chromosomes in the genome (sampling chromosomal orderings without
replacement). The data were then combined by averaging the posterior probabilities across the four
experiments. This approach did not dramatically affect the number of AI events detected; however, it did
improve identification of event boundaries. AI was also analyzed in the X chromosome of female patients. To
do so, hapLOH was performed two times per sample placing the X chromosome markers at the beginning
and at the end of the genome after which the data of both experiments were combined by averaging their
posterior probabilities.
Classification of events as gains, losses or copy neutral LOH. AI events were classified as gain, loss or
cn-LOH using the average log R ratio (LRR) and BAF deviations of the markers within each event
(Supplementary Figure 4). LRR thresholds were set at +/- 0.05 for gains and losses respectively. Events with
LRR between 0.1 and -0.1 and BAF deviations > 0.1 were classified as cn-LOH. Airway events (n = 179) that
exhibited BAF and LRR deviations that were too subtle for classification (BAF < 0.1 and LRR between 0.1
and -0.1) were deemed as “undeterminable”. To classify those undeterminable events in airway samples that
had a positive BAF correlation with a paired tumor event, the events between paired airway and tumor
samples were compared using a 50% reciprocal overlap rule. An undeterminable airway event was inferred
to be the same type as the tumor event if the airway event matched an event in the tumor/CNB. In cases
where the airway sample event matched multiple tumor/CNB samples from that same patient, a majority rule
among the tumor events was used to label the airway event. Using this approach, 61 subtle airway events
2
were classified. Also, AI events were classified as focal or arm events with an arm event spanning 90% or
more markers on a chromosome arm (Supplementary Figure 6).
Identification of anti-correlated BAF signals as indicators of secondary mutations in airways and
tumors. For individuals with multiple samples exhibiting AI at a given region, we attempted to infer whether
the imbalance of alleles occurred in the same or different directions among samples. An observation of the
same direction would be consistent with the same mutational event, whereas the opposite direction implies
an independent (or additional) mutation. Shifts in the BAF frequencies at heterozygous markers were
modeled to determine whether two samples from the same patient showed the same pattern of AI (for
example both exhibit the pattern shown in Supplementary Figure 5C) or an opposite pattern (for example
tumor exhibits the pattern shown in Supplementary Figure 5C and the airway shows the pattern in
Supplementary Figure 5D). A quantitative assessment of these deviations was performed for each airway AI
event by calculating the correlation between the heterozygous BAFs in the airway event and the BAFs for the
same set of markers in the tumor event. In order to remove possible artifacts from systematic probe biases,
tumor/CNB BAFs were divided into two groups: increased BAF (values are above 0.5) and decreased BAF
(values are below 0.5) after which the BAFs were smoothed to 0.9 and 0.1 for the increased group and
decreased group respectively. Further, we evaluated only events where the mean BAF deviations are at least
.05 from the expected value of 0.5. Correlations between observed BAF values in the airway samples and
the corresponding smoothed BAFs from the tumor samples were computed within a region of interest or
putative AI. These regions and events are reported in Supplementary Table 3.
The main manuscript includes data on the rate of whole-arm secondary independent mutations on
chromosome 9 to be 0.8 with a standard error of 0.15. These data were derived by the following logic. The
second mutation can occur in one of two directions: the same as the first mutation or the opposite. If the
second mutation occurs in the same direction (same haplotype in relative excess) as the first, we will not be
able to detect it, since our test is based on the presence of a negative correlation (above), which is thus a
latent process (the mutation occurs but we cannot observe it). Only those second mutations that happen in a
3
different direction from the first one can be observed. Assuming there are a total of n pairs of identicallypositioned AI events between airway sample and tumor from the same patient, let x denote the number of
these in which we observed the second mutation in the opposite (BAF) direction as the first, and let y denote
the number of cells in which a second mutation happens but in the same direction as the first mutation and is
thus not observed. We assume that the secondary mutation is an independent event. Then x and y follow
binomial distributions as follows:
x ~ Binom(n, px) and y ~ Binom(n, py),
where px and py are the rates of each type of mutation. We further assume the direction of the mutation is
random with equal probability and thus px = py and the overall probability (or rate) of secondary mutation p is
px + py. Given observed data n and x, we can obtain a point estimate for p as twice the point estimate for px,
or 2(x/n). To obtain a standard error of this estimate, we use the square-root of the variance of the estimator,
Var(𝑝̂ ), which in turn is given by
Var(𝑝
̂)
̂)
̂𝑥 (1 – 𝑝
̂)
̂𝑦 (1 – 𝑝
̂)
̂)(1
–𝑝
̂)/n.
𝑥 + Var(𝑝
𝑦 = [𝑝
𝑥 +𝑝
𝑦 ]/n or 2(𝑝
𝑥
𝑥
On chromosome 9, 6 opposite-direction mutations were observed out of a possible 15 and thus our estimate
for p, the rate of secondary independent mutations, is 12/15 or 0.8 with a std. error approximately 0.15.
4
References
1.
Kadara H, Fujimoto J, Yoo SY, Maki Y, Gower AC, Kabbout M, et al. Transcriptomic architecture of
the adjacent airway field cancerization in non-small cell lung cancer. Journal of the National Cancer Institute.
2014;106:dju004.
2.
Vattathil S, Scheet P. Haplotype-based profiling of subtle allelic imbalance with SNP arrays. Genome
research. 2013;23:152-8.
3.
Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Goransson H, Juliusson G, et al.
Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole
genome SNP arrays. Genome Biol. 2008;9:R136.
4.
Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data:
applications to inferring missing genotypes and haplotypic phase. American journal of human genetics.
2006;78:629-44.
5
Supplementary Figure 1. HapLOH results for matched normal large airway and NSCLC tumor from
case 18
BAF
A
BAF
B
CHR:
1
3
5
7
9
11
13
15
17
19
Supplementary Figure 1. B-Allele frequencies (BAF) (y-axis) are plotted for each marker on the array.
Markers are ordered along the x-axis by their genomic position. Red lines indicated HapLOH output, the
posterior probability that markers are in a region of allelic imbalance (AI). Blue dashed lines indicate
chromosome boundaries and blue dots indicate centromeres. Plots for normal-appearing large airway
brushing (A) and NSCLC tumor (B) from case 18.
6
21
100
0
50
Number of Events
150
Supplementary Figure 2. Distribution of posterior probabilities for airway events called by hapLOH
0.5
0.6
0.7
0.8
0.9
1.0
Posterior Probability
Supplementary Figure 2. Distribution of the posterior probabilities (x-axis) for the events identified by
hapLOH in the normal-appearing airway field of cancerization.
7
Supplementary Figure 3: Allelic imbalance events detected in normal-appearing airway brushings
CHR 1
025_S2b
026_S1
031_S2
039_S2
017_S4
AnnotationTrack
CHR 2
013_S3
025_S2b
026_S1
039_S2
k
CHR 3
018_L1
025_S2b
026_S1
032_S4
AnnotationTrack
039_S2
CHR 4
018_L1
025_S2b
026_S1
039_S2
AnnotationTrack
042_S2
CHR 5
018_L1
024_S1
025_S2b
026_S1
039_S2
039_S3
AnnotationTrack
042_S2
8
CHR 6
018_L1
024_S1
025_S2b
026_S1
AnnotationTrack
039_S2
CHR 7
018_L1
026_S1
027_S2
025_S2b
AnnotationTrack
039_S2
CHR 8
018_L1
025_S2b
026_S1
026_S2
039_S2
041_S1
AnnotationTrack
042_S2
CHR 9
012_S4
016_S1
016_S2
018_L1
020_S2
020_S4
022_S2
AnnotationTrack
022_S4
023_S2
024_S1
025_S2b
026_S1
033_S1
033_S2
039_S2
039_S3
042_S1
042_S2
042_S4
044_S5
9
CHR 10
025_S2b
026_S1
039_S2
018_L1
CHR 11
AnnotationTrack
035_S5
039_S2
018_L1
025_S2b
AnnotationTrack
CHR 12
018_L1
024_S1
040_S2
AnnotationTrack
025_S2b
CHR 13
CHR 14
042_S2
026_S2
018_L1
039_S2
025_S2b
041_S1
026_S1
018_L1
039_S2
026_S1
CHR 16
AnnotationTrack
AnnotationTrack
CHR 15
025_S2b
018_L1
026_S1
018_L1
025_S2b
026_S1
039_S2
CHR 18
AnnotationTrack
AnnotationTrack
CHR 17
018_L1
025_S2b
044_S2
026_S1
018_L1
024_S1
024_S2
041_S1
039_S2
020_L1
10
AnnotationTrack
AnnotationTrack
039_S2
CHR 19
CHR 20
CHR 22
CHR 21
007_L1
026_S1
018_L1
026_S1
025_S2b
039_S2
025_S2b
025_S2b
026_S1
025_S2b
018_L1
018_L1
CHR X
AnnotationTrack
AnnotationTrack
AnnotationTrack
AnnotationTrack
039_S2
018_L1
026_S1
039_S2
039_S3
025_S2b
AnnotationTrack
Supplementary Figure 3.
Plots of somatic AI events detected in the airway samples from NSCLC patients. Events are shown as bars:
red (gain), blue (loss), green (cnLOH), and gray (undeterminable). Labels indicate the case number and
sample type.
11
Supplementary Figure 4. Event classification using B-allele frequencies and log R ratios
B
0.2
0.0
LRR deviation
-0.4
-0.2
0.2
0.0
-0.2
-0.4
CNLOH
loss
gain
undeterminable
0.005
0.01
CNLOH
loss
gain
undeterminable
-0.6
-0.6
0.02 0.03
0.05
0.1
0.2
0.3
0.005
0.5
0.01
0.02 0.03
0.05
0.1
0.2
0.3
0.5
0.2
0.3
0.5
BAF deviation (log scale)
BAF deviation (log scale)
C
0.6
Supplementary Figure 4.
0.2
0.0
-0.2
-0.4
LRR deviation
0.4
Plots of AI events with BAF deviation plotted on
x-axis and log R ratio (LRR) deviation plotted on
the y-axis. Panel A depicts the events in the
normal-appearing airways and panel B displays
the same events but with the NSCLC tumor
event type designation. Panel C depicts all
events identified in the tumors.
-0.6
LRR deviation
0.4
0.4
0.6
0.6
A
CNLOH
loss
gain
undeterminable
0.005
0.01
0.02 0.03
0.05
0.1
BAF deviation (log scale)
12
Supplementary Figure 5. Effect of allelic imbalance on B-allele frequencies
A
B
C
D
To assess AI, we first phased heterozygous genotypes in
order to identify the two haplotypes, labeled as the
maternal and paternal haplotypes (A). Regions with no AI
showed patterns that are consistent with an equal
proportion of maternal and paternal haplotypes (B). We
detected AI in regions where BAF deviations indicated an
abundance of one of the parental haplotypes (C-D). In
addition, we calculated BAF correlations between airway
and tumor samples in order to determine if AI in the
samples were the result of both samples having the same
haplotype or opposite haplotypes in excess. Positive
correlations indicated AI occurring in the same direction
(both exhibiting pattern C or D). Negative correlations
between paired airway and tumor samples indicated that
samples exhibited opposite haplotypes in excess (for
example C in tumor and D in airway) pointing to
independent AI events.
13
Supplementary Figure 6: Distribution of focal and arm airway events
A
B
60
80
In Tumor Events
0
20
40
Number of Events
40
20
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
Event Markers / Markers in Chromosome Arm
0.2
0.4
0.6
0.8
1.0
Event Markers / Markers in Chromosome Arm
Supplementary Figure 6.
Not in Tumor Events
C)
0
20
40
60
80
Relative size distribution of normalappearing airway events. Size is measured
as the number of markers in the airway
events / the number of markers on the
chromosome arm (x-axis). The dashed line
at 0.9 indicates the threshold used to
differentiate between arm and focal events.
Histogram for all events detected in the
airway samples or for airway events with
matching tumor event are displayed in
panels A and B, respectively. Panel C
depicts distribution of airway events that did
not match a tumor event.
Number of Events
Number of Events
60
80
All Events
0.0
0.2
0.4
0.6
0.8
Event Markers / Markers in Chromosome Arm
14
1.0