1 - bioRxiv

1
SUPPLEMENTAL METHODS
2
Filtering of sites from total read coverage for selection analysis
3
We first excluded sites with total read coverage that were not within two standard deviations of
4
the mean. The mean was calculated by first averaging the total reads across the three replicates
5
for each passage and then averaging all passages. This was done separately for JK and HeLa
6
cells, yielding a mean coverage of 37,670 and 67,720 reads, respectively. A total of 17,970 (JK)
7
and 17,924 (HeLa) sites passed this filter. Sites that did not pass the filter were excluded from
8
the selection analyses.
9
Estimating sequencing error rates across the EBOV genome
10
A PhiX control was sequenced in our data, which we used estimate the sequencing error rate for
11
the EBOV genome. For the PhiX data, the expectation is that all sequenced reads should have the
12
reference allele, and any reads that deviate from this expectation are the result of sequencing
13
error. We counted the number of non-reference reads of each type for each nucleotide, then
14
normalized these counts by the total number of reads observed for each nucleotide to obtain
15
bidirectional error rates (e.g., X>Y and Y>X, for all pairs of nucleotides X and Y). We
16
summarize the rate at which nucleotide π‘Œ mutates to any other nucleotide as π‘…π‘Œβˆ™ , and the rate at
17
which any other nucleotide mutates to π‘Œ as π‘…βˆ™π‘Œ .
18
Estimating EBOV population growth
19
Estimation of the EBOV population size at each of the sequencing time points was achieved by
20
RT-ddPCR (see Methods). These figures gave us estimates of the EBOV census population size
21
directly at the time of sequencing, and just before the population was bottlenecked due to
22
passaging. In order to estimate the severity of these bottlenecks, we estimated the number of
23
initial infections based on the assumption that EBOV is just as likely to infect a JK cell as a Vero
24
cell. First we took the population sizes from RT-ddPCR and reduced them by the volume used to
25
start the next passage (1/40). Then we normalized that by the ratio of the PFUs calculated for P0,
26
which was calculated with Vero cells, to the P0 population size from RT-ddPCR (0.00243678).
27
We then modeled an exponential growth curve between each bottleneck and the following
28
observed population size at the time of sequencing (Fig. S2). The form of the exponential curve
29
we used was
30
𝑋𝑑 = 𝑋0 (1 + π‘Ÿ)𝑑 .
31
Here, 𝑋0 is set to be the population size directly after a given bottleneck. 𝑋𝑑 is set to equal the
32
census population size at the following sequencing time-point, and 𝑑 is always 6, as there are
33
about 6 generations between each bottleneck and the following sequencing time-point. We then
34
simply solve for π‘Ÿ to get an exponential curve that connects the bottleneck population size to the
35
following sequencing population size, over 6 generations. This gave us estimates of the
36
population size at each of the 42 generations over the course of the experiments. While these
37
estimates were quite rough, they provided us with the general shape of the EBOV demography
38
over the course of the evolution experiments, which we were able to incorporate into our
39
simulations.
40
Simulation-based neutrality test for changes in allele frequency
41
We invoked a simulation-based procedure to identify alleles in the EBOV genome that changed
42
frequency over passages more than expected under neutrality given the dynamic viral population
43
size and estimated sequencing error rates. The neutral simulations had five parameters: the
44
overall population growth function, the number of generations, the starting allele frequency, and
45
the read depth for a each site during the first and last passage.
46
A set of 100,000 neutral simulations were run for each allele in the data. For each site, we
47
modeled allele 𝐴 at time 𝑑 with frequency 𝑝𝑑 in a population of size 𝑁𝑑 (all the other alleles at the
48
site have combined frequency π‘žπ‘‘ = 1 βˆ’ 𝑝𝑑 ). We let 𝑃𝑑 = 𝑁𝑑 𝑝𝑑 be the number of 𝐴 alleles at time
49
𝑑. The first task is to initialize a starting allele frequency, 𝑝0 for the simulation. There exists an
50
observed starting frequency, 𝑝0β€² for each allele that is confounded by error due to sampling the
51
population and sequencing. Thus, we used a Bayesian approach to derive a probability
52
distribution for the true allele frequency, given the observed allele frequency and sequencing
53
depth. We modeled our prior expectation of the true allele frequency using the beta distribution
54
with shape parameters 𝛼 and 𝛽
π›½βˆ’1
55
Pr(𝑝0|𝛼, 𝛽) =
𝑝0π›Όβˆ’1 π‘ž0
𝐡(𝛼,𝛽)
56
in which 𝐡(𝛼, 𝛽) is the beta function and serves as a normalizing constant However, because we
57
have no prior expectation as to what the starting allele frequency is, other than that it will be
58
within [0,1], we employ an uninformative prior by setting the shape parameters, 𝛼 and 𝛽 to 1.
59
This causes the Beta distribution to be uniform between 0 and 1. We then modeled the
60
probability of observing a given allele count, 𝑃0β€² , given a true starting allele frequency, 𝑝0 and the
61
sample size 𝑁0β€² using the binomial distribution,
𝑁′
62
𝑃 β€² 𝑁′ βˆ’π‘ƒ0β€²
Pr(𝑃0β€² |𝑁0β€² , 𝑝0 ) = ( 𝑃0β€² ) 𝑝00 π‘ž0 0
0
,
63
in which 𝑁0β€² is equal to the read depth for the allele’s position at time 0. We then used a Bayesian
64
framework to model the probability of the true allele frequency, given our prior and
65
observations,
66
67
Pr(𝑝0|𝑃0β€² , 𝑁0β€² )
which can be simplified to
=
β€²
β€²
Pr(𝑃0 |𝑁0 , 𝑝0 )Pr(𝑝0 |𝛼 = 1, 𝛽 = 1)⁑
𝐴=𝑁
βˆ‘π΄=0 0 Pr(𝑃0β€² |𝑁0β€² , 𝐴/𝑁0 )Pr(𝐴/𝑁0 |𝛼 = 1, 𝛽
= 1)
.
68
Pr(𝑝0|𝑃0β€² , 𝑁0β€² )
=
𝛼+𝑃′0 𝑁′0 βˆ’π‘ƒβ€²0
π‘ž0
β€²
𝐡(𝑃0 ,𝑁0β€² βˆ’π‘ƒ0β€² )
𝑝0
⁑,
69
which is a beta distribution with shape parameters 𝛼 = 𝑃0β€² + 1 and 𝛽 = 𝑁0β€² βˆ’ 𝑃0β€² + 1. We then
70
randomly sampled from this distribution to get a value for the starting allele frequency 𝑝0 for the
71
simulation.
72
Once 𝑝0 was set, we ran a Wright-Fisher simulation of neutral drift over 42 generations
73
(1). Under this framework, the probability of an allele count in the next generation (𝑃𝑑+1 )
74
follows the binomial distribution,
𝑃
𝑄
75
Pr(𝑃𝑑+1 ) = (𝑁(𝑑+1)
) 𝑝𝑑 𝑑+1 π‘žπ‘‘ 𝑑+1 .
𝑃
76
We then sample a random value for 𝑃𝑑+1 from this distribution, and repeat until 𝑑 = 42
77
generations have been simulated to obtain a simulated terminal frequency.
𝑑+1
78
We then used a binomial distribution to simulate sub-sampling of the population to
79
mimic our sequencing strategy. The size of this subsample was equal to the number of reads that
80
mapped to the allele’s position in the genome (i.e., read depth). If 𝑁42 is the final population size,
81
β€²
β€²
let the subsample size be 𝑁42
. Similarly, if 𝑃42 is the count of 𝐴 at the final generation, then 𝑃42
82
is the count of 𝐴 resulting from subsampling. We again used the binomial to model the
83
β€²
probability of 𝑃42
,
84
85
𝑁′
𝑃′
β€²
𝑁 β€² βˆ’π‘ƒ42
β€² )
42
42
Pr(𝑃42
= ( 𝑃42
β€² ) 𝑝42 π‘ž42
42
,
β€²
and randomly sampled from this distribution to simulate a value for 𝑃42
.
86
We used a two-step process to incorporate sequencing error into our simulations. First we
87
account for the possibility that no sequencing error occurs, as we observe sites in our PhiX data
88
that have no error despite high read depth. We estimate the probability of no sequencing error
89
occurring by simply taking the proportion of PhiX sites for which there was no sequencing error.
90
A given simulation was then selected to have no error with this probability. If sequencing error is
91
to be incorporated into a simulation then this was accomplished by the following binomial
92
sampling approach. For allele⁑𝐴, sequencing errors spuriously decrease or increase the number of
93
reads with an 𝐴 allele at rates π‘…π΄βˆ™ and π‘…βˆ™π‘Œ . (respectively). The number of reads that spuriously
94
carry the 𝐴 allele (when they should not) is π‘‹βˆ™π΄ and the number of reads that spuriously do not
95
carry an 𝐴 allele (when they should) is π‘‹π΄βˆ™ whose distributions are as follows
β€²
β€²
β€²
𝑋
β€²
96
Pr(π‘‹βˆ™π΄ ) = (𝑁42π‘‹βˆ’π‘ƒ42 ) π‘…βˆ™π΄βˆ™π΄ (1 βˆ’ π‘…βˆ™π΄ )𝑁42βˆ’π‘ƒ42 βˆ’π‘‹βˆ™π΄ ; and
97
Pr(π‘‹π΄βˆ™ ) = (𝑃𝑋42 ) π‘…π΄βˆ™π΄βˆ™ (1 βˆ’ π‘…βˆ™ )𝑃42βˆ’π‘‹π΄βˆ™ .
98
We randomly sampled from these two distributions to get discrete values for π‘‹βˆ™π΄ and π‘‹π΄βˆ™ . Our
99
β€²β€²
final count of allele 𝐴 is given by 𝑃42
:
βˆ™π΄
β€²
β€²
𝑋
π΄βˆ™
100
β€²β€²
β€²
𝑃42
= ⁑ 𝑃42
βˆ’ π‘‹π‘Œ,βˆ™ + π‘‹βˆ™,π‘Œ .
101
This process is then repeated 100,000 times to arrive at a simulated distribution for the terminal
102
β€²β€²
allele count 𝑃42
. This distribution represents the range of ending allele frequencies that one could
103
expect for a neutral allele, if this allele had the same starting frequency, read depth, and
104
nucleotide identity as a given observed allele in our data.
105
106
Point estimate of selection coefficient
107
We used population genetics theory to estimate the selection coefficient (𝑠) of an allele when
108
given the starting frequency, ending frequency, and number of generations spanned (2). The
109
equation for 𝑠 is given by
110
111
𝑠=𝑒
ln(𝑝𝑖+𝑑 β„π‘žπ‘–+𝑑 )βˆ’ln⁑(𝑝𝑖 β„π‘žπ‘–)
𝑑
,
in which 𝑖 indexes the starting generation and 𝑑 is the number of generations that have elapsed.
112
Fig. 4B show the estimated selection coefficients across all sites in EBOV for the two cell lines.
113
Fig. 4C shows the allele frequency trajectory of the most positively selected sites in EBOV for
114
both JK and HeLa cells. Table S1 shows several statistics for each of the most positively selected
115
sites.
116
Derivation of selection coefficient equation
117
Assuming a large Wright-Fisher population, if an allele has a selection coefficient 𝑠, then (by the
118
definition of the selection coefficient) the allele is expected to increase in frequency in the next
119
generation by a factor of 1 + 𝑠. The expected frequency is given by
𝑝 (1+𝑠)
120
𝑖
𝑝𝑖+1 = π‘ž +𝑝
.
(1+𝑠)
121
where the denominator is the mean fitness in the population. Similarly for the alternative allele π‘Ž,
122
but dropping the term 1 + 𝑠 in the numerator. The ratio of the allele frequency change in one
123
generation is given by:
124
125
126
127
128
129
𝑖
𝑝𝑖+1
π‘žπ‘–+1
=
𝑖
𝑝𝑖 (1+𝑠)
.
π‘žπ‘–
By recursion, it can be shown that after 𝑑 generations the expected allele ratio will be
𝑝𝑖+𝑑 𝑝𝑖
= (1 + 𝑠)𝑑
π‘žπ‘–+𝑑 π‘žπ‘–
which can be rearranged to solve for 𝑠 and arrived at
𝑠=𝑒
ln(𝑝𝑖+𝑑 β„π‘žπ‘–+𝑑 )βˆ’ln⁑(𝑝𝑖 β„π‘žπ‘–)
𝑑
our estimator for the expected selection coefficient (2).
.
130
SUPPLEMENTAL REFERENCES
131
1.
132
133
134
135
136
Hartl DL, Clark AG. 2007. Principles of Population Genetics, Fourth Edition. Sinauer
Associates, Sunderland, Massachusetts, USA.
2.
Felsenstein J. 2016. Theoretical evolutionary genetics - a draft text.
http://evolution.genetics.washington.edu/pgbook/pgbook.html.