1 SUPPLEMENTAL METHODS 2 Filtering of sites from total read coverage for selection analysis 3 We first excluded sites with total read coverage that were not within two standard deviations of 4 the mean. The mean was calculated by first averaging the total reads across the three replicates 5 for each passage and then averaging all passages. This was done separately for JK and HeLa 6 cells, yielding a mean coverage of 37,670 and 67,720 reads, respectively. A total of 17,970 (JK) 7 and 17,924 (HeLa) sites passed this filter. Sites that did not pass the filter were excluded from 8 the selection analyses. 9 Estimating sequencing error rates across the EBOV genome 10 A PhiX control was sequenced in our data, which we used estimate the sequencing error rate for 11 the EBOV genome. For the PhiX data, the expectation is that all sequenced reads should have the 12 reference allele, and any reads that deviate from this expectation are the result of sequencing 13 error. We counted the number of non-reference reads of each type for each nucleotide, then 14 normalized these counts by the total number of reads observed for each nucleotide to obtain 15 bidirectional error rates (e.g., X>Y and Y>X, for all pairs of nucleotides X and Y). We 16 summarize the rate at which nucleotide π mutates to any other nucleotide as π πβ , and the rate at 17 which any other nucleotide mutates to π as π βπ . 18 Estimating EBOV population growth 19 Estimation of the EBOV population size at each of the sequencing time points was achieved by 20 RT-ddPCR (see Methods). These figures gave us estimates of the EBOV census population size 21 directly at the time of sequencing, and just before the population was bottlenecked due to 22 passaging. In order to estimate the severity of these bottlenecks, we estimated the number of 23 initial infections based on the assumption that EBOV is just as likely to infect a JK cell as a Vero 24 cell. First we took the population sizes from RT-ddPCR and reduced them by the volume used to 25 start the next passage (1/40). Then we normalized that by the ratio of the PFUs calculated for P0, 26 which was calculated with Vero cells, to the P0 population size from RT-ddPCR (0.00243678). 27 We then modeled an exponential growth curve between each bottleneck and the following 28 observed population size at the time of sequencing (Fig. S2). The form of the exponential curve 29 we used was 30 ππ‘ = π0 (1 + π)π‘ . 31 Here, π0 is set to be the population size directly after a given bottleneck. ππ‘ is set to equal the 32 census population size at the following sequencing time-point, and π‘ is always 6, as there are 33 about 6 generations between each bottleneck and the following sequencing time-point. We then 34 simply solve for π to get an exponential curve that connects the bottleneck population size to the 35 following sequencing population size, over 6 generations. This gave us estimates of the 36 population size at each of the 42 generations over the course of the experiments. While these 37 estimates were quite rough, they provided us with the general shape of the EBOV demography 38 over the course of the evolution experiments, which we were able to incorporate into our 39 simulations. 40 Simulation-based neutrality test for changes in allele frequency 41 We invoked a simulation-based procedure to identify alleles in the EBOV genome that changed 42 frequency over passages more than expected under neutrality given the dynamic viral population 43 size and estimated sequencing error rates. The neutral simulations had five parameters: the 44 overall population growth function, the number of generations, the starting allele frequency, and 45 the read depth for a each site during the first and last passage. 46 A set of 100,000 neutral simulations were run for each allele in the data. For each site, we 47 modeled allele π΄ at time π‘ with frequency ππ‘ in a population of size ππ‘ (all the other alleles at the 48 site have combined frequency ππ‘ = 1 β ππ‘ ). We let ππ‘ = ππ‘ ππ‘ be the number of π΄ alleles at time 49 π‘. The first task is to initialize a starting allele frequency, π0 for the simulation. There exists an 50 observed starting frequency, π0β² for each allele that is confounded by error due to sampling the 51 population and sequencing. Thus, we used a Bayesian approach to derive a probability 52 distribution for the true allele frequency, given the observed allele frequency and sequencing 53 depth. We modeled our prior expectation of the true allele frequency using the beta distribution 54 with shape parameters πΌ and π½ π½β1 55 Pr(π0|πΌ, π½) = π0πΌβ1 π0 π΅(πΌ,π½) 56 in which π΅(πΌ, π½) is the beta function and serves as a normalizing constant However, because we 57 have no prior expectation as to what the starting allele frequency is, other than that it will be 58 within [0,1], we employ an uninformative prior by setting the shape parameters, πΌ and π½ to 1. 59 This causes the Beta distribution to be uniform between 0 and 1. We then modeled the 60 probability of observing a given allele count, π0β² , given a true starting allele frequency, π0 and the 61 sample size π0β² using the binomial distribution, πβ² 62 π β² πβ² βπ0β² Pr(π0β² |π0β² , π0 ) = ( π0β² ) π00 π0 0 0 , 63 in which π0β² is equal to the read depth for the alleleβs position at time 0. We then used a Bayesian 64 framework to model the probability of the true allele frequency, given our prior and 65 observations, 66 67 Pr(π0|π0β² , π0β² ) which can be simplified to = β² β² Pr(π0 |π0 , π0 )Pr(π0 |πΌ = 1, π½ = 1)β‘ π΄=π βπ΄=0 0 Pr(π0β² |π0β² , π΄/π0 )Pr(π΄/π0 |πΌ = 1, π½ = 1) . 68 Pr(π0|π0β² , π0β² ) = πΌ+πβ²0 πβ²0 βπβ²0 π0 β² π΅(π0 ,π0β² βπ0β² ) π0 β‘, 69 which is a beta distribution with shape parameters πΌ = π0β² + 1 and π½ = π0β² β π0β² + 1. We then 70 randomly sampled from this distribution to get a value for the starting allele frequency π0 for the 71 simulation. 72 Once π0 was set, we ran a Wright-Fisher simulation of neutral drift over 42 generations 73 (1). Under this framework, the probability of an allele count in the next generation (ππ‘+1 ) 74 follows the binomial distribution, π π 75 Pr(ππ‘+1 ) = (π(π‘+1) ) ππ‘ π‘+1 ππ‘ π‘+1 . π 76 We then sample a random value for ππ‘+1 from this distribution, and repeat until π‘ = 42 77 generations have been simulated to obtain a simulated terminal frequency. π‘+1 78 We then used a binomial distribution to simulate sub-sampling of the population to 79 mimic our sequencing strategy. The size of this subsample was equal to the number of reads that 80 mapped to the alleleβs position in the genome (i.e., read depth). If π42 is the final population size, 81 β² β² let the subsample size be π42 . Similarly, if π42 is the count of π΄ at the final generation, then π42 82 is the count of π΄ resulting from subsampling. We again used the binomial to model the 83 β² probability of π42 , 84 85 πβ² πβ² β² π β² βπ42 β² ) 42 42 Pr(π42 = ( π42 β² ) π42 π42 42 , β² and randomly sampled from this distribution to simulate a value for π42 . 86 We used a two-step process to incorporate sequencing error into our simulations. First we 87 account for the possibility that no sequencing error occurs, as we observe sites in our PhiX data 88 that have no error despite high read depth. We estimate the probability of no sequencing error 89 occurring by simply taking the proportion of PhiX sites for which there was no sequencing error. 90 A given simulation was then selected to have no error with this probability. If sequencing error is 91 to be incorporated into a simulation then this was accomplished by the following binomial 92 sampling approach. For alleleβ‘π΄, sequencing errors spuriously decrease or increase the number of 93 reads with an π΄ allele at rates π π΄β and π βπ . (respectively). The number of reads that spuriously 94 carry the π΄ allele (when they should not) is πβπ΄ and the number of reads that spuriously do not 95 carry an π΄ allele (when they should) is ππ΄β whose distributions are as follows β² β² β² π β² 96 Pr(πβπ΄ ) = (π42πβπ42 ) π βπ΄βπ΄ (1 β π βπ΄ )π42βπ42 βπβπ΄ ; and 97 Pr(ππ΄β ) = (ππ42 ) π π΄βπ΄β (1 β π β )π42βππ΄β . 98 We randomly sampled from these two distributions to get discrete values for πβπ΄ and ππ΄β . Our 99 β²β² final count of allele π΄ is given by π42 : βπ΄ β² β² π π΄β 100 β²β² β² π42 = β‘ π42 β ππ,β + πβ,π . 101 This process is then repeated 100,000 times to arrive at a simulated distribution for the terminal 102 β²β² allele count π42 . This distribution represents the range of ending allele frequencies that one could 103 expect for a neutral allele, if this allele had the same starting frequency, read depth, and 104 nucleotide identity as a given observed allele in our data. 105 106 Point estimate of selection coefficient 107 We used population genetics theory to estimate the selection coefficient (π ) of an allele when 108 given the starting frequency, ending frequency, and number of generations spanned (2). The 109 equation for π is given by 110 111 π =π ln(ππ+π‘ βππ+π‘ )βlnβ‘(ππ βππ) π‘ , in which π indexes the starting generation and π‘ is the number of generations that have elapsed. 112 Fig. 4B show the estimated selection coefficients across all sites in EBOV for the two cell lines. 113 Fig. 4C shows the allele frequency trajectory of the most positively selected sites in EBOV for 114 both JK and HeLa cells. Table S1 shows several statistics for each of the most positively selected 115 sites. 116 Derivation of selection coefficient equation 117 Assuming a large Wright-Fisher population, if an allele has a selection coefficient π , then (by the 118 definition of the selection coefficient) the allele is expected to increase in frequency in the next 119 generation by a factor of 1 + π . The expected frequency is given by π (1+π ) 120 π ππ+1 = π +π . (1+π ) 121 where the denominator is the mean fitness in the population. Similarly for the alternative allele π, 122 but dropping the term 1 + π in the numerator. The ratio of the allele frequency change in one 123 generation is given by: 124 125 126 127 128 129 π ππ+1 ππ+1 = π ππ (1+π ) . ππ By recursion, it can be shown that after π‘ generations the expected allele ratio will be ππ+π‘ ππ = (1 + π )π‘ ππ+π‘ ππ which can be rearranged to solve for π and arrived at π =π ln(ππ+π‘ βππ+π‘ )βlnβ‘(ππ βππ) π‘ our estimator for the expected selection coefficient (2). . 130 SUPPLEMENTAL REFERENCES 131 1. 132 133 134 135 136 Hartl DL, Clark AG. 2007. Principles of Population Genetics, Fourth Edition. Sinauer Associates, Sunderland, Massachusetts, USA. 2. Felsenstein J. 2016. Theoretical evolutionary genetics - a draft text. http://evolution.genetics.washington.edu/pgbook/pgbook.html.
© Copyright 2026 Paperzz