An experimentally determined evolutionary model

bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
An experimentally determined evolutionary model
dramatically improves phylogenetic fit
Jesse D. Bloom
Division of Basic Sciences and Computational Biology Program,
Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
E-mail: [email protected].
Submitted as an Article (Methods classification) to Mol Biol Evol
Abstract
All modern approaches to molecular phylogenetics require a quantitative model for how
genes evolve. Unfortunately, existing evolutionary models do not realistically represent the
site-heterogeneous selection that governs actual sequence change. Attempts to remedy this
problem have involved augmenting these models with a burgeoning number of free parameters. Here I demonstrate an alternative: experimental determination of a parameter-free
evolutionary model via mutagenesis, functional selection, and deep sequencing. Using this
strategy, I create an evolutionary model for influenza nucleoprotein that describes the gene
phylogeny far better than existing models with dozens or even hundreds of free parameters. Emerging high-throughput experimental strategies such as the one employed here
provide fundamentally new information that has the potential to transform the sensitivity
of phylogenetic and genetic analyses.
1
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Introduction
The phylogenetic analysis of gene sequences is one of the most important and widely used
computational techniques in all of biology. All modern phylogenetic algorithms require a quantitative evolutionary model that specifies the rate at which each site substitutes from one identity to another. These evolutionary models can be used to calculate the statistical likelihood
of the sequences given a particular phylogenetic tree (Felsenstein, 1973). Phylogenetic relationships are typically inferred by finding the tree that maximizes this likelihood (Felsenstein,
1981) or by combining the likelihood with a prior to compute posterior probabilities of possible
trees (Huelsenbeck et al., 2001).
Actual sequence evolution is governed by the rates at which mutations arise and the selection that subsequently acts upon them (Thorne et al., 2007; Halpern and Bruno, 1998). Unfortunately, neither of these aspects of the evolutionary process are traditionally known a priori. The
standard approach in molecular phylogenetics is therefore to assume that sites evolve independently and identically, and then construct an evolutionary model that contains free parameters
designed to represent features of mutation and selection (Goldman and Yang, 1994; Kosiol
et al., 2007; Yang, 1994; Yang et al., 2000) This approach suffers from two major problems.
First, although adding parameters enhances a model’s fit to data, the parameter values must be
estimated from the same sequences that are being analyzed phylogenetically – and so complex
models can overfit the data (Posada and Buckley, 2004). Second, even complex models do not
contain enough parameters to realistically represent selection, which is highly idiosyncratic to
specific sites within a protein. Attempts to predict site-specific selection from protein structure have had limited success (Rodrigue et al., 2009; Kleinman et al., 2010), probably because
even sophisticated computer programs cannot reliably predict the impact of mutations (Potapov
et al., 2009). Approximating selection via additional site-specific parameters can improve phylogenetic fit (Lartillot and Philippe, 2004; Le et al., 2008; Wu et al., 2013; Wang et al., 2008),
but further exacerbates the proliferation of free parameters that already plagues simpler models.
Here I demonstrate a radically different approach for constructing quantitative evolutionary models: direct experimental measurement. Specifically, using influenza nucleoprotein (NP)
as an example, I experimentally estimate mutation rates via limiting-dilution passage and experimentally estimate site-specific selection via deep mutational scanning (Fowler et al., 2010;
Araya and Fowler, 2011), a combination of high-throughput mutagenesis, functional selection,
and deep sequencing. I then show that these experimental measurements can be used to create
2
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
a parameter-free evolutionary model describes the NP gene phylogeny far better than existing
models with numerous free parameters. Finally, I discuss how the increasing availability of data
from high-throughput experimental strategies such as the one employed here has the potential
to transform analyses of genetic data by augmenting generic statistical models of evolution with
detailed molecular-level information.
Results
Components of an experimentally determined evolutionary model
A phylogenetic evolutionary model specifies the rate at which one genotype is replaced by
another. These rates of genotype substitution are determined by the underlying rates at which
new mutations arise and the subsequent selection that acts upon them (Thorne et al., 2007;
Halpern and Bruno, 1998). A standard assumption in molecular phylogenetics is that the rate of
genotype substitution can be decomposed into independent substitution rates at individual sites.
Here I make this assumption at the level of codon sites, and use Pr,xy to denote the rate that site
r substitutes from codon x to y given that the identity is already x. I further assume that it is
possible to decompose Pr,xy as
Pr,xy = Qxy × Fr,xy ,
(Equation 1)
where Qxy is the rate of mutation from x to y (assumed to be constant across sites) and Fr,xy
is the site-specific probability that a mutation from x to y will fix at site r if it arises. Both are
assumed to be constant over time.
An objection that can be raised to Equation 1 is that it does not allow interdependence of
sites (Pollock et al., 2012) – however, the motivating idea of the work here is that accurately
accounting for the site-specific fixation probabilities Fr,xy can lead to major improvements without including interdependence among sites. This idea is supported by experimental work for
the specific case of the NP protein: experiments have shown that evolutionary constraints on
NP mutations tend to be mediated by generic aspects of protein folding and stability rather then
specific interactions among sites (Gong et al., 2013), and that mutational effects on NP protein stability themselves are mostly conserved over evolutionary time (Ashenberg et al., 2013).
However, the ultimate justification for Equation 1 is provided by the results below, which show
that using this evolutionary model with experimentally determined parameters dramatically improves phylogenetic fit to real protein sequences.
3
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Given the evolutionary model described by Equation 1, the challenge is to experimentally
estimate the mutation rates Qxy and the fixation probabilities Fr,xy . In the following sections, I
describe these experiments.
Measurement of mutation rates
A general challenge in quantifying mutation rates is the difficulty of separating mutations from
the subsequent selection that acts upon them. To decouple mutation from selection, I utilized
a previously described method for growing influenza viruses that package GFP in the PB1
segment (Bloom et al., 2010). The GFP does not contribute to viral growth and so is not under
functional selection – therefore, substitutions in this gene accumulate at the mutation rate.
To drive the rapid accumulation of substitutions in the GFP gene, I performed limitingdilution mutation-accumulation experiments (Halligan and Keightley, 2009). Specifically, I
passaged 24 replicate populations of GFP-carrying influenza viruses by limiting dilution in
tissue culture, at each passage serially diluting the virus to the lowest concentration capable of
infecting target cells. Because each limiting dilution bottlenecks the population to one or a few
infectious virions, mutations fix rapidly. After 25 rounds of passage, the GFP gene was Sanger
sequenced for each replicate to identify 24 substitutions (Table 1, Table 2), for an overall rate
of 5.6 × 10−5 mutations per nucleotide per tissue-culture generation – a value similar to that
estimated previously by others using a somewhat different experimental approach (Parvin et al.,
1986). The rates of different types of mutations are in Table 3, and possess expected features
such as an elevation of transitions over transversions.
Because an observed mutation of A→G can arise from either this change on the sequenced
strand or a change of T→C on the complementary strand, then assuming that the same molecular mutation process affects both strands, there are only the six different mutation rates shown
in Table 3. Specifically, let Rm→n represent the rate at which nucleotide m mutates to n given
that the identity is already m, and let mc denote the complement of m (so for example Ac is
T). The assumption that the same molecular mutation process affects both strands means that
Rm→n = Rmc →nc . An additional empirical observation from Table 3 is that the mutation rates
for influenza are approximately symmetric, with the rate of each mutation approximately equal
to its reversal (Rm→n + Rmc →nc ≈ Rn→m + Rnc →mc ). Because it somewhat simplifies computational aspects of the subsequent phylogenetic analyses, I enforce this empirical observation of
approximately symmetric mutation rates to be exactly true by taking the rates of mutations and
4
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
their reversals to be the average of the two. With the further assumption that codon mutations
occur a single nucleotide at a time, the mutation rates Qxy from codon x to y are estimated from
the experimental data in Table 3 as
Qxy



0






4.7 × 10−5


= 1.84 × 10−5





6.0 × 10−6




3.8 × 10−6
if x and y differ by more than one nucleotide,
if x and y differ by A→G, T→C, G→A, or C→T,
if x and y differ by A→C, T→G, C→A, or G→T,
(Equation 2)
if x and y differ by A→T or T→A,
if x and y differ by G→C or C→G.
These mutation rates define the first term in the evolutionary model specified by Equation 1.
Deep mutational scanning to assess effects of mutations on NP
Estimation of the fixation probabilities Fr,xy in Equation 1 requires quantifying the effects of
all ≈ 104 possible amino-acid mutations to NP. Such large-scale assessments of mutational
effects are feasible with the advent of deep mutational scanning, a recently developed experimental strategy of high-throughput mutagenesis, selection, and deep sequencing (Fowler et al.,
2010; Araya and Fowler, 2011) that has now been applied to several genes (Fowler et al., 2010;
Melamed et al., 2013; Traxlmayr et al., 2012; Starita et al., 2013; Roscoe et al., 2013). Applying
this experimental strategy to NP requires creating large libraries of random gene mutants, using
these genes to generate pools of mutant influenza viruses which are then passaged at low multiplicity of infection to select for functional variants, and finally using Illumina sequencing to
assess the frequency of each mutation in the input mutant genes and the resulting viruses. Mutations that interfere with NP function or stability impair or ablate viral growth, since NP plays
an essential role in influenza genome packaging, replication, and transcription (Ye et al., 2006;
Portela and Digard, 2002). Such mutations will therefore be depleted in the mutant viruses
relative to the input mutant genes.
Most previous applications of deep mutational scanning have examined single-nucleotide
mutations to genes, since such mutations can easily be generated by error-prone PCR or other
nucleotide-level mutagenesis techniques. However, many amino-acid mutations are not accessible by single-nucleotide changes. I therefore used a PCR-based strategy to construct codonmutant libraries that contained multi-nucleotide (i.e. GGC→ACT) as well as single-nucleotide
5
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
(i.e. GGC→AGC) mutations. The use of codon-mutant libraries has an added benefit during
the subsequent analysis of the deep sequencing when trying to separate true mutations from
errors, since the majority (54 of 63) possible codon mutations involve multi-nucleotide changes
whereas sequencing and PCR errors generate almost exclusively single-nucleotide changes. I
used identical experimental procedures to construct two codon-mutant libraries of NP from the
wild type (WT) human H3N2 strain A/Aichi/2/1968, and two from a variant of this NP with
a single amino-acid substitution (N334H) that enhances protein stability (Gong et al., 2013;
Ashenberg et al., 2013). These codon-mutant libraries are termed WT-1, WT-2, N334H-1, and
N334H-2. Each of these four mutant libraries contained > 106 unique plasmid clones. Sanger
sequencing of 30 clones drawn roughly equally from the four libraries revealed that the number of codon mutations per clone followed a Poisson distribution with a mean of 2.7 (Figure
1). These codon mutations were distributed roughly uniformly along the gene sequence and
showed no obvious biases towards specific mutations (Figure 1). Most of the ≈ 104 unique
amino-acid mutations to NP therefore occur in numerous different clones in the four libraries,
both individually and in combination with other mutations.
To assess effects of the mutations on viral replication, the plasmid mutant libraries were
used to create pools of mutant influenza viruses by reverse genetics (Hoffmann et al., 2000).
The viruses were passaged twice in tissue culture at low multiplicity of infection to enforce
a linkage between genotype and phenotype. The NP gene was reverse-transcribed and PCRamplified from viral RNA after each passage, and similar PCR amplicons were generated from
the plasmid mutant libraries and a variety of controls designed to quantify errors associated with
sequencing, reverse transcription, and viral passage (Figure 2). The entire process outlined in
Figure 2 was performed in parallel but separately for each of the four mutant libraries (WT-1,
WT-2, N334H-1, and N334H-2) in what will be termed one experimental replicate. This entire
process of viral creation, passaging, and sequencing was then repeated independently for all
four libraries in a second experimental replicate. The two independent replicates will be termed
replicate A and replicate B.
The mutation frequencies in all sample were quantified by Illumina sequencing, using overlapping paired-end reads to reduce errors (Figure 3). Each sample produced ≈ 107 paired
reads that could be aligned to NP, providing an average of ≈ 5 × 105 calls per codon (Figure
4). Sequencing of unmutated NP plasmid revealed a low rate of errors, which were almost
exclusively single-nucleotide changes (Figure 5). As expected, the plasmid mutant libraries
contained a high frequency of single and multi-nucleotide codon mutations (Figure 5). Muta6
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
tion frequencies for unmutated RNA or viruses created from unmutated NP plasmid were only
slightly above the sequencing error rate (Figure 5), indicating that reverse-transcription and
viral replication introduced few mutations relative to the targeted mutagenesis in the plasmid
libraries. Mutation frequencies were reduced in the mutant viruses relative to the mutant plasmids used to create these viruses, particularly for nonsynonymous and stop-codon mutations
(Figure 5) – consistent with selection purging deleterious mutations. These results indicate that
the deep mutational scanning experiment successfully introduced many of the NP variants in the
plasmid mutant libraries into mutant viruses which were then subjected to purifying selection
against mutations that interfered with viral replication.
A key question is the extent to which the possible mutations were sampled in both the plasmid mutant libraries and the mutant viruses created from these plasmids. The deep mutational
scanning would not achieve its goal if only a small fraction of possible mutations are sampled
by the mutant plasmids or by the mutant viruses created from these plasmids (the latter might
be the case if there is a bottleneck during virus creation such that all viruses are generated from
only a few plasmids). Fortunately, Figure 6 shows that the sampling of mutations was quite
extensive in both the mutant plasmids and the mutant viruses. Specifically, Figure 6 suggests
that for each replicate, nearly all codon mutations were sampled numerous times in the plasmid
mutant libraries, and that over 75% of codon mutations were sampled by the mutant viruses.
Figure 6 also suggests that replicate A was technically superior to replicate B in the thoroughness with which mutations were sampled by the mutant viruses. Because most amino acids are
encoded by multiple codons, the fraction of amino-acid mutations sampled in each replicate
is even higher than the > 75% of sampled codon mutations. So while the experiments may
not have exhaustively examined every possible codon mutation, the thoroughness of sampling
is certainly sufficient to make the sort of statistical inferences about mutational effects that are
necessary to construct a quantitative evolutionary model.
Inference of site-specific amino-acid preferences
Qualitatively, it is obvious that changes in mutation frequencies between the plasmid mutant
libraries and the resulting mutant viruses reflect selection. But it is less obvious how to quantitatively analyze this information. Selection acts on the full genomes of all viruses in the
population. In contrast, the experiments only measure site-independent mutation frequencies
averaged over the population. Here I have chosen to analyze this data by assuming that each
7
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
site has an inherent preference for each possible amino acid. The motivation for envisioning
site-heterogenous but site-independent amino-acid preferences comes from experiments suggesting that the dominant constraint on mutations that fix during NP evolution relates to protein
stability (Gong et al., 2013), and that mutational effects on stability tend to be conserved in a
site-independent manner (Ashenberg et al., 2013). Because the experiments generally examine
each mutation in combination with several other mutations (the average clone has between 2 and
3 codon mutations; Figure 1), the site-specific amino-acid preferences are not simply selection
coefficients for specific mutations. Instead, they reflect the effect of each mutation averaged
over a set of genetic backgrounds.
P
Specifically, let πr,a denote the preference of site r for amino-acid a, with
πr,a = 1.
a
Figure 5 indicates that most observed mutations are the result of the desired codon mutagenesis,
but that there is also a low rate of apparent mutations arising from Illumina sequencing errors
and reverse transcription. The expected frequency fr,x of mutant codon x at site r in the mutant
viruses is related to the preference πr,A(x) for its encoded amino-acid A (x) by fr,x = r,x +ρr,x +
µ ×πr,A(x)
P r,x
where µr,x is the frequency that site r is mutagenized to codon x in the plasmid
µr,y ×π
y
r,A(y)
mutant library, r,x is the frequency the site is erroneously identified as x during sequencing,
ρr,x is the frequency the site is mutated to x during reverse transcription, y is summed over
all codons, and the probability that a site experiences multiple mutations or errors in the same
clone is taken to be negligibly small. The observed codon counts are multinomially distributed
around these expected frequencies, so by placing a symmetric Dirichlet-distribution prior over
πr,a and jointly estimating the error (r,x and ρr,x ) and mutation (µr,x ) rates from the appropriate
samples in Figure 2, it is possible to infer the posterior mean for all amino-acid preferences by
MCMC (see the Methods section).
A basic check on the consistency of the overall experimental and computational approach
is to compare the amino-acid preferences inferred from different replicates, or different viral
passages of the same replicate. Figure 7A,B shows that the preferences inferred from the first
and second viral passages within each replicate are extremely similar, indicating that most selection occurs during initial viral creation and passage and that technical variation (preparation
of samples, stochasticity in sequencing, etc) has little impact. A more crucial comparison is
between the preferences inferred from the two independent experimental replicates. This comparison (Figure 7C) shows that preferences from the independent replicates are substantially
but less perfectly correlated – probably the imperfect correlation is because the mutant viruses
created by reverse genetics independently in each replicate are different incomplete samples
8
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
of the many clones in the plasmid mutant libraries. Nonetheless, the substantial correlation
between replicates shows that the sampling is sufficient to clearly reveal inherent preferences
despite these experimental imperfections. Presumably better inferences can be made by aggregating data via averaging of the preferences from both replicates. Figure 7D shows such average
preferences from the first passage of both replicates. These preferences are consistent with existing knowledge about NP function and stability. For example, at the conserved residues in
NP’s RNA binding interface (Ye et al., 2006), the amino acids found in natural sequences tend
to be the ones with the highest preferences (Table 4). Similarly, for mutations that have been
experimentally characterized as having large effects on NP protein stability (Gong et al., 2013;
Ashenberg et al., 2013), the stabilizing amino acid has the higher preference (Table 5).
The experimentally determined evolutionary model
The final step is to use the amino-acid preferences to estimate the fixation probabilities Fr,xy ,
which can then be combined with the mutation rates to create a fully experimentally determined evolutionary model. Intuitively, it is obvious that the amino-acid preferences provide
information about the fixation probabilities. For instance, it seems reasonable to expect that
a mutation from x to y at site r will be more likely to fix (relatively larger value of Fr,xy ) if
amino-acid A (y) is preferred to A (x) at this this site (if πr,A(y) > πr,A(x) ) and less likely to fix
if πr,A(y) < πr,A(x) . However, the exact relationship between the amino-acid preferences and
the fixation probabilities is unclear. A rigorous derivation would require knowledge of unknown
and probably unmeasurable population-genetics parameters for both the deep mutational scanning experiment and the naturally evolving populations that gave rise to the sequences being
analyzed phylogenetically. Instead, I provide two heuristic relationships. Both relationships
satisfy detailed balance (reversibility), such that πr,A(x) × Fr,xy = πr,A(y) × Fr,yx , meaning that
Fr,xy defines a Markov process with πr,A(x) proportional to its stationary state when all aminoacid interchanges are equally probable. Other relationships between the amino-acid preferences
and the fixation probabilities are certainly possible, and so this paper provides the raw data to
enable others to explore such relationships ad naseum.
It is helpful to first consider what the amino-acid preferences values actually represent. Most
NP variants in the deep mutational scanning libraries contain multiple mutations, so the aminoacid preferences represent the mutational effects averaged over the nearby genetic neighborhood
of the parent protein. Therefore, one interpretation is that a preference is proportional to the
9
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
fraction of genetic backgrounds in which a mutation is tolerated, such that a mutation from x
to y is always tolerated if πr,A(y) > πr,A(x) , but is only sometimes tolerated if πr,A(x) < πr,A(y) .
In this interpretation, there should be strong selection during initial viral growth depending
on whether the mutation is tolerated in the particular genetic background in which it occurs,
and then there should be little further enrichment or depletion during subsequent viral passages
– loosely consistent with Figure 7A,B, which shows that the amino-acid preferences inferred
after two viral passages are very similar to those inferred after one passage. Note that this
interpretation can be related to the the selection-threshold evolutionary dynamics described in
(Bloom et al., 2007). An equation that describes this scenario is
Fr,xy

1
= πr,A(y)

πr,A(x)
if πr,A(y) ≥ πr,A(x)
otherwise.
(Equation 3)
This equation is equivalent to the Metropolis acceptance criterion (Metropolis et al., 1953).
An alternative interpretation is that πr,A(x) reflects the selection coefficient for the aminoacid A (x) at site r. In this case, if the πr,a values represent the expected amino-acid equilibrium frequencies in a hypothetical evolving population in which all amino-acid interchanges
are equally likely, and assuming (probably unrealistically) that this hypothetical population and
the actual population in which NP evolves are in the weak-mutation limit and have identical
constant effective population sizes, then Halpern and Bruno (1998) derive
Fr,xy



1 πr,A(y) = ln πr,A(x)


 1− πr,A(x)
πr,A(y)
if πr,A(x) = πr,A(y)
otherwise.
(Equation 4)
Given one of these definitions for the fixation probabilities and the mutation rates defined
by Equation 2, the experimentally determined evolutionary model is defined by Equation 1.
For the mutation rates and fixation probabilities used here, this evolutionary model defines a
stochastic process with a unique stationary state for each site r. These stationary states give the
expected amino-acid frequencies at evolutionary equilibrium. These evolutionary equilibrium
frequencies are shown in Figure 8, and are somewhat different than the amino-acid preferences
since they also depend on the structure of the genetic code (and the mutation rates when these are
non-symmetric). For example, if arginine and lysine have equal preferences at a site, arginine
10
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
will be more evolutionarily abundant since it has more codons.
Phylogenetic analyses
The experimentally determined evolutionary model can be used to compute phylogenetic likelihoods, thereby enabling its comparison to existing models. I used codonPhyML (Gil et al.,
2013) to infer maximum-likelihood trees (Figure 9) for NP sequences from human influenza using the Goldman-Yang (GY94) (Goldman and Yang, 1994) and the Kosiol et al. (KOSI07) (Kosiol et al., 2007) codon substitution models. In the simplest form, GY94 contains 11 free parameters (9 equilibrium frequencies plus transition-transversion and synonymous-nonsynonymous
ratios), while KOSI07 contains 62 parameters (60 frequencies plus transition-transversion and
synonymous-nonsynonymous ratios). More complex variants add parameters allowing variation in substitution rate (Yang, 1994) or synonymous-nonsynonymous ratio among sites or
lineages (Yang et al., 2000; Yang and Nielsen, 1998). For all these models, HYPHY (Pond
et al., 2005) was used to calculate the likelihood after optimizing branch lengths for the tree
topologies inferred by codonPhyML.
Comparison of these likelihoods strikingly validates the superiority of the experimentally
determined model (Table 6 and Table 7). Adding free parameters generally improves a model’s
fit to data, and this is true within GY94 and KOSI07. But the parameter-free experimentally
determined evolutionary model describes the sequence phylogeny with a likelihood far greater
than even the most highly parameterized GY94 and KOSI07 variants. Interpreting the aminoacid preferences as the fraction of genetic backgrounds that tolerate a mutation (Equation 3)
outperforms interpreting them as selection coefficients (Equation 4), although either interpretation yields evolutionary models for NP far superior to GY94 or KOSI07. Comparison using
Aikake information content (AIC) to penalize parameters (Posada and Buckley, 2004) even
more emphatically highlights the superiority of the experimentally determined models.
There is a clear correlation between the quality and volume of experimental data and the
phylogenetic fit: models from individual experimental replicates give lower likelihoods than
both replicates combined, and the technically superior replicate A (Figure 6) gives a better likelihood than replicate B (Table 6 and Table 7). The superiority of the experimentally determined
model is due to measurement of site-specific amino-acid preferences: randomizing preferences
among sites gives likelihoods far worse than any other model.
11
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Discussion
These results clearly establish that an experimentally determined evolutionary model is far superior to existing models for describing the phylogeny of NP gene sequences. The extent of
this superiority is striking – the parameter-free evolutionary model outperforms even the most
highly parameterized existing model by over 400 log-likelihood units before correcting for the
differences in the number of parameters (Table 6). The reason for this superiority is easy to
understand: proteins have strong and fairly conserved preferences for specific amino acids at
different sites (Ashenberg et al., 2013), but these site-specific preferences are ignored by most
existing phylogenetic models. Inspection of the overlaid bars in Figure 7D illustrates the inadequacy of trying to capture these preferences simply by classifying sites based on gross features
of protein structure (Thorne et al., 1996; Goldman et al., 1998) – the site-specific amino-acid
preferences are not simply related to secondary structure or solvent accessibility. The complexity of the preferences in Figure 7D also show the limitations of attempting to infer amino-acid
preference parameters for a limited number of site classes from sequence data (Lartillot and
Philippe, 2004; Le et al., 2008; Wu et al., 2013; Wang et al., 2008), as it is clear that each site
is unique. Direct experimental measurement therefore represents a highly attractive method for
determining the idiosyncratic constraints that affect the evolution of each site in a gene.
Another appealing aspect of an experimentally determined evolutionary model is interpretability. One of the more frustrating aspects of existing evolutionary models is the inability
to assign meaningful interpretations to many of their proliferating free parameters. For instance,
many models infer a shape parameter describing a gamma-distribution of rates – the biological
meaning of such a parameter at the molecular level is completely unclear. Similarly, the equilibrium frequency parameters used by most existing models reflect some unknown combination
of mutation and selection. On the other hand, all aspects of the parameter-free experimentally
determined evolutionary model described here can be related to direct measurements. So even
if such a model were eventually augmented with a few free parameters, this could be done in a
way that allowed these parameters to be directly interpreted in terms of the molecular processes
of biology and evolution.
The major drawback of the experimentally determined evolutionary model described here is
its lack of generality. While this model is clearly superior for influenza NP, it is entirely unsuitable for other genes. At first blush, it might seem that the arduous experiments described here
provide data that is unlikely to ever become available for most situations of interest. However, it
12
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
is worth remembering that today’s arduous experiment frequently becomes routine in just a few
years. For example, the very gene sequences that are the subjects of molecular phylogenetics
were once rare pieces of data – now such sequences are so abundant that they easily overwhelm
modern computers. The experimental ease of the deep mutational scanning approach used here
is on a comparable trajectory: similar approaches have already been applied to several proteins (Fowler et al., 2010; Melamed et al., 2013; Traxlmayr et al., 2012; Starita et al., 2013;
Roscoe et al., 2013), and there continue to be rapid improvements in techniques for mutagenesis (Firnberg and Ostermeier, 2012; Jain and Varadarajan, 2014) and sequencing (Hiatt et al.,
2010; Schmitt et al., 2012; Lou et al., 2013). It is therefore especially encouraging that the
phylogenetic fit of the NP evolutionary model improves with the quality and volume of experimental data from which it is derived (Table 6). The increasing availability and quality of similar
high-throughput data for a vast range of proteins has the potential to transform phylogenetic
analyses by greatly increasing the accuracy of evolutionary models, while at the same time replacing a plethora of free parameters with experimentally measured quantities that can be given
clear biological and evolutionary interpretations.
13
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Methods
Contents of methods
Availability of data and computer code . . . . . . . . . .
Experimental measurement of mutation rates . . . . . . .
Construction of NP codon-mutant libraries . . . . . . . .
Viral growth and passage . . . . . . . . . . . . . . . . . .
Sample preparation and Illumina sequencing . . . . . . .
Read alignment and quantification of mutation frequencies
Inference of the amino-acid preferences . . . . . . . . . .
Phylogenetic analyses . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
15
16
19
21
24
25
31
Availability of data and computer code
• The Illumina sequencing data are available at the SRA under accession SRP036064 at
http://www.ncbi.nlm.nih.gov/sra/?term=SRP036064.
• A detailed description of the computational process used to analyze the sequencing data
and infer the amino-acid preferences is at http://jbloom.github.io/mapmuts/
example_2013Analysis_Influenza_NP_Aichi68.html. The source code
itself is at https://github.com/jbloom/mapmuts. NOTE: the link to the description is accessible now; access to the source code itself will be made publicly available
upon publication of the paper. If you are a reviewer and would like to see the source code,
please contact the editor to relay the request for access to the author.
• A detailed description of the computational process used for the phylogenetic analyses is
available at http://jbloom.github.io/phyloExpCM/example_2013Analysis_
Influenza_NP_Human_1918_Descended.html. The source code itself is at
https://github.com/jbloom/phyloExpCM. NOTE: the link to the description
is accessible now; access to the source code itself will be made publicly available upon
publication of the paper. If you are a reviewer and would like to see the source code,
please contact the editor to relay the request for access to the author.
14
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Experimental measurement of mutation rates
To measure mutation rates, I generated GFP-carrying viruses with all genes derived from A/WSN/1933
(H1N1) by reverse genetics as described previously (Bloom et al., 2010). These viruses were
repeatedly passaged at limiting dilution in MDCK-SIAT1-CMV-PB1 cells (Bloom et al., 2010)
using what will be referred to as low serum media (Opti-MEM I with 0.5% heat-inactivated
fetal bovine serum, 0.3% BSA, 100 U/ml penicillin, 100 µg/ml streptomycin, and 100 µg/ml
calcium chloride) – a moderate serum concentration was retained and no trypsin was added
because viruses with the WSN HA and NA are trypsin independent (Goto and Kawaoka, 1998).
These passages were performed for 27 replicate populations. For each passage, a 100 µl volume containing the equivalent of 2 µl of virus collection was added to the first row of a 96-well
plate. The virus was then serially diluted 1:5 down the plate such that at the conclusion of the
dilutions, each well contained 80 µl of virus dilution, and the final row contained the equivalent
of (0.8 × 2 µl) /57 ≈ 2 × 10−5 µl of the original virus collection. MDCK-SIAT1-CMV-PB1
cells were then added to each well in a 50 µl volume containing 2.5 × 103 cells. The plates
were grown for approximately 80 hours, and wells were examined for cytopathic effect indicative of viral growth. Most commonly the last well with clear cytopathic effect was in row
E or F (corresponding to an infection volume of between (0.8 × 2 µl) /54 ≈ 3 × 10−3 and
(0.8 × 2 µl) /55 ≈ 5 × 10−4 µl, although this varied somewhat among the replicates and passages. The last well with cytopathic effect was collected and used as the parent population for
the next round of limiting-dilution passage.
After 25 limiting-dilution passages, 10 of the 27 viral populations no longer caused any
visible GFP expression in the cells in which they caused cytopathic effect, indicating fixation
of a mutation that ablated GFP fluorescence. The 17 remaining populations all caused fluorescence in infected cells, although in some cases the intensity was visibly reduced – these
populations therefore must have retained at least a partially functional GFP. Total RNA was
purified from each viral population, the PB1 segment was reverse-transcribed using the primers
CATGATCGTCTCGTATTAGTAGAAACAAGGCATTTTTTCATGAAGGACAAGC and CATGATCG
TCTCAGGGAGCGAAAGCAGGCAAACCATTTGATTGG, and the reverse-transcribed cDNA was
amplified by conventional PCR using the same primers. For 22 of the 27 replicate viral populations, this process amplified an insert with the length expected for the full GFP-carrying PB1
segment. For 2 of the replicates, this amplified inserts between 0.4 and 0.5 kb shorter than the
expected length, suggesting an internal deletion in part of the segment. For 3 replicates, this
failed to amplify any insert, suggesting total loss of the GFP-carrying PB1 segment, a very large
15
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
internal deletion, or rearrangement that rendered the reverse-transcription primers ineffective.
For the 24 replicates from which an insert could be amplified, the GFP coding region of the
PCR product was Sanger sequenced to determine the consensus sequence. The results are in
Table 1 and Table 2.
To estimate Rm→n , it is necessary to normalize by the nucleotide composition of the GFP
gene. The numbers of each nucleotide in this gene are: NA = 175, NT = 103, NC = 241, and
NG = 201. Given that the counts in Table 2 come after 25 passages of 24 replicates:
Rm→n = Rmc →nc =
1
Nm→n + Nmc →nc + C
×
24 × 25 × 2
Nm + Nm c
(Equation 5)
where C is a pseudocount that is added to the observed counts to avoid estimating rates of zero
and Nm→n is the number of observed mutations from m to n in Table 2. The values of Rm→n
estimated from Equation 5 give the probability that a nucleotide that is already m will mutate
to n in a single tissue-culture generation. Using a value of C = 1 (a pseudocount of one) gives
the rates in Table 3.
Construction of NP codon-mutant libraries
The goal was to construct a mutant library with an average of two to three random codon mutations per gene. Most techniques for creating mutant libraries of full-length genes, such as
error-prone PCR (Cirino et al., 2003) and chemical mutagenesis (Neylon, 2004), introduce mutations at the nucleotide level, meaning that codon substitutions involving multiple nucleotide
changes occur at a negligible rate. There are established methods for mutating specific sites
at the codon level with synthetic oligonucleotides that contain triplets of randomized NNN nucleotides (Georgescu et al., 2003). However, these oligo-based codon mutagenesis techniques
are typically only suitable for mutating a small number of sites. Recently, several groups have
developed strategies for introducing codon mutations along the lengths of entire genes (Firnberg and Ostermeier, 2012; Jain and Varadarajan, 2014, J. Kitzman and J. Shendure - personal
communication). Most of these strategies are designed to create exactly one codon mutation per
gene. For the purpose of my experiments, it was actually desirable to introduce a distribution
of around one to four codon mutations per gene to examine the average effects of mutations
in a variety of closely related genetic backgrounds. Therefore, I devised a codon-mutagenesis
protocol specifically for this purpose.
This technique involved iterative rounds of low-cycle PCR with pools of mutagenic syn16
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
thetic oligonucleotides that each contain a randomized NNN triplet at a specific codon site. Two
replicate libraries each of the wildtype and N334H variants of the Aichi/1968 NP were prepared
in full biological duplicate, beginning each with independent preps of the plasmid templates
pHWAichi68-NP and pHWAichi68-NP-N334H. The sequences of the NP genes in these plasmids are provided in (Gong et al., 2013). To avoid cross-contamination, all purification steps
used an independent gel for each sample, with the relevant equipment thoroughly washed to
remove residual DNA.
First, for each codon except for that encoding the initiating methionine in the 498-residue
NP gene, I designed an oligonucleotide that contained a randomized NNN nucleotide triplet preceded by the 16 nucleotides upstream of that codon in the NP gene and followed by the 16
nucleotides downstream of that codon in the NP gene. I ordered these oligonucleotides in 96well plate format from Integrated DNA Technologies, and combined them in equimolar quantities to create the forward-mutagenesis primer pool. I also designed and ordered the reverse
complement of each of these oligonucleotides, and combined them in equimolar quantities to
create the reverse-mutagenesis pool. The primers for the N334H variants differed only for those
that overlapped the N334H codon. I also designed end primers that annealed to the termini of
the NP sequence and contained sites appropriate for BsmBI cloning into the influenza reversegenetics plasmid pHW2000 (Hoffmann et al., 2000). These primers are 5’-BsmBI-Aichi68-NP
(catgatcgtctcagggagcaaaagcagggtagataatcactcacag) and 3’-BsmBI-Aichi68NP (catgatcgtctcgtattagtagaaacaagggtatttttcttta).
I set up PCR reactions that contained 1 µl of 10 ng/µl template pHWAichi68-NP plasmid (Gong et al., 2013), 25 µl of 2X KOD Hot Start Master Mix (product number 71842, EMD
Millipore), 1.5 µl each of 10 µM solutions of the end primers 5’-BsmBI-Aichi68-NP and 3’BsmBI-Aichi68-NP, and 21 µl of water. I used the following PCR program (referred to as the
amplicon PCR program in the remainder of this paper):
1.
2.
3.
4.
5.
6.
7.
95 ◦ C for 2 minutes
95◦ C for 20 seconds
70◦ C for 1 second
50◦ C for 30 seconds cooling to 50◦ C at 0.5◦ C per second.
70◦ C for 40 seconds
Repeat steps 2 through 5 for 24 additional cycles
Hold 4◦ C
The PCR products were purified over agarose gels using ZymoClean columns (product number
17
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
D4002, Zymo Research) and used as templates for the initial codon mutagenesis fragment PCR.
Two fragment PCR reactions were run for each template. The forward-fragment reactions
contained 15 µl of 2X KOD Hot Start Master Mix, 2 µl of the forward mutagenesis primer pool
at a total oligonucleotide concentration of 4.5 µM, 2 µl of 4.5 µM 3’-BsmBI-Aichi68-NP, 4 µl
of 3 ng/µl of the aforementioned gel-purified linear PCR product template, and 7 µl of water.
The reverse-fragment reactions were identical except that the reverse mutagenesis pool was
substituted for the forward mutagenesis pool, and that 5’-BsmBI-Aichi68-NP was substituted
for 3’-BsmBI-Aichi68-NP. The PCR program for these fragment reactions was identical to the
amplicon PCR program except that it utilized a total of 7 rather than 25 thermal cycles.
The products from the fragment PCR reactions were diluted 1:4 in water. These dilutions
were then used for the joining PCR reactions, which contained 15 µl of 2X KOD Hot Start
Master Mix, 4 µl of the 1:4 dilution of the forward-fragment reaction, 4 µl of the 1:4 dilution of
the reverse-fragment reaction, 2 µl of 4.5 µM 5’-BsmBI-Aichi68-NP, 2 µl of 4.5 µM 3’-BsmBIAichi68-NP, and 3 µl of water. The PCR program for these joining reactions was identical to
the amplicon PCR program except that it utilized a total of 20 rather than 25 thermal cycles.
The products from these joining PCRs were purified over agarose gels.
The purified products of the first joining PCR reactions were used as templates for a second round of fragment reactions followed by joining PCRs. These second-round products were
used as templates for a third round. The third-round products were purified over agarose gels, digested with BsmBI (product number R0580L, New England Biolabs), and ligated into a dephosphorylated (Antarctic Phosphatase, product number M0289L, New England Biolabs) BsmBI
digest of pHW2000 (Hoffmann et al., 2000) using T4 DNA ligase. The ligations were purified
using ZymoClean columns, electroporated into ElectroMAX DH10B T1 phage-resistant competent cells (product number 12033-015, Invitrogen) and plated on LB plates supplemented
with 100 µg/ml of ampicillin. These transformations yielded between 400,000 and 800,000
unique transformants per plate, as judged by plating a 1:4,000 dilution of the transformations
on a second set of plates. Transformation of a parallel no-insert control ligation yielded approximately 50-fold fewer colonies, indicating that self ligation of pHW2000 only accounts for a
small fraction of the transformants. For each library, I performed three transformations, grew
the plates overnight, and then scraped the colonies into liquid LB supplemented with ampicillin
and mini-prepped several hours later to yield the plasmid mutant libraries. These libraries each
contained in excess of 106 unique transformants, most of which will be unique codon mutants
of the NP gene.
18
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
I sequenced the NP gene for 30 individual clones drawn from the four mutant libraries. As
shown in Figure 1, the number of mutations per clone was approximately Poisson distributed
and the mutations occurred uniformly along the primary sequence. If all codon mutations are
made with equal probability, 9/63 of the mutations should be single-nucleotide changes, 27/63
should be two-nucleotide changes, and 27/63 should be three-nucleotide changes. This is approximately what was observed in the Sanger-sequenced clones. The nucleotide composition
of the mutated codons was roughly uniform, and there was no tendency for clustering of multiple mutations in primary sequence. The full results of the Sanger sequencing and analysis are
https://github.com/jbloom/SangerMutantLibraryAnalysis/tree/v0.1.
The results of this Sanger sequencing are compatible with the mutation frequencies obtained
from deep sequencing the mutDNA samples after subtracting off the sequencing error rate estimated from the DNA samples (Figure 5), especially considering that the statistics from the
Sanger sequencing are subject to sampling error due to the limited number of clones analyzed.
Viral growth and passage
As described in the main text, two independent replicates of viral growth and passage were
performed (replicate A and replicate B). The experimental procedures were mostly identical between replicates, but there were a few small differences. In the actual experimental chronology,
replicate B was performed first, and the modifications in replicate A were designed to improve
the sampling of the mutant plasmids by the created mutant viruses. These modifications may
be the reason why replicate A slightly outperforms replicate B by two objective measures: the
viruses more completely sample the codon mutations (Figure 6), and the evolutionary model
derived solely from replicate A gives a higher likelihood than the evolutionary model derived
solely from replicate B (Table 6 and Table 7).
For replicate B, I used reverse genetics to rescue viruses carrying the Aichi/1968 NP or one
of its derivatives, PB2 and PA from the A/Nanchang/933/1995 (H3N2), a PB1 gene segment
encoding GFP, and HA / NA / M / NS from A/WSN/1933 (H1N1) strain. With the exception of
the variants of NP used, these viruses are identical to those described in (Gong et al., 2013), and
were rescued by reverse genetics in 293T-CMV-Nan95-PB1 and MDCK-SIAT1-CMV-Nan95PB1 cells as described in that reference. The previous section describes four NP codon-mutant
libraries, two of the wildtype Aichi/1968 gene (WT-1 and WT-2) and two of the N334H variant (N334H-1 and N334H-2). I grew mutant viruses from all four mutant libraries, and four
19
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
paired unmutated viruses from independent preps of the parent plasmids. A major goal was to
maintain diversity during viral creation by reverse genetics – the experiment would obviously
be undermined if most of the rescued viruses derived from a small number of transfected plasmids. I therefore performed the reverse genetics in 15 cm tissue-culture dishes to maximize
the number of transfected cells. Specifically, 15 cm dishes were seeded with 107 293T-CMVNan95-PB1 cells in D10 media (DMEM with 10% heat-inactivated fetal bovine serum, 2 mM
L-glutamine, 100 U/ml penicillin, and 100 µg/ml streptomycin). At 20 hours post-seeding,
the dishes were transfected with 2.8 µg of each of the eight reverse-genetics plasmids. At 20
hours post-transfection, about 20% of the cells expressed GFP (indicating transcription by the
viral polymerase of the GFP encoded by pHH-PB1flank-eGFP), suggesting that many unique
cells were transfected. At 20 hours post-transfection, the media was changed to the low serum
media described above. At 78 hours post-transfection, the viral supernatants were collected,
clarified by centrifugation at 2000xg for 5 minutes, and stored at 4◦ C. The viruses were titered
by flow cytometry as described previously (Gong et al., 2013) – titers were between 1 × 103 and
4 × 103 infectious particles per µl. A control lacking the NP gene yielded no infectious virus as
expected.
The virus was then passaged in MDCK-SIAT1-CMV-Nan95-PB1 cells. These cells were
seeded into 15 cm dishes, and when they had reached a density of 107 per plate they were
infected with 106 infectious particles (MOI of 0.1) of the transfectant viruses in low serum
media. After 18 hours, 30-50% of the cells were green as judged by microscopy, indicating viral
spread. At 40 hours post-transfection, 100% of the cells were green and many showed clear
signs of cytopathic effect. At this time the viral supernatants were again collected, clarified,
and stored at 4◦ C. NP cDNA isolated from these viruses was the source the deep-sequencing
samples virus-p1 and mutvirus-p1 in Figure 2. The virus was then passaged a second time
exactly as before (again using an MOI of 0.1). NP cDNA from these twice-passaged viruses
constituted the source for the samples virus-p2 and mutvirus-p2 in Figure 2.
For replicate A, all viruses (both the four mutant viruses and the paired unmutated controls)
were re-grown independently from the same plasmids preps used for replicate B. The experimental process was identical to that used for replicate B except for the following:
• Standard influenza viruses (rather than the GFP-carrying variants) were used, so plasmid
pHW-Nan95-PB1 (Gong et al., 2013) was substituted for pHH-PB1flank-eGFP during reverse genetics, and 293T and MDCK-SIAT1 cells were substituted for the PB1-expressing
variants.
20
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
• Rather than creating the viruses by transfecting a single 15-cm dish, each of the eight
virus samples was created by transfecting two 12-well dishes, with the dishes seeded at
3 × 105 293T and 5 × 104 MDCK-SIAT1 cells prior to transfection. The passaging was
then done in four 10 cm dishes for each sample, with the dishes seeded at 4×106 MDCKSIAT1 cells 12-14 hours prior to infection. The passaging was still done at an MOI of
0.1. These modifications were designed to increase diversity in the viral population.
• The viruses were titered by TCID50 rather than flow cytometry.
Sample preparation and Illumina sequencing
For each sample, a PCR amplicon was created to serve as the template for Illumina sequencing.
The steps used to generate the PCR amplicon for each of the seven sample types (Figure 2) are
listed below. Once the PCR template was generated, for all samples the PCR amplicon was
created using the amplicon PCR program described above in 50 µl reactions consisting of 25 µl
of 2X KOD Hot Start Master Mix, 1.5 µl each of 10 µM of 5’-BsmBI-Aichi68-NP and 3’-BsmBIAichi68-NP, the indicated template, and ultrapure water. A small amount of each PCR reaction
was run on an analytical agarose gel to confirm the desired band. The remainder was then run
on its own agarose gel without any ladder (to avoid contamination) after carefully cleaning the
gel rig and all related equipment. The amplicons were excised from the gels, purified over
ZymoClean columns, and analyzed using a NanoDrop to ensure that the absorbance at 260 nm
was at least 1.8 times that at 230 nm and 280 nm. The templates were as follows:
• DNA: The templates for these amplicons were 10 ng of the unmutated independent plasmid preps used to create the codon mutant libraries.
• mutDNA: The templates for these amplicons were 10 ng of the plasmid mutant libraries.
• RNA: This amplicon quantifies the net error rate of transcription and reverse transcription.
Because the viral RNA is initially transcribed from the reverse-genetics plasmids by RNA
polymerase I but the bidirectional reverse-genetics plasmids direct transcription of RNA
by both RNA polymerases I and II (Hoffmann et al., 2000), the RNA templates for these
amplicons were transcribed from plasmids derived from pHH21 (Neumann et al., 1999),
which only directs transcription by RNA polymerase I. The unmutated WT and N334H
NP genes were cloned into this plasmid to create pHH-Aichi68-NP and pHH-Aichi68NP-N334H. Independent preparations of these plasmids were transfected into 293T cells,
21
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
transfecting 2 µg of plasmid into 5 × 105 cells in 6-well dishes. After 32 hours, total RNA
was isolated using Qiagen RNeasy columns and treated with the Ambion TURBO DNAfree kit (Applied Biosystems AM1907) to remove residual plasmid DNA. This RNA was
used as a template for reverse transcription with AccuScript (Agilent 200820) using the
primers 5’-BsmBI-Aichi68-NP and 3’-BsmBI-Aichi68-NP. The resulting cDNA was quantified by qPCR specific for NP (see below), which showed high levels of NP cDNA in the
reverse-transcription reactions but undetectable levels in control reactions lacking the reverse transcriptase, indicating that residual plasmid DNA had been successfully removed.
A volume of cDNA that contained at least 2 × 106 NP cDNA molecules (as quantified
by qPCR) was used as template for the amplicon PCR reaction. Control PCR reactions
using equivalent volumes of template from the no reverse-transcriptase control reactions
yielded no product.
• virus-p1: This amplicon was derived from virus created from the unmutated plasmid and
collected at the end of the first passage. Clarified virus supernatant was ultracentrifuged
at 64,000 x g for 1.5 hours at 4 ◦ C, and the supernatant was decanted. Total RNA was then
isolated from the viral pellet using a Qiagen RNeasy kit. This RNA was used as a template
for reverse transcription with AccuScript using the primers 5’-BsmBI-Aichi68-NP and 3’BsmBI-Aichi68-NP. The resulting cDNA was quantified by qPCR, which showed high
levels of NP cDNA in the reverse-transcription reactions but undetectable levels in control
reactions lacking the reverse transcriptase. A volume of cDNA that contained at least 107
NP cDNA molecules (as quantified by qPCR) was used as template for the amplicon
PCR reaction. Control PCR reactions using equivalent volumes of template from the no
reverse-transcriptase control reactions yielded no product.
• virus-p2, mutvirus-p1, mutvirus-p2: These amplicons were created as for the virus-p1
amplicons, but used the appropriate virus as the initial template as outlined in Figure 2.
An important note: it was found that the use of relatively new RNeasy kits with β-mercaptoethanol
(a reducing agent) freshly added per the manufacturer’s instructions was necessary to avoid what
appeared to be oxidative damage to purified RNA.
The overall experiment only makes sense if the sequenced NP genes derive from a large diversity of initial template molecules. Therefore, qPCR was used to quantify the molecules produced by reverse transcription to ensure that a sufficiently large number were used as PCR templates to create the amplicons. The qPCR primers were 5’-Aichi68-NP-for (gcaacagctggtctgactcaca)
22
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
and 3’-Aichi68-NP-rev (tccatgccggtgcgaacaag). The qPCR reactions were performed
using the SYBR Green PCR Master Mix (Applied Biosystems 4309155) following the manufacturer’s instructions. Linear NP PCR-ed from the pHWAichi68-NP plasmid was used as a
quantification standard – the use of a linear standard is important, since amplification efficiencies differ for linear and circular templates (Hou et al., 2010). The standard curves were linear
with respect to the amount of NP standard over the range from 102 to 109 NP molecules. These
standard curves were used to determine the absolute number of NP cDNA molecules after reverse transcription. Note that the use of only 25 thermal cycles in the amplicon PCR program
provides a second check that there are a substantial number of template molecules, as this moderate number of thermal cycles will not lead to sufficient product if there are only a few template
molecules.
In order to allow the Illumina sequencing inserts to be read in both directions by paired-end
50 nt reads (Figure 3), it was necessary to us an Illumina library-prep protocol that created NP
inserts that were roughly 50 nt in length. This was done via a modification of the Illumina Nextera protocol. First, concentrations of the PCR amplicons were determined using PicoGreen
(Invitrogen P7859). These amplicons were used as input to the Illumina Nextera DNA Sample
Preparation kit (Illumina FC-121-1031). The manufacturer’s protocol for the tagmentation step
was modified to use 5-fold less input DNA (10 ng rather than 50 ng) and two-fold more tagmentation enzyme (10 µl rather than 5 µl), and the incubation at 55 ◦ C was doubled from 5 minutes
to 10 minutes. Samples were barcoded using the Nextera Index Kit for 96-indices (Illumina FC121-1012). For index 1, the barcoding was: DNA with N701, RNA with N702, mutDNA with
N703, virus-p1 with N704, mutvirus-p1 with N705, virus-p2 with N706, and mutvirus-p2
with N707. After completion of the Nextera PCR, the samples were subjected to a ZymoClean
purification rather than the bead cleanup step specified in the Nextera protocol. The size distribution of these purified PCR products was analyzed using an Agilent 200 TapeStation Instrument. If the NP sequencing insert is exactly 50 nt in size, then the product of the Nextera PCR
should be 186 nt in length after accounting for the addition of the Nextera adaptors. The actual
size distribution was peaked close to this value. The ZymoClean-purified PCR products were
quantified using PicoGreen and combined in equal amounts into pools: a WT-1 pool of the seven
samples for that library, a WT-2 pool of the seven samples for that library, etc. These pools were
subjected to further size selection by running them on a 4% agarose gel versus a custom ladder
containing 171 and 196 nt bands created by PCR from a GFP template using the forward primer
gcacggggccgtcgccg and the reverse primers tggggcacaagctggagtacaac (for the
23
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
171 nt band) and gacttcaaggaggacggcaacatcc (for the 196 nt band). The gel slice for
the sample pools corresponding to sizes between 171 and 196 nt was excised, and purified using
a ZymoClean column. A separate clean gel was run for each pool to avoid cross-contamination.
Library QC and cluster optimization were performed using Agilent Technologies qPCR
NGS Library Quantification Kit (Agilent Technologies, Santa Clara, CA, USA). Libraries were
introduced onto the flow cell using an Illumina cBot (Illumina, Inc., San Diego, CA, USA) and a
TruSeq Rapid Duo cBot Sample Loading Kit. Cluster generation and deep sequencing was performed on an Illumina HiSeq 2500 using an Illumina TruSeq Rapid PE Cluster Kit and TruSeq
Rapid SBS Kit. A paired-end, 50 nt read-length (PE50) sequencing strategy was performed
in rapid run mode. Image analysis and base calling were performed using Illumina’s Real
Time Analysis v1.17.20.0 software, followed by demultiplexing of indexed reads and generation of FASTQ files, using Illumina’s CASAVA v1.8.2 software (http://www.illumina.
com/software.ilmn). These FASTQ files were uploaded to the SRA under accession
SRP036064 (see http://www.ncbi.nlm.nih.gov/sra/?term=SRP036064).
Read alignment and quantification of mutation frequencies
A custom Python software package, mapmuts, was created to quantify the frequencies of mutations from the Illumina sequencing. A detailed description of the software as utilized in the
present work is at http://jbloom.github.io/mapmuts/example_2013Analysis_
Influenza_NP_Aichi68.html. The source code is at https://github.com/jbloom/
mapmuts. Briefly:
1. Reads were discarded if either read in a pair failed the Illumina chastity filter, had a mean
Q-score less than 25, or had more than two ambiguous (N) nucleotides.
2. The remaining paired reads were aligned to each other, and retained only if they shared
at least 30 nt of overlap, disagreed at no more than one site, and matched the expected
terminal Illumina adaptors with no more than one mismatch.
3. The overlap of the paired reads was aligned to NP, disallowing alignments with gaps
or more than six nucleotide mismatches. A small fraction of alignments corresponded
exclusively to the noncoding termini of the viral RNA; the rest contained portions of the
NP coding sequence.
24
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
4. For every paired read that aligned with NP, the codon identity was called if both reads
concurred for all three nucleotides in the codon. If the reads disagreed or contained an
ambiguity in that codon, the identity was not called.
A summary of alignment statistics and read depth is in Figure 4. The mutation frequencies
are summarized in Figure 5, and the counts for mutations in Figure 6. Full data can be accessed
via http://jbloom.github.io/mapmuts/example_2013Analysis_Influenza_
NP_Aichi68.html.
Inference of the amino-acid preferences
The approach described here is based on the assumption that there is an inherent preference for
each amino acid at each site in the protein. This assumption is clearly not completely accurate,
as the effect of a mutation at one site can be influenced by the identities of other sites. However,
experimental work with NP (Gong et al., 2013) and other proteins (Serrano et al., 1993; Bloom
et al., 2005; Bershtein et al., 2006; Bloom et al., 2006) suggests that at an evolutionary level,
sites interact mostly through generic effects on stability and folding. Furthermore, the effects
of mutations on stability and folding tend to be conserved during evolution (Ashenberg et al.,
2013; Serrano et al., 1993). So one justification for assuming site-specific but site-independent
preferences is that selection on a mutation is mostly determined by whether the protein can
tolerate its effect on stability or folding, so stabilizing amino acids will be tolerated in most genetic backgrounds while destabilizing amino acids will only be tolerated in some backgrounds,
as has been described experimentally (Gong et al., 2013) and theoretically (Bloom et al., 2007).
A more pragmatic justification is that the work here builds off this assumption to create evolutionary models that are much better than existing alternatives.
Assume that the preferences are entirely at the amino-acid level and are indifferent to the
specific codon (the study of preferences for synonymous codons is an interesting area for future
work). Denote the preference of site r for amino-acid a as πr,a , where
X
πr,a = 1.
(Equation 6)
a
Define πr,a /πr,a0 as the expected ratio of amino-acid a to a0 after viral growth if both are initially
introduced into the mutant library at equal frequency. Mutations that enhance viral growth will
have larger values of πr,a , while mutations that hamper growth will have lower values of πr,a .
25
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
However, πr,a /πr,a0 cannot be simply interpreted as the fitness effect of mutating site r from a to
a0 : because most clones have multiple mutations, this ratio summarizes the effect of a mutation
in a variety of related genetic backgrounds. A mutation can therefore have a ratio greater than
one due to its inherent effect on viral growth or its effect on the tolerance for other mutations (or
both). This analysis does not separate these factors, but experimental work (Gong et al., 2013)
has shown that it is fairly common for one mutation to NP to alter the tolerance to a subsequent
one.
The most naive approach is to set πr,a proportional to the frequency of amino-acid a in
mutvirus-p1 divided by its frequency in mutDNA, and then apply the normalization in Equation 6. However, such an approach is problematic for several reasons. First, it fails to account
for errors (PCR, reverse-transcription) that inflate the observed frequencies of some mutations.
Second, estimating ratios by dividing finite counts is notoriously statistically biased (Pearson,
1910; Ogliore et al., 2011). For example, in the limiting case where a mutation is counted once
in mutvirus-p1 and not at all in mutDNA, the ratio is infinity – yet in practice such low counts
give little confidence that enough variants have been assayed to estimate the true effect of the
mutation.
To circumvent these problems, I used an approach that explicitly accounts for the sampling
statistics. The approach begins with prior estimates that the πr,a values are all equal, and that the
error and mutation rates for each site are equal to the library averages. Multinomial likelihood
functions give the probability of observing a set of counts given the πr,a values and the various
error and mutation rates. The posterior mean of the πr,a values is estimated by MCMC.
Use the counts in DNA to quantify errors due to PCR and sequencing. Use the counts in
RNA to quantify errors due to reverse transcription. Assume that transcription of the viral genes
from the reverse-genetics plasmids and subsequent replication of these genes by the influenza
polymerase introduces a negligible number of new mutations. The second of these assumptions
is supported by the fact that the mutation frequency in virus-p1 is close to that in RNA (Figure
5). The first of these assumptions is supported by the fact that stop codons are no more frequent
in RNA than in virus-p1 (Figure 5) – deleterious stop codons arising during transcription will
be purged during viral growth, while those arising from reverse-transcription and sequencing
errors will not.
At each site r, there are ncodon codons, indexed by i = 1, 2, . . . ncodon . Let wt (r) denote the
wildtype codon at site r. Let NrDNA be the total number of sequencing reads at site r in DNA,
P DNA
and let nDNA
nr,i = NrDNA .
r,i be the number of these reads that report codon i at site r, so that
i
26
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Similarly, let NrmutDNA , NrRNA , and Nrmutvirus be the total number of reads at site r and let nmutDNA
,
r,i
RNA
mutvirus
nr,i , and nr,i
be the total number of these reads that report codon i at site r in mutDNA,
RNA, and mutvirus-p1, respectively.
First consider the rate at which site r is erroneously read as some incorrect identity due
to PCR or sequencing errors. Such errors are the only source of non-wildtype reads in the
sequencing of DNA. For all i 6= wt (r), define r,i as the rate at which site r is erroneously read
P
as codon i in DNA. Define r,wt(r) = 1 −
r,i to be the rate at which site r is correctly
i6=wt(r)
DNA
where E denotes
read as its wildtype identity of wt (r) in DNA. Then r,i = E nDNA
r,i /Nr
−−
→
→
−
DNA
DNA
the expectation value. Define r = (r,1 , . . . , r,ncodon ) and nr = nr,1 , . . . , nDNA
r,ncodon as
−−
→
−
DNA
given →
r and NrDNA is
vectors of the r,i and nDNA
r,i values, so the likelihood of observing nr
−−→
−−→
−
−
DNA →
DNA
DNA →
,
;
N
=
Mult
n
,
|
N
Pr nDNA
r
r
r
r
r
r
(Equation 7)
where Mult denotes the multinomial distribution.
Next consider the rate at which site r is erroneously copied during reverse transcription.
−
These reverse-transcription errors combine with the PCR / sequencing errors defined by →
r to
create non-wildtype reads in RNA. For all i 6= wt (r), define ρr,i as the rate at which site r
P
is miscopied to i during reverse transcription. Define ρr,wt(r) = 1 −
ρr,i as the rate at
i6=wt(r)
which site r is correctly reverse transcribed. Ignore as negligibly rare the possibility that a site
is subject to both a reverse-transcription and sequencing / PCR error within the same clone (a
reasonable assumption as both r,i and ρr,i are very small for i 6= wt (r)). Then r,i + ρr,i −
RNA
/N
where δi,wt(r) is the Kronecker delta (equal to one if i = wt (r) and
δi,wt(r) = E nRNA
r
r,i
−−→
−
−
given →
ρr , →
r , and NrRNA is
zero otherwise). The likelihood of observing nRNA
r
−−→
−−→
→
−
−
→
−
−
→
−
RNA →
RNA
RNA →
Pr nRNA
|
N
,
ρ
,
=
Mult
n
;
N
,
+
ρ
−
δr .
r
r
r
r
r
r
r
r
(Equation 8)
→
−
where δr = δ1,wt(r) , . . . , δncodon ,wt(r) is a vector that is all zeros except for the element wt (r).
Next consider the rate at which site r is mutated to some other codon in the plasmid mutant
−
library. These mutations combine with the PCR / sequencing errors defined by →
r to create nonwildtype reads in mutDNA. For all i 6= wt (r), define µr,i as the rate at which site r is mutated
P
to codon i in the mutant library. Define µr,wt(r) = 1 −
µr,i as the rate at which site r is not
i6=wt(r)
mutated. Ignore as negligibly rare the possibility that a site is subject to both a mutation and a
27
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
sequencing / PCR error within the same clone. Then µr,i +r,i −δi,wt(r) = E nmutDNA
/NrmutDNA .
r,i
−−−−→
−
−
The likelihood of observing nmutDNA
given →
µr , →
r , and NrmutDNA is
r
−−−−→
−−−−→
→
−
−
→
−
−
→
−
mutDNA →
mutDNA
mutDNA →
,
µ
+
−
δr .
;
N
,
µ
,
=
Mult
n
|
N
Pr nmutDNA
r
r
r
r
r
r
r
r
(Equation 9)
Finally, consider the effect of the preferences of each site r for different amino acids, as denoted by the πr,a values. Selection due to these preferences is manifested in mutvirus. This selection acts on the mutations in the mutant library (µr,i ), although the actual counts in mutvirus
are also affected by the sequencing / PCR errors (r,i ) and reverse-transcription errors (ρr,i ).
Again ignore as negligibly rare the possibility that a site is subject to more than one of these
sources of mutation and error within a single clone. Let A (i) denote the amino acid encoded
→
−
−
by codon i. Let →
πr be the vector of πr,a values. Define the vector-valued function C as
→
− →
C (−
πr ) = πr,A(1) , . . . , πr,A(ncodon ) ,
(Equation 10)
−
so that this function returns a ncodon -element vector constructed from →
πr . Because the selection in mutvirus due to the preferences πr,A(i) occurs after the mutagenesis µr,i but before the
mutvirus
=
/N
reverse-transcription errors ρr,i and the sequencing / PCR errors r,i , then E nmutvirus
r
−1 r,i
−1
P
→
− − →
r,i + ρr,i + γr × πr,A(i) × µr,i − 2 × δi,wt(r) where γr =
πr,A(i) µr,i
= C (→
πr ) · −
µr
i
(where · denotes the dot product) is a normalization factor that accounts for the fact that changes
in the frequency of one variant due to selection will influence the observed frequency of other
−−−−→
−
−
−
−
given →
µr , →
r , →
ρr , →
πr , and Nrmutvirus is therefore
variants. The likelihood of observing nmutvirus
r
!
→
− →
−−−−→
−
→
−
−
−
−
−
→
→
−
C
(
π
)
◦
µ
r
r
−
−
−
−
−
−
Pr nmutvirus
|→
µr , →
r , →
ρr , →
πr , Nrmutvirus = Mult nmutvirus
; Nrmutvirus , →
r + →
ρr + →
− 2 δr .
r
r
− →
−
C (−
πr ) · →
µr
(Equation 11)
where ◦ is the Hademard (entry-wise) product.
−
−
−
−
Specify priors over →
r , →
ρr , →
µr , and →
πr in the form of Dirichlet distributions (denoted here by
−
Dir). For the priors over the mutation rates →
µr , I choose Dirichlet-distribution parameters such
that the mean of the prior expectation for the mutation rate at each site r and codon i is equal to
the average value for all sites, estimated as the frequency in mutDNA minus the frequency in
DNA (Figure 5), denoted by µ. So the prior is
−
−
→)
Pr (→
µr ) = Dir (→
µr ; ncodon · σµ · −
α−µ,r
28
(Equation 12)
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
→ is the n
where −
α−µ,r
codon -element vector with elements αµ,r,i = µ + δi,wt(r) (1 − ncodon µ) and σµ
is the scalar concentration parameter.
For the priors over r,i and ρr,i , the Dirichlet-distribution parameters again represent the
average value for all sites, but now also depend on the number of nucleotide changes in the
codon mutation since sequencing / PCR and reverse-transcription errors are far more likely to
lead to single-nucleotide codon changes than multiple-nucleotide codon changes (Figure 5).
Let M (wt (r) , i) be the number of nucleotide changes in the mutation from codon wt (r) to
codon i. For example, M (GCA, ACA) = 1 and M (GCA, ATA) = 2. Let 1 , 2 , and 3 be the
average error rates for one-, two-, and three-nucleotide codon mutations, respectively – these
are estimated as the frequencies in DNA. So the prior is
−
−
Pr (→
r ) = Dir (→
r ; ncodon · σ · −
α→
,r )
(Equation 13)
where −
α→
,r is the ncodon -element vector with elements α,r,i = M(wt(r),i) where 0 = 1 − 9 ×
1 − 27 × 2 − 27 × 3 , and where σ is the scalar concentration parameter.
Similarly, let ρ1 , ρ2 , and ρ3 be the average reverse-transcription error rates for one-, two-,
and three-nucleotide codon mutations, respectively – these are estimated as the frequencies in
RNA minus those in DNA. So the prior is
−
−
Pr (→
ρr ) = Dir (→
ρr ; ncodon · σρ · −
α−→
ρ,r )
(Equation 14)
where −
α−→
ρ,r is the ncodon -element vector with elements αρ,r,i = ρM(wt(r),i) where ρ0 = 1 − 9 ×
ρ1 − 27 × ρ2 − 27 × ρ3 , and where σρ is the scalar concentration parameter.
−
Specify a symmetric Dirichlet-distribution prior over →
πr (note that any other prior, such as
one that favored wildtype, would implicitly favor certain identities based empirically on the
wildtype sequence, and so would not be in the spirit of the parameter-free derivation of the πr,a
values employed here). Specifically, use a prior of
→
−
−
−
π r ; σπ · 1
Pr (→
πr ) = Dir →
(Equation 15)
→
−
where 1 is the naa -element vector that is all ones, and σπ is the scalar concentration parameter.
It isnnow possible to writ expressions for the likelihoods and posterior
probabilities. Let
−−
→ −−
−−→ −−
→ −−
−−→ DNA mutDNA RNA mutvirus o
DNA
mutDNA
RNA
mutvirus
Nr = nr , nr
, nr , nr
, Nr , Nr
, Nr , Nr
denote the full set of
counts for site r. The likelihood of Nr given values for the preferences and mutation / error
29
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
rates is
−−→
−−→
−
→
−
−
−
−
−
−
RNA →
RNA
DNA →
,
,
ρ
×
|
N
,
×
Pr
n
|
N
Pr (Nr | →
πr , →
r , →
ρr , →
µr ) = Pr nDNA
r
r
r
r
r
r
r
−−−−→
−
−
Pr nmutDNA
| NrmutDNA , →
r , →
µr ×
r
−−−−→
−
→
−
→
−
→
−
mutvirus →
mutvirus
, r , ρr , µr , πr
(Equation 16)
| Nr
Pr nr
where the likelihoods that compose Equation 16 are defined by Equations Equation 7, Equation
8, Equation 9, and Equation 11. The posterior probability of a specific value for the preferences
and mutation / error rates is
−
−
−
−
−
−
−
−
Pr (→
πr , →
r , →
ρr , →
µr | Nr ) = Cr × Pr (Nr | →
πr , →
r , →
ρr , →
µr ) ×
(Equation 17)
−
−
−
−
Pr (→
) × Pr (→
ρ ) × Pr (→
µ ) × Pr (→
π )(Equation 18)
r
r
r
r
where Cr is a normalization constant that does not need to be explicitly calculated in the MCMC
−
approach used here. The posterior over the preferences →
πr can be calculated by integrating over
Equation 17 to give
−
Pr (→
πr | Nr ) =
Z Z Z
−
−
−
−
−
−
−
Pr (→
πr , →
r , →
ρr , →
µr | Nr ) d→
r d→
ρr d→
µr ,
(Equation 19)
where the integration is performed by MCMC. The posterior is summarized by its mean,
−
h→
πr i =
Z
−
→
−
−
πr × Pr →
πr | Nrk | 1 ≤ k ≤ R d→
πr .
(Equation 20)
In practice, each replicate consists of four libraries (WT-1, WT-2, N334H-1, and N334H-2) –
the posterior mean preferences inferred for each library within a replicate are averaged to give
the estimated preferences for that replicate. The preferences within each replicate are highly
correlated regardless of whether mutvirus-p1 or mutvirus-p2 is used as the mutvirus data set
(Figure 7A,B). This correlation between passages is consistent with the interpretation of the
preferences as the fraction of genetic backgrounds that tolerate a mutation (if it was a selection
coefficient, there should be further enrichment upon further passage). The preferences averaged
over both replicates serve as the “best” estimate, and are displayed in Figure 7D. This figure
was created using the WebLogo 3 program (Schneider and Stephens, 1990; Crooks et al., 2004).
Figure 7D also shows relative solvent accessibility (RSA) and secondary structure for residues
30
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
present in chain C of NP crystal structure PDB 2IQH (Ye et al., 2006). The total accessible
surface area (ASA) and the secondary structure for each residue in this monomer alone was
calculated using DSSP (Kabsch and Sander, 1983; Joosten et al., 2011). The RSAs are the total
ASA divided by the maximum ASA defined in (Tien et al., 2013). The secondary structure
codes returned by DSSP were grouped into three classes: helix (DSSP codes G, H, or I), strand
(DSSP codes B or E), and loop (any other DSSP code).
A full description of the algorithm used to infer the preferences along with complete data can
be accessed at http://jbloom.github.io/mapmuts/example_2013Analysis_
Influenza_NP_Aichi68.html.
Phylogenetic analyses
A detailed description of the phylogenetic analyses is available at http://jbloom.github.
io/phyloExpCM/example_2013Analysis_Influenza_NP_Human_1918_Descended.
html. The source code is at https://github.com/jbloom/phyloExpCM.
A set of NP coding sequences was assembled for human influenza lineages descended from
a close relative the 1918 virus (H1N1 from 1918 to 1957, H2N2 from 1957 to 1968, H3N2
from 1968 to 2013, and seasonal H1N1 from 1977 to 2008). All full-length NP sequences from
the Influenza Virus Resource (Bao et al., 2008) were downloaded, and up to three unique sequences per year from each of the four lineages described above were retained. These sequences
were aligned using EMBOSS needle (Rice et al., 2000). Outlier sequences that correspond to
heavily lab-adapted strains, lab recombinants, mis-annotated sequences, or zoonotic transfers
(for example, a small number of human H3N2 strains are from zoonotic swine variant H3N2
rather than the main human H3N2 lineage) were removed. This was done by first removing known outliers in the influenza databases (Krasnitz et al., 2008), and then using an analysis with RAxML (Stamatakis, 2006) and Path-O-Gen (http://tree.bio.ed.ac.uk/
software/pathogen/) to remove remaining sequences that were extreme outliers from
the molecular clock. The final alignment after removing outliers consisted of 274 unique NP
sequences.
Maximum-likelihood phylogenetic trees were constructed using codonPhyML (Gil et al.,
2013). Two substitution models were used. The first was GY94 (Goldman and Yang, 1994)
using CF3x4 equilibrium frequencies (Pond et al., 2010), a single transition-transversion ratio
optimized by maximum likelihood, and a synonymous-nonsynonymous ratio drawn from four
31
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
discrete gamma-distributed categories with mean and shape parameter optimized by maximum
likelihood (Yang et al., 2000). The second was KOSI07 (Kosiol et al., 2007) using the F method
for the equilibrium frequencies, optimizing the relative transversion-transition ratio by maximum likelihood, and letting the relative synonymous-nonsynonymous ratio again be drawn
from four gamma-distributed categories with mean and shape parameter optimized by maximum likelihood. The trees produced by codonPhyML are unrooted. These trees were rooted
using Path-O-Gen (http://tree.bio.ed.ac.uk/software/pathogen/), and visualized with FigTree (http://tree.bio.ed.ac.uk/software/figtree/) to create
the images in Figure 9. The tree topologies are extremely similar for both models.
The evolutionary models were compared by using them to optimize the branch lengths of
the fixed tree topologies in Figure 9 so as to maximize the likelihood using HYPHY (Pond et al.,
2005) for sites 2 to 498 (site 1 was not included, since the N-terminal methionine is conserved
and was not mutated in the plasmid mutant libraries). The results are shown in Table 6 and Table 7. Regardless of which tree topology was used, the experimentally determined evolutionary
models outperformed all variants of GY94 and KOSI07. The experimentally determined evolutionary models performed best when using the preferences determined from the combined data
from both replicates and using Equation 3 to compute the fixation probabilities. Using the data
from just one replicate also outperforms GY94 and KOSI07, although the likelihoods are slightly
worse. In terms of the completeness with which mutations are sampled in the mutant viruses,
replicate A is superior to replicate B as discussed above – and the former replicate gives higher
likelihoods. If the fixation probabilities are instead determined using the method of Halpern
and Bruno (Halpern and Bruno, 1998) as in Equation 4, the experimentally determined models still outperform GY94 and KOSI07 – but the likelihoods are substantially worse. To check
that the experimentally determined models really do utilize the site-specific preferences information, the preferences were randomized among sites and likelihoods were computed. These
randomized models perform vastly worse than any of the alternatives.
The variants of GY94 and KOSI07 that were used for the HYPHY analyses are listed in the
tables of likelihoods. The parameters were counted as follows: the simplest variants contained
equilibrium frequency parameters that were empirically estimated from the sequences under
analysis: there are 9 such parameters for GY94 using CF3x4 (Goldman and Yang, 1994; Pond
et al., 2010) and 60 such parameters for KOSI07 using F (Kosiol et al., 2007). In addition, all
variants of the models contain a transition-transversion ratio optimized by likelihood, and at
least a single synonymous-nonsynonymous ratio optimized by likelihood. For more complex
32
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
variants, one or more of the following is also done:
• The overall substitution rate is drawn from a gamma distribution with four discrete categories to add one shape parameter (Yang, 1994).
• The nonsynonymous-synonymous ratio is drawn from a gamma distribution with four
discrete categories to add one shape parameter (Yang et al., 2000).
• A different nonsynonymous-synonymous ratio is estimated for each branch (Yang and
Nielsen, 1998) to add 547 parameters.
33
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
References
Araya CL, Fowler DM. 2011. Deep mutational scanning: assessing protein function on a massive scale. Trends in biotechnology. 29:435–442.
Ashenberg O, Gong LI, Bloom JD. 2013. Mutational effects on stability are largely conserved
during protein evolution. Proc. Natl. Acad. Sci. USA. 110:21071–21076.
Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, Ostell J, Lipman D. 2008.
The Influenza Virus Resource at the National Center for Biotechnology Information. J. Virol.
82:596–601.
Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS. 2006. Robustness-epistasis link
shapes the fitness landscape of a randomly drifting protein. Nature. 444:929–932.
Bloom JD, Gong LI, Baltimore D. 2010. Permissive secondary mutations enable the evolution
of influenza oseltamivir resistance. Science. 328:1272–1275.
Bloom JD, Labthavikul ST, Otey CR, Arnold FH. 2006. Protein stability promotes evolvability.
Proc. Natl. Acad. Sci. USA. 103:5869–5874.
Bloom JD, Raval A, Wilke CO. 2007. Thermodynamics of neutral protein evolution. Genetics.
175:255–266.
Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. 2005. Thermodynamic
prediction of protein neutrality. Proc. Natl. Acad. Sci. USA. 102:606–611.
Cirino PC, Mayer KM, Umeno D. 2003. Directed evolution library creation: methods and
protocols, Humana Press, chapter Generating mutant libraries using error-prone PCR, pp.
3–9.
Crooks GE, Hon G, Chandonia JM, Brenner SE. 2004. Weblogo: a sequence logo generator.
Genome research. 14:1188–1190.
Felsenstein J. 1973. Maximum likelihood and minimum-step methods for estimating evolutionary trees from data on discrete characters. Systematic Zoology. 22:240–249.
Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach.
J. Mol. Evol. 17:368–376.
34
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Firnberg E, Ostermeier M. 2012. PFunkel: efficient, expansive, user-defined mutagenesis. PLoS
One. 7:e52031.
Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, Fields S. 2010. Highresolution mapping of protein sequence-function relationships. Nat. Methods. 7:741–746.
Georgescu R, Bandara G, Sun L. 2003. Directed evolution library creation, Springer, chapter
Saturation mutagenesis, pp. 75–83.
Gil M, Zanetti MS, Zoller S, Anisimova M. 2013. Codonphyml: Fast maximum likelihood
phylogeny estimation under codon substitution models. Mol. Biol. Evol. 30:1270–1280.
Goldman N, Thorne JL, Jones DT. 1998. Assessing the impact of secondary structure and
solvent accessibility on protein evolution. Genetics. 149:445–458.
Goldman N, Yang Z. 1994. A codon-based model of nucleotide substitution probabilities for
protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.
Gong LI, Suchard MA, Bloom JD. 2013. Stability-mediated epistasis constrains the evolution
of an influenza protein. eLife. 2:e00631.
Goto H, Kawaoka Y. 1998. A novel mechanism for the acquisition of virulence by a human
influenza a virus. Proc. Natl. Acad. Sci. USA. 95:10224–10228.
Halligan DL, Keightley PD. 2009. Spontaneous mutation accumulation studies in evolutionary
genetics. Annual Review of Ecology, Evolution, and Systematics. 40:151–172.
Halpern AL, Bruno WJ. 1998. Evolutionary distances for protein-coding sequences: modeling
site-specific residue frequencies. Mol. Biol. Evol. 15:910–917.
Hiatt JB, Patwardhan RP, Turner EH, Lee C, Shendure J. 2010. Parallel, tag-directed assembly
of locally derived short sequence reads. Nat. Methods. 7:119–122.
Hoffmann E, Neumann G, Kawaoka Y, Hobom G, Webster RG. 2000. A DNA transfection
system for generation of influenza A virus from eight plasmids. Proc. Natl. Acad. Sci. USA.
97:6108–6113.
Hou Y, Zhang H, Miranda L, Lin S. 2010. Serious overestimation in quantitative pcr by circular
(supercoiled) plasmid standard: microalgal pcna as the model gene. PLoS One. 5:e9545.
35
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. 2001. Bayesian inference of phylogeny
and its impact on evolutionary biology. Science. 294:2310–2314.
Jain PC, Varadarajan R. 2014. A rapid, efficient, and economical inverse polymerase chain
reaction-based method for generating a site saturation mutant library. Analytical Biochemistry. 449:90–98.
Joosten RP, te Beek TA, Krieger E, Hekkelman ML, Hooft RW, Schneider R, Sander C, Vriend
G. 2011. A series of pdb related databases for everyday needs. Nucleic Acids Res. 39:D411–
D419.
Kabsch W, Sander C. 1983. Dictionary of protein secondary structure: pattern recognition of
hydrogen-bonded and geometrical features. Biopolymers. 22:2577–2637.
Kleinman CL, Rodrigue N, Lartillot N, Philippe H. 2010. Statistical potentials for improved
structurally constrained evolutionary models. Mol. Biol. Evol. 27:1546–1560.
Kosiol C, Holmes I, Goldman N. 2007. An empirical codon model for protein sequence evolution. Mol. Biol. Evol. 24:1464–1479.
Krasnitz M, Levine AJ, Rabadan R. 2008. Anomalies in the influenza virus genome database:
new biology or laboratory errors? J. Virology. 82:8947–8950.
Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the
amino-acid replacement process. Mol. Biol. Evol. 21:1095–1109.
Le SQ, Lartillot N, Gascuel O. 2008. Phylogenetic mixture models for proteins. Phil. Trans. R.
Soc. B. 363:3965–3976.
Lou DI, Hussmann JA, McBee RM, Acevedo A, Andino R, Press WH, Sawyer SL. 2013. Highthroughput dna sequencing errors are reduced by orders of magnitude using circle sequencing. Proc. Natl. Acad. Sci. USA. 110:19872–19877.
Marsh GA, Rabadán R, Levine AJ, Palese P. 2008. Highly conserved regions of influenza a
virus polymerase gene segments are critical for efficient viral rna packaging. J. Virology.
82:2295–2304.
Melamed D, Young DL, Gamble CE, Miller CR, Fields S. 2013. Deep mutational scanning of an
rrm domain of the saccharomyces cerevisiae poly (a)-binding protein. RNA. 19:1537–1551.
36
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. 1953. Equation of state
calculations by fast computing machines. Journal of Chemical Physics. 21:1087–1092.
Neumann G, Watanabe T, Ito H, et al. (12 co-authors). 1999. Generation of influenza A viruses
entirely from cloned cDNAs. Proc. Natl. Acad. Sci. USA. 96:9345–9350.
Neylon C. 2004. Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution. Nucleic Acids
Research. 32:1448–1459.
Ogliore R, Huss G, Nagashima K. 2011. Ratio estimation in sims analysis. Nuclear Instruments
and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms.
269:1910–1918.
Parvin J, Moscona A, Pan W, Leider J, Palese P. 1986. Measurement of the mutation rates of
animal viruses: influenza a virus and poliovirus type 1. J. Virology. 59:377–383.
Pearson K. 1910. On the constants of index-distributions as deduced from the like constants
for the components of the ratio, with special reference to the opsonic index. Biometrika.
7:531–541.
Pollock DD, Thiltgen G, Goldstein RA. 2012. Amino acid coevolution induces an evolutionary
stokes shift. Proceedings of the National Academy of Sciences. 109:E1352–E1359.
Pond SK, Delport W, Muse SV, Scheffler K. 2010. Correcting the bias of empirical frequency
parameter estimators in codon models. PLoS One. 5:e11230.
Pond SL, Frost SD, Muse SV. 2005. Hyphy: hypothesis testing using phylogenies. Bioinformatics. 21:676–679.
Portela A, Digard P. 2002. The influenza virus nucleoprotein: a multifunctional rna-binding
protein pivotal to virus replication. Journal of General Virology. 83:723–734.
Posada D, Buckley TR. 2004. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests.
Systematic Biology. 53:793–808.
37
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Potapov V, Cohen M, Schreiber G. 2009. Assessing computational methods for predicting
protein stability upon mutation: good on average but not in the details. Prot. Eng. Des. Sel.
22:553–560.
Rice P, Longden I, Bleasby A. 2000. Emboss: the european molecular biology open software
suite. Trends in Genetics. 16:276–277.
Rodrigue N, Kleinman CL, Philippe H, Lartillot N. 2009. Computational methods for evaluating
phylogenetic models of coding sequence evolution with dependence between codons. Mol.
Biol. Evol. 26:1663–1676.
Roscoe BP, Thayer KM, Zeldovich KB, Fushman D, Bolon DN. 2013. Analyses of the effects
of all ubiquitin point mutants on yeast growth rate. J. Mol. Biol. .
Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. 2012. Detection of ultra-rare
mutations by next-generation sequencing. Proc. Natl. Acad. Sci. USA. 109:14508–14513.
Schneider TD, Stephens RM. 1990. Sequence logos: a new way to display consensus sequences.
Nucleic Acids Res. 18:6097–6100.
Serrano L, Day AG, Fersht AR. 1993. Step-wise mutation of barnase to binase: a procedure for
engineering increased stability of proteins and an experimental analysis of the evolution of
protein stability. J. Mol. Biol. 233:305–312.
Stamatakis A. 2006. Raxml-vi-hpc: maximum likelihood-based phylogenetic analyses with
thousands of taxa and mixed models. Bioinformatics. 22:2688–2690.
Starita LM, Pruneda JN, Lo RS, Fowler DM, Kim HJ, Hiatt JB, Shendure J, Brzovic PS, Fields
S, Klevit RE. 2013. Activity-enhancing mutations in an e3 ubiquitin ligase identified by
high-throughput mutagenesis. Proc. Natl. Acad. Sci. USA. 110:E1263–E1272.
Thorne JL, Choi SC, Yu J, Higgs PG, Kishino H. 2007. Population genetics without intraspecific
data. Mol. Biol. Evol. 24:1667–1677.
Thorne JL, Goldman N, Jones DT. 1996. Combining protein evolution and secondary structure.
Mol. Biol. Evol. 13:666–673.
Tien M, Meyer AG, Spielman SJ, Wilke CO. 2013. Maximum allowed solvent accessibilites of
residues in proteins. PLoS One. 8:e80635.
38
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Traxlmayr MW, Hasenhindl C, Hackl M, Stadlmayr G, Rybka JD, Borth N, Grillari J, Rüker F,
Obinger C. 2012. Construction of a stability landscape of the ch3 domain of human igg1 by
combining directed evolution with high throughput sequencing. J. Mol. Biol. .
Wang HC, Li K, Susko E, Roger AJ. 2008. A class frequency mixture model that adjusts
for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC
evolutionary biology. 8:331.
Wu CH, Suchard MA, Drummond AJ. 2013. Bayesian selection of nucleotide substitution
models and their site assignments. Mol. Biol. Evol. 30:669–688.
Yang Z. 1994. Maximum likelihood phylogenetic estimation from dna sequences with variable
rates over sites: approximate methods. J. Mol. Evol. 39:306–314.
Yang Z, Nielsen R. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of
mammals. J. Mol. Evol. 46:409–418.
Yang Z, Nielsen R, Goldman N, Pedersen AMK. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155:431–449.
Ye Q, Krug RM, Tao YJ. 2006. The mechanism by which influenza a virus nucleoprotein forms
oligomers and binds rna. Nature. 444:1078–1082.
39
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Acknowledgments
Thanks to D. Fowler, J. Kitzman, A. Adey, O. Ashenberg, and T. Bedford for helpful discussions. This work was supported by the National Institute of General Medical Sciences of the
National Institutes of Health (grant number R01 GM102198).
40
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
clone
Clone 1
Clone 2
Clone 3
Clone 4
Clone 5
Clone 6
Clone 7
Clone 8
Clone 9
Clone 10
Clone 11
Clone 12
Clone 13
Clone 14
Clone 15
Clone 16
Clone 17
Clone 18
Clone 19
Clone 20
Clone 21
Clone 22
Clone 23
Clone 24
mutations
G62T (G21V), T693C (synonymous), del153-522 (indel)
None
C29T (T10I)
None
None
G429A (synonymous), C447T (synonymous)
None
None
C646A (R216S)
G471T (K157N), G703A (D235N)
T111C (synonymous), T718G (*240E)
T25C (F9L), T26C (F9S)
C45T (synonymous), C549T (synonymous)
T319C (Y107H), C372T (synonymous), C539T (A180V)
A488C (K163T)
G274T (G92C)
None
None
None
G527A (S176N), A676G (T226A)
G4A (V2I)
T266C (M89T)
None
C30T (synonymous), del45-590 (indel)
Table 1: Mutations identified by sequencing the 720 nucleotide GFP gene packaged in the
PB1 segment after 25 limiting-dilution passages for 24 independent replicates. The numbering
is sequential beginning with the first nucleotide of the GFP start codon. For nonsynonymous
mutations, the induced amino-acid change is indicated in parentheses.
41
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
mutation type
total substitutions
transversions
transitions
nonsynonymous
synonymous
stop codons
indels
T→G
T→C
T→A
G→T
G→C
G→A
C→T
C→G
C→A
A→T
A→G
A→C
number of occurrences
24
6
18
15
8
1
2
1
6
0
3
0
4
7
0
1
0
1
1
Table 2: Counts for different types of mutations after the 25 limiting-dilution passages. The
numbers are calculated from Table 1. Given that GFP is 720 nucleotides long, the data suggest
a viral mutation rate of 5.6 × 10−5 mutations per nucleotide per tissue-culture generation.
42
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
mutation type
A→G, T→C (transition)
G→A, C→T (transition)
A→C, T→G (transversion)
C→A, G→T (transversion)
A→T, T→A (transversion)
G→C, C→G (transversion)
rate
2.4 × 10−5
2.3 × 10−5
9.0 × 10−6
9.4 × 10−6
3.0 × 10−6
1.9 × 10−6
Table 3: Influenza mutation rates. Numbers represent the probability a site that has the parent
identity will mutate to the specified nucleotide in a single tissue-culture generation, and are
calculated from Table 1 and Table 2 after adding one pseudocount to each mutation type. Mutations are in pairs because an observed change of A→G can derive either from this mutation
on the sequenced strand or a T→C on the complementary strand, and so the paired mutations
are indistinguishable assuming that the same mutational process applies to both strands of the
replicated nucleic acid molecule. The numbers are the estimated rates of each individual mutation, so for example the observed rate of change from A→G is 2 × (2.4 × 10−5 ) since this
change can arise from either of the two mutations A→G and T→C.
43
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
residue
150
frequencies in
natural
sequences
R (0.83), K
(0.17)
R (1.00)
152
156
174
175
195
199
R (1.00)
R (1.00)
R (1.00)
R (1.00)
R (1.00)
R (1.00)
213
214
R (1.00)
R (0.72), K
(0.28)
221
236
R (1.00)
R (0.94), K
(0.06)
R (1.00)
K (0.56), Q
(0.44)
R (1.00)
R (1.00)
Y (1.00)
65
355
357
361
391
148
experimentally measured
amino-acid preferences
R (0.40), K (0.10), N (0.06)
expected equilibrium
evolutionary frequencies
from experiments
R (0.58), S (0.07)
R (0.46), K (0.06), P (0.05), L
(0.05)
R (0.52), K (0.07), Q (0.07)
R (0.52), Q (0.06)
R (0.58), N (0.06), T (0.05)
R (0.46), K (0.16)
R (0.51)
R (0.44), M (0.08), Y (0.06), V
(0.05)
R (0.51), N (0.06)
K (0.24), H (0.09), R (0.09), Q
(0.08), M (0.06), N (0.06), A
(0.06), I (0.06)
R (0.46), E (0.07), K (0.07)
K (0.32), R (0.30)
R (0.63), L (0.07)
R (0.29), L (0.13), K (0.09)
K (0.38), E (0.09), N (0.07), Y
(0.05)
R (0.53), V (0.13)
R (0.59), K (0.09)
Y (0.54), I (0.06)
R (0.43), L (0.19)
K (0.31), R (0.09), E (0.08), N
(0.06)
R (0.68), V (0.11)
R (0.77)
Y (0.44), I (0.07), T (0.07), P
(0.06), S (0.06)
R (0.71)
R (0.69), S (0.06)
R (0.75)
R (0.66), K (0.08), S (0.05)
R (0.69)
R (0.64), V (0.05)
R (0.69)
R (0.19), K (0.17), A (0.09), H
(0.07), I (0.06), L (0.06), Q
(0.06)
R (0.66), L (0.05)
R (0.51), K (0.18)
Table 4: For residues involved in NP’s RNA-binding groove, the preferences and expected
evolutionary equilibrium frequencies from the experiments correlate well with the amino-acid
frequencies in naturally occurring sequences. Shown are the 17 residues in the NP RNA-binding
groove in (Ye et al., 2006). The second column gives the frequencies of amino acids in all 21,108
full-length NP sequences from influenza A (excluding bat lineages) in the Influenza Virus Resource as of January-31-2014. The third column gives the experimentally measured amino-acid
preferences (Figure 7D). The fourth column gives the expected evolutionary equilibrium frequency of the amino acids (Figure 8). Only residues with frequencies or preferences ≥ 0.05
are listed. In all cases, the most abundant amino acid in the natural sequences has the highest
expected evolutionary equilibrium frequency. In 15 of 17 cases, the most abundant amino acid
in the natural sequences has the highest experimentally measured preference – in the other two
cases, the most abundant amino acid in the natural sequences is among those with the highest
preference.
44
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
residue
stability
measurement
259
L259S is
destabilizing
(∆Tm = −3.9◦ C)
V280A is
destabilizing
(∆Tm = −3.5◦ C)
N334H is
stabilizing
(∆Tm = 4.5◦ C)
R384G is
destabilizing
(∆Tm = −4.8◦ C)
280
334
384
frequencies experimentally
in natural measured
sequences amino-acid
preferences
L (0.98), S L (0.23), S (0.04)
(0.02)
expected equilibrium
evolutionary
frequencies from
experiments
L (0.36), S (0.06)
V (0.89),
A (0.10)
V (0.19), A (0.02)
V (0.25), A (0.03)
H (0.93),
N (0.07)
H (0.28), N (0.12)
H (0.23), N (0.10)
R (0.80),
G (0.17)
R (0.22), G (0.04)
R (0.39), G (0.04)
Table 5: For residues where mutations have previously been characterized as having large
effects on the stability of the A/Aichi/2/1968 NP, the more stable amino acid has a higher
preference and is also more frequent in actual NP sequences. The second column gives the
experimentally measured change in melting temperature (∆Tm ) induced by the mutation to the
A/Aichi/2/1968 NP as measured in (Gong et al., 2013); these mutational effects on stability are
largely conserved in other NPs (Ashenberg et al., 2013). The third column gives the frequencies
of the amino acids in all 21,108 full-length NP sequences from influenza A (excluding bat
lineages) in the Influenza Virus Resource as of January-31-2014. The fourth column gives
the experimentally measured amino-acid preferences (Figure 7D). The fifth column gives the
expected evolutionary equilibrium frequency of the amino acids (Figure 8).
45
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
model
experimental, combined replicates
experimental, replicate A
experimental, replicate B
Halpern & Bruno, combined replicates
Halpern & Bruno, replicate A
Halpern & Bruno, replicate B
GY94, branch-specific ω, multiple rates
GY94, multiple ω, multiple rates
KOSI07, branch-specific ω, multiple rates
GY94, multiple ω, one rate
KOSI07, multiple ω, multiple rates
GY94, one ω, multiple rates
KOSI07, multiple ω, one rate
KOSI07, one ω, multiple rates
GY94, one ω, one rate
KOSI07, one ω, one rate
randomized experimental, combined replicates
randomized experimental, replicate A
randomized experimental, replicate B
randomized Halpern & Bruno, combined replicates
randomized Halpern & Bruno, replicate B
randomized Halpern & Bruno, replicate A
log likelihood
-12338.9
-12372.8
-12392.0
-12517.9
-12535.4
-12566.7
-12769.1
-12853.9
-12891.7
-12935.9
-12999.9
-13069.8
-13154.8
-13191.5
-13205.0
-13404.0
-14209.4
-14243.7
-14259.1
-14533.3
-14618.5
-14649.9
parameters
(optimized +
empirical)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
556 (547 + 9)
13 (4 + 9)
607 (547 + 60)
12 (3 + 9)
64 (4 + 60)
12 (3 + 9)
63 (3 + 60)
63 (3 + 60)
11 (2 + 9)
62 (2 + 60)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
AIC
24677.8
24745.7
24783.9
25035.7
25070.8
25133.3
26650.1
25733.8
26997.3
25895.8
26127.7
26163.5
26435.5
26508.9
26431.9
26932.0
28418.8
28487.4
28518.2
29066.5
29236.9
29299.9
Table 6: Likelihoods computed using various evolutionary models after optimizing the branch
lengths for the tree in Figure 9 inferred using the GY94 model. Experimentally determined models vastly outperform GY94 or KOSI07 despite having no free parameters. The experimentally
determined models give the best fit if the amino-acid preferences are interpreted as the fraction
of genetic backgrounds that tolerate a mutation (Equation 3) rather than as selection coefficients
using the method of Halpern and Bruno (1998) (Equation 4). Randomizing the experimentally
determined amino-acid preferences among sites makes the models far worse. All variants of
GY94 and KOSI07 contain empirical equilibrium frequencies plus a transition-transversion ratio and synonymous-nonsynonymous ratio (ω) optimized by likelihood. Some variants allow
multiple (four discrete gamma categories) rates or ω values, or a different ω for each branch.
46
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
model
experimental, combined replicates
experimental, replicate A
experimental, replicate B
Halpern & Bruno, combined replicates
Halpern & Bruno, replicate A
Halpern & Bruno, replicate B
GY94, branch-specific ω, multiple rates
GY94, multiple ω, multiple rates
KOSI07, branch-specific ω, multiple rates
GY94, multiple ω, one rate
KOSI07, multiple ω, multiple rates
GY94, one ω, multiple rates
KOSI07, multiple ω, one rate
KOSI07, one ω, multiple rates
GY94, one ω, one rate
KOSI07, one ω, one rate
randomized experimental, combined replicates
randomized experimental, replicate A
randomized experimental, replicate B
randomized Halpern & Bruno, combined replicates
randomized Halpern & Bruno, replicate B
randomized Halpern & Bruno, replicate A
log likelihood
-12334.6
-12368.5
-12387.7
-12513.0
-12530.3
-12562.0
-12769.0
-12850.8
-12889.6
-12932.4
-12995.2
-13069.1
-13148.2
-13188.7
-13204.7
-13401.0
-14205.2
-14239.3
-14255.3
-14528.4
-14613.6
-14645.0
parameters
(optimized +
empirical)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
556 (547 + 9)
13 (4 + 9)
607 (547 + 60)
12 (3 + 9)
64 (4 + 60)
12 (3 + 9)
63 (3 + 60)
63 (3 + 60)
11 (2 + 9)
62 (2 + 60)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
0 (0 + 0)
AIC
24669.2
24737.1
24775.4
25026.0
25060.7
25124.0
26650.0
25727.5
26993.3
25888.8
26118.4
26162.3
26422.5
26503.5
26431.4
26926.0
28410.5
28478.7
28510.6
29056.8
29227.1
29290.0
Table 7: Likelihoods for the various evolutionary models for the tree topology inferred with
codonPhyML using KOSI07. This table differs from Table 6 in that it optimizes the likelihoods
on the tree topology inferred with KOSI07 rather than GY94.
47
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
actual
8
6
4
2
0
0
2
codon composition
cumulative fraction
0.4
4
6
number of mutated codons
C
E
Poisson
parent codons
8
D
mutant codons
0.3
0.2
0.1
0.0
1.0
0.8
A
T
C
G
nucleotide
actual
1.0
0.8
number of mutations
B
cumulative fraction
number of clones
A
32
24
16
8
0
1
2
3
number of nucleotide changes in codon mutation
actual cumulative distribution
uniform distribution
0.6
0.4
0.2
0.0
100
200
300
codon number
400
expected
0.6
0.4
0.2
0.0
0
100
200
300
400
distance between pairs of mutations
Figure 1: The codon-mutant libraries as assessed by Sanger sequencing 30 individual clones.
(A) The clones have an average of 2.7 codon mutations and 0.1 indels per full-length NP coding
sequence, with the number of mutated codons per gene following an approximately a Poisson
distribution. (B) The number of nucleotide changes per codon mutation is roughly as expected if
each codon is randomly mutated to any of the other 63 codons, with a slight elevation in singlenucleotide mutations. (C) The mutant codons have a uniform base composition. (D) Mutations
occur uniformly along the primary sequence. (E) In clones with multiple mutations, there is
no tendency for mutations to cluster. Shown is the actual distribution of pairwise distances
between mutations in all multiply mutated clones compared to the distribution generated by
1,000 simulations where mutations are placed randomly along the primary sequence of each
multiple-mutant clone. The data and code for this figure are available at https://github.
com/jbloom/SangerMutantLibraryAnalysis/tree/v0.1.
48
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
,
virus-p1
virus-p2
RNA
on
pti , >
i
r
6
6
c CR g
RT, PCR,
RT, PCR,
ns
tra T, Pncin
sequencing
sequencing
Rue
seq
unmutated
virus passage
virus passage
- rescued virus
plasmid
1
2
viral rescue
viral
viral
codon
(transcrippassage
passage
mutagenesis
tion, viral
(viral
(viral
?
replication)
replication)
replication)
mutant
mutant
plasmid
rescued
viruses
viruses
mutant
mutant
passage 1
passage 2
library
viruses
DNA
6
PCR,
sequencing
PCR,
sequencing
RT, PCR,
sequencing
?
mutvirus-p1
?
mutDNA
RT, PCR,
sequencing
?
mutvirus-p2
Figure 2: Design of the deep mutational scanning experiment. The sequenced samples are
in yellow. Blue text indicates sources of mutation and selection; red text indicates sources of
errors. The comparison of interest is between the mutation frequencies in the mutDNA and
mutvirus samples, since changes in frequencies between these samples represent the action of
selection. However, since some of the experimental techniques have the potential to introduce
errors, the other samples are also sequenced to quantify these unintended sources of error. Each
of the two experimental replicates (replicate A and replicate B) involved independently repeating the entire viral rescue, viral passaging, and sequencing process for each of the four plasmid
mutant libraries (WT-1, WT-2, N334H-1, and N334H-2).
49
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
A
sequencing read 1
adaptor 1
NP fragment
adaptor 2
sequencing read 2
B
-
fraction of reads
0.060
0.045
0.030
0.015
0.000
30
35
40
45
50
55
insert length
60
65
Figure 3: Illumina sequencing accuracy was increased by using overlapping paired-end reads.
(A) The strategy is to shear the NP fragments to about 50 nucleotides in length, and then sequence with overlapping paired-end reads. This provides double coverage, and only codon
identities for which both reads agree are called. (B) The actual length distribution of alignable
paired reads that had at least 30 nucleotides of overlap (lengths between 30 and 70 nucleotides)
for a typical sample.
50
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
A
×107
aligned
outside gene
unaligned
unpaired
excess N
low Q
filtered
number of read pairs
2.5
2.0
1.5
1.0
0.5
B
WT-1 DNA
WT-1 RNA
WT-1 mutDNA
WT-1 virus-p1
WT-1 mutvirus-p1
WT-1 virus-p2
WT-1 mutvirus-p2
WT-2 DNA
WT-2 RNA
WT-2 mutDNA
WT-2 virus-p1
WT-2 mutvirus-p1
WT-2 virus-p2
WT-2 mutvirus-p2
N334H-1 DNA
N334H-1 RNA
N334H-1 mutDNA
N334H-1 virus-p1
N334H-1 mutvirus-p1
N334H-1 virus-p2
N334H-1 mutvirus-p2
N334H-2 DNA
N334H-2 RNA
N334H-2 mutDNA
N334H-2 virus-p1
N334H-2 mutvirus-p1
N334H-2 virus-p2
N334H-2 mutvirus-p2
0.0
aligned
outside gene
unaligned
unpaired
excess N
low Q
filtered
2.0
1.5
1.0
0.5
0.0
C
WT-1 DNA
WT-1 RNA
WT-1 mutDNA
WT-1 virus-p1
WT-1 mutvirus-p1
WT-1 virus-p2
WT-1 mutvirus-p2
WT-2 DNA
WT-2 RNA
WT-2 mutDNA
WT-2 virus-p1
WT-2 mutvirus-p1
WT-2 virus-p2
WT-2 mutvirus-p2
N334H-1 DNA
N334H-1 RNA
N334H-1 mutDNA
N334H-1 virus-p1
N334H-1 mutvirus-p1
N334H-1 virus-p2
N334H-1 mutvirus-p2
N334H-2 DNA
N334H-2 RNA
N334H-2 mutDNA
N334H-2 virus-p1
N334H-2 mutvirus-p1
N334H-2 virus-p2
N334H-2 mutvirus-p2
number of read pairs
×107
number of reads
8
×105
paired reads
single read
ambiguous
6
4
2
0
100
200
300
codon number
400
Figure 4: The number of alignable paired reads for each sample from (A) replicate A and
(B) replicate B. Most read pairs passed the various filters, could be overlapped, and could
have their overlap aligned to NP. (C) The read depth across the primary sequence for a typical sample. The read depth was not entirely uniform along the primary sequence, probably
due to biases in fragmentation locations. The computer code used to generate these figures
is available at http://jbloom.github.io/mapmuts/example_2013Analysis_
Influenza_NP_Aichi68.html.
51
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
A
8
×10−3
synonymous
1 nucleotide mutation
nonsynonymous
2 nucleotide mutations
stop codon
3 nucleotide mutations
fraction
6
4
B
×10−3
synonymous
1 nucleotide mutation
nonsynonymous
2 nucleotide mutations
Sanger
0
WT-1 DNA
WT-2 DNA
N334H-1 DNA
N334H-2 DNA
WT-1 RNA
WT-2 RNA
N334H-1 RNA
N334H-2 RNA
WT-1 mutDNA
WT-2 mutDNA
N334H-1 mutDNA
N334H-2 mutDNA
WT-1 virus-p1
WT-2 virus-p1
N334H-1 virus-p1
N334H-2 virus-p1
WT-1 mutvirus-p1
WT-2 mutvirus-p1
N334H-1 mutvirus-p1
N334H-2 mutvirus-p1
WT-1 virus-p2
WT-2 virus-p2
N334H-1 virus-p2
N334H-2 virus-p2
WT-1 mutvirus-p2
WT-2 mutvirus-p2
N334H-1 mutvirus-p2
N334H-2 mutvirus-p2
2
stop codon
3 nucleotide mutations
fraction
6
4
Sanger
0
WT-1 DNA
WT-2 DNA
N334H-1 DNA
N334H-2 DNA
WT-1 RNA
WT-2 RNA
N334H-1 RNA
N334H-2 RNA
WT-1 mutDNA
WT-2 mutDNA
N334H-1 mutDNA
N334H-2 mutDNA
WT-1 virus-p1
WT-2 virus-p1
N334H-1 virus-p1
N334H-2 virus-p1
WT-1 mutvirus-p1
WT-2 mutvirus-p1
N334H-1 mutvirus-p1
N334H-2 mutvirus-p1
WT-1 virus-p2
WT-2 virus-p2
N334H-1 virus-p2
N334H-2 virus-p2
WT-1 mutvirus-p2
WT-2 mutvirus-p2
N334H-1 mutvirus-p2
N334H-2 mutvirus-p2
2
Figure 5: Per-codon mutation frequencies for each of the four libraries (WT-1, WT-2,
N334H-1, N334H-2) in (A) replicate A or (B) replicate B. The seven samples for each library are named as in Figure 2. Errors due to Illumina sequencing (DNA sample), reversetranscription (RNA sample), and viral replication (virus-p1 and virus-p2 samples) are rare,
and are mostly single-nucleotide changes. The codon-mutant libraries (mutDNA) contain a
high frequency of single- and multi-nucleotide changes as expected from Sanger sequencing
(rightmost bars of this plot and Figure 1; note that Sanger sequencing is not subject to Illumina sequencing errors that affect all other samples in this plot). Mutations are reduced in
mutvirus samples relative to mutDNA plasmids used to create these mutant viruses, with most
of the reduction in stop-codon and nonsynonymous mutations – as expected if deleterious mutations are purged by purifying selection. Details of the analysis used to generate these figures
are available at http://jbloom.github.io/mapmuts/example_2013Analysis_
Influenza_NP_Aichi68.html.
52
Frac. ≥ this many counts
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
all
1.0
Multi-nucleotide codon mutations
synonymous
DNA
RNA
mutDNA
mutvirus-p1
mutvirus-p2
virus-p1
virus-p2
0.5
Frac. ≥ this many counts
0.0
0
15
30
Mutation counts
all
1.0
45
0
15
30
Mutation counts
45
Multi-nucleotide codon mutations
synonymous
DNA
RNA
mutDNA
mutvirus-p1
mutvirus-p2
virus-p1
virus-p2
0.5
0.0
0
15
30
Mutation counts
45
0
15
30
Mutation counts
45
Figure 6: The completeness with which all possible mutations were sampled in the mutant
plasmids and viruses, as assessed by the number of counts for each multi-nucleotide codon mutation in the combined four libraries of (A) replicate A or (B) replicate B. Restricting these plots
to codon mutations that involve multiple nucleotide changes avoids confounding effects from
sequencing errors, which typically generate single-nucleotide codon mutations. Very few multinucleotide codon mutations are observed more than once in the unmutagenized controls (DNA,
RNA, virus-p1, virus-p2). Nearly all multi-nucleotide codon mutations are observed many
times in the mutant plasmid libraries (mutDNA). A key question is how many of these plasmid mutations were actually incorporated into mutant viruses during creation of these viruses
from plasmids. About half the multi-nucleotide codon mutations are found at least five times in
the mutant viruses (mutvirus-p1, mutvirus-p2), indicating that at least half the possible mutations were incorporated into a virus. However, this is only a lower bound, since deleterious
mutations will be absent from the mutant viruses due to purifying selection. If the analysis is
restricted to synonymous multi-nucleotide codon mutations (which are less likely to be deleterious), then over 75% of the possible mutations were incorporated into a virus. This is still
only a lower bound, since even synonymous mutations are sometimes strongly deleterious to influenza (Marsh et al., 2008). The completeness with which amino-acid mutations are sampled
is higher than for codon mutations, since most amino acids are encoded by multiple codons.
Note that replicate A is technically superior to replicate B in terms of the completeness with
which the mutations are sampled by the mutant viruses. Details of the analysis used to generate these figures are available at http://jbloom.github.io/mapmuts/example_
2013Analysis_Influenza_NP_Aichi68.html.
53
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
B
0.5
0.0
0.0
R = 0.98
P < 10−10
0.5
0.0
0.0
0.5
1.0
replicate A p1
D
C
1.0
R = 0.99
P < 10−10
2
0
1.0
R = 0.79
P < 10−10
0.5
0.0
0.0
0.5
1.0
replicate B p1
amino-acid hydrophobicity
4
replicate B p1
1.0
replicate B p2
replicate A p2
A
0.5
1.0
replicate A p1
relative solvent accessibility (RSA)
-2
-4
0
0.5
secondary structure (SS)
1
strand
helix
loop
SS
RSA
A TKR
Y
SA
K
R
S
W
KK Q
ET T
S
PSR
M
PF
R
C
Q
HD
T
I
DEMQ
CS
H
LV
L
P
Q
T
W
G
Q QN
I
C
W
M
Q
L
QM
G
I
VH
M
C
K
T
A
A
F
M
DT
V
W
E
F
K
Y
FCE
S
F
Q
W
V
SG
M
CY
I
A T
DM
N
TRP
V
F
Q
L
N
L
K
H
T
T
Y
TC
K
I
G
S
L
K
V
V
Y
Y
C
V
W
R
Y
A
W
G
D
I
L
W
I
C
Q
P
S
L
F
S
L
K
T
Q
D
A
K
K
W
I
E
L
H
P
H
D
Y
V
W
Y
R
C
V
L
C
C
F
KN
H
H
D
R
H
W
FV
E
E
E
F
S
D
E
H
N
W
AD
E
F
Q
M
Q
C
RH
L
AH
T
W
Y
W
W
Y
H
YK
V
P W
A
I
D
S
W
F
EW
CS
F
K
M
F
E
S
Y
V
N
SG
QK
G
G
F
S
F
R
G
I
I
W
G
L
T
E
L
N
C
C
WK
R
V
N
RF
D
N
V
CM
I
Q
M
Q
T
K
M
M
S
S
F
D
E
Y
V
R
Y
M
G
R
H
Y
Y
I
H
C
P
E
G
G
W
Y
F
M
I
PTD
V
YF
R
M
WR
P
MC
F
PW
RK C
G
S
L
AH
M
A
K
R
L
Q
M
F
P
L
D
G
W
Q
M
W
N
AD
Q
L
W
E
I
P
W
W G
L
R
N
H
T
E
R
T
I
I
E
K
R
C
WET
VS
Y Y
M
D
S
W
P
P
M
S
L
K
I
Q
T
A
G
Q
C
YA
V T
P V
KL
PS
W
E
MQ
I
H
N
M
T
D
L
W S
E
G
A
VH
G
P
K
F
ETC
VM
N
D
W
W
V
Y
H
S
I
IQNS N E
EG
N
Q
W
V
Y
H
L
C
K
G
A
VG
F
H
E
F
Y
K
V
I
N
C
M
H
Q
M
L
I
W
L
L
W
L
P
R
Q
R
Q
T
Y
N
F
S
SQ
K
C
A
G
V
W
I
M
H
A
R
R
QMITELK
GIGR Y
T
P
C
P
WP
K
FG
R
E
H
DQH
M
WH
SE
P
H
FH
T
N
EDEP
SQR
I
F
CK
R
LC
C A GN
T
I
QD
LM
C
F
V
I
AEHM
Q Y
SCC
EPTD
DSDEG
Q Y AF
NG VL
I
R T A SQ
NT
RP
I
A
T
I
M
E
M
F
TN
H
A V
GTPD V M A VLGP
N
MQP
RN
F
P
M
SM
FT
F
AC
N
D
I
RR Q
PS
S
E WL
LRR
VRNV G
M
V
CT
I
Y
YD W M
I
R AC
I
S
QC
H
T
A
V AMK
R
A
C
I
N
H
W
E
S
C
H
SVGKMI
M
S
V
L ANT
C
W I
RL
NL
E
V
A
T
A
T
FSLF G
V QFGT
AD
S
G
KQ
R MD
LNSPP A Y Q G A A TK T
NGG
TLSLI
NK
GSQNIQENLISM
C
AH KH
R T APD Y YHL
C
VH
YDND
SQ
LL
W AH
TE
NR
V
MV
TM
N
K
AG
R A AF
AC
MKH
V
SC
GD
I
A
RR
P
T
YD
CT
Q
R
N
H
LN
TDN
M
NA
I
F
CY
F
I
F
E
L
V
AK
K T W Q
FM
A Y
RP
FC
N
S
W
I
I
QC
N
W TR
W SY
YN
F
H
P
T
CHG A
G
C
T
H
N
QR
EQI
M AH
GPD
H KLE
AK VN
D I M
QL
ML
A WC
WE
V T
NV S
V
CL
NI
V
N
IW
M
RQ
E METDGE
LFG
YS
YN
R
FL
SS A
M
VH
D
Q
F
D
V
Y
A
C
N
F
E
H
V
5
10
15
20
25
30
I
35
40
45
50
55
60
SS
RSA
RMVLSGSAFYDLRRK WRFYLEE QA
I
Q
T
N
F
CMH
KG
G
Q AS
NM
K KTGGPIY R D
Q
N NP
CT
A
N
D
R
C
A
N
F
NPG
H
V
VD
T
TDM
RW
QI IN
K
QH
I
C A CR
DR
IN
F
MF
A TFQ
A
SGL
HN
VDN
MKL TECH
I
SE
S W CT
C
TP
K
Y
SR
M
Y
P
I
N
Y QC
I
NA
ES
I
EKN
S
N
D
P
L
REL L KEEIV RIWR VANNL
A
TD
K M
C M
L R
C
WV
F
TL
KH
VPF
YN
VIF
NFL DPI
H
I
W
K
R
GLKN
K
N
VI
W
A VFQ
QD
Q
E
R
L
Q
Q
C
S
GFGV
K
M
LV
AT
YLIGSL
I
Q
C
W
K
D
Q
A
Y
CFC
N
Y Y CD
D
L
Q STPL
G
SES
H
H
LQN
HY
F
K
LE
G WK C YEHY M
N
QMN
P
NY
SFSMF
R
ES
T
DK
TE
I
DTCLGQEK A
R
AM
V WD
NS
C
MH
KER W Y M
R W MR G
LD
Y W
R
CS
MH
KK
A
C
F
FR TN
A
SCFFK
SV
I
PN
M
NA GT Y
A V VK V
A Q Q WL V MGS W
V MGM
I
I
LDKDS
HQF
V
K
RS
I
CT
SFHCMT TK
TG
HHHQ Q T
TN
V
T
S
M
R
I
C
G
FT
Y VC
P
K
D VD
VE
M
EN
K
TDSC
C
P
Y Y V GC
QF
ESH
P
SV A QG
VK C
HY S A
Y
SG
LV
Q AHA M
W
W ANCG WF
C
G
Q
V
Y
V
HC
F
C
M
R
E
L
S
N
C
P
L
EQ
R
I
W
V
F
L
M
F
R
I
E
SQ
R
AH
Y WDTM
D
H
PV
P
M
T V TM A WL AN
N
Q
F
N
G
T
H
N
SS
NV Q W W
D
LK S
GQ
S
F
I
A
E
S
C
C
F
HYRN
S
W QM
H
W
N
L
D
H
T Q TM
I
L
T CQ G
CW
E
Q
L AL
EA
E
L
F
H
V CG
HY G
RP A M
E
H
RH
KP
H
K
I
N
P W QR
VK G
I
H
L
S
M
GS
PQ
A VE Y
R
D
M
L
E
V
QM
VN
M
VP
C
P Y
FA
I
I
G
K
S
L TCG YL
A
HT T
KG
K YMW
GT
N
H
I
C
TD W
I
E
HAF
A VR
I
AF
M
H
W MT
I
W AEG
C
G
A
I
MW
M
E
S
I
E
A S
P A T
F
L
T
PS
W
RR
F
MW
A
F
H
H
E
KH
F
C
TN
L
AD
D
T
QH
I
E
N
T
P
I
W T
N
R
Q
W
Y
Q
H
G
C
Q
E
R Q
E V
C
L Y
S
L
MY
V S
L
G
G
H
M
W
F
C
K
MY
K V W
D
T YR
H
M
H
L
V
RR
E
H
W
C
M
I
C
D
E
I
G
Y
F
V
C
M
F
W
L
E
N
T
R
W
L
L
T
W
G
Q
H
N
A
W
I
M
KQ
T
A
D
I
H
GS
LV
A
T
D
H
N
C
K
Y
D
Q
D
H
E
A
F
N
PT C
Q
F
K
K
D
Q
M
R
E
G
M
K
R
W
K
P
F
A
P
P
P
P
V
AK
G
D
D
D
E
G
H
C
R
E
E
S
W
M
E
F
Y
V
A
W
P
D
E
W
I
F
H
L
T
I
Y
M
S
W
K
A
R
Y
E
A
I
A
A
P
S
G
K
N
D
Q
N
K
C
T
G
W
T
Y
T
S
N
I
C
Q
E
V
N
V
E
I
A
T
M
W
P
F
L
H
Y
P
Q
Y
L
G
Y
G
E
T
G
H
W
N
C
Q
G
K
S
N
E
H
Y
R
L
D
N
N
R
P
P
D
V
Y
R
Q
R
L
R
C
P
P
S
H
W
F
M
I
D
T
K
E
H
I
I
S
H
P
S
S
E
G
F
A
P
K
K
L
S
Q
Q
F
Y
D
N
W
S
K
I
N
Y
G
S
C
T
T
E
R
P
P
V
K
65
70
75
80
85
P
D
V
90
95
Q
G
100
P
R
105
110
115
V
120
P
125
SS
RSA
YQRTRAILVRIGMDPARIASLMQGSTLPRRSGAAG AVKG
LNDT
GITHI
MM WHSNI
ST
HATP
G
F
V
A
NKK
AIL
I
LE
SM
HS
Q
C
C
HSD V
V
FQFSCMH
A
C
AI
S
TM
S
L
K
N
DR
GM
K
G
Y
Q
I
PKK T
Q
S
I
T
I
W
Y TC
S
F
F
I
MHNPF
L
K
V
S
GM
T WESQ S
P
FPT ST
N
HC
L
TV
T V MQM
VH
L
QP
I
M
K
E
Q
DC
W W
K WE
HQ
I
GE
I
N
H
HV M
Y W
V
VLR
Y
A M AN
LL
C
G
PTK VN
I
Y G V QH
HT S
A
SLMDKD A
DV
TD
C
LM
A
F
LQ
CDK
A TK GG
VM
N
CA
L
N
R
E
V
K
I
F
E
G
A
Q
M
K
CA A
AE
V
LS
A
MGP
K
T VN
T
H
F
G
A
MC AQ
S
I
A
Q
M
M
FN
N
A
KG
Y
TK
V
Q
I
QH
P
GTLMV
A
IAS V G
EHS AC
QT
G
I
D
EMEFG Y THA
S
N
A
PQ
SD
I
ST S
WD
M
FD
M
HG AP
H
YG
I
VCW
GV
NT
V Y A
HME
I
W
VS
W Q
VE
T
I
I
W
G A SSCR
HC
ECTE
R
I
N
R
LC
T
H
QL
WN
PHCRN
R
G
W
I
Q
S
C
CE
I
S
W
P
I
L
F
M
S
V
P
Y
L
P
MY
Q
V Q G VE WD
DMM V YD
C
N
H
N
D
M
EC
H
P
QR
D
QCW
F
H
E
Q CQFSSQH
S
K A T
N
Q
N
G
QDC
VN
M V SQ
N
H
H
V
R A Q
E
E
R
I
C
D
K
C
I
DV G
F VD
DTN
KK
D
G
T
P
EY V W
M
C
C W
T A
WEC
F
L
N
F
YF
K VK YF
L
EV G
M
F
K Y
M
L
KC
C
R
S
Q
K C
F
L Y RN
F Y
T QN
M
L
N
DQH
K
W
F
R G
MQ G
QM AH
Y QKK W YE
L
G
L TK QD
E
K
E
CQ
Y
C
H
LS
C
F
WE
P
L
V
Q
C
L
R
VH
A AE
I
L
FV
I
L
N
LQ
Y
I
N
F
Q
T
HAP
P
Y
Q
P
R
N
CS
C
M
K
Y
R
N
E
KFT
L
PQK
D
I
L
C A
I
N
V
YR
I
M
W
W
F
E
R
Q
G
Y
V
GT Q S
F
E
M
H
W
L
P W
MW C
N
K
G
N
G
PCS
C
G
F
A
S
CT
MG
TS
W
N
K
P
E
P V
R
E
T
I
E
G
Q
S
S
H
E
W
G
Y
T
HA
T
W
R
S
G
P
YH
C
Y
C
F
N
K
Q
W
D
I
L
P
T
H
G
L
H
L
P
C
M
G
C
F
M
W
WN
F
R
F
G
K
T
L
G
R
Y
D
D
H
135
I
T
W
H
R
H
L
K
T
S
P
T
G
W
Y
P
V
F
R
F
F
E
R
A
150
L
E
A
N
P
D
E
I
Q
I
M
G
M
N
P
S
M
D
M
145
P
R
V
D
N
C
Y
E
H
140
Y
V
C
I
P
Y
E
T
N
T
C
M
K
E
V
C
D
KM
P
KH
T
I
T
I
L
W
F
L
Y
S
F
N
K
A
G
T
YE
P
C
Q
R
W
P
R
H
R
D
D
Y
G
W
F
I
D
E
P
V
130
Y
V
L
A
W
V
K
R
K
Y
C
H
I
N
L
D
I
Y
K
V
M
T
E
A
S
VH
R
D
H
R
V
P
S
G
N
K
S
Q
Q
K
D
V
Q
Y
N
Q
Y
Y
P
H
E
E
Y
Q
L
E
M
W
M
M
N
L
P
S
A
L
P
F
E
P
L
A
T
I
F
A
N
W
R
S
I
N
K
P
N
I
W
T
Y
Y
P
S
R
H
Y
L
D
Y
A
D
P
R
R
M
Q
P
C
RM
E
A
F
M
S
F
V
F
R
A
D
Q
F
WNV
A
K
L
A
D
I
T CT
R
V
Q
C
E
L
155
160
165
170
175
180
185
190
SS
RSA
IN RNFW GELMGRKTRTSAYEDR
MELIRMI
TKR GL
D R
S
QV
L YM
H
T
MT
I
DE
AS
QY
Q
C
S
H
V
R SKI
Q
A
T
Q
I
C
V
N
KGKFLQ
NIL
AKL
Q
GNAEI
M
CP
QK AM
TD
AC
VREGRN
A VD
A
SC
P
R LA MK P
S H
A
Q
T
KR
H
HN
A
T
Q VS
TQ
GL
E
M
H
S
G
Q
SL
N
YD
VFN
TT
ET V VN
RY
F
PT
NLNKNMK TMFLQK S
HGFN
G
I
I
N
HQ V
EM
K
AA
Y
L
T
I
I
M
F
C
A
C
TM
Y AHY
MC
W
N
A
Y
D
H
Q
CS
FC
H
I I
I
Q
H
V
Q
M
CV
R
A
M
I
L
T
K
F
E
S
W
Y
I
YL
KS
SW
W
Y
T
LG
Q
QG
LQ
I
SH
QLFH
M
L
FCG AE
MV Y
N
T
QC V
LF
AH
E
W
H
VM
GEDD
V
A
VG
F
W WM
A
W T W MA
V
LKF
SD
DY
M
EP A M
CD
HVNV
QFG
SCCF
KE
DMC
S
H
Q
D
PW
GS
SKE
I
S
G WP
GF
C
H
I
CT QH
ML
MSCQ Q S
W
E
CE
T WN
DW
F
P
GT T
E
D
K
I
M
F
P
I
A
FM
D
LMM
C
AK
I
E
W
H
T A
NW
PT
CS A W Q
H
N
D
I
Y
I
DM
V
Q Y VE
KN
EGTR G
D
MT
N
Q
E
D
S AM
K
LM
ER SG
V
P
WF
N
F
KK
F
K A
D
NA C
I
N
E
L
R A A Y TP
H
L
V W VM
S
L
WNYD
F
W
W
YL
HG
I
MQP W
R A
F
T
T V
A SQ
V
H
T V
AR Q
LQ
E
M
P
CG W
FS
M VD AN
Y AC
Y V
C
I
F
KM
L
Q
K V
WR
L
N
Y
MV
DY CY
E
R
TG
HVL
N
V W
G
N
E AH
W
T
K
A
C
NWL
H
HV
I
L
I
L
S
V
VE
CA
E
A
K A TC
Q
A Y
K W
Q
C
I
H
Y
E
F
V
N
I
NL
V
QM
WK
K
C
T
G
H
K
V
G
D
C
F
R
Y
Q
F
K
C
YP
S
S
E
V
I
G
F
Y
H
L
WP
I
P
V
G
W
R
H
C
S
D
G
P
WE
W
I
E
S
F
I
S
T
T
I
L
G
P
I
G
H
P
A
I
T
R
Q
E
195
200
205
210
D
V
P
C
T
AP
V
215
W
G
G
C
Q
K
E
K
S
V
K
Y
W
R
W
R
W
L
L
G
T
V
K
D
K G
K
L
A
G
S
K
LSQ
N
T
R
T
M
T
L
F
R
H
E
C
H
D
Y
A
N
V
W
F
S
R
F
I
M
E
G
S
H
R
F
K
D
E
N
V
P
H
F
L
M
S
P
C
N
D
R
C
R
Q
D
F
G
L
N
R
220
L
P
I
ET
Q
N
Y
E
C
D
S
P
R
T
Q
L
N
M
R
W
C
P
R
N
G
M
V
W
F Y
I
RR
Q
E
V
T
C
W
N
N
S
K
V
Q
E
P
E
P
P
Q
E
I
W
F
C
D
ANG
Q
F
F
F
K
S
S
Q
M
KRN
R
R
S
Q WE
S
C
Y
YP
S
E
C Y QH
H
N
YD
F W M VK
M
E W
AD
H
DT
W
F
I
D
M
Q
G
A
P
P
V
I
N
T
N
M
G
Q
M
YE
L
W
I
KD
Q
C
D
S
P
Y
L
I
H
K
G
C
K
K
E
C
A
R
Q
A
H
W
H
F
N
R
R
A
N
P
A
D
P
G
W
A
I
I
G
G
L
W
G
C
T
G
R
R
I
Y
F
A
L
T Y W
NWH
N
EC
E
KC
I
Y
G
M
G
C
H
S
M
P
D
M
N
P
E
M
V
S
K
G
S
C
D
N
E
M
E
C
Q
C
M
WH
C
L
M
P Y
G
KFQ
L YR
E
R
E
A
W
F
P
P
S
T
P
W
Q
Y
S
W
Q
I
N
Q
K
H
S
I
L
L
K
KD
T
N
H
D
M
Q
N
F
F
C
A
D
WF
A
T
S
P
V C
V
S
S
Q
DT
C
N
Y
D
R
225
Y
F
230
235
240
245
250
SS
RSA
D FLARSTL
E LL
K
K V AH SCLPACIVYG
IL S V
S
L
Q
A AFMI GT
N
A
I
G
A
VV
LVGI PFK
V RGHDFLHQGY S
GR
MMI
IK
GA
SV
P
QK A
N
YE
K
A YHWN I
PS
V
Q
AI
S AD A
SRLEGIE
V
ARI
Y SLI
L
L HR
N
RP
Q I
AMM
VRR CNK
TMHG
LD
S AIQMQL TSNCQC
IL
N
S Y Q CMF
C
GG
M
C
Q
K
Q
F
W Y
H
QW
NN
TP A
LE
SQ Q
R
W
FA
C
QD W
S
CG
G
GE
WTA
V CV
M
T A TF
SM
F
W GD
YP
M
CRN
AS
CCHM
MT C A T
W
A CT
HH
T
I
I
T VD
SLC V CTF
A VH
SW
E
HSS
I
FH
W SN
TT
RK
NNV G
N
Q
VCY
Q
LG WF
P
Y
M
I
E
CM
K
W Y S
H
FS
FW
P
LCMSC W
D
I
F VKN
TR
MY T
G
LFG
I
G
L
YN
W
H
N
FA S
T QK S
A
N
M
I
LH
GF
K CG
E
W
H
I
FS
LM Y SP
M
PN
GY V
EGGE
M
RE
S
I
L
LQ
CGS
Q QP
C
I
G
PC
L
N
RQ
T
G
C
CC
E
QT
D
N
I
LQKL
Y
NV C
R V
W VN
V
VHCH
AF
CG Y S W
G
RFD
G
RF WK
N
L
GA
Q
I
S YM
N
L
Q
K
N
K
Q
M
S
E
D
T
C
I
M
I
V
Q
N
C
KQ
K
GS
NM
MK
T Q GNW VK
M
Y V C V MDM V CTPT
T
I
A
D
M
C
R
E
G
P
I
W T
Y Q
I
N
K
T
D
C
CTM
C
P
SW
V
N
H
H
W
T
Q
P
E
SS
C
N
F
W
A
AR
A
AM
MK
T
A
T
WF
M
M
I
F
F
A
A
Y Q
H
P
T
K
S
H
L
M
R
F
R
P
P
L
Y
Q
F
Q
WN
A
I
E
Q
I
KH
Q
V
L
A
V
D
L
C
R
G
P
H
E
D
G
H
K
N
F
S
T
F
N
G
I
K
N
M
S
L
L
D
T
P
W
Y
E
P
R
255
260
265
270
A
C
K
E
D
G
R
E
T
W
R
L
K
R
F
M
D
T
H
V
K
I
A
V
I
W
Q
WN
F
E
R
V
F
P
E
G
P
R
Q
F
V
Y
M
NY R
P
M
L
D
C
R
V
I
Q
C
P
C
H
W
YH
V
Y
V
Y
V
Y
W
V
I
Y
M
C
Q
N
L
G
V
A
K
V
WP
W
L
W
I
P
C
V
Y
Y
R
S
Y
R
WH
I
L
Q
H
F
K
N
D
F
W
T
T
KE
K
E
K
F
E
V
I
Q
Q
E
RE
K
Y
P
D
F
Y
Q
F
H
P
AH
K
E
KG
DS
A
RF
C
N
N
Q
I
R Q Y MW
HYL
P
G
L
Q
H
P YE
H
AM
D
V
F
A VK M
F
W
R
P
S
I
D
V
KH
G
T
WN
N
E
D
S A
E
P
L
H
A
N
V
W
WD
F
V
M
E
L
E
T
E
R
P
C
S
P
T
KP
G
A
E
A
T
K
S
F
S
T
F
K
D
D
E
E
M
R
V
D
N
R
YH
Y
K
P
W
Y
K
W
K
R
N
K
N
QM
D
G
PM
G
PC
I
D
D AD A MTRR C
TDT A CG
DT S AN
A
HW
K
Q
DT
H
V
E
L
I
I
I
N
E
P
E
R
F
K
E
M
V
I
VE
G
M
A
H
T
F
C
N
E
KD
A
C
M
F
H
F
G
L
R V
K
Q
L
P
M
H
R
G
H
G
S
Q
S
A
N
I
R
L
N
Y
M
ES
M
E
M
L
E
S
S
QD
Y
N
N
EHL W MG
Q
G
S
F
F
V
D
R
K
TH
P
N
D
Y
N
D
L
W
C
P
D
Q
H
W
Q
E
W
F
Y
Q
C
W
L
I
S
F
I
PT
N
R
F
V
H
H
M
I
AHY
I
R
L
R
A
V
G
V
H
F
ET
M
H
P
L
T
Q
H
275
I
280
P
285
A
E
T
290
T
295
L
300
305
310
315
SS
RSA
R E P HKSQLVWMACHSLA EDLRIVLS IRG KV PRGKL TRG IA
PT
SN
N
K
V
H
C
G
H
A
V
R
F
Q
F
L
M
MI
N
H
TS
R
RL
I
I
N
Q
V
P
A
G
M
M
V
STG
GLL
FDN
T
LT C AC
M
A
L
C
DISG
SNAGR
G
VQ
IF
Q
S
NI NIESSTL
G
AL S
SM
GR
DP
T E
MR
H ME A CL
I
L N
AN M N
R
R
TFLMALKE V
HF
C
T
SCK
CL
T
A
VDL
SN
T
D
Q
E
H
VY
M
T
NK C
R
M
C
T
KF
A
G
P
W W
NC
A
I
TQ
EY
G
K
PL
R
W
Y
S W CK
Y
P
N
E
F
E
KE
FQ
DHC
MMT
C
AL
F
TT
V CK
QY
I
S
C
LM
E
Q
Q
V
G
DT YP
Y
T SSS
T YD
V
I
QM
VFHGKDD
SKE
T
H
P W MA V
SA
I
W
M
QH
NW
MTRH
EQ
GW QC
D
A Q SQN
FY V
Q
K AD
C
I
R W
N
N
A
C VE W MG V
CGC
P
CH
F
CT
Y
N
M
H
S A VD
SG
HA QN
W WQ
PCKF
S
I
H
PS
D
NG
VC
CM
TN
N
T W T VM
I
LQM
S
Y Q
D
I
K CT
PT G
R
LD TH
D
S A YFQ
G
F
LLE
L
NVH
I
V VN
VKDSS
T
P
WF
TP
G
I
N
Q
VDQ Q CE
H
Q
N
FH
EQ
I
H
M
FS A VDG W A
H
V S VM
MQEM
W W
MT
N
GY
N
H
H
P
S
K
M
E
HS
P
P
F
S WN
K S
F
N
P
Q
L
KL TM
CN
FC V M V
DA
I
M
S
Q W
AE A Q T
CG
P
C
T
A
KRK
E
Y
A
V
Y V
I
I
F
K
KM
VL
I
DA
R
HQ
A
H
K
N
T
S
K
PK
LY
I
N
I
NV
V
N
I
W
D
I
W
S
V
D
N
N
L
D
S AI
G
S
L
G
C
R
F
FLN
V
A
N
Q
A
M
A
G
RE
AN
Q Y
KE
L
R
MQNY
C
R
A
QG
K S
R
K
V Y
V
P
T
T
D
F
P
Q
V
Y
F
L W
P
E
G
DT
G
GT
YF
K Y C
I
Y
F YR
F
T W C
A W
YP
N
Y
F
M
W QN
H
FGQ S
W
S
Q
MV
C W
K W YN
Y
G
W V
FT Y
V C
V
C
M
A Y
S
M
L
H
L
CC
N
HW C
C
M
S
RG
W
M
P
A
DT
F
H
I
I
I
F
F
R
F
E
L
L
F
S
H
HA
L
H
R
D
F
D
P
N
M
K V TG
F
P
H
H
D
S
G
H
KK
F
T
L
A
F
F
M
WN
I
R
D
L
N
F
I
N
E
KL
Y
S
P
P
SG
K
R
V
K C
E
R
Y
L
FT
G
RC
Y
D
Q
G
P
F
L
P
EW Q
A
P VH
I
D
G
I
G
K
Y
G
V
H
Y
P
W
I
E
A
N
D
S
V
H
K
T
WF
W
H
PT
R
A Q
C
N
I
L
V
T
Q
R
C
W
Y
WM
I
I
AR
E
C
F
A
I
T
G
A
T
C
Y
N
V
N
Q
S
S
T
V
P
330
335
340
C
L
345
Q
R
H
K
G
Y
D
K
Y
W
R
C
K
WE
P
Y
A
H
I
F
M
W
R
P
Y
D
A
I
C
365
N
V
R
H
F
360
L
G
P
Y
Y
M
P
355
M
H
H
S
D
R
F
N
C
WD
R
L
W
V
350
V
N
Q
H
K
R
A
I
K
A
M
F
I
Y
E
H
D
Y
K
Q
M
R
W
L
P
Q
F
C
D
A
W
G
325
N
ST
C
K
P
K
320
Q
S
M
H
Q
W
A
D
L
E
H
Y
I
F
K
I
Q
R
R
P
E
C
Q
M
M
M
H
H
L
D
VD
D
Q
Y
H
M
E
D
V
A
L
K
I
G
Y
E
W
Y
R
I
P
P
G
L
M
V
W
G
Y
T
P
R
C
M
E
E
Q
G
C
M
W
Q
G
R
H
T
G
S
Y
E
I
E
E
D
F
R
A
A
A
Y
Q
H
I
P
F
L
L
H
I
T
T
D
S
G
T
L
E
L
T
M
I
G
N
P
L
I
K
370
375
SS
RSA
IRTRSGG
EQLTRSKRPY W A
T
Q
M
VG
FY
Y
A
K SKI
S
CA
A
M
L
T
K
C
Y
R
F
T
R
W
L
M
V
T
H
C
FQ Q
LV
I
A
G
GRT I RT
G M
A
SS Y
LE
F
H
Q
T
M
K
C
P
SYLIS
F
Q
Q
AF
GK
GN
A
K
NI
IL
CTQ
NVD
CC
MGN
ARK W
GL
LCP
HK Y T
V C VNH
F
R MP
IS
V
S
AT AF P
AT
T
CT
H
TMS
TN
L VH
VV
K
TR
K
W
YF
MN
T
N
L
Y
SH RLICGT
MPF
AR T LT A
V
L
L
R CKK
I
I
E
N
H
V
SV
I
T
V
W
I
S
V
N WHI
D
G
M
AHD
D
QS
R
D
VE
E
D
Q SP AFK
S
Q
S
I
H
Q
G
A
C
VN
K
H
P
A MCP
C
N
I
I
Q
I
N
T
V
N
D
N
E
P
C
V
YH
N
I
E
W
L
A
C
C
L
Q
VR
I
A
G
Q
T
F
Y
R
Y
F
Y
E
RM
E
H
N
F
Y
V
K
V
M
T
F
D
G
W
L
G
C
D
H
E
N
L
P
HW
I
M
R
D
W
M
C
G
H
Y
S
Q
Y
W
RM
R
P
Q C
D
I
W
C
R
M
I
D YM
W
M
G
H
S
L
Y
QC
P
T
I
P
P
I
F
Q
P
G
W
L
H
K
Y
R
T
E
P
F
Y
N
K
P
A
W
V
I
N
D A
W
F
R
M
E Y
D
R
E
M
V
K
D
G
Q
P
I
D
P
N
Q
E
P
S
L
Q
H
Y
L V
W
N
I
V
A
M
V
I
F
G
Y
G
S
M
WD
F
T
M
S
H
I
H
M
H
Y
P
T
I
S
W
P
SL
P
CC
T
H
V
KF
YLN
D
G
F
A
V
N
L
P
L
M
N
C
M
A
G
VC
T
Q
QL
S
C AM
E
CR
W
N
L
W
D
M
H
P
Q
L
D
S R
A A
VC
AGQISM
A F MVSA
K
S
T
N
E
S
K
P
L
SM
H
H
EY
N
QN
E
WN
F
Q Q QM
VC
NA
VH
P
MT
Q
MQ A A M
V
I
H
I
G
GS
I
G
N
GSS
T
L
MR
N
L W G
W
N
CA
C
GV A W
E Y YK T
KPT
H
N
MQL
V
M
F
S
S
N
I
K A AM
QC
T
EQN
K W Y
L Y
H
N
C
D WL
P
E
A
I
QK
Y TR
C
Q Y
L
I
DS
CV
R
C
P
I
T V Q
L Y V YC
W
I
Y Q S
Q
A
G
KM
E
I
E
L
V
R
H
E
G
A
E
N
Q
W
I
S
K AR T
C
CG
YH
Y
D
F
R
R
S
C
RD
H
M
Q
V
E
M
Q
M
T
W
Y
S
W
H
W
T
N
K
AP
P
I
Y
F
I
T
R
P
R
HTS
P
E
P
KF
W CQ A T
VH
Q
D
F
FQ WH
TR Q CCT
W S
S
Y
E
K
T TC
G
F
TC W
V
Y
C
EQN
K
N
F
F
N
F
D
GV
A A S
EV
M
F
R
S
A
R H NT SM
NT A
RHF
Q
AN
K AKK
A T AL VH
SMCS
S
CFC
SL W
HQ
H
QHHMHG
W
R
N
W
KE
E
GW C
N
V
N
I
QK
N
KH
N
I
S
G
C
AP
Y
H
L
G
R
KRR
N
A
H NKKR
A
Q
NMHS
V
WNCN
G
Y
D
KDH
PDL
D
G
K
V
MA C
C
K V T
F
E
CL
I
SR
T
N
H
DW
F
E
D
V
Q
E
P
P
Y
I
E
F
W
T
F
F
C
V
A
380
385
390
395
400
405
410
415
420
425
430
435
440
SS
RSA
EIID IMENTGKR
M
C
Q
I
CQR
M
WY
L
W VKKH
RGIVFE DERAA PIVPSI MSNV SYFFGDN E
R
SFR
GG
Y A
VE
GG
YRNNATK
CLR WLNWD
EF
Q
CR
PGL
N
T GNF
QL
Q Q Q CQ
T A THQDN
I
I
AG
H
Y GT A M
ST
V
V
VLPL
D VLS
P
V
Q
E
WS
M
W
S
AF
MT
H
A
SV
M
S
M
TM
Y
Y
H
H
A
W
R
V
G
V
A
A
M
M
W
V
I
Q
T
L
D
CS
D
F
P
E
P
I
S
Q
N
R
C
N
T
R
C
T
K
H
C
M
Q
F
T
M
N
R
D
T
G
Y
K
Y
L
C
S Y
Y
K
V
F R
F
TG
QM
G
H
I
D
V
E
A
C
F
H
Q
P
P
C
V
W
Q
D
P
A
C
D
I
W
W
Y
L
L
K
W
A
W
H
N
M
K
FH
N
L
L
TQ
M WL
W
C
S
F
H
N
Q
N
L
Q
R
S
T
G
A
D
V
V
A
F
L
E
R
C
E
A
G
D
Y
T
W
I
Q
I
S
G
L
H
L
N
R
S
D
Q
F
T
H
C
R
H
E
Q
K
D
N
E
I
F
V
V
L
F
QM
C
M
F
WK
F
M
S
R
C
H
G
KN
I
I
R
C
D
S
M
G
L
C
M
M
H
M
D
W
F
S
K
K
EY Y
N
A
S
V D
TND
NK
K
W GN
ES YI
GM
L
I
NQK
TR CR
SA A T
Y CK
W THCH
T VNM
H
S
H
G
H
V
Y
L
NIK
T
A
N
D
L VM
Y
I
I
I
V
L
VP
P
K
KM
I
AG
R
S
MW
Q
R
Y
C
M
H
R
D
MK
MG
E
E
H
M
D
S
N
Q
AP TN
Q
N
T GC
KD A
I
D
I
L
M
Q
S
HHF
RG
LA
EK
Y
YG
IWLH
AT
Y
TMC
T Y
PT
GT
I
R GQ
S
D
W
V
H
VD
Q
C
H
K
R
R
V
Y
K
A
SV
E
RM
VC
NT A
P
E
KY
S
S
H
T
A
N
Y
D
S
G
QF
C W MP
EF
A F
RD IY
P
W
Y
FM
G
E
W
D
L
L
W
F
N
W
Q
D
L
T
GPE
E
S
H
Q
T
E
FT
S
M
Q
M
S
WP
H
C
G
R
N
QN
NV
V W
V
A
Q
L
LFP
I
STR
D
P
P
R
H
L
K
T
D
E
T
S
H
MY
F
VD
E
C
G
E
W
F
A
D
I
G
Y T
PM
V VP
FG
M
L
T
AL
I
K
S
RR G
NW S
NA
AK
W C
E
H
I
C
Q
M
YH
K
C
MI
CP
F
MI
HR
Y
KE
I
K
R
TF
HL
K
R
LCP
KSA
N
LQEE
LW
H
N
KL AFSDE
QK
S
TA A
SC
REKH
R
D
PS
A CR QM
WP
L
QLT
R
I
H
T
FK
KN
G
L AH
SGG
M
H
DG
H
Y
G
N
I
H
Q
F
W
C
V
Y
L
Y
F
445
450
455
460
465
470
475
480
485
490
495
Figure 7: Amino-acid preferences. (A), (B) Preferences inferred from passage 1 and 2 are
similar within each replicate, indicating that most selection occurs during initial viral creation
and passage, and that technical variation is small. (C) Preferences from the two independent
replicates are also correlated, but less perfectly. The increased variation is presumably due
to stochasticity during the independent viral creation from plasmids for each replicate. (D)
Preferences for all sites in NP (the N-terminal Met was not mutagenized) inferred from passage
1 of the combined replicates. Letters heights are proportional to the preference for that amino
acid and are colored by hydrophobicity. Relative solvent accessibility and secondary structure
are overlaid for residues in crystal structure. Correlation plots show Pearson’s R and P -value.
Numerical data for (D) are in Supplementary file 1. The preferences are consistent with existing
knowledge about mutations to NP (Table 4, Table 5). The computer code used to generate
this figure is at http://jbloom.github.io/mapmuts/example_2013Analysis_
Influenza_NP_Aichi68.html.
54
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
amino-acid hydrophobicity
4
2
0
relative solvent accessibility (RSA)
-2
-4
0
0.5
secondary structure (SS)
1
strand
helix
loop
SS
RSA
A TKR
F
S
YIR
Y
A
FN
VT
AR
N
TP
C
HA
HL
CD
I
K
N
M
S
AR
W
LI
G
I
T
T
Q
T
SP
ST
V
C
L
D
Q
T
S
A
T
M
P
F
V
Y
PT
Y
V
V
H
T
C
ASSLT VNTGVTI
S
I
K
D
R
L
NP
RI
P
S
Q
M
G
D
AD
C V
P
T
E
D
I
G
Y
E
I
G
D
I
C
P
W
Y
F
K
K
G
M
Q
F
R
D
P
C
H
E
N
C
I
F
I
C
Y
Y
Q
F
H
H
W
N
H
C
Q
V
C
C
K
Q
E
D
D
M
H
P
N
V
F
S
YE
I
K
M
D
N
N
N
I
A
V
R Q
L
L
P
QC
N
N
C
A
G
Q
Y
M
F
N
Y
I
P
V
W
N
A
Q
A
C
I
T
N
F
H
H
D
H
ES
F
R
YD
E
H
F
L
I
G
I
E
W
T
D
F
C
P
T
L
V
K
F
G
I
Y
A
E
C
N
C
I
D
C
C
P
G
G
H
P
K
Q
G
Q
G
R
A
L V SN
SY T
Q TP
KS P
HR TS
QV
PSF
YR AP V GAD
Y
L
S
APSNA
V
F
N
N
I
P
L
L
T
F
S
F
C
G
I
F
L
P
H
D
V
G
Y
C
L
N
KR
H
P
N
F
V
A
Y
V
N
A
Q
D
SRR
V
R
S V GN
R
REP Y
CA
C
H
I
T
M
VK
Q
N
F
M
C
TL
T
F
IK M
S
AR
P
C
L WHL
GG
G
T
F
D
LT
G
V AH
I
TD
R VP
V
G
Y V
I
LDRRKN
GTPH
PFG
T ALS
I
RR
R VT
GR
Q
T CNT
D
S
AQ
S A ALKRD
VL AH
AN
I
P
RL
H
I
A
G A TD
RF
N
QL
RLL A T
LH
C
SSG
RK AN
R CG
CT
I
VK VL
Y
P
KHND
S V QDC
P
V
Q
I
P
G
H
H
C
H
G AL
R CM
E
KN
M
L
A
PQ QL
F
CP
L
I
RS
R
NV
K T
EC
TE
C V GS
PC
H
M
FQ
RM
MQ
P
I
Q
T
E
C
TP
R G YN
I
HS VL V
LS
E
EC
F
F
I
K
V
V
G
S
V
A
T
F
I
Q V
I
K
T
F
F
S
Y
Q
K AT T A
LE
QN
TP
ALGL EGRL
NL GSLS A
ELT
G
SVGRSI GSIGSR ARYSQ
I
T
T
I
G
N
W
E
P
SQ S
T
I
H
P
F
PT
W
H
Q
T
E AD
R
A
SA S G R
LL
LSH
S
R
QR V
NAI
NVLR
A
AIAK
V
G
TV
P
M
LG
QE GE
SR A
KRR
V SK
VT
E
S
A
E
CP
VR
L
L
A SH
CQ S V SQR
A V
N
A
F
N
E
TT
MY
TE
NH
Q
R
E
T
I
L
RC
Q
A
C
D
G
C
Q
LR
I
A
GN
Y
L
V
W
V V SIK
T
IR
L
N
I
M
R
SLAESKRDMETDGPGLV QS
L
G
LL
SS
M
F
Y
G
H
V
Q
P
W
M
F
F
K
F
Y
Q
5
10
15
20
25
30
35
40
45
50
55
60
SS
RSA
R VLSSA DLRR RYL
F
R
Y
T
AR
T
S
M
G
FA
S
N
I
CS
G
HG
SR
I
V
P
A
I
K
T
G
T
A
PTL A T
LR G
T
N
I
S
L YR L
RTGGPI
RADRVFLRELCV
L DAQ
PSS
TSR S
SS
AH
R
V
LLC A
T
I
Y
W GR
GHSL G
I
RTV
N
KEL VI
S
GQH
A
K
S
C
AS
RP
K
M
G AD
V
N
YK
F
LDG W A
N
Y Q
L
P
E
G
E
T
N
N
N
RR T
E
DQ V TCSCG
N
F
YP
TC Y
KCW C
DC
M
EA
E
K
W
K
L V
I
I
I
KN
I
I
Q YK
A
G
RM
R
R
KH
T
Q
FV
CT
Q
L
Q
K
L
V
CV
G
D
C
C
H
P
N
I
Q
N
F
Q
E
W
N
M
H
A
ET
K
Q
H
C
I
Y
R
C
Y
Y
L
L
V
F
S
Y
I
Q
V
N
R
YG
Q
P
Q
Q
E
C
V
T
P
C
M
G
Y
R
N
H
G
H
S
A
S
V
A
T
F
D
V
P
H
E
C
M
E
C
C
E
TM
N
Q
H
H
F
I
E
N
N
Y
F
D
M
Y
G
V
HY
C
K
H
M
W
G
H
F
W
Y
C
P
K
M
M
Y
P
K
H
M
D
F
I
H
A
Y
V
N
I
C
H
K
P
TP
N
Q
K
I
Q AD
C
I
P
R
N
P
S
I
Q
G
Q
F
F
R
G
A
P
Y
A
P
H
E
D
F
F
P
H
Q
Q
D
S
A Q
H
I
F
P
W
M
F
E
Y
D
E
E
M
75
Q
LRR
TV
C
N
K
T
PS V
G
P
A
K G
L
A
P
G
Y
C
K
D
P
P
D
70
C
G
S
LL
N
QI
TA
W
65
NLLL
V N V
L SSGA
G RT
SIG
S
W
D
NR
RS
G
K
E
T
E
M
M
Q
E
H
Q
A
I
W
D E
K
N
SL
FL
M
S
F YR Q
AD
T
G
DA
H
Q
C
I
N
C
N
P Y
N
A
Y
G
G
K
TG
I
D
N
E
C
H
P
D
P
LT K EIVRRI R A
L
I
P
I
P
L
Q
D
V
M
Y
Y
H
S
KR
A
H
W
T
I
H
F
D
S
R
A
E
P
N
F
M
T
I
H
V
G
H
M
A
Q
E
D
Y
V
K
C
F
H
K
K
H
Y
T
N
AVS
P
LP
S
I
T
T SR GLSN
A
LA
A
G
N
F
W
P
G
F
A
VT
P
L
RHAH
T
V
V
G
G
N
V
E
G
N
M
H
P
P
N
H
C
M
T
C
C
P
P
VL
RS
I
QT
L A GHA
R
LTL
SEN
LR SF
S A Q SR
Y
L
E
L
A
Q
C
K WF
T
KK
G
M
N
S V VLKPPL
SRI
R
G
T
SD
NP
R
T
VG
V
F
P
V
GSPL AL
W A
A
Q
A V
I
T
A V
CK C
F
C
V V
EGQ
H
T
LRF
I
L
I
T
P
G
SV T
I
I
NVR
S
I
T TD
H
CQ
FG
L
LCSLED
FCL
E
C Y SV VG
KN
D
NY T T
R VP
E
TL
S
H
TH
T QE
FA Q Y
I
GG
RK
E
V
TK GMGRR
YC
I
VKHAH
SV
F
V
I
R T
A
H
GQ
W
R
I
I
C
F
Q
P
R
YD
C
E
T
K
F
HS
N
S
I
I
PTE A
NP
NL
R
FTM
T
S
VI
K
CN
CL
S
R
V
A
I
V AD Y G
SV
LG
VF
DT CHAR
D
K
AG
T
EQQ AR
E
NP
T VF
SR
PN
S
L
SF
I
QI
VL
SA
PL
80
85
90
95
100
105
110
115
K
K
120
125
SS
RSA
AA
H
I SNLNDSTTYQRTRAILVRSGLDARIASLL GSTLPRRSGAAGAAVKGVGTL
M LH
TPG HI
LL W S
AI
S S V L
I
LT
GV
L
R
P
TR
R
V
I
S
S
G
GQ
R
T
Q
V
L
Y T
P
K
KG
N
P
Q
AK
R
RN
VD
T
P
Q
R
C
M
Q
V
V
E
TRL
I
R
G
T
P
D
Q
Q
C
K
M
Y
T
P
Y
C
Y
P
V
N
H
E
T
D
C
P
L
A
V
V
D
K
T
K
K
Q
E
I
YD
R
N
V
N
E
Y
K
C
D
H
Q
N
F
P
K
Y
H
Q
C
N
P
C
Y M
Y
V
M
C
I
T
R
P
N
I
YF
I
T
AR
Y C
F
CT
I
T
TR
M
V
P
F
C
S
Q
P
E
E
T
Q
Q
P
Y
Q
R
P
E
H
A
G
T
R
T
L
D
I
I
E
V
K
R
H
Y
V
M
GCSC
R
F
V
A
S
R
I
G
V
P
P
P
N
Q
H
M
K
W
Q
D
Q
A
S
IS
A
R
H
S
R
D
H
E
LR
Q
LE
P
TP
L
H
Y G
T
E
P
Q
Q
H
H
E
W
Q
G
I
LLG
F
RNA
L
N
L
D
C
H
K
I
V
N
F
F
V
D
D
F
R
A
Y
E
G
RF
C
E
V
G
F
T
FT
P
N
W
L
L
E
D
Y
Q
K
K C
N
H
K
HT S VH
G
I
TC
I
Q
C
H
WR
Q
Q
I
T
A
D
F
E
M
AP
K
I
G
L
S V
G
P
A
Q
V
N
A
S
R
Q
C
SNG A
L
HAP
D
CM
C
L
L
F
G
R
IS
RP V
TS
I
K
RL
D
MCN
T TKR G A T
K
G
H
C
A
L
E
K
Y
F
E
D
R
N
F
H
R
E
Q
T
C
L
S
S
A
A
E
S
C
PA
TS
G
Q
L
P
VN
L A Q V T AEH
S
I
R
GS
A
L
FV G
EV V
T S VL
I
SV V
A MQL
PC
H
I
I
T SGR
STRL
MH
C
PM
L
R
YK
LS
RS
G
L
G
S
I
A
Q
L
Q
P
R
Q
V
P
F
M
C
C
VI
R
Y
C
R
D
G
T
M
N
E
A VR
C
N
P
T
H
T
N
P
E
G
T
GS
D
S
E
I
G
I
E
I
S
S
G
N
T
R
R
R A G
LS
H
P
L
E A
VD A TH
C
H
P
K
Y
QMQ YL
A
F
A
V QPGG
E
S
YF
D
L
P
I
C
Q
D
T
G
E
K C
P
C
I
G
E
Q
F
I
N
A
V
P
A
D
N
L
VV
MS
P T
A
S AQ
V
T
S
R
S
Q
L
R A G
RK V
Q
L
H
K V GG
N
P
N
G
K
I
RC
H
A
S
E
SM
S IL
I
V
I
A
L
V
A
F
K
G
R
VC
Y
V
Q
S AG
T
I
TE
K
GA
L
F
I
A AS
T
SLL
I
TR
G
RT
T
Q
VH
I
FSK
VPS
MPP
GGP
R SM
Y
P
VM
GQ V
CST
L
NK S
F
VLL
I
P
A SL
L
AD AFC
VT
SK
I
FSH
CSS
V
V
PTHHL
L
VC
N
R
P
P
G
PA
TS
A
D
N
V
A
Q
N
N
Y
Y
I
F
E
F
C
Y
W
K
K
C
M
A
K
Y
G
D
C
N
M
D
H
D
D
130
135
W
140
145
150
155
160
165
170
175
180
185
190
SS
RSA
R IKRGIL DR F RG LGR TR A R
LELSIV
R
M Y QL
G
RN
TR
I
I
SS
M
TS
V
A
V
A
T
SR
GQ
Y
L
VQ
ST
R
I
TN
S
K
G
H
SDS
YR G A
V GMK
N
Q
LQ
L
G
LF
RA
TA
P
KS
FH
S
SP
V
A
TQ
H
NPH
V TRL TL V
H
QKK ALH
GR
I
LYG
I
P AES
ST
A AEM YR
N
SV
F
V GS
N
A
L
I
EQ C
N
VR
VF
Q VRESGRSPAGNLAEI
Q
RR
L VD
Q
L
GT
A S
A
SC
SM
SA
T
G I
K
A
LKGKL
LNL
R
S YE S
V
A AL
T LD G
C I
L AL K S
LST
VS
T
AT V T S
S
E
V
SLL
T
A
W
N
T
N
H
S
A
L
L
AK
L
SS
T
T
Q
VL
S
I
GA
ST VL
G
A
P
DP
C
V
M
TI
R
LF
LC
G
P
S
K
S
A
L
N
R
G
GTR C
AT
Q
P
HSL
VL
T
D
S
MGR
I
I
C
A
S
VS
Y
N
L
KP
HQ
SM
Q
A V
VS
F
CE
G
A T T QL
ST A A
AS
N
YE
L
L
R TK
RP
HA
AD
Q
S
A V
SHG
R A V
V
PY
I
I
N
F
YG
Y
GC
D
NVH
R T YRN
S
TQ
HA
Y T
FR ST
L VH
T
F
I
GSSH
V VL
A
T
R
N
C
FG
N
P V
P
C
S
KF
V VL
D
DA
I
N
R
N
I
P
AG
F
P VPR
N
LQ C
D
S
DG
SQ C
R G VRH
R CN
F
R AR W V
CT
P
L
F
P
G
KP
E
L
I
L TP
I
G
I
T
M
R
G
E
G
A C
A VE
AK
P
E
G
R
Q
I
PS
C
RC
I
TRE
D
E
R
P
A
Y V V
N
Q
P
R
I
D
T
M
T
A
V T GT S
P
G
L
T
K
T
A
F
K
L
C
A
W
Y
L
C
Q
N
K
Y
Q
Y
Q
E
H
P
P
N
K
N
M
G
F
E
M
P
K
D
L
N
R
I
W
V C
KE
E
Y
Q
Y
N
Q
P
C
C
T
I
G
I
C
T
FY
P
G
H
Y
Q
N
Q
W
D
N
R
K
E
F
P
K
Q
C
G
K
Y
A
C
F
E
Q
H
H
D
E
W
P
H
Q
F
T
Y
K
K
C
Y
H
F
D
H
W
E
H
V
D
H
V G
H
L
V
Y
P
S
G
A
G
C
M
I
K
I
G
L
V
T
T
E
F
W
D
G
N
W
W
H
T
R
C
C
R
K
G
E
V
N
E
L
F
I
D
E
M
L
V
QR
Q
AL T
M
N
C
M
Y
R
H
C
C
A
L
H
L
P
Q
P
C
C
Y
D
E
V
H
I
C
N
D
W
H
Q
H
K
I
H
I
D
Y
H
Y
H
T
P
Q
V
Q
F
E
D
K
Q
I
W
G
K
R
P
M
C
C
V
I
H
EQ
F
C
F
E
A
S
Q
I
Y
P
K
I
N
M
H
P
H
L
N
P
R
N
Y
Y
P
V
F
M
V
D
T
A
C
A
P
P
G
RE
P
N
G V
K
F
C
I
C
N
Y
Q
E
V
Q
T
E
I
S
L
W
I
D
F
FT
K
E
D
T
G
E
D
H
A
P
Q
G
C
I
I
Y
I
K
F
K
M
A
P
K
D
I
M
P
E
Q
K
S
G
R
H
N
D
M
K
D
T
H
M
I
G
R
G
N
A
P
C
C
Y
IV
S
M
T
R Q
GT
P
H
V
F
M
M
F
P
K
VE
I
F
D
N
C
P
K
A
P
F
I
V
P
S
L
K MQE
H
Q
Q
N
Q
K
M
W
195
200
205
210
215
220
225
230
235
240
245
250
SS
RSA
DLLFLARSTALLLRSSVGAHKSCLPACIVYG R VARSRGHPD
Q
K
V
I
L
VV
ASFP
TR
VS
T
K
Q
GYSLVGI P RLLLRARVYSLI
F
L
RP F
S VK
N I
R
AL
L
S
A
S
D
Q
I
P
H
T
R
T
T
ITC IS
A
H
Q
SLS
G
V GA A A
QC
ST V
R
I
V
ST
V
G
A Y
A
V
F
AL
LV A
Q
Q
L
R
G
L
I
K
Q
M
T
R
P
F
F
R
I
S
SG
M
VF
M
VP
TCT
V
D
Q
CYC
C
H
P
M
R
LG
L
GT G
A T
QG
L VG
LN
I
E
FC VRH
SA
CQ
Q Y
P
L
I
I
V TP
LH
R MC
P
DT Y
I
Y
PR A
A
VE
RH
RN
A TD Y Q G
CQ
SGC VNVE
NV YE
P Y MGC
G
I
P
P
D
L
H
G
K
P V
T
RRL TPSS
A
VC
P
V
RR
P W AD
V
C
F
PQR TE
G
CSM
W
S
G
I
I
G
N
N
T AN
QK C
RP
Q
D A
P
L
T YP
F
R QK G
C
I
M
A
R
E
H
V
L
M
C
H
C
S
H
I
V
C V
W
V
AC
P A
KN
H
F
F
G
I
Y KP
NV
H
E
M
P
H
H
C
W
Q
M
Q
C
I
E
I
E
Q
N
F
P
D
C
F
Y
VN
K
Y
K
H
I
K
P
H
C
C
S
W
D
H
C
G
E
K
A
E
K
Q
F
D
D
F
F V
L
N
Q
T
K
V
E
Y
H
M
K
Y
T
C
N
E
D
Q
H
L
AI
R
T
H
R
CGGT V
Q Q TMC
S
V
S
L
VS
NNPCCM
I
L
Q
I
K
M
A
L
RF
N
NA
P
KF
F
P
H
S
Y
Q A
Q
R
V
R
VL
KR
W
P
GT
E
H
N
S
G
P
P
W
I
G
Y
H
T
N
F
M
A
D
G
R
D
Q
N
Q
H
E
C
I
N
F
W
N
K
P
E
C
K
E
N
Y
Q
I
E
K
K
W
T
W
N
N
E
K
E
G
Q
I
M
D
K
M
E
D
M
N
I
W
H
W
I
T
L
F
R
C
P
G
D
Y
V
N
E
P
A
W
Q
A
H
K
R
G
N
I
A
F
P
T
I
C
F
I
V
I
M
E
G
S
R
Y
N
T
M
V
F
D
N
L
D
C
N
H
E
Y
E
L
D
S
M
E
F
H
F
K
Y
L
Q
C
C
I
T
A
L
A
Q
I
V
A
H
Q
A
E
I
M
C
G
TL
T
FQ C
GGS
T ST QNA G
D
A
TP
N
SR
SG AI
RT
T
S
GRP
AKR
PR
A
TPL
EE
IYI
GP
TS
R
T
R
ALK
G
Q
LS
RL
I
TMT
TRRN
CS Y
TL
GN
E
N
PPR
S
TG
R
I
N
V
HQPT
I
I
L
LL
A
VF
G
TK V GK
T
SG
AK
V
I
N
S
G
QS
A
HQ
GP
RE
GRN
N
LR
I
KS
SY
S
GL
L
A ALG
R
S
T
A
A
QPHL
SI
VG
IS
EA
VV
S VL V SSK S W
S
II V L
GT
A
S
V
LA
GP A
R
LSS
RL
G
TY V
TCC
I
A
T
A
L
S SV
A
G
E
S
FN
D
K
C
M
D
I
M
K
M
Q
E
T
Y
Y
255
260
265
270
275
280
285
290
295
300
305
310
315
SS
RSA
R
RG IAS E LDSILSSTL
V F
ERPAHKSLQLVWLACSGHCSLAFQEDSLRGI
SLIL GAKVPRPLRGKLRST
LL
PT
V
SR
L
K
Q TRP
S
T
RG
N
V MS
LI
TT
V AD
I
S
V
V
VI
S
A AG
TA
K
S
S V TR
SR S
GS
G
NN
P
I
V
V A
I
M
L TGT V
I
HA
LFG
C
LSRRR A
VH
A
V
R
D
SS
NTR
G
G
LNS
T
V
G
R
GR
A
TPLSR
R
E
RN
S
S
LLF
HFLS
V A
TG
T
S
SS
N
H
S
S
A
S
LL
P
I
SS
I
V
VTL
VC
QS
TL
LY
L
KS
G
S
V SA A
G
TC
R
A
A
VLLPK
T V
TK G
AK
H
PMRLR
V GR Q T
P
P
I
MS A VR M
TL
CRP
DQ Y
I
Y VT
I
P
L A CHANC
LT V A
G
P
SC
GV
I
ALR
C
A
G
KN
T V
LG
L
P
KP
PCTR
K V
AN
I
I
I
TR
AL A
I
LK
H
TP
I
YG
DSG
PD
P
F
E
I
FC
L
C A T V A C
L
PC
Q
R
Q
K
D
P
E
T ST
I
A
V A VS
F
I
N
I
QD YDQN
F
I
I
R
Q
C
Q
PC
F
V
E
R
P
L
K
V
T
R
G
F
A
G
F
W
A
Q
A
N
C YM
LP
K
Q
N
H
N
I
P
N
V
K
Q
P
E
N
E
Q
H
R CDGR SI
V
R
S
NMFPL
L
V
T
VR
D
L
Y
I
I
K
TC
M
M
D
K
C
P
I
P
G
QD
N
I
AF
W
Y
I
L
N
D
K
M
Y
G
P
C
V
C
N
H
C
M
F
G
Q
I
Y
K
A
G
D
C
H
H
P
A
E
H
E
M
G
E
A
K
K
C
Q
N
R
P
W
F
I
Q
E
C
H
YK
P
K
I
Y
M
P
F VN
R
AK
Q
R
G
N
F
W
Y
Y
Y
W
F
M
TG
T
A TQ
NV AC
N
SG
LGS A AK S
R
GERH
V
P
P
F VR V
N
RP V SM
LT
CQFST T G
AL
I
S
D
Q
I
P
F
F
M
M
A
GG
F
P
C
T
HV
FG
Q
Y
G
D
F
H
W
C
M
I
V V TNR G V
T
S
R
P
F
RR
E
RH
E
Q
H
K
K
F
Q
Y
N
F
R
N
H
W
YT
LLQ
T
N
I
I
T
H
I
T
F
K
A
ALE
TSL AE
N
SAI
S
H
C
D
D
D
Q
F
C
Y
P
G
H
C
G
F
I
M
H
K
L
S
L
A
A
N
E
C
N
P
Q
Y
P
M
S
H
E
H
C
P
V
H
C
D
R
H
I
P
P
V
E
K
A
F
G
Y
I
RN
CG
E
G
N
TG
E
C
G
N
H
C
HYR M
V
I
A
N
P
Q
Q
W C
T
K
V
R
K
W
I
Y
Q
T
F
N
G
C
R
E
N
Q
D
V
P
Q
T
F
W
E
S
T
E
Y
N
H
F
Q
Q
T
R
D
N
N
K
TR
I
G
D
G
K
A
E
H
Q
T
NV
HY
F
T
G
S
F
L
D
D
P
Y
A
H
I
C
N
RP
Y
I
G
NV
F
P
Y
Q
M
T
I
I
F
M
H
C
R
V
Q
C
A
Q
D
H
I
C
TV
PP
STR
AF
S
D
A
G
Y
H
A
P
V
P
Y
K
L
T
L
G
NHN
Q
LL
P
Y
K
M
G
M
C
320
325
330
335
340
345
350
355
360
365
370
375
SS
RSA
R
IRTRSGG R RRRR AGQISV PASLSVSR LP
E
L S
RPW
Q
T T Y A
A
W
D
Q
C
R
L
H
K VS
NMT
N
LV
LY
S
G
N
S
Y AG
C
H
C
S
L
S
D
R
T
A
S
R
E
T
I
A
A
I
TR
L
N
K
C
H
C
RL
AG
H
P
PQ
M
G
G VF
QE
H
M
L
LH
C A T AP
C A
I
C
R
Y
S
P
H
I
K QP
F
V
L
V
Q
V
G
Y
Y
G
R
I
R
R
W
I
A
Q
YR
C
M
I
H
P
F
CRL
R
C
H
Q
Q
D
G
M
M
KR
A
N
G
Q
T
SG
I
A
R W
I
H
P
C
I
Q
V
T
L
N
R
K
N
Q
T
T
N
K
D
V
M
N
E
C
G
C
Y
Q
R
C
Y
N
M
Y
Q
N
G
S
L
RHV
L
S
R
V
G
F
A
C
Q
H
P
K
Y
F
V
I
K
H
G
N
N
K
T
Y
L
N
H
A
C
P
E
V
R
I
E
M
I
L
I
E
E
AA
L
D
W
L
C
V
T
D
A
S
G
M
L
D
DHD
T
L
A
P
D
F
A
I
TH
A
M
N
S
E
H
N
G
QG
Q
Q
I
NV
S
C
R
L
T
H
F
G
N
Q
T
L
N
I
D
Y
R IR
SG STSG
IE
T
Q
VR
AN
AS
V T VPP
HCP
R VS
SN
I
S
PK YL
RP VP
K
I
A
LM
R
NQR G
TN
I
R
C
P
K
P
N
V
TC
Q
V
K
N
I
H
A
S
C
E
G
S
Y Q
N
H
P
N
T
Q
C
V
Q
G
G
F
E
QP
F
NY
C
M
Q
H
Q
DQ
G
P
L
A
G
LN
I
V
H
A
C
C
L
Q
E
N
A
YS
FP
FR
I
S
S
IS
T
CG
A
V
CRIT TKD
S
H
W
I
NT
K
L
A
V
M
N
R
Q
Y
P
C
Y
A
I
A TT FA
T
V
GA
LP
S
S
A
P
E
I
K
G
S
K
F
S
N
E
E
T
Y
S
T
A T
N
H
K
H
M
P
H
E
N
Y
K
V
K
W
C
S
C
W
V
L
F
D
L
P
P
T
K
E
V G
I
Q
W
L
T
G
Y
K
V
H
F
C
C
T
K
Q
G
T
K
C VR
F
LP
H
G
A
F
I
C
K
P
H
H
C
L
R LT
S
VH
LL
V VF
AV
TA
M
MQ S
Y
V
K
C
Q T S VKH
G
T TLH
NT
N
S
VRLM V
I
Q
S
H
P
P
T
AF
R
GLL
GT SL
SA R
VL
SL
VL
R
L
AS
S
N
E
VD
K
L
I
T
L
L
G
Y
C
C
F
SSS
V
SAN
ALK
E
T
V
N
K
F
Y
H
R
S
A
V
CCGQ
P
D
R
PS
N
P
A
E
R
G
C
GV
G
P
R
N
D
L AK
V
I
P
NSR
K
V
KT
TST
TT AS
CLC
ALF
H
A AR
HR
TC
SKPK A
L
CV
A
RH
G
T
S
G
PK
A S A VK
N
I
F
R
I
K
T
GH
V T
KF
K
TM
V
L
I
A
R
R
S
G
R
Q SR V
APDG
F
380
385
390
395
400
405
410
415
420
425
430
435
440
SS
RSA
R
F D R
Y
F
I
G
S
S
RA P
S
VFE LSDE TARSP V SI
S R NVR
EIIDIIETRGKRRL RLSFYRGRG RGI
LPG
T
R
P
A
L
L
S
G
N
L RLL L
V T ASR T TG
Q VLEFM
LY
VS
T
G
PA
S
N
RL
V GGQ
A
S
GL
LST A
S
S
V
V
S
K
FNC
G
T
I
Q
V
I
R QC V
RDCN
Q
SES
DS
PL
DRR AHSSPH
WE
I
MKR A V GQ
V
A AS
P
I
QP
T
C
TG
M
Y
GT
W
D
KN
H
M
V
PQR
K
E
F
C
E
P
P
PW Q
YD Y
HTR
AP
AH
H
L
L
A
R
G
H
C
G
FK
T
Q
V
C
S
P
I
S
E
H
H
G
K
T
Q
V
Y
V
R
NA
L
W
V
F
F
G
G
Q
L
C
LL
T
A
I
GG
T TH
Y
L
R
W
K
P
Y
Q
K
V
YL
A
P
S
M
A
E
Y
NG
A
R
P
Q
C
R
E
C
N
V
C
A
T
G
A
Y
T
K
G
V
K
I
P
W
C
S
I
Q
D
KD
L
WN
G
A
GV G
F
V
Q
M
A
P
Q
T
L
D
K
R
H
Y
CQ
N
P
N
R
I
H
S
I
P
C
M
M
T
Q
F
E
H
M
H
Q
C
A
H
D
G
VD
P
K
L
A
T
N
S
R
L
N
C
D
E
H
Y
C
N
RT
N
Y
D
I
W
G
T
RF
C
D
V
S
S
Q
T
N
F
H
V YS
ES
N
FV R
SA
V
N
E
RC
T
H
N
E
H
A
T
P
G
KR VL T
Y
GK
ITF
H
I
C A YL T S A
G
LLH
K
Y
S
S
I
I
DL TN
SRRLL
M
I
VT
LI
ES
LS Y GN
KRD
V
D
I
I
L
K
LS
T
P
Y
F AK S
RFK
GH
Q
RT
E
I
P
P
L
L
H
G
E
S
GC
Y
M
S
G
H
E
R
F
M
TT
A
N
E
N
W
L
Y
L
M
A
D
E
Y
PQ
R
V
SF V AP
G
QV
A
V
Y
AES
T TPR
N
TET T
F
D
N
MH
ET
PS A
C VK
G
I
RLM A
ST
M
I
V
Y
TK
S
Q
LCGT
P
D
D
F
C
N
E
K
V
R
P
S
S
N
KLH
T
F
VD
VHW
P
P
I
I
K
P
QKECC
I
AGA
KC
Q
V
M
FR
AL
Q CD
W
L
W
AA
L
W
QNQN
LG A GNA T T
K
NAH
E
I
VK
K
L
FG
N
T
K S
H
Y
L
IR
F
V
H
Y
K
445
450
455
460
465
470
475
480
485
490
495
Figure 8: The expected frequencies of the amino acids at evolutionary equilibrium using the
experimentally determined evolutionary model from passage 1 of the combined replicates and
Equation 3 for the fixation probabilities. Note that these expected frequencies are slightly different than the amino-acid preferences in Figure 7D due to the structure of the genetic code.
For instance, when arginine and lysine have equal preferences at a site, arginine will tend
to have a higher evolutionary equilibrium frequency because it is encoded by more codons.
The numerical data are in Supplementary file 2. The computer code used to generate this
plot is at http://jbloom.github.io/phyloExpCM/example_2013Analysis_
Influenza_NP_Human_1918_Descended.html.
55
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
A
A/Brevig Mission/1/1918 (H1N1)
A/Albany/1/1958 (H2N2)
A/Aichi/2/1968 (H3N2)
A/Lyon/815/1993 (H3N2)
B
0.1
0.1
Figure 9: Phylogenetic tree of NPs from human influenza descended from a close relative of
the 1918 virus. Black - H1N1 from 1918 lineage; green - seasonal H1N1; red - H2N2; blue H3N2. Maximum-likelihood trees constructed using codonPhyML (Gil et al., 2013) with (A)
the GY94 substitution model or (B) the KOSI07 substitution model. Up to three NP sequences
per year from each subtype were used to build the tree.
56
bioRxiv preprint first posted online Feb. 20, 2014; doi: http://dx.doi.org/10.1101/002899. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY 4.0 International license.
Supplementary file 1: The file Preferences.xlsx is an Excel table of the inferred amino-acid
preferences at each site from the combined passage 1 data for the two replicates.
Supplementary file 2: The file EvolutionaryEquilibriumFreqs.xlsx is an Excel table of the
expected equilibrium frequencies of the amino acids during evolution.
57