Expression Divergence Is Correlated with Sequence Evolution but Not Positive Selection in Conifers Kathryn A. Hodgins,*,†,1 Sam Yeaman,†,2,3,4 Kristin A. Nurkowski,1 Loren H. Rieseberg,2 and Sally N. Aitken3 1 School of Biological Sciences, Monash University, Melbourne, VIC, Australia Department of Botany, University of British Columbia, Vancouver, BC, Canada 3 Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada 4 Department of Biological Sciences, University of Calgary, Calgary, AB, Canada † These authors contributed equally to this work. *Corresponding author: E-mail: [email protected]. Associate editor: Stephen Wright 2 Abstract The evolutionary and genomic determinants of sequence evolution in conifers are poorly understood, and previous studies have found only limited evidence for positive selection. Using RNAseq data, we compared gene expression profiles to patterns of divergence and polymorphism in 44 seedlings of lodgepole pine (Pinus contorta) and 39 seedlings of interior spruce (Picea glauca engelmannii) to elucidate the evolutionary forces that shape their genomes and their plastic responses to abiotic stress. We found that rapidly diverging genes tend to have greater expression divergence, lower expression levels, reduced levels of synonymous site diversity, and longer proteins than slowly diverging genes. Similar patterns were identified for the untranslated regions, but with some exceptions. We found evidence that genes with low expression levels had a larger fraction of nearly neutral sites, suggesting a primary role for negative selection in determining the association between evolutionary rate and expression level. There was limited evidence for differences in the rate of positive selection among genes with divergent versus conserved expression profiles and some evidence supporting relaxed selection in genes diverging in expression between the species. Finally, we identified a small number of genes that showed evidence of site-specific positive selection using divergence data alone. However, estimates of the proportion of sites fixed by positive selection (a) were in the range of other plant species with large effective population sizes suggesting relatively high rates of adaptive divergence among conifers. Key words: lodgepole pine, white spruce, Engelmann spruce, gene expression, RNAseq, positive selection, climate. Introduction Article Understanding why genes evolve at different rates is a major goal of the field of molecular evolution, and the evolutionary cause(s) of rate variation has been the subject of much debate. Evolutionary rate could largely be determined by the balance between the level of selective constraint on a protein (i.e., negative or purifying selection) and genetic drift (Kimura 1983; Ohta 2002). Alternatively protein sequence evolution may depend on the rate at which new beneficial mutations arise and fix (Gillespie 1991). Evolutionary rates of coding sequences can be quantified by comparing the rate of substitutions at synonymous sites (dS), which are presumed neutral, to the rate of substitutions at nonsynonymous sites (dN, amino acid replacement), which may experience selection. Comparisons between these measures can provide compelling evidence of selection. Purifying selection will reduce the fixation rate at deleterious replacement sites, lowering dN/dS ratios, while positive selection should increase the fixation rate of beneficial sites, increasing dN/dS (Bielawski and Yang 2005). Several genomic parameters, such as the expression level, breadth, and divergence; the number of protein–protein interactions; gene network position; synonymous polymorphism; and gene essentiality can correlate with evolutionary rate (Duret and Mouchiroud 2000; Ingvarsson 2007; Ramsay et al. 2009; Slotte et al. 2011; Renaut et al. 2012), but the evolutionary mechanisms that produce many of these patterns remain obscure. For example, many studies have found that genes with high levels of expression evolve more slowly (for review see Rocha 2006), and while neutral theory predicts that this pattern is likely due to differences in the strength of negative selection, it is possible that adaptive divergence is also constrained in high expression genes. Previous studies examining the relationship between sequence and expression divergence have found contrasting results, with some identifying no correlation (e.g., yeast, Tirosh and Barkai 2008; sunflowers, Renaut et al. 2012; Moyers and Rieseberg 2013) prompting suggestions that sequence and expression evolution are fundamentally decoupled. However, others have identified positive correlations (e.g., Drosophilla, Nuzhdin et al. 2004; mammals, Jordan et al. 2005; Warnefors and Kaessmann 2013). In some cases, such as Drosphila, positive selection has been implicated as an important contributor to the correlation between dN and expression divergence (Nuzhdin et al. 2004), while in others ß The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] 1502 Mol. Biol. Evol. 33(6):1502–1516 doi:10.1093/molbev/msw032 Advance Access publication February 12, 2016 Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 relaxed purifying selection and genetic drift have been proposed to be the primary driving force of weaker associations between sequence and expression divergence (e.g., Liao and Zhang 2006). Why these discrepancies among taxa exist is uncertain and could be related to the efficacy of selection or time since divergence, as tempo of gene expression and sequence evolution may differ through time, or the genomic targets of selection may differ among taxa. Investigation of the molecular evolution of conifers on a genome wide basis has received relatively little attention, perhaps because it has been hampered by the large genome sizes of these species (De La Torre et al. 2014). Conifers have genomic characteristics that distinguish them from many other plant groups, including large genomes, long generation times, slow evolutionary rates, long introns, and limited evidence of paleopolyploidy (Nystedt 2013; but see Li et al. 2015). Previous comparisons of evolutionary rates between spruce-pine and poplar-Arabidopsis lineages have identified higher dN/dS ratios in the conifers (Buschiazzo et al. 2012). This is despite much lower substitution rates in conifers, perhaps related to longer generation times (or cell lineage division time), lower mutation rates or the impact of large effective population sizes on weakly deleterious mutations (Buschiazzo et al. 2012). Nearly neutral theory predicts that genome wide elevation in dN/dS should largely be a function of drift rather than greater rates of positive selection (Kimura 1983; Ohta 2002), and molecular evidence for adaptive divergence in conifers is somewhat limited (Eckert et al. 2013a; but see Eckert et al. 2013b). However, widespread conifers have life history traits that confer large effective population sizes and weak population structure (Neale and Kremer 2011), suggesting that adaptive evolution could contribute substantially to divergence among species. There is also considerable evidence for local adaptation at the phenotypic level in many forest trees, and a growing number of studies have focused on identifying the genomic basis of these adaptations (Howe et al. 2003; Savolainen et al. 2007; Aitken et al. 2008; Eckert and Shahi 2012; Parchman et al. 2012; De La Torre et al. 2013). Here, we describe the results of a study of evolutionary rate in lodgepole pine (Pinus contorta) and interior spruce (natural hybrid Picea engelmannii Picea glauca). In the context of this study we refer to interior spruce as a single “species” as the seeds used in this study were produced by parents originating in a population within this ancient hybrid zone (De La Torre et al. 2014; Yeaman et al. 2014). Lodgepole pine and interior spruce are widespread conifers in Western North America. Both have substantial ecological and economic importance, with over 200 million trees planted each year in Western Canada. Our aim is to compare patterns of divergence and polymorphism in coding and untranslated regions of lodgepole pine and interior spruce to help elucidate the evolutionary forces that shape the genomes of conifers. To do this we used our reference transcriptomes for both species, as well as publically available transcriptomes for four other species, to identify orthologs and substitution rates. We then aligned RNAseq reads for lodgepole pine and interior spruce to their respective reference transcriptomes to determine levels of MBE polymorphism. Using these data, we were able to estimate the proportion of amino acid substitutions fixed by positive selection and also identify specific genes that showed evidence of positive selection. Results from these analyses allowed us to address the following specific questions: (1) What are the genomic determinants of evolutionary rates in coding and noncoding gene regions in conifers? We tested whether expression level, expression divergence, protein length, specificity of expression across environments, and neutral polymorphism were associated with nonsynonymous and synonymous divergence between lodgepole pine and interior spruce. (2) What are the relative contributions of positive and negative selection in driving differences in evolutionary rate for genes with different expression profiles? We examined whether expression classes based on divergence versus conservation of gene expression between species, as well as average expression levels, influenced the distribution of fitness effects (DFE) of new mutations and the proportion of substitutions fixed by positive selection in lodgepole pine (a), using either interior spruce or loblolly pine (Pinus taeda) as an outgroup. This provided insight into the evolutionary forces that drive patterns of molecular evolution in conifers. (3) What is the role of adaptation in driving divergence in conifers, and more specifically, what genes show evidence of high evolutionary rates and positive selection in this group? We used a maximum likelihood approach to identify orthologous genes showing evidence of site-specific positive selection in conifers. To do this we utilized the transcriptome references of the focal species (lodgepole pine and interior spruce), as well as loblolly pine, Sitka spruce (Picea sitchensis), Norway spruce (Picea abies) and Douglas-fir (Pseudotsuga menziesii). Results We found 13,809 one-to-one orthologs between lodgepole pine and interior spruce. On average the dN/dS ratio was 0.278 60.002 (N ¼ 5,195; dN ¼ 0.0436 60.0003; dS ¼ 0.166 6 0.0009). The substitution rate for the 50 -UTR was 0.130 6 0.0015 (N ¼ 4,566) and for the 30 -UTR was 0.090 6 0.0019 (N ¼ 4,719). Gene Expression Patterns versus Evolutionary Rate in Lodgepole Pine and Interior Spruce We predicted that evolutionary rate would be reduced in genes that were conserved in their pattern of gene expression between species, reflecting conserved gene function across evolutionary time compared to those that had diverged in expression between the species. For orthologs identified between pine and spruce we considered four mutually exclusive classes of genes based on their expression pattern between species and among seven climate treatments as identified in Yeaman et al. (2014). 1503 MBE 1504 0.28 ab 0.24 0.26 a 0.20 0.22 dN/dS c bc CEG DEG NON SEG Expression pattern (b) 0.09 ab a ab 0.05 0.05 Substitution rate 3’UTR b CEG DEG NON SEG ab ab DEG NON a 0.12 0.13 b 0.11 (c) 0.14 Expression pattern 0.10 We found strong evidence that evolutionary rate varied depending on the pattern of gene expression among treatments and species (fig. 1 and table 1; supplementary tables S4 and S5, Supplementary Material online). These expression pattern classes of genes for the lodgepole pine-interior spruce comparison were significantly related to gene-specific estimates of evolutionary rate for pairwise comparisons of the two species for both untranslated regions and coding sequences. The fastest evolutionary rates were found for genes that diverged in overall expression (SEG) or diverged in their pattern of expression plasticity (DEG), relative to lower rates observed in genes with expression variation that was not associated with treatment or species (NON), or conserved patterns of expression plasticity (CEG). SEG had significantly higher rates of evolution than both NON and CEG in the 50 UTR and dN/dS, while DEG had significantly higher rates of evolution than NON and CEG in the 30 -UTR and dN/dS. Differences between NON and CEG versus DEG or SEG were nonsignificant for the 50 -UTR (DEG) and 30 -UTR (SEG). There was a marginally significant difference in dS rates among expression classes (no pairwise differences after correcting for multiple tests). However, there was a stronger effect of expression class on dN, indicating that differences in the rates of dN/dS are being driven by the accumulation of amino acid changing mutations. On average, the more rapidly evolving classes (DEG and SEG) had a 6.4% higher rate of sequence evolution (as measured by dN/dS) than the more slowly evolving classes (CEG and NON). Multiple factors are known to be associated with sequence evolution in other species (Duret and Mouchiroud 2000; Ingvarsson 2007; Ramsay et al. 2009; Slotte et al. 2011; Renaut et al. 2012). We predicted that, in addition to diverged patterns of expression, rapidly evolving genes should have low levels of average expression. If rapidly evolving genes experienced recent sweeps, reduced levels of synonymous polymorphism are also predicted (Andolfatto 2007; Lohmueller et al. 2011). We first examined the relationship among these factors using pairwise correlations, followed by partial correlations to (a) Substitution rate 5’UTR (1) Conserved expression genes (CEG) are those with significant differences in expression among treatments (expression plasticity), but the pattern of expression among the seven treatments was retained between the species. These genes were identified by a significant treatment effect, but no species by treatment interaction effect (in some cases these genes had a significant species effect, with differences in the average expression level overall). (2) Diverged expression genes (DEG) are those with divergent patterns of differential expression between species. These genes were identified by a significant interaction between species and treatment. (3) Species expression genes (SEG) had a significant difference in overall expression level between species, but no interaction or treatment effect. (4) Nonsignificant genes (NON) were those for which none of the factors in the model explained a significant proportion of the variation in gene expression. 0.30 Hodgins et al. . doi:10.1093/molbev/msw032 CEG SEG Expression pattern FIG. 1. The impact of gene expression class on evolutionary rate between interior spruce and lodgepole pine for different gene regions. CEG, conserved expression genes (conserved plasticity in gene expression between species); DEG, diverged expression genes (diverged plasticity in gene expression between species); SEG, species expression genes (no plasticity in expression but divergence in average expression level between the species); NON, no differences in expression among treatments or species. (a) dN/dS (nonsynonmous substitution rate/synonymous substitution rate) (CEG: N ¼ 1,608; DEG: N ¼ 870; SEG: N ¼ 1,220; NON: N ¼ 1,470); (b) substitution rate for the 30 -UTR (CEG: N ¼ 1,465; DEG: N ¼ 775; SEG: N ¼ 1,370; NON: N ¼ 1,106); (c) substitution rate for the 50 -UTR (CEG: N ¼ 1,423; DEG: N ¼ 764; SEG: N ¼ 1,316; NON: N ¼ 1,063). Means and standard errors are shown. Different letters indicate significant differences. Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 Table 1. Results from ANOVA Tests of Differences in Evolutionary Rates between Genes of Different Expression Class (CEG, DEG, SEG, NON). Region or Site Type 0 3 -UTR 50 -UTR dN dS dN/dS F Statistic 3.31 7.83 4.43 2.57 5.81 df P 3,4715 3,4562 3,5191 3, 5191 3,5191 <0.05 <0.001 <0.01 0.052 <0.001 NOTE.—See figure 1 for a depiction of the patterns. control for correlations among variables (fig. 3a; supplemen tary table S7, Supplementary Material online). dN/dS ratios were positively associated with protein length (q ¼ 0.05, P < 0.01) but negatively correlated with average expression level (q ¼ 0.05, P <0.01), synonymous nucleotide diversity, ps (q ¼ 0.14, P < 0.001), and expression divergence (q ¼ 0.04, P < 0.05). However, the negative correlation between expression divergence and dN/dS was driven by a strong positive relationship with mean expression and expression divergence. Consequently, the correlation between dN/dS and expression divergence became positive once mean expression level was accounted for in the analysis (partial q ¼ 0.09, P < 0.001). Therefore, we conclude that rapidly diverging genes tend to have lower expression levels, reduced levels of synonymous site diversity, greater expression divergence, and longer proteins. We note that Snorm (diversity scaled by divergence; Lohmueller et al. 2011) andps were highly correlated (q ¼ 0.71, P < 0.001) and provided a similar negative correlation with dN (q ¼ 0.11, P < 0.001) so we chose to conduct all of the analysis using ps rather than Snorm. Using pairwise correlations, ps was positively correlated with average expression level (q ¼ 0.08, P < 0.001) and expression divergence (q ¼ 0.11, P < 0.001), and negatively correlated with protein length (q ¼ 0.19, P < 0.001). Partial correlations followed a similar pattern. Genes with lower levels of synonymous nucleotide diversity had weaker expression, longer proteins, and reduced divergence in expression. Shorter proteins had weaker expression divergence (q ¼ 0.14, P < 0.001), and tended to have lower expression levels when controlling for correlations among variables (q ¼ 0.13, P < 0.001). Partial correlation estimates for dN and dS separately identified similar patterns for dN and dN/dS (supplementary fig. S1, Supplementary Material online). Partial correlations were positive between dS and expression divergence, as well as ps. We also examined the relationship between evolutionary rate and treatment specificity of expression (supplementary fig. S2, Supplementary Material online) in a broader set of genes, as polymorphism data were not required for this analysis. We predicted that genes with context-dependent expression (i.e., greater treatment specificity) would experience higher evolutionary rates because the effect of mutations is limited to a specific environmental context (Snell-Rood et al. 2010). We found a positive correlation between dN/dS (q¼0.07, P <0.001) and treatment specificity, meaning genes with high dN/dS tended to be expressed in fewer treatments, MBE but this correlation was not significant after correlations among the other variables were accounted for (P ¼0.25). As changes in the UTRs can have direct impacts on gene expression, we tested if divergence in expression was associated with evolutionary rate in these regions. Patterns in untranslated regions showed associations that paralleled those for dN/dS, with the exception of protein length and ps, which exhibited the opposite pattern (fig. 3b and c). Partial correlations were positive between 30 -UTR rate and dN (q ¼ 0.15, P < 0.001), dS (q ¼ 0.07, P < 0.001), expression divergence (q¼0.06, P <0.001), and ps (q¼ 0.07, P <0.01), but were negative for mean expression (q¼0.05, P < 0.01) and protein length (q¼0.08, P <0.001). The 50 -UTR patterns of variation were similar to 30 -UTR, with positive correlations between 50 UTR substitution rate and dN (q¼0.08, P <0.001), as well as ps (q¼0.08, P <0.001), but negative associations for mean expression (q¼0.06, P <0.01) and protein length (q ¼ 0.12, P < 0.001). However, dS and expression divergence were not correlated. Contrasting patterns were identified with treatment specificity as this variable was negatively correlated with 50 -UTR rate (q¼0.06, P <0.001), but weakly positively correlated with 30 -UTR (q¼0.04, P <0.01; supplementary fig. S2, Supplementary Material online). We tested the hypothesis that the association between expression and sequence divergence is driven by relaxed selection. We predicted a negative correlation between the difference in branch-specific estimates of dN/dS and average expression difference (pine-spruce) if relaxed selection was driving this pattern. We identified 3,286 alignments with lodgepole pine, interior spruce and Douglas fir. There was no association between the difference in lineage specific estimates of dN/dS and expression level differences between pine and spruce (q¼ 0.007, P ¼ 0.65). Variation in Polymorphism and Divergence among Gene Regions and between Species Interior spruce had higher levels of polymorphism at synonymous sites than lodgepole pine (ps: P <0.001; pine mean 6 SE ¼0.0058 60.00016; spruce mean 6 SE ¼ 0.0073 6 0.00018) and a more skewed site frequency spectrum (SFS) at synonymous sites (Tajima’s D: P <0.001; pine mean 6 SE ¼0.51 60.034; spruce mean 6 SE ¼ 0.84 60.028; supplementary table S3, Supplementary Material online). We found no differences in the SFS and amount of within-species polymorphism at synonymous sites for the different expression classes (P >0.1) in either species. There was a positive correlation between average expression and ps (pine: q ¼ 0.08, P <0.001; spruce: q ¼ 0.06, P <0.001). We repeated this analysis for pine using down-sampled reads and found the same correlation (q ¼ 0.09, P <0.001) demonstrating that this pattern is likely not the result of biases in single nucleotide polymorphism (SNP) calling associated with expression differences. In addition, ps was highly correlated between down-sampled and non-down-sampled approaches (q ¼ 0.92, P <0.001). Moreover the same positive relationship was detected between dS and expression levels; similar patterns between polymorphism and divergence are expected if 1505 MBE Hodgins et al. . doi:10.1093/molbev/msw032 0.20 (b) a a 0.15 c 0.05 0.10 divergence a c b d b 0.00 0.000 0.001 0.002 0.003 0.004 0.005 0.006 nucleotide polymorphism ( (a) Synonymous Replacement 3'UTR 5'UTR Region Synonymous Replacement 3'UTR 5'UTR Region FIG. 2. Polymorphism and divergence estimates of 57 genes containing information for both untranslated regions and coding sequences for lodgepole pine. Interior spruce was used as the outgroup. Different letters indicate significant differences. synonymous sites are largely evolving neutrally, but not if biases in SNP calling are driving the association between expression and ps. The highest levels of nucleotide polymorphism and divergence were at synonymous sites, followed by 50 -UTR, and 30 -UTR sites, with the lowest levels of diversity and divergence in replacement sites for both lodgepole pine and interior spruce (fig. 2; supplementary table S2, Supplementary Material online). DFEs and Adaptive Divergence We used the method of Eyre-Walker and Keightley (2009) under two different demographic scenarios to quantify the DFEs, the proportion differences fixed by positive selection (a) and the rate of adaptive fixation (xa) in lodgepole pine. The DFE showed similar patterns across the expression pattern categories with one exception: there was a slightly smaller proportion of sites in the effectively neutral category (NeS < 1) for the NON genes compared with the SEG genes (fig. 4a) in the two-epoch model (proportion NON¼ 0.17, SEG¼ 0.21, P < 0.01). The same pattern was identified in the down-sampled data set, but the difference between NON and SEG was no longer significant (supplementary fig. S4, Supplementary Material online). There was no difference in a or xa among the categories (fig. 4b and c), nor any differences among the expression categories for the one-epoch model for a, xa, or DFE (supplementary fig. S3, Supplementary Material online). A similar pattern was found using the approximation method although a was even higher across all categories (method II, Eyre-Walker and Keightley 2009; supplementary table S5, Supplementary Material online). Over all genes our estimate of a was 0.16 (95% CI ¼ 0.12–0.20) and our estimate of xa was 0.034 (95% CI ¼ 0.026–0.043) when applying the two-epoch model. We compared the DFEs, a and xa using the two-epoch model among expression levels. We found a greater 1506 proportion of nearly neutral sites in the low expression category compared with the high category (proportion NeS < 1, high¼ 0.16, low¼ 0.23, P < 0.001) and a larger fraction of sites under strong purifying selection (proportion NeS > 100, high¼ 0.72, low¼ 0.62, P <0.001) in high expression genes compared with low expression genes. We found no evidence of differences in the proportion of sites fixed by positive selection between expression level categories (fig. 5a), nor the rate of positive selection between expression level categories (fig. 5b and c). Using the one-epoch model with loblolly pine as the outgroup, there was a greater proportion of highly deleterious sites in the high expression category compared with the low expression categories (proportion NeS > 100, high¼ 0.57, low¼ 0.17, P < 0.001), and the reverse pattern for the intermediate categories (proportion NeS 10–100, high¼ 0.21, low¼ 0.50, P < 0.001; NeS 1-10, high¼ 0.11, low¼ 0.22; supplementary fig. S5, Supplementary Material online). We also found that a and xa were greater for the low expression category compared with the high expression category. The mid expression levels were intermediate between the low and high categories in all cases. Similar patterns were found when interior spruce was used as an outgroup for this analysis (supplementary figs. S6 and S8, Supplementary Material online), although the absolute values of a and xa were higher when loblolly was the outgroup, perhaps because a more conserved set of orthologs were identified in comparisons between pine and spruce. Similarly, a higher dN/dS was identified on average for lodgepole pine when loblolly was used as the outgroup compared to interior spruce. The same patterns and similar absolute values of a were identified when down-sampled reads were used to calculate polymorphism (supplementary fig. S7, Supplementary Material online), demonstrating that biases and SNP calling were not driving these patterns. Again, a similar pattern was found using the approximation method, although a was even higher across all MBE Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 Table 2. Genes with Evidence for Site-Specific Positive Selection Using Paml Models (M1a vs. M2a and M7 vs. M8) and the Corresponding Best BLASTX Hit to Arabidopsis. Orthogroup LRT M7:M8 LRT M1a:M2a ortho_group8482 ortho_group3686 ortho_group10360 ortho_group13328 ortho_group8319 ortho_group3142 ortho_group10677 ortho_group18841 ortho_group4644 ortho_group7797 ortho_group14244 ortho_group6195 ortho_group6148 ortho_group7794 ortho_group13072 ortho_group7027 ortho_group7322 ortho_group5194 ortho_group12985 ortho_group5684 ortho_group2017 ortho_group12314 ortho_group7511 ortho_group8450 ortho_group8790 43.63*** 28.08** 25.23** 40.49*** 28.00** 24.98** 25.61*** 24.35** 24.29** 22.85** 22.81** 21.43* 20.42* 19.71* 19.64* 19.39* 19.15* 18.51* 24.64** 24.55** 22.86* 22.82* 21.48* 22.21* 20.42* 19.60* 19.39* 19.58* 18.56* 21.39* 20.78* 20.32* 18.45* 18.38* 18.29* 18.18* 18.09* 17.67* 17.57* Arabidopsis Top Hit AT3G17730.1 AT3G22600.1 AT5G01310.1 AT2G42510.1 AT1G28090.1 AT4G32300.1 AT1G52320.2 AT4G14385.2 AT1G69170.1 AT5G04000.1 AT3G20560.1 AT5G01310.1 AT1G70630.1 AT4G29260.1 AT1G60730.2 AT3G14120.1 AT4G17330.1 AT1G26180.1 AT2G15440.1 Description No hit NAC domain transcription factor Seed storage 2S albumin superfamily APRATAXIN-like No hit Involved in spliceosome assembly Polynucleotide adenylyltransferase family protein S-domain-2 5 Involved in N-terminal protein myristoylation Unknown function Squamosa promoter-binding protein-like transcription factor No hit Unknown function Thioredoxin (TRX) superfamily APRATAXIN-like Nucleotide-diphospho-sugar transferase HAD superfamily, subfamily IIIB acid phosphatase No hit NAD(P)-linked oxidoreductase superfamily protein Unknown function G2484-1 protein of unknown function Unknown function No hit Unknown function No hit NOTE.—Two times the difference in the lnL from the models (Likelihood ratio test (LRT), df ¼ 2 in both cases) is shown along with the significance. *P < 0.05; **P < 0.01; ***P < 0.001. categories (method II, Eyre-Walker and Keightley 2009; sup plementary table S5, Supplementary Material online). Genes with Signatures of Positive Selection To identify specific genes that showed evidence of positive selection in conifers, we examined 7,185 alignments that passed all of the filtering requirements. Only 17 orthogroups had dN/dS ratios that were greater than 1, suggesting substantial constraints on evolutionary rates. After correcting for multiple comparisons, Paml’s M1a:M2a models identified 15 significant orthogroups and the M7:M8 models identified 23. All of the orthogroups in M1a:M2a except one were significant in the M7:M8 comparison. Separate runs of the M8 model with different starting values returned similar results. Top blast hits from Arabidopsis thaliana for these 24 genes (Yeaman et al. 2014) are shown in table 2. Discussion Divergence and Conservation of Expression and Evolutionary Rate To our knowledge, this is the first conifer study to find that evolutionary rate is associated with changes in expression plasticity between species (fig. 3). Genes that had conserved expression patterns across treatments between lodgepole pine and interior spruce, particularly those that varied significantly across treatments, had the lowest average level of coding sequence evolution (table 1 and fig. 1). In contrast, those genes showing divergence in expression plasticity, as well as those that had constitutive differences in expression between lodgepole pine and interior spruce, had significantly higher dN/dS. Our results show a distinct association between coding sequence evolution and divergence in expression patterns in comparisons between lodgepole pine and interior spruce with a 6.4% increase in dN/dS for the two more highly divergent expression categories (DEG, SEG) (fig. 1). Two nonmutually exclusive factors could be driving this pattern. The first is that an increase in the rate of fixation due to positive selection in genes diverging in expression could produce this relationship. For instance, a change in gene expression might lead to selection for corresponding changes in coding sequence to improve the function of the gene in its altered role. However, we found no significant differences in the proportion of substitutions resulting from positive selection or the rate of positive selection (a or xa; fig. 4b and c; supplementary fig. S3b and c, Supplementary Material online) among our expression categories. Indeed, the trend was in the opposite direction with CEG and NON genes having slightly higher estimates of a. As a second explanation, genes that are diverging in expression may experience reduced purifying selection in one or both of the species. For example, higher expression appears to be correlated with greater purifying selection, so a reduction in gene expression in one species relative to another may be accompanied by a relaxation of selection in that species. 1507 MBE Hodgins et al. . doi:10.1093/molbev/msw032 (a) Partial correlation P<0.001 dN/dS P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 Spearman’s rho mean expression P<0.01 protein length P<0.01 P=0.10 expression divergence P<0.05 P<0.001 P<0.05 P<0.001 P<0.001 P<0.001 dN/dS mean expression protein length 1.00 0.75 0.50 0.25 s s P<0.001 P<0.001 P<0.01 rate P<0.001 P<0.001 P<0.01 P<0.001 P<0.001 P<0.001 P<0.001 P<0.01 dN P<0.001 dS P<0.001 P<0.001 P=0.11 0.00 P<0.001 expression divergence (b) P<0.01 P=0.85 P<0.001 P<0.001 Spearman’s rho 1.00 0.75 mean expression P<0.01 P=0.62 P<0.001 P<0.001 P<0.001 P<0.001 0.50 0.25 protein length expression divergence s P<0.001 P<0.001 P<0.001 P=0.11 P<0.001 P<0.001 P<0.001 P=0.06 P<0.001 P<0.001 P<0.04 P<0.001 P=0.19 P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 P=0.33 s expression divergence P<0.01 protein length P<0.001 P=0.51 rate mean expression dS dN rate (c) P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 P<0.001 dN P<0.001 dS P<0.01 P<0.001 P=0.07 0.00 P=0.07 P=0.92 P<0.001 P<0.001 Spearman’s rho 1.00 0.75 P<0.001 P<0.001 P<0.001 mean expression P=0.53 P=0.30 P<0.001 protein length P<0.001 P<0.01 P<0.01 expression divergence P=0.82 P=0.12 P<0.001 P<0.001 P=0.03 P<0.001 P=0.11 P<0.001 P<0.01 0.50 0.25 s P<0.001 P<0.001 P=0.12 0.00 P<0.001 P<0.001 P<0.001 s expression divergence protein length mean expression dS dN rate Pairwise correlation FIG. 3. Pairwise (below diagonal) and partial (above diagonal) Spearman’s rank correlations between evolutionary rates and genomic variables. (a) dN/dS (N¼ 3,199), (b) 50 -UTR (N ¼ 2,924), and (c) 30 -UTR (N ¼ 3,028). Note that some contrasts that are reproduced among panels may differ slightly in their results, due to differences in the number of genes included, as a result of incomplete information from UTRs. A slight but significant increase in the proportion of nearly neutral substitutions in genes with differences in expression among species (SEG) compared to those with constitutive expression across treatments and species (NON) provides some support for this hypothesis (fig. 4a). This pattern 1508 suggests that genes with conserved expression patterns experience stronger purifying selection than those that have shifted in some way between the species. However, we found no significant differences between CEG and DEG genes in a, xa, or the DFE, and differences in branch-specific estimates of dN/dS were not correlated with expression differences between pine and spruce. Although differences in the DFE among expression class provide some support for the hypothesis that genes that change expression over evolutionary time experience relaxed selection, we lacked an outgroup for the gene expression analysis with which to make inferences about which branch(s) the change in expression took place and the direction of the shift. Such information is needed to assess whether positive or negative selection is associated with an increase or decrease in gene expression in each lineage, as both could be contributing to the pattern. Future studies that examine both gene expression and sequence evolution in a phylogenetic context may allow a greater understanding of the relative roles of positive and negative selection in driving coding sequence and expression divergence. The UTRs of eukaryotic mRNAs play an important role in the posttranscriptional regulation of gene expression (Pesole et al. 2001; Liu et al. 2012). UTRs can influence gene expression through their impact on mRNA stability, transcription or translation efficiency, and mRNA localization (Narsai et al. 2007; Liu et al. 2012). The 30 -UTR showed patterns of polymorphism and divergence that were more similar to the replacement sites while the 50 -UTR showed patterns more similar to synonymous sites (fig. 2). This suggests greater evolutionary constraints on the 30 -UTR, which tends to be longer and likely harbors more functional sites (Narsai et al. 2007; Liu et al. 2012). Because changes in these regions may have direct impacts on gene expression, we tested if divergence in expression was associated with evolutionary rate in these regions. Genes conserved in their plasticity of expression (CEG) showed the slowest evolutionary rates for both the 50 - and 30 UTRs, perhaps because of the role of these regions in maintaining proper regulation in response to environmental change (fig. 1). Interestingly, the 30 -UTR displayed a significant positive association between evolutionary rate and expression divergence (fig. 3), but there was no significant association of the 50 -UTR rate with expression divergence. This makes sense as a primary function of the 30 -UTR is to regulate expression of mRNA, whereas the 50 -UTR has a major role in regulating translation (Mignone et al. 2002). Gene Expression Level and Evolutionary Rate We identified negative correlations between both dN/dS and dN and average expression level. Several studies have found that genes with high levels of expression are constrained in their evolutionary rate (for review see Rocha 2006). High expression genes may function in a wide array of biochemical environments and may be able to tolerate fewer mutations because they interact with a wide array of partners (Duret and Mouchiroud 2000; Subramanian and Kumar 2004). Alternatively, there may be selection to reduce the number of mis-folded proteins. This would constrain dN MBE Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 (b) Proportion of sites (a) ns a CEG DEG NON SEG ns ab ab a b ns ns (c) ns <1 SEG NON DEG CEG NeS category Expression category FIG. 4. The DFEs (a), the proportion of sites fixed by positive selection (b) and the rate of fixation by positive selection (c) for four gene expression categories determined by the patterns of expression among climate treatments and between lodgepole pine and interior spruce. CEG: conserved plasticity in expression; DEG: diverged plasticity in expression; SEG: species diverged genes with no plasticity in expression; and NON: genes will no treatment or species differences in expression. Interior spruce was used as the outgroup and a two-epoch model was applied. 95% confidence intervals are shown from bootstrapping the data. when expression is high through selection for protein sequences that fold properly despite mistranslation (Drummond et al. 2006). To determine the relative importance of positive and negative selection in driving the correlation between expression level and dN/dS, we compared patterns of divergence and polymorphism at synonymous and nonsynonymous sites. We found significantly more nearly neutral (NeS < 1) and fewer highly deleterious mutations (NeS > 100) in the low expression class compared with the high expression class when using the two-epoch model (fig. 5). However, neither the proportion of substitutions fixed by positive selection (a) nor the rate of positive selection (xa) differs among expression level classes. This suggests that the differences in dN/dS are driven primarily by a relaxation of purifying selection in low expression genes compared to the high expression genes. Similarly, other studies that have taken this approach have found strong evidence for greater purifying selection in high expression genes (Paape et al. 2013; Williamson et al. 2014). In order to identify polymorphisms using RNAseq we required a relatively high level of expression across many individuals, suggesting the pattern would potentially be even stronger if SNPs could be identified in genes with even lower levels of expression. However, this approach meant that we eliminated very weakly expressed genes from the analysis and ensured that pseudogenes did not create an apparent relationship between expression level and evolutionary rate. Pseudogenes are common in conifers (Nystedt 2013) and could have a basal level of expression (Thibaud-nissen et al. 2009). Evolutionary Rates and Genomic Context Context-specific gene expression can reduce pleiotropic constraints by limiting the effects of mutations to a specific context (e.g., tissues or environments). This can potentially facilitate sequence divergence (Pal et al. 2006; Snell-Rood et al. 2010). Genes with low tissue specificity have also been found to be more slowly evolving (e.g., Subramanian and Kumar 2004; Renaut et al. 2012; Paape et al. 2013), likely because these genes are involved in multiple biochemical pathways and experience multiple selective environments. Our present study was unable to examine tissue specificity as only one organ type (needles) was examined. Environment-specific gene expression may have similar impacts on evolutionary rate as it could restrict gene expression to a subset of individuals. The restricted number of individuals that experience this environment and thereby express these genes will mean that the effects of selection will be weakened (Snell-Rood et al. 2010). To address this hypothesis, we examined the association of dN/dS with treatment specificity. However, in coding regions we found limited evidence for this association once correlations among other variables were taken into account (supplemen tary fig. S2, Supplementary Material online). This could be because climate treatments reflect environments commonly 1509 MBE Hodgins et al. . doi:10.1093/molbev/msw032 (b) (a) Proportion of sites low mid low mid high high a ab b b a a a ns a (c) bc ns ns ns <1 high mid high mid low low NeS category Expression level FIG. 5. The DFEs (a), the proportion of sites fixed by positive selection (b), and the rate of fixation by positive selection, (c) for four gene expression categories, equal in size, determined by average expression level in lodgepole pine. Loblolly pine was used as the outgroup and a two-epoch model was applied. 95% confidence intervals are shown from bootstrapping the data. Different letters indicate significant differences. experienced by most individuals during development due to temporal variation, or because genes that appear to have high treatment specificity in these seedlings are actually expressed in a number of environments or involved in multiple traits over the long lifetimes of these trees. Alternatively, our cutoff for determining if a gene was expressed or not in a treatment may not reflect the functionality of these weakly expressed genes. However, the untranslated regions were associated with treatment specificity but in opposing ways, with the positive association between substitution rate and treatment specificity predicted for the coding sequence found in the 30 -UTR, and with a stronger negative association in the 50 -UTR such that genes with faster evolving 50 -UTRs were expressed in a wider array of climate treatments. We found a strong positive correlation between protein length and evolutionary rate. Protein length has been found to be correlated with divergence in previous studies (but see Drummond et al. 2006; Alvarez-Ponce 2012), although the direction of the relationship is not consistent among species. Similar to our findings, some studies find strong positive correlations with dN/dS (Lemos et al. 2005; Ingvarsson 2007), but others report the reverse pattern (Liao and Zhang 2006; Larracuente et al. 2008; Yang and Gaut 2011; Sun and Choi 2015). Positive correlations between protein length and the coding sequence could be due to greater selective interference among sites (i.e., the Hill–Robertson effect), which is expected in longer proteins, thereby reducing the efficiency of natural selection and potentially leading to 1510 greater divergence (Ingvarsson 2007). Long proteins also had reduced synonymous polymorphisms, which would be expected under this scenario. However, such an explanation predicts that linked untranslated regions should follow a similar pattern. This is inconsistent with our results, perhaps implying a greater role of the untranslated regions in regulating transcription or translation of longer proteins in particular. Protein length could be correlated with other factors not accounted for in our analysis that may influence divergence in the coding sequence or UTRs, such as gene function, intron number or length, and codon bias (Bush et al. 2015; De La Torre et al. 2015). We also found associations between evolutionary rate and some gene ontology (GO) categories for both the UTR and coding regions, suggesting gene function can be an important driver of evolutionary change (sup plementary fig. S9, Supplementary Material online). Estimates of Adaptive Divergence in Conifers We identified only a handful of genes that showed evidence of site-specific positive selection using divergence data with Paml. Although this method has limited power to detect selection when few taxa are included in the alignment (Bielawski and Yang 2005), our findings provide an initial glimpse at which genes are evolving rapidly during conifer evolution. Several putative transcription factors were identified (ortho_group3686, ortho_group13328, ortho_group 13072), including one with a homolog involved in defense in A. thaliana (ortho_group14244). Ongoing work identifying loci important for local adaptation in these species will allow Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 us to determine how often genes diverging within a species in response to local selective pressures are the same as those fixing rapidly among species in response to positive selection. The low number of sites identified as experiencing positive selection using divergence data alone contrasts with the relatively high proportion of sites identified as adaptively diverging in lodgepole pine using polymorphism and divergence data (figs. 4 and 5). The results of the two-epoch model suggest that the proportion of sites evolving due to positive selection in conifers ranges from 0.13 to 0.52, depending on the expression class examined and the outgroup used. Our estimates are higher than previous estimates of a conducted in several conifer species (Eckert et al. 2013a). Using a similar approach, Eckert et al. (2013a) found a estimates were not significantly different from zero in all 11 lineages examined. They attribute these low values to strong biases in the genes chosen for the analysis. In a sample of a larger number of genes in loblolly pine similar estimates to ours of have been found (Eckert et al. 2013b). Many widespread conifers are thought to have relatively large effective population sizes (Neale and Savolainen 2004) and several lines of evidence including decades of common garden experiments demonstrate the adaptive capacity of conifers (Alberto et al. 2013). The proportion of sites fixed by positive selection is consistent with estimates in other plant species with large effective population sizes and low population structure (see Hough et al. 2013 for review) such as sunflowers (Renaut et al. 2012), Capsella grandiflora (Williamson et al. 2014), and Populus tremula (Ingvarsson 2010). However, estimates of a require many assumptions to be made and violation of these assumptions can dramatically impact the outcome and hinder comparisons among species (Gossmann et al. 2010; Hough et al. 2013). For example, current methods to estimate a assume that positively selected sites are swept to fixation and do not contribute to polymorphism. Although we only used sites from a single population in this study, local adaptation or strong population structure will maintain variation within a species that is not accounted for by this approach. Estimates of a can be strongly biased downwards if a significant number of polymorphic sites are slightly deleterious, as such sites would rarely contribute to divergence. Therefore, we implemented DFE-a, which attempts to estimate the fraction of these deleterious alleles using the allele frequency spectrum (AFS; Eyre-Walker and Keightley 2009). This approach estimates demographic changes, which can also impact the frequency spectrum of polymorphisms. However, the demographic model used (one or two epochs) had a substantial impact on the outcome. The effects of genetic draft can distort the AFS and impact estimates of demographic parameters and strength of purifying selection (Messer and Petrov 2013). Differences in selection on linked functional sites could explain why slightly different demographic estimates were obtained for different classes of genes in our analysis. A single epoch model resulted in significantly greater rates of positive selection overall, shifts in the DFE and significantly higher a and xa estimates in low expression genes compared with the two-epoch model (supplementary figs. S3 and S5, Supplementary Material online). Simulations MBE have shown that a two-epoch model can produce more accurate estimates of a even when population sizes are held constant due to the impact of genetic draft. The two-epoch model essentially accounts for the skew of the site frequency spectrum at synonymous sites introduced by background selection. Messer and Petrov (2013) found that the one-epoch model generally overestimated a, which is consistent with our findings. However, variation in demographic estimates for different gene categories and the nonindependence of functional and nonfunctional sites may have obscured differences in DFE or rates of positive selection. We found a weak but significant negative correlation between dN/dS and synonymous site polymorphism. This pattern is driven by an underlying positive correlation with dS and a stronger negative correlation with dN. One explanation for this pattern is recurrent selection, whereby repeated sweeps at functional sites results in reduced variation at linked neutral sites (Maynard Smith and Haigh 1974). Similarly, interactions between positive and negative selection could contribute to this pattern, such that positive selection may result in divergence at weakly deleterious sites via local reductions in effective population size (Andolfatto 2007). Such sweeps would also be expected to reduce neutral diversity. Simulations have demonstrated that a negative correlation between nonsynonymous divergence and neutral polymorphism is unlikely to be generated by negative selection alone (Lohmueller et al. 2011). Along with intermediate estimates of a, the negative partial correlation between dN and synonymous nucleotide diversity suggests that adaptive divergence among conifer genomes may be more frequent than some previous estimates have indicated (Eckert et al. 2013a). Our capacity to detect rapidly evolving genes or sites is likely hampered somewhat by difficulties in properly aligning diverged regions and our focus on one-to-one orthologs, suggesting that the adaptive divergence identified here could be even more prevalent. Conifer genomes have notoriously low evolutionary rates relative to angiosperms and have several other unique genomic features (e.g., reduced whole-genome duplications, but see Li et al. 2015), perhaps suggesting conifer-specific mechanisms of genome evolution (Buschiazzo et al. 2012). Determining whether this level of adaptive divergence is due to the demography and life history of our particular study species or a general feature of conifers will require further comparative studies. Materials and Methods Transcriptome Data We previously developed reference transcriptomes for lodgepole pine and interior spruce (Yeaman et al. 2014). The reference transcriptomes contained a single (longest) isoform per transcript cluster (identified by the Trinity assembler; Grabherr et al., 2011) and had weakly expressed transcripts removed. We also obtained reference transcriptomes for loblolly pine, Sitka spruce, Norway spruce and Douglas-fir from the TreeGenes database (http://dendrome.ucdavis.edu/tree genes/, last accessed February 20, 2016). For each of these transcriptomes, we removed redundant transcripts by 1511 MBE Hodgins et al. . doi:10.1093/molbev/msw032 clustering using Cd-Hit-Est (94% identity, word size ¼ 8 and both strands were compared) (Li and Godzik 2006; Fu et al. 2012). Ortholog Identification and Interspecific Alignments for Six Conifer Species Using the six conifer transcriptome assemblies, we conducted an all-against-all TBLASTX. Using these results, we identified orthologs with OrthoMCL version 2.0.8 applying default parameters (Li et al. 2003). This program uses a heuristic BLASTbased approach to identify putative orthologous clusters of genes termed orthogroups. For comparisons between species we only used one-to-one orthologs. We identified the most likely open reading frames (ORFs) for all orthologs using Transdecoder (option–search_pfam; Haas et al., 2013). After ORFs were translated, we conducted a BLASTP to the TAIR 10 database. Only those ORF with a pfam hit or hit to A. thaliana were retained for further analysis to ensure the correct identification of ORFs, particularly in fragmented transcripts. A relatively small fraction of putative ORFs were rejected at this stage (<18%). Following this we extracted the predicted coding sequences and the 50 -UTR and 30 -UTR for each orthogroup and aligned the sequences using Prank (þF option; L€ oytynoja and Goldman 2008). This program, which takes evolutionary relationships into account when doing the alignments, has been shown to outperform other alignment programs (L€ oytynoja and Goldman 2008; Fletcher and Yang 2010; Markova-Raina and Petrov 2011). For the coding regions we used the codon model for the alignments. RNAseq Data and Expression Analysis We employed approximately 350 Gb RNAseq data that we previously gathered to examine patterns of gene expression among seven climate treatments that differed in their light, temperature, and moisture regime (for methodological details see supplementary table S1, Supplementary Material online; Yeaman et al. 2014). Data were derived from needle samples of 44 lodgepole pine and 39 interior spruce seedlings obtained from the British Columbia Ministry of Forests, Lands, and Natural Resource Operations. Lodgepole pine seed was from seedlot 63,019, seed orchard 313 (Nelson Seed Planning Unit) containing 46 parental genotypes. Interior spruce seedlings were grown from seedlot 63,060 from seed orchard 305 (Nelson Seed Planning Unit) containing 70 parental genotypes. We used RSEM to estimate gene expression levels in each individual (Li and Dewey 2011), aligning each library back to our Trinity assemblies. For orthologous genes between lodgepole pine and interior spruce, we analyzed patterns of gene expression among treatments and species using the EdgeR software package (Robinson et al. 2010). Evolutionary Rates for Pairwise Comparisons between Lodgepole Pine and Interior Spruce For all orthogroups for which lodgepole pine and interior spruce had one-to-one orthologs, we conducted pairwise comparisons to determine the divergence at nonsynonymous and synonymous sites in the coding sequence, as well as the 1512 divergence in the untranslated regions using PAML version 4.5 (Yang 1997; Yang 2007). We ran BASEML for the UTRs (general time reversible model), and CODEML for the protein coding regions (runmode -2, F3X4 codon frequency). We removed columns with missing data or gaps and only retained sequences for which there were at least 60 bases for the untranslated regions and 60 codons for the coding regions. As low divergence leads to uncertain estimates, orthogroups where dS was below 0.01 were excluded. We removed alignments where the untranslated regions had substitution rates below this level and discarded orthogroups showing substitution rates greater than 2, indicating saturation of substitutions and potential alignment errors. We used a number of filters to avoid alignment errors. This may bias alignments toward more slowly diverging genes, but we felt that this was preferable to introducing errors that can provide incorrect estimates of evolutionary rate. Prior to the analysis, we eliminated all alignments with average percent identity below 50%, and for the analysis of the coding regions, we removed genes without complete ORFs in both species (i.e., a start and a stop codon detected), and removed genes with large discrepancies in protein length (greater than a 10% difference). These features may indicate misalignments, frameshifts, premature stop codons, pseudogenes, or the erroneous identification of orthologs based on homologous conserved domains in otherwise separate genes. SNP Identification and Polymorphism Estimates We aligned RNAseq reads for lodgepole pine and interior spruce to the de novo assemblies for each species (Yeaman et al. 2014) with BWA-mem (Li and Durbin 2009; Li 2013) and GATK IndelRealigner (DePristo et al. 2011), and called SNPs and indels using Mpileup (Samtools/Bcftools v 0.1.19; Li et al. 2009; Li 2011). Details are in the supplementary methods, Supplementary Material online. Following filtering (supple mentary methods, Supplementary Material online), we used a modified version of Polymorphorama perl script (Bachtrog and Andolfatto 2006; Andolfatto 2007; Haddrill et al. 2008) to generate nonsynonymous and synonymous AFS, as well as summary statistics for each gene. This script estimated the number of synonymous sites, nonsynonymous sites, average pairwise diversity at synonymous (ps) and nonsynonymous sites (pn), average pairwise divergence (Dxy), as well as counts of the number of polymorphisms (S), and the summary of the frequency distribution of polymorphism, Tajima’s D (Tajima 1989). Average pairwise diversity and divergence estimates were either corrected for multiple hits using a Jukes–Cantor correction (Jukes and Cantor 1969) or, in the case of synonymous divergence, the Kimura (1980) two-parameter model. Statistical Analysis We determined if the rates of protein and UTR evolution differed depending on the pattern of gene expression among treatments or between lodgepole pine and interior spruce using a linear model in R (package lm). When gene expression class (CEG, DEG, SEG, and NON) explained a significant amount of the variation in evolutionary rate, we carried out multiple comparisons using Tukey’s test. The data were either Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 log or square root transformed to improve normality and homogeneity of variance. We determined if dN/dS ratio and substitution rate in the untranslated regions were correlated with average expression level, expression divergence between species (lodgepole pine vs. interior spruce), treatment specificity, or protein length using partial correlations (Spearman’s) in R (pcor R). Average expression level was based on the transcript fraction measure from RSEM and is preferred over RPKM and FPKM measures because it is independent of the mean expressed transcript length, and comparable across samples and species (Li and Dewey 2011). Values were averaged within each treatment and species, and then averaged over all treatments. Expression divergence between species was assessed by determining the Euclidean distance between the average expression values for each treatment between species. Treatment specificity was determined by identifying treatments with average expression below 1.0 e6. The proportion of treatments with values below this threshold was then determined so that specificity would be 0 if expression occurred in all treatments and 1 if it was expressed in none of the treatments examined. We repeated the above analysis but replaced treatment specificity with ps as we were unable to estimate this in genes where treatment specificity was low. We tested the hypothesis that the correlation between expression and sequence divergence is driven by relaxed selection in the species with lower expression levels. We identified alignments with lodgepole pine, interior spruce and Douglas fir and used PAML to determine lineage specific evolutionary rates. Because we did not have an outgroup for expression to determine changes in expression level along each branch, we examined correlations between the difference in dN/dS and the difference in average expression level between lodgepole pine and interior spruce. For those orthologs in interior spruce and lodgepole pine with sufficient data, we compared levels of nucleotide diversity and Tajima’s D between the species using a Wilcoxon Rank test in R. Patterns of polymorphism and divergence were examined among gene expression classes and gene regions (untranslated regions vs. replacement and synonymous sites) using the nonparametric Kruskal–Wallis test. Because only a handful of orthogroups contained all three gene regions in interior spruce with sufficient polymorphism data for both UTRs and coding region, we only present this comparison for lodgepole pine, but the same pattern was observed in both species when comparisons of the coding region and each UTR was conducted (supplementary results, Supplementary Material online). Estimates of the DFEs, , and !a The McDonald–Kreitman test compares polymorphism and divergence between selectively and neutrally evolving sites to estimate the proportion of fixations driven by positive selection (a). However, the effects of slightly deleterious mutations can downwardly bias estimates of positive selection. Therefore, we implemented the approach of Eyre-Walker and Keightley (2009) to estimate a and the rate of positive selection (xa) while taking into account segregating MBE deleterious polymorphism by using the site frequency spectrum. The divergence values and the AFS were calculated for each gene using Polymorphorama for lodgepole pine with interior spruce as an outgroup. We chose this outgroup as we wanted to compare changes in gene expression between these species to the adaptive substitutions arising during divergence between them. Using synonymous sites as a neutral reference we estimated a, xa, and the DFEs using DFE-a (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009) for each expression category. This was also repeated using method II from Eyre-Walker and Keightley (2009) implemented in DoFe 3.0. Divergence and polymorphism data were summed across all genes in the specific category. We repeated this analysis by down-sampling reads to ensure that patterns were not impacted by potential biases in SNP calling associated with expression level. We implemented a one and two-epoch model (i.e., a single population size vs. a stepwise change in population size) and ran each model under a range of starting parameters (t2: 1,000, 100, 10; s: 0.1, 0.01, 0.001; beta: 2, 1, 0.5, 0.1). Changing the starting parameters had little impact on the final estimates (<1%) in all cases. However, the number of epochs in the model had a large impact on the absolute estimates of a and in some cases the relative differences among expression classes. We estimated 95% CI using 1,000 bootstraps by sampling genes with replacement from each expression category. We determined significance in the same manner as Williamson et al. (2014). We also examined the impact of expression level on a, xa, and the DFE to determine the relative importance of positive and negative selection in shaping the relationship between divergence and expression level. We examined expression level in lodgepole pine using loblolly pine as an outgroup. We repeated this analysis using interior spruce as the outgroup and by down-sampling reads to ensure that patterns were not impacted by the selected outgroup or potential biases in SNP calling associated with expression level. Polymorphism data from interior spruce were not used in this analysis, as the hybrid zone effects would not be accounted for in the demographic model available in DFE-a. Genes were divided into four equal categories based on average expression levels and DFE-a was run in the same manner as above. Analysis of Site-Specific Positive Selection Using Divergence among Six Conifers Only a small fraction of sites are likely targeted by positive selection over a brief window of evolutionary time (Golding and Dean 1998; Bielawski and Yang 2005). Positive selection is difficult to detect using pairwise comparisons, as this approach averages selective pressure over the entire evolutionary history separating the two lineages and over all codon sites in the sequences. Power is improved if selective pressure is allowed to vary over sites or branches. However, the greater complexity of the model means that multiple sequences are needed (Yang 1998; Yang et al. 2000; Bielawski and Yang 2005). To accommodate this, we included all six conifer species for which transcriptomes are publicly available. As visual inspection of the alignments occasionally indicated potential 1513 Hodgins et al. . doi:10.1093/molbev/msw032 paralogs, we used a tree-based approach to flag these problematic alignments (see supplementary methods, Supplementary Material online), which were ignored for all downstream analyses. We evaluated site-specific positive selection using PAML 4.5. We filtered all alignments according to the same parameters as above. Only orthogroups with at least three species in the trimmed and filtered alignments were used. For the analysis we constructed an unrooted tree using the known topology among the six species (Wang et al. 2000; Lockwood et al. 2013; supplementary fig. S10, Supplementary Material online). Specifically, we used the sites model in CODEML to estimate dN and dS at each codon averaged across all branches in the tree. We tested for sites evolving by positive selection (i.e., dN/dS, x > 1) by comparing M1a (nearly neutral), M2a (positive selection), and M7 (beta) against M8 (beta and x) (Yang 1997; Yang 2014). Equilibrium codon frequencies for each alignment were estimated from the average nucleotide frequencies at the three codon positions (F3X4), and transition/transversion ratios were estimated by iteration of the data. Twice the difference in log-likelihood values of the M1a:M2a (2 df) and M7:M8 (2 df) comparisons were assessed for statistical significance using the v2 distribution. Because the M8 model in particular is known to be influenced by starting parameters, we ran the program using multiple starting values of the parameter x. We compared P values to critical values calculated based on a ¼ 0.05 so that fewer than 5% of genes identified as significant were false positives (q value R; Storey, 2002). Supplementary Material Supplementary methods and results, figures S1–S10 and tables S1–S7 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). Acknowledgments The authors would like to thank the Aitken and Rieseberg lab groups for helpful suggestions as well as the editor and three anonymous reviewers for their insightful comments. The authors thank G. O’Neill of the BC Ministry of Forests, Lands and Natural Resources Operations for providing seedlings, and P. Smets, R. Belvas and C. Fitzpatrick for cultivating seedlings and maintaining treatments. This work is part of the AdapTree Project funded by the Genome Canada Large Scale Applied Research Project program, with co-funding from Genome BC, the BC Ministry of Forests, Lands and Natural Resources Operations, and the Forest Genetics Council of BC (co-Project leaders S.N. Aitken and A. Hamann). References Aitken SN, Yeaman S, Holliday JA, Wang T, Curtis-McLane S. 2008. Adaptation, migration or extirpation: climate change outcomes for tree populations. Evol Appl. 1:95–111. Alberto FJ, Aitken SN, Alıa R, Gonzalez-Martınez SC, H€anninen H, Kremer A, Lefèvre F, Lenormand T, Yeaman S, Whetten R, et al. 1514 MBE 2013. Potential for evolutionary responses to climate change — evidence from tree populations. Glob Chang Biol. 19:1645–1661. Alvarez-Ponce D. 2012. The relationship between the hierarchical position of proteins in the human signal transduction network and their rate of evolution. BMC Evol Biol. 12:192. Andolfatto P. 2007. Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome. Genome Res. 17:1755–1762. Bachtrog D, Andolfatto P. 2006. Selection, recombination and demographic history in Drosophila miranda. Genetics 174:2045–2059. Bielawski JP, Yang Z. 2005. Maximum likelihood methods for detecting adaptive protein evolution. In: Nielsen R, editor. Statistical methods in molecular evolution. New York: Springer Verlag. p. 103–124. Buschiazzo E, Ritland C, Bohlmann J, Ritland K. 2012. Slow but not low: genomic comparisons reveal slower evolutionary rate and higher dN/dS in conifers compared to angiosperms. BMC Evol Biol. 12:1–15. Bush SJ, Kover PX, Urrutia AO. 2015. Lineage-specific sequence evolution and exon edge conservation partially explain the relationship between evolutionary rate and expression level in A. thaliana. Mol Ecol. 24:3093–3106. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43:491–498. Drummond DA, Raval A, Wilke CO. 2006. A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol. 23:327–337. Duret L, Mouchiroud D. 2000. Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol Biol Evol. 17:68–74. Eckert A, Shahi H. 2012. Spatially variable natural selection and the divergence between parapatric subspecies of lodgepole pine (Pinus contorta, Pinaceae). Am J Bot. 99:12–11. Eckert AJ, Bower AD, Jermstad KD, Wegrzyn JL, Knaus BJ, Syring JV, Neale DB. 2013a. Multilocus analyses reveal little evidence for lineage-wide adaptive evolution within major clades of soft pines (Pinus subgenus Strobus). Mol Ecol. 22:5635–5650. Eckert AJ, Wegrzyn JL, Liechty JD, Lee JM, Cumbie WP, Davis JM, Goldfarb B, Loopstra C a., Palle SR, Quesada T, et al. 2013b. The evolutionary genetics of the genes underlying phenotypic associations for loblolly pine (Pinus taeda, Pinaceae). Genetics 195:1353–1372. Eyre-Walker A, Keightley PD. 2009. Estimating the rate of adaptive molecular evolution in the presence of slightly deleterious mutations and population size change. Mol Biol Evol. 26:2097–2108. Fletcher W, Yang Z. 2010. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Mol Biol Evol. 27:2257–2267. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152. Gillespie J. 1991. The causes of molecular evolution. Oxford: Oxford University Press. Golding GB, Dean AM. 1998. The structural basis of molecular adaptation. Mol Biol Evol. 15:355–369. Gossmann TI, Song B-H, Windsor AJ, Mitchell-Olds T, Dixon CJ, Kapralov MV, Filatov Da, Eyre-Walker A. 2010. Genome wide analyses reveal little evidence for adaptive evolution in many plant species. Mol Biol Evol. 27:1822–1832. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 29:644–652. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, et al. 2013. De novo transcript sequence reconstruction from RNA-seq using the trinity platform for reference generation and analysis. Nat Protoc. 8:1494–1512. Haddrill PR, Bachtrog D, Andolfatto P. 2008. Positive and negative selection on noncoding DNA in Drosophila simulans. Mol Biol Evol. 25:1825–1834. Conifer Sequence and Expression Evolution . doi:10.1093/molbev/msw032 Hough J, Williamson RJ, Wright SI. 2013. Patterns of selection in plant genomes. Annu Rev Ecol Evol Syst. 44:31–49. Howe GT, Aitken SN, Neale DB, Jermstad KD, Wheeler NC, Chen TH. 2003. From genotype to phenotype: unraveling the complexities of cold adaptation in forest trees. Can J Bot. 81:1247–1266. Ingvarsson PK. 2007. Gene expression and protein length influence codon usage and rates of sequence evolution in Populus tremula. Mol Biol Evol. 24:836–844. Ingvarsson PK. 2010. Natural selection on synonymous and nonsynonymous mutations shapes patterns of polymorphism in Populus tremula. Mol Biol Evol. 27:650–660. Jordan IK, Mari~ no-ramırez L, Koonin EV. 2005. Evolutionary significance of gene expression divergence. Gene 345:119–126. Jukes TH, Cantor CR. 1969. Evolution of protein molecules. In: Munro HN, editor. Mammalian protein metabolism. New York: Academic Press. p. 21–123. Keightley PD, Eyre-Walker A. 2007. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics 177:2251–2261. Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 16:111–120. Kimura M. 1983. The neutral theory of molecular evolution. Cambridge: Cambridge University Press. De La Torre AR, Birol I, Bousquet J, Ingvarsson PK, Jansson S, Jones SJM, Keeling CI, MacKay J, Nilsson O, Ritland K, et al. 2014. Insights into conifer giga-genomes. Plant Physiol. 166:1724–1732. De La Torre AR, Lin Y-C, Van de Peer Y, Ingvarsson PK. 2015. Genomewide analysis reveals diverged patterns of codon bias, gene expression, and rates of sequence evolution in Picea gene families. Genome Biol Evol. 7:1002–1015. De La Torre AR, Wang T, Jaquish B, Aitken SN. 2013. Adaptation and exogenous selection in a Picea glauca Picea engelmannii hybrid zone: implications for forest management under climate change. New Phytol. 687–699. Larracuente AM, Sackton TB, Greenberg AJ, Wong A, Singh ND, Sturgill D, Zhang Y, Oliver B, Clark AG. 2008. Evolution of protein-coding genes in Drosophila. Trends Genet. 24:114–123. Lemos B, Bettencourt BR, Meiklejohn CD, Hartl DL. 2005. Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length, and number of protein-protein interactions. Mol Biol Evol. 22:1345–1354. Li B, Dewey CN. 2011. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323. Li H. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993. Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]. Li H, Durbin R. 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25:1754–1760. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. Li L, Stoeckert CJ, Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13:2178–2189. Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. Liao BY, Zhang J. 2006. Low rates of expression profile divergence in highly expressed genes and tissue-specific genes during mammalian evolution. Mol Biol Evol. 23:1119–1128. MBE Liu H, Yin J, Xiao M, Gao C, Mason AS, Zhao Z, Liu Y, Li J, Fu D. 2012. Characterization and evolution of 5 0 and 3 0 untranslated regions in eukaryotes. Gene 507:106–111. Lockwood JD, Aleksic JM, Zou J, Wang J, Liu J, Renner SS. 2013. A new phylogeny for the genus Picea from plastid, mitochondrial, and nuclear sequences. Mol Phylogenet Evol. 69:717–727. Lohmueller KE, Albrechtsen A, Li Y, Kim SY, Korneliussen T, Vinckenbosch N, Tian G, Huerta-Sanchez E, Feder AF, Grarup N, et al. 2011. Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome. PLoS Genet. 7:e1002326. L€oytynoja A, Goldman N. 2008. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320:1632–1635. Markova-Raina P, Petrov D. 2011. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Res. 21:863–874. Maynard Smith J, Haigh J. 1974. The hitch-hiking effect of a favourable gene. Genet Res. 23:23–35. Messer PW, Petrov DA. 2013. Frequent adaptation and the McDonaldKreitman test. Proc Natl Acad Sci U S A 110:8615–8620. Mignone F, Gissi C, Liuni S, Pesole G. 2002. Untranslated regions of mRNAs. Genome Biol. 3:REVIEWS0004. Moyers BT, Rieseberg LH. 2013. Divergence in gene expression is uncoupled from divergence in coding sequence in a secondarily woody sunflower. Int J Plant Sci. 174:1079–1089. Narsai R, Howell KA, Millar AH, O’Toole N, Small I, Whelan J. 2007. Genome-wide analysis of mRNA decay rates and their determinants in Arabidopsis thaliana. Plant Cell 19:3418–3436. Neale DB, Kremer A. 2011. Forest tree genomics: growing resources and applications. Nat Rev Genet. 12:111–122. Neale DB, Savolainen O. 2004. Association genetics of complex traits in conifers. Trend Plant Sci. 9:1360–1385. Nuzhdin SV, Wayne ML, Harmon KL, McIntyre LM. 2004. Common pattern of evolution of gene expression level and protein sequence in Drosophila. Mol Biol Evol. 21:1308–1317. Nystedt B. 2013. The Norway spruce genome sequence and conifer genome evolution. Nature 497:579–584. Ohta T. 2002. Near-neutrality in evolution of genes and gene regulation. Proc Natl Acad Sci U S A. 99:16134–16137. Paape T, Bataillon T, Zhou P, Kono TJY, Briskine R, Young ND, Tiffin P. 2013. Selection, genome-wide fitness effects and evolutionary rates in the model legume Medicago truncatula. Mol Ecol. 22:3525–3538. Pal C, Papp B, Lercher MJ. 2006. An integrated view of protein evolution. Nat Rev Genet. 7:337–348. Parchman TL, Gompert Z, Mudge J, Schilkey FD, Benkman CW, Buerkle CA. 2012. Genome-wide association genetics of an adaptive trait in lodgepole pine. Mol Ecol. 21:2991–3005. Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, Liuni S. 2001. Structural and functional features of eukaryotic mRNA untranslated regions. Gene 276:73–81. Ramsay H, Rieseberg LH, Ritland K. 2009. The correlation of evolutionary rate with pathway position in plant terpenoid biosynthesis. Mol Biol Evol. 26:1045–1053. Renaut S, Grassa C, Moyers B, Kane N, Rieseberg L. 2012. The population genomics of sunflowers and genomic determinants of protein evolution revealed by RNAseq. Biology 1:575–596. Rocha EPC. 2006. The quest for the universals of protein evolution. Trends Genet. 22:412–416. Savolainen O, Pyh€aj€arvi T, Kn€ urr T. 2007. Gene flow and local adaptation in trees. Annu. Rev Ecol Evol Syst. 38:595–619. Slotte T, Bataillon T, Hansen TT, St Onge K, Wright SI, Schierup MH. 2011. Genomic determinants of protein evolution and polymorphism in Arabidopsis. Genome Biol Evol. 3:1210–1219. 1515 Hodgins et al. . doi:10.1093/molbev/msw032 Snell-Rood EC, Dyken JDV, Cruickshank T, Wade MJ, Moczek AP. 2010. Toward a population genetic framework of developmental evolution: the costs, limits, and consequences of phenotypic plasticity. BioEssays 32:71–81. Storey J. 2002. A direct approach to false discovery rates. J R Stat Soc. 64:479–498. Subramanian S, Kumar S. 2004. Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics 168:373–381. Sun SS, Choi S. 2015. Lengths of coding and noncoding regions of a gene correlate with gene essentiality and rates of evolution. Genes Genomics 37:365–374. Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 595:585–595. Thibaud-nissen F, Ouyang S, Buell CR. 2009. Identification and characterization of pseudogenes in the rive gene complement. BMC Genomics. 13:1–13. Tirosh I, Barkai N. 2008. Evolution of gene sequence and gene expression are not correlated in yeast. Trends Genet. 24:109–113. Wang XQ, Tank DC, Sang T. 2000. Phylogeny and divergence times in Pinaceae: evidence from three genomes. Mol Biol Evol. 17:773–781. Warnefors M, Kaessmann H. 2013. Evolution of the correlation between expression divergence and protein divergence in mammals. Genome Biol Evol. 5:1324–1335. 1516 MBE Williamson RJ, Josephs EB, Platts AE, Hazzouri KM, Haudry A, Blanchette M, Wright SI. 2014. Evidence for widespread positive and negative selection in coding and conserved noncoding regions of Capsella grandiflora. PLoS Genet. 10:e1004622. Yang L, Gaut BS. 2011. Factors that contribute to variation in evolutionary rate among Arabidopsis genes. Mol Biol Evol. 28:2359–2369. Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555–556. Yang Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol Biol Evol. 15:568–573. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24:1586–1591. Yang Z. 2014. User guide PAML: phylogenetic analysis by maximum likelihood. Version 4.8a (August 2014). Yang Z, Nielsen R, Goldman N, Pedersen a M. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449. Yeaman S, Hodgins K, Suren H, Nurkowski K, Rieseberg L, Holliday J, Aitken S. 2014. Conservation and divergence of gene expression plasticity following 140 million years of evolution in lodgepole pine (Pinus contorta) and interior spruce (Picea glauca, Picea engelmannii and their hybrids). New Phytol. 203:578–591.
© Copyright 2026 Paperzz