The Plant Journal (2013) 73, 941–951 doi: 10.1111/tpj.12089 Gene family evolution in green plants with emphasis on the origination and evolution of Arabidopsis thaliana genes Ya-Long Guo1,2,* 1 State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China, and 2 Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tu€ bingen, Germany Received 22 May 2012; revised 1 November 2012; accepted 23 November 2012; published online 15 January 2013. *For correspondence (e-mail [email protected] or [email protected]). SUMMARY Gene family size variation is an important mechanism that shapes the natural variation for adaptation in various species. Despite its importance, the pattern of gene family size variation in green plants is still not well understood. In particular, the evolutionary pattern of genes and gene families remains unknown in the model plant Arabidopsis thaliana in the context of green plants. In this study, eight representative genomes of green plants are sampled to study gene family evolution and characterize the origination of A. thaliana genes, respectively. Four important insights gained are that: (i) the rate of gene gains and losses is about 0.001359 per gene every million years, similar to the rate in yeast, Drosophila, and mammals; (ii) some gene families evolved rapidly with extreme expansions or contractions, and 2745 gene families present in all the eight species represent the ‘core’ proteome of green plants; (iii) 70% of A. thaliana genes could be traced back to 450 million years ago; and (iv) intriguingly, A. thaliana genes with early origination are under stronger purifying selection and more conserved. In summary, the present study provides genome-wide insights into evolutionary history and mechanisms of genes and gene families in green plants and especially in A. thaliana. Keywords: Arabidopsis thaliana, evolutionary genomics, gene family, gene origination, green plants. INTRODUCTION Most genes can be grouped into gene families based on sequence similarity. Gene family size is variable across different lineages and may have important functional outcomes related to adaptation or speciation (Lynch and Conery, 2000; Nei and Rooney, 2005). Various factors can alter the size of gene families (De Bodt et al., 2005; Maere et al., 2005; Hahn et al., 2007; Freeling, 2009; Proost et al., 2011; Tautz and DomazetLoso, 2011; De Smet and Van De Peer, 2012). First, gene duplication or whole-genome duplication (polyploidy) increases gene family size. Second, gene deletion decreases gene family size. Third, the de novo creation of genes creates new gene families. Gene duplication is an ongoing contributor to genome evolution and is of the same order of magnitude as the mutation rate per nucleotide site (Lynch and Conery, 2000, 2003; Gu et al., 2002; McLysaght et al., 2002; Simillion et al., 2002; Doyle et al., 2008; Flagel and Wendel, 2009; Van De Peer et al., 2009; Van De Peer, 2011). While duplication and de novo creation of genes increase the gene number, gene deletion © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd decreases the gene number either by deletion of single gene or several genes within segmental fragment. Although variation in gene family has been the subject of many studies, most analyses have focused on only one or a few gene families across a small number of plant genomes (Shiu et al., 2004; Chardon and Damerval, 2005; Kim et al., 2006; Leseberg et al., 2006; Li et al., 2006; Porter et al., 2009; Guo et al., 2011a) due to the limitation of available plant genomes or lack of robust pipelines. Recently, two new methods have been developed to address the evolutionary pattern of genes and gene families at the whole-genome level, including the likelihood approach implemented in CAFE (Computational Analysis of gene Family Evolution) (De Bie et al., 2006) and the gene emergence approach (Domazet-Loso et al., 2007; Cai et al., 2009; Zhang et al., 2010a; Banks et al., 2011). The likelihood approach uses a stochastic birth and death process to model the evolution of gene family size over a phylogeny (Hahn et al., 2005, 2007; De Bie et al., 2006; Demuth et al., 2006). In contrast, the gene emergence approach 941 942 Ya-Long Guo maps genes to different phylogenetic ranks (internodes) according to their emergence by similarity searches in the representative genomes, called as ‘phylostratigraphy’, naming phylogenetic ranks or internodes ‘phylostrata’ (Domazet-Loso et al., 2007; Prachumwat and Li, 2008; Cai et al., 2009; Zhang et al., 2010b, 2011; Banks et al., 2011; David and Alm, 2011). Recently, this approach has been used to study the protein domain evolution and transcription-associated proteins as well (Lang et al., 2010; Kersting et al., 2012; Zhang et al., 2012). The comparison of gene families among closely related and representative species of green plants (Plantae or Viridiplantae), comprising land plants and green algae, allows a comprehensive study of the genes and gene families evolution in green plants. Eight representative genomes of green plants were compared, including two Arabidopsis genomes (Figure 1). Here I have investigated several fundamental questions regarding genes and gene families evolution in green plants: (i) the distribution of the gene family size; (ii) the rate of the gene gain and loss, the rate of the expansion or contraction per gene family on each lineage; and (iii) for A. thaliana genes in different phylostrata, the variation of genes or gene families fixation rate, the variation of functional annotation, of protein size, of expression level, and the variation of selection strength. In summary, the analyses of gene families flanking the eight representative genomes of green plants provide indepth insights into the researches of evolutionary genomics about green plants. RESULTS Gene family size distribution All proteins from the eight genomes are used for the following analyses, since none of them is homologous to transposable elements. Markov Cluster Algorithm (MCL) clustering of all these proteins resulted in 55 989 gene families; where 13.6% of genes are singletons, 4.6% are grouped into two-gene families, and 81.7% are grouped into multigene families (Table 1, see Data S1 for the detail). The clustering of genes in each genome indicates that the number of gene families ranges from 8692 Figure 1. Phylogeny and divergence time (million years ago) of the eight green plants. Table 1 Numbers of gene families (and genes) in green plants Species Singletonsa Two-genesb 3c Total gene familiesd Max_gfe Chlamydomonas reinhardtii Physcomitrella patens Selaginella moellendorffii Oryza sativa Sorghum bicolor Populus trichocarpa Arabidopsis lyrata Arabidopsis thaliana All eight genomes 9088 (9088) 11 513 (11 513) 1717 (1717) 13 597 (13 597) 7579 (7579) 6525 (6525) 6551 (6551) 5980 (5980) 38 330 (38 330) 890 (1780) 2081 (4162) 4348 (8696) 2201 (4402) 1850 (3700) 2806 (5612) 1964 (3928) 1689 (3378) 6548 (13 096) 697 (4293) 2115 (20 241) 2627 (24 262) 2589 (37 633) 2419 (24 308) 3465 (33 235) 2592 (21 990) 2054 (17 481) 11 111 (230 225) 10 675 (15 161) 15 709 (35 916) 8692 (34 675) 18 387 (55 632) 11 848 (35 587) 12 796 (45 372) 11 107 (32 469) 9723 (26 839) 55 989 (281 651) 132 4322 1541 2208 1258 335 226 204 4322 a The number of singletons. The number of gene families (genes) with two genes. c The number of gene families (genes) contain at least three genes. d The number of total gene families (genes). e The maximum gene family size. b © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 Evolution of gene families in green plants 943 (S. moellendorffii) to 18 387 (O. sativa); and the number of genes within gene families varies from 15 161 (C. reinhardtii) to 45 372 (P. trichocarpa) (Figure 2a and Table 1). In addition, the proportion of singletons in each genome ranges from 5% (S. moellendorffii) to 59.9% (C. reinhardtii); genes in two-gene families from 7.9% (O. sativa) to 25.1% (S. moellendorffii); and genes in multigene families from 28.3% (C. reinhardtii) to 73.3% (P. trichocarpa). The maximum gene family size varies from 132 (C. reinhardtii) to 4322 (P. patens). Interestingly, for the closely related species within the same genus or family, the ratios of genes within different classes are similar, such as O. sativa and S. bicolor, A. lyrata and A. thaliana. Most genes grouped into multigene families in each genome except C. reinhardtii, in which singletons takes up more than half of total genes (59.9%). However, only 2.4% (217/ 9088) of singletons in C. reinhardtii can be annotated to Gene Ontology (GO) terms, which are related with cellular process, metabolic process, response to stimulus, developmental process, cellular component organization, and multicellular organismal process. Orphan genes and loss of entire gene families in Arabidopsis There are many orphan genes in each genome, which do not have homologous in other genomes. The proportions of orphan genes in different genomes are different, 61.5% in C. reinhardtii, while other species have a much lower proportion, especially two dicot species A. thaliana (5.3%) (a) and A. lyrata (12.3%) (Figure 2b and Table 2). The maximum orphan gene family size ranges from 17 (A. thaliana) to 4322 (P. patens); the total number of orphan gene families varies from 1328 (A. thaliana) to 10 812 (O. sativa); and the number of genes within orphan gene families varies from 1430 (A. thaliana) to 19 351 (P. patens). Among the eight genomes, 16.8% (S. moellendorffii) to 89.1% (A. thaliana) of orphan gene families are singletons; 4.8% (A. thaliana) to 31.3% (S. moellendorffii) are two-gene families; and 6.2% (A. thaliana) to 51.9% (S. moellendorffii) are multigene families. To understand the functional characteristics of the orphan genes, the well studied model species A. thaliana was selected as a representative to perform GO functional annotation. 1042 genes (72.9%) of A. thaliana orphan genes (1430) can be annotated to GO terms, though most (93.0%) are classified as unknown biological processes; other terms are other cellular processes, other metabolic processes, protein metabolism, transcription, and response to stress. Another class of interesting genes represents those with homologues between one Arabidopsis species and at least one of other six species but not in another Arabidopsis species, which could have been deleted from the latter Arabidopsis species. There are 98 gene families (174 genes, the minimum is 1, the maximum is 8, the average is 1.7755) only have A. lyrata genes; and 117 gene families (163 genes, minimum 1, maximum 7, average 1.3932) only have A. thaliana genes. Based on the unique genes in each Arabidopsis species but shared with other species, the gene extinction rate for the entirely lost gene families can be calculated. Assuming the divergence time of A. thaliana and A. lyrata to be 13 million years ago (Beilstein et al., 2010; Guo et al., 2011b) (Figure 1) and normalized to the total gene number, extinction rate in A. thaliana is 0.00050 (174/13 9 27 025) extinctions/million year/gene; in A. lyrata, 0.00038 (163/13 9 32 670) extinctions/million year/gene. Gene family expansion or contraction (b) Figure 2. Proportions of the genes that are singletons, in two-gene families, or in multigene families. (a) Gene family sizes at whole-genome level. (b) Gene family sizes of orphan genes. The likelihood approach assumes that there is at least one ancestral gene in the most recent common ancestor (MRCA), thus only those gene families that are shared between C. reinhardtii and at least one of other seven species are included for the analysis. Of the 3296 families present in the MRCA of green plants, 551 gene families have been lost completely in at least one species. It is 2745 gene families that are present across all the eight species most probably represent the ‘core’ proteome of green plants. Assuming that each species has the same k value, the average rate of gene turnover is estimated to be 0.001379 (gains and losses/gene/million years), the rate of gene family to either expand or contract over time. The estimated rate of gene gains and losses implies that within a single © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 944 Ya-Long Guo Table 2 Number of orphan gene families (and genes) in green plants Species Singletonsa Two-genesb 3c Total gene familiesd Max_gfe Chlamydomonas reinhardtii Physcomitrella patens Selaginella moellendorffii Oryza sativa Sorghum bicolor Populus trichocarpa Arabidopsis lyrata Arabidopsis thaliana 6619 (6619) 9001 (9001) 1356 (1356) 9534 (9534) 3600 (3600) 4667 (4667) 2279 (2279) 1274 (1274) 443 (886) 700 (1400) 1263 (2526) 775 (1550) 394 (788) 688 (1376) 272 (544) 34 (68) 317 (1817) 368 (8950) 600 (4184) 503 (4961) 265 (3663) 343 (2055) 180 (1165) 20 (88) 7379 (9322) 10 069 (19 351) 3219 (8066) 10 812 (16 045) 4259 (8051) 5698 (8098) 2731 (3988) 1328 (1430) 68 4322 129 562 1258 82 99 17 a The number of singletons. The number of gene families (genes) with two genes. c The number of gene families (genes) contain at least three genes. d The number of total gene families (genes). e The maximum gene family size. b genome like A. thaliana, there are approximately 37.267 new duplicates and 37.267 new losses fixed every million years (0.001379 gains and losses/gene/million years 9 27 025 genes). The average expansions and contractions among different lineages are different (Figure 3). Along the phylogenetic ranks, most lineages expanded, including embryophyte, tracheophyte, angiosperm, and rosid. In contrast, lineages of C. reinhardtii, grass, and Arabidopsis contracted, and Arabidopsis genus is the most contracted lineage (average contraction size is 0.547). The likelihood approach can identify individual gene families evolving at the rate of gain or loss significantly higher than the genome-wide average (Hahn et al., 2005). Of the 3296 gene families, 41 are evolved non-randomly at P < 0.0001. Of the 41 rapidly evolved gene families, only 37 contain A. thaliana genes, and 34 of which can be annotated to GO terms. Rapidly evolving gene families are associated with various biological processes, including other metabolic processes, other cellular processes, response to stress, protein metabolism, other biological processes, and response to abiotic or biotic stimulus (Tables S1 and S2). Nine of the 41 rapidly evolving gene families have significantly changed (P < 0.0001) on the terminal branch to A. thaliana. All of the nine gene families significantly contracted (P < 0.0001), most of which are affiliated with nucleotide binding, protein binding, DNA or RNA binding, hydrolase activity, and structural molecular activity (Table S2). Among nine significantly changing gene families on the A. thaliana terminal, eight of them have A. thaliana genes and in total 43 genes, of them 34 have the orthologues in A. lyrata. The average of dN/dS for these genes is 0.1505 (median 0.0466), lower than all proteins of A. thaliana within the gene families (average 0.2799, median 0.1786) (Wilcoxon rank sum test, P = 4.525 9 10 6); the average of dS is 0.1480 (minimum 0.0427, maximum 0.3557, median 0.1398), also a little lower than all proteins of A. thaliana within the gene families (average 0.2167, median 0.1448) (P = 0.3325). The results indicate there is not any correlation between the gene copy variation and the sequence divergence. Figure 3. Expansions and contractions of gene families in green plants. The numbers nearby branches show the average expansion (mean number of genes gained or lost per family, where ‘–’ expansion is a net contraction), and the values inside the brackets are the number of families with expansions and contractions, respectively. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 Evolution of gene families in green plants 945 Origination of Arabidopsis thaliana genes The 26 839 A. thaliana genes within gene families are assigned to different phylogenetic ranks (phylostrata) (Figure 4a and Table 3). More than 70% genes are assigned to Viridiplantae (40.6%) and Embryophyte (30.1%) internodes. In addition, fixation rates of gene or gene family for A. thaliana genes, which is the rate of genes (or gene families) appearance on different phylogenetic ranks, are calculated, respectively (Figure 4b and Table S3). A. thaliana internode has the highest gene fixation rate (110.000 genes/ MY), while angiosperm internode has the lowest gene fixation rate (6.267 genes/MY). For A. thaliana genes assigned to different phylostrata, the proportion of A. thaliana genes annotated to GO terms is 86.7% (23 282/26 839) (Figure 4c and Table S4). Among internodes, the proportion ranges from 92.1% (rosid internode) to 57.2% (A. thaliana internode). For the unknown function process, along the phylostrata, Viridiplantae has the lowest proportion, but A. thaliana has the highest proportion. For most other biological processes, Viridiplantae has the highest proportion; A. thaliana has nearly the lowest proportion (Figure S1 and Table S5). Within A. thaliana internal, the most abundant terms are unknown biological processes, followed by other cellular processes, other metabolic processes, and transcription. Along phylostratigraphy from Viridiplantae to A. thaliana, the medians of protein sizes of the A. thaliana genes in different phylostrata reduced from 413 (Viridiplantae) to 77 (A. thaliana) (Figure 4d and Table S6). While the median of protein size for all A. thaliana proteins within the gene families is 351 (minimum 29, maximum 5337, average 407.8). In addition, along phylostratigraphy, the proportion of expressed genes reduced from 90.1% (Viridiplantae internode) to 37.1% (A. thaliana internode) (Figure 5a and Table S7); and the median of expression levels reduced (a) (b) (c) (d) Table 3 Number of A. thaliana gene families (and genes) assigned to each phylostratum Phylostratum internode Genes (%)a Gene families (min, max, avg)b 1 Viridiplantae 2 Embryophyte 3 Tracheophyte 4 Angiosperm 5 Rosid 6 Arabidopsis 7 A. thaliana 10 897 (40.601) 8074 (30.083) 741 (2.761) 1617 (6.025) 1076 (4.009) 3004 (11.193) 1430 (5.328) 3007 (1, 193, 3.6239) 2216 (1, 204, 3.6435) 290 (1, 28, 2.5552) 732 (1, 25, 2.2090) 484 (1, 106, 2.2231) 1666 (1, 134, 1.8031) 1328 (1, 17, 1.0768) a The total number of genes assigned to each phylostratum in A. thaliana genome, and their percentage of whole genome given in parentheses. b The total number of gene families assigned to each phylostratum. Within parentheses, min indicates the minimum gene family size; max indicates the maximum gene family size; avg indicates the average of gene family size. Figure 4. Phylostratigraphic analysis of A. thaliana genes. (a) Gene appearance. (b) Gene fixation rate. (c) Gene ontology annotation. A dashed line indicates the fraction of A. thaliana genes within gene families that can be annotated to GO biological process terms. (d) Protein sizes. A dashed line indicates the median protein size of A. thaliana genes within gene families. Significances of the deviations from the median protein size of A. thaliana genes within gene families using Wilcoxon rank sum test are shown in the P-value panel. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 946 Ya-Long Guo from 6.985 (Viridiplantae) to 2.841 (A. thaliana) (Figure 5b and Table S7). In total, 79.2% A. thaliana genes are expressed (21 248/26 839); and the median of expression level is 5.988 (min 1.344, max 14.050, average 5.866). In order to understand the divergence variation along phylostratigraphy from Viridiplantae to A. thaliana, the divergence between A. thaliana genes and their orthologues in A. lyrata were calculated. The medians of dN/dS increased from 0.1389 (Viridiplantae) to 0.4556 (Arabidopsis), and the medians of dS increased from 0.1407 (Viridiplantae) to 0.1553 (Arabidopsis) (Figure 6 and Table S8). The median of dN/dS for all proteins of A. thaliana within the gene families is 0.1786, and the median of dS is 0.1448. DISCUSSION Family size distribution Markov Cluster (MCL) algorithm, a method for clustering proteins into groups based on similarity, rigorously tested (a) and validated on a number of databases (Van Dongen, 2000; Enright et al., 2002), has been used for gene family analysis in many taxa, including human (Lander et al., 2001), vertebrate and invertebrate (Prachumwat and Li, 2008), and Drosophila (Drosophila 12 Genomes Consortium, 2007). In this study, using the MCL algorithm, most genes in each genome are grouped into multigene families except C. reinhardtii, in which singletons make up more than half of total genes (59.9%). The predominant number of singletons in C. reinhardtii suggests that the C. reinhardtii genome has a lower gene duplicability than other genomes, possibly due to functional constraints or that some genes of C. reinhardtii are too young to have accumulated duplicates like the case of vertebrate-specific genes (Prachumwat and Li, 2008), or there are probably some annotation artifacts in C. reinhardtii. However, only 2.4% of singletons in C. reinhardtii can be annotated to a GO term, thus functional constraint is unlikely the case here. For S. moellendorffii genome, in order to sufficiently (a) (b) (b) Figure 5. Expression variation of A. thaliana genes among phylostrata. (a) Expressed genes. A dashed line indicates the expression fraction level of A. thaliana genes within gene families. (b) Expression level. A dashed line indicates the median expression level of A. thaliana genes within gene families. Significances of the deviations from the median expression level of A. thaliana genes within gene families using Wilcoxon rank sum test are shown in the P-value panel. Figure 6. Divergence variation of A. thaliana genes among phylostrata, estimated between A. thaliana and A. lyrata. (a) dN/dS. (b) dS. A dashed line indicates the median dN/dS or dS of A. thaliana genes within gene families, respectively. Significances of the deviations from the median dN/dS and dS of A. thaliana genes within gene families using Wilcoxon rank sum test are shown in the P-value panel, respectively. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 Evolution of gene families in green plants 947 represent the S. moellendorffii genome, all gene models instead of the data set with only one of the two haplotypes are used in this study, thus in the present study for this genome, genes in two-gene families are inflated, and genes in singletons are reduced at the same time accordingly. In total, 2745 gene families are shared among all the eight species, which must be the ‘core’ proteome of green plants. This number is consistent with another recently published study (Van Bel et al., 2012), in which 2928 core gene families are found via a parsimony method from 25 genomes, although these two studies used different strategy to call ‘core’ proteome. Among the 2928 gene families found in the previous study, 2207 (75%) are found in the present study, when match the representative A. thaliana genes of each core gene family reported in the previous study (Van Bel et al., 2012) to the present study. Interestingly, two recent studies have indicated across green plants, there is a relatively large and well conserved core set of protein domains as well (Kersting et al., 2012; Zhang et al., 2012). In addition, phylogenetic comparative analysis indicated the expansion of transcription-associated proteins is positively correlated with the increase in morphological complexity (Lang et al., 2010). Orphan genes are protein-coding genes that do not have recognizable homologues in other species. Various factors can lead to rise of the orphan genes (Domazet-Loso and Tautz, 2003; Demuth et al., 2006; Hahn et al., 2007). Theoretically, all of these factors can be classified into six classes: (i) true loss of all functional genes; (ii) the de novo evolution of new genes; (iii) rapid protein evolution in previously existing genes so that they are no longer identified as being part of a pre-existing family; (iv) horizontal gene transfer; (v) annotation artifacts; and (vi) artifacts of the clustering process. Such genes might be involved frequently in specific ecological adaptations or be the raw fuel for micro-evolutionary divergence (Carroll et al., 2001; Domazet-Loso and Tautz, 2003; Yang et al., 2009). Not surprisingly, in C. reinhardtii and P. patens, more than half of the genes within these two genomes are orphan genes, given some genes in C. reinhardtii and P. patens are probably not required for vascular plants. Recently, in Drosophila, it has been demonstrated that orphan genes can quickly become essential and play an essential role in development (Chen et al., 2010). Similar gene gain/loss rate between green plants and animals The likelihood method implemented in CAFE version 2.0 (De Bie et al., 2006) is a robust pipeline to study gene family evolution, and has been used widely in genome analyses, including mammalian genomes (Demuth et al., 2006), Drosophila genomes (Hahn et al., 2007), and rhesus macaque genome (Rhesus Macaque Genome Sequencing and Analysis Consortium, 2007). Of the 3296 families used for the likelihood analysis, along the phylogenetic ranks, gene families in each lineage expands to different extent except for C. reinhardtii, grass, and Arabidopsis. Consistent with the contraction of A. thaliana at the whole-genome level, all nine rapidly evolving gene families contract on the terminal branch to A. thaliana. In Drosophila, it has been demonstrated that some gene families (defense response, proteolysis, and trypsin activity) are evolving rapidly both in copy number and sequence level (Drosophila 12 Genomes Consortium, 2007; Hahn et al., 2007). However, in Arabidopsis, there is not any correlation between the copy number variation and sequence divergence for the genes of rapidly evolving gene families, which suggested that the relationship between copy number variation and sequence divergence is much more complex. In the present study, the rate of gene gains and losses is about 0.001379 (gains and losses/gene/million years). This rate is a little lower than some previous estimates. For instance, the average duplication rate of a eukaryotic gene was estimated to be on the order of 0.01/gene/million years and 0.002 for A. thaliana (Lynch and Conery, 2000, 2003). However, this rate is similar to that in Drosophila – 0.0012 (Hahn et al., 2007) and Mammalian – 0.0016 (Demuth et al., 2006). As this result is consistent with the average rate of duplication of a eukaryotic gene, and also eight representative genomes of green plants are used, thus the estimated rate is reliable for A. thaliana relatives and green plants as well. One important factor that affects the estimation of gene gains or losses rate in certain lineages is polyploidy/wholegenome duplication. Especially for most green plants, including a few angiosperm species studied here, are paleopolyploidy, and whole-genome duplication is normally followed by gene loss and diploidization (Simillion et al., 2002; Blanc and Wolfe, 2004; Adams and Wendel, 2005; Doyle et al., 2008; Soltis et al., 2009; Koh et al., 2010; Jiao et al., 2011; Proost et al., 2011). Therefore, at the green plants level, whole-genome duplication likely raises the rate of gene gains and losses. However, with the availability of more genomes from various plant lineages, it would be interesting to address how these paleopolyploidy events affects gene gains or losses rate in some specific plant lineages that experienced whole-genome duplication. Origin of Arabidopsis thaliana genes The gene emergence approach has been used to study the evolutionary emergence of whole-genome genes in individual species or lineage at different phylogenetic levels (Domazet-Loso et al., 2007; Domazet-Loso and Tautz, 2008, 2010a,b; Prachumwat and Li, 2008; Zhang et al., 2010a,b, 2011; Banks et al., 2011). The significant difference of gene emergence on the phylostratigraphic map supports the assumption that some of these genes have retained a signal of their evolutionary history. In this way, the © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 948 Ya-Long Guo phylostratigraphy can imply a picture of evolutionary history using extant species. In this study, more than 70% of A. thaliana genes are traced back to 450 million years ago. Interestingly, the A. thaliana internode also has the highest gene fixation rate, which might be due to high gene origination rate, or more complete annotation since much more expression data can be used for annotation. Transposition in certain families is a possible cause of the high gene origination rate. A recent study indicates that specific gene families can transpose at specific points in evolutionary time, especially after whole-genome duplication events in the Brassicales (Woodhouse et al., 2011). The fraction of expressed genes, expression level, and protein size reduced along the phylostratigraphy from Viridiplantae to A. thaliana. In yeast, there is an inverse relationship between protein size and protein expression level (Warringer and Blomberg, 2006). Conversely, in A. thaliana the oldest genes are much longer and also highly expressed. In addition, the comparison between A. thaliana and A. lyrata indicates transposable elements and small RNAs contribute to gene expression divergence (Hollister et al., 2011). Most probably, different lineage or species has different regulatory mechanism. For example, even comparing to its sister species A. lyrata, A. thaliana has much higher intron loss rate (Fawcett et al., 2012) and deletion rate, and also tend to delete longer fragments (Hu et al., 2011). Interestingly, Viridiplantae internode has the lowest dN/ dS, while Arabidopsis internode has the highest. This probably suggests that older genes have been under stronger purifying selection, since older genes are mostly fundamental genes for green plants to survive. Accordingly, the older genes are more conserved with a slow evolution rate (synonymous substitution rate). Evolution rate can be shaped by various factors (Alba and Castresana, 2005; Larracuente et al., 2008; Singh et al., 2009; Liao et al., 2010; Gaut et al., 2011; Yang and Gaut, 2011). Recent comparisons between two available Arabidopsis genomes provide more robust results. The rates of protein evolution are correlated mainly with expression level and specificity (Slotte et al., 2011); and rates at synonymous sites vary on genomic scale and might be a function of recombination rate and evolution rates at non-synonymous sites, strongly correlated with expression pattern and whether a gene is retained after a whole-genome duplication or not (Yang and Gaut, 2011). In addition, 1939 annotated protein-coding genes with little evidence of expression in A. thaliana are shorter and evolved with a two-fold higher rate of nonsynonymous substitutions (Yang et al., 2011). This finding is consistent with what is found here, that younger genes are shorter, higher in the non-synonymous substitution rate, and lowly expressed. Furthermore, it would be interesting to study dN/dS ratio in other species pairs, similar comparisons had been made between A. thaliana and popular and between two conifers (Picea sitchensis and Pinus taeda) (Buschiazzo et al., 2012), which indicates low rates of diversification in conifers but much higher dN/dS, possibly due to positive selection. It would also be interesting to do a lineage-specific study about dN/dS ratio, which could be informative for understanding the mechanism of lineage-specific evolutionary events. Although many other plant species’ genome sequences are available now, the eight genomes here are good representative of green plants with clear phylogeny and divergence times estimated, especially assembled with longer and high quality reads at high coverage, which are believed to be very important to address evolutionary questions at whole-genome level (Demuth and Hahn, 2009; Schneider et al., 2009; Milinkovitch et al., 2010). Therefore, the eight representative green plants’ genomes are sufficient to address those interesting questions at whole-genome level. The comparison of the eight genomes, revealed the evolutionary pattern of green plants’ genes and gene families at an unprecedented resolution. The rate of gene gains and losses is similar to that in animals, some gene families are rapidly evolved and the 2745 gene families are ‘core’ proteome of green plants. Most genes of A. thaliana are biased toward ancient genes, and genes with early origination are under stronger purifying selection and more conserved; and gene families of Arabidopsis lineage contract a lot, especially A. thaliana, which could be correlated with the shaping of the difference of morphology, physiology, and metabolism among species. EXPERIMENTAL PROCEDURES Genome data sets Protein sequences from eight genomes of green plants were analysed, including A. thaliana (TAIR8) (27 025 proteins) (The Arabidopsis Genome Initiative, 2000), A. lyrata (JGI-v1.0, FM6) (32 670 proteins) (Hu et al., 2011), Populus (Populus trichocarpa, JGI-v1.1) (45 555 proteins) (Tuskan et al., 2006), rice (Oryza sativa, v5.0) (56 278 proteins) (International Rice Genome Sequencing Project, 2005; Ouyang et al., 2007), Sorghum (Sorghum bicolor, JGI-v1.0) (35 899 proteins) (Paterson et al., 2009), Selaginella moellendorffii (JGI-v1.0) (34 697 proteins) (Banks et al., 2011), Physcomitrella patens ssp patens (JGI-v1.1) (35 938 proteins) (Rensing et al., 2007), and Chlamydomonas reinhardtii (JGI-v3.1) (15 256 proteins) (Merchant et al., 2007). The phylogeny and divergence times of €m the eight genomes were clarified clearly (Kellogg, 2001; Wikstro et al., 2001; Gaut, 2002; Wellman et al., 2003; Magallon and Sanderson, 2005; Zimmer et al., 2007; Beilstein et al., 2010; Lang et al., 2010; Guo et al., 2011b) (Figure 1 and Table S9 for the detail of the divergence time at each node). In addition, the sampled genomes are representative, flanking the most important events in the green plants leading to A. thaliana. In order to get rid of possible transposable element sequences within the proteins data sets, all proteins were searched against the repeat library database (repeat masker database version 20090604) using tblastn (E value <1 9 10 5), and length of hit must be at least half of the query sequence. The putative transposable elements are deleted from the dataset in the following analyses. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 Evolution of gene families in green plants 949 Homology searches and gene family construction Markov Cluster Algorithm (mcl-06-058 package; http://micans.org/ mcl/src/) was used to cluster all the eight genomes’ protein sequences into gene families with default parameters ( I 2, S 6). There are two steps for clustering analysis: (i) blastp was used to match each protein against all other proteins (E value <1 9 10 5) to produce a raw NCBI BLAST output; and (ii) MCL was used to build Markov Matrix using the parsed similarity data converted from raw NCBI BLAST output, and then generated the final gene families. The clustering results were imported into MySQL database for the following analyses. Genes (gene families) that do not have recognizable homologues in other species are called as orphan genes (gene families). Gene family size analysis Genes were classified into one of the following three groups by its family size in the clustering according to Prachumwat and Li (2008): (i) a singleton is a single-copy gene; (ii) a two-gene family has two copies; and (iii) a multigene family has 3 copies. Likelihood analysis of gene gain and loss version 2.0 (De Bie et al., 2006), a tool for the statistical analysis of the evolution of the gene family size, was used to study the evolution of gene families. One can estimate gene gain and loss rate among these plant genomes; calculate the average expansion or contraction per gene family on each branch; and identify the gene family that has larger-than-expected contractions or expansions. The below input files and parameters are used. First, the input tree structure file, which is a Newick description of a rooted and bifurcating phylogenetic tree (including branch lengths in units of time) (Figure 1). Second, the data file consists of gene family sizes of each gene family across eight species. Third, k, the probability of both gene gains and losses per gene per unit time in the phylogeny (CAFE assumes that gene birth and death are equally probable), was calculated by CAFE given the gene families in the data file (the estimated k is the globally most probable value across all families in the input file or the different lineages’ k with related options of t). Here I calculated the k across all the eight species. Fourth, two other parameters were set as P-value threshold (0.01) and number of random samples used the default value (1000). Fifth, the default Viterbi method was employed to identify the branch with the most unlikely amount of change. For gene families that are evolving at rates of gains and losses significantly higher than the genome-wide average with a P-value below P < 0.0001 (Hahn et al., 2005), Viterbi method was used to identify the branch that is the most likely cause of deviation from the random model significant at P < 0.005 (Hahn et al., 2007). CAFE Phylostratigraphic analyses of Arabidopsis thaliana genes Phylostratigraphic analysis was performed according to previous study (Domazet-Loso et al., 2007). All genes of A. thaliana were assigned into seven phylogenetic ranks (phylostrata) of different ages according to the emergence in the reference genomes on the phylogeny based on the similarity clustering via MCL. The seven gene groups were: (1) A. thaliana, genes that arose along the A. thaliana; (2) Arabidopsis, genes were present in the MRCA of Arabidopsis genus; (3) rosid, genes were present in the MRCA of Arabidopsis genus and P. trichocarpa; (4) angiosperm, genes were present in the MRCA of rosid and grass (S. bicolor and O. sativa); (5) Tracheophyte, genes were present in the MRCA of angiosperm and S. moellendorffii; (6) Embryophyte, genes were present in the MRCA of Tracheophyte and P. patens; and (7) Viridiplantae, genes were present in the MRCA of Embryophyte and C. reinhardtii. Genes (or gene families) fixation rates were calculated according to the formula, r = N/T: r is the number of genes (or gene families) per million years; N is the number of genes (or gene families) in the phylostratum; T is duration of the interval (million years). Divergence and polymorphism There are several steps to estimate the divergence of orthologs: (i) the orthologous gene pairs between A. thaliana and A. lyrata were extracted from a whole-genome alignment analysis (Hu et al., 2011); (ii) a codon-based alignment of orthologous coding sequences was performed with PAL2NAL (Suyama et al., 2006); (iii) dS, the number of synonymous substitutions per synonymous site, and dN/dS ratios, the ratio of the number of non-synonymous substitutions per non-synonymous site to the number of synonymous substitutions per synonymous site, were calculated for orthologue pairs using the codeml program implemented in phylogenetic analysis by maximum likelihood version 3.15 package (Yang, 1997). GO annotation and gene expression Gene ontology annotation was performed using GO implemented at TAIR (October 2009) (http://www.arabidopsis.org/tools/bulk/go/ index.jsp) for A. thaliana (Gene Ontology Consortium, 2004); for other species, blast2go was used to do the GO annotation (Conesa et al., 2005). GO Biological Processes were used to estimate the abundance of GO terms. Gene expression data from AtGen Express Arabidopsis expression atlas (Schmid et al., 2005), which was generated on the widely available Affymetrix ATH1 array platform, were averaged across all tissues as an estimate of expression level of A. thaliana genes. ACKNOWLEDGEMENTS I thank Matthew W. Hahn for suggestions on CAFE analysis; Tomislav Domazet-Loso for suggestions on phylostratigraphy analysis; Manyuan Long, Yong Zhang, Jie Guo, and anonymous reviewers for comments about this work; and Tina Hu for proofreading and improving the manuscript. Especially, I am very grateful to Detlef Weigel and Song Ge for their long-term support and also valuable comments on this work. This work was supported by 100 talents program of Chinese Academy of Sciences and the Max Planck Society. SUPPORTING INFORMATION Additional Supporting Information may be found in the online version of this article. Data S1. The gene family clustering among all proteins of the eight genomes. Figure S1. GO annotation of A. thaliana genes in each phylostratum. Table S1. GO annotation of 41 rapidly evolved gene families. Table S2. Functional annotation of 41 rapidly evolved gene families. Table S3. Fixation rate of A. thaliana genes or gene families in each phylostratum (genes/duration of interval or gene families/ duration of interval). Table S4. The fraction of A. thaliana genes in each phylostratum that can be annotated to GO biological process. Table S5. GO annotation of A. thaliana genes in each phylostratum in terms of proportion of major biological processes (%). Table S6. Protein sizes of A. thaliana genes in each phylostratum. Table S7. Expression of A. thaliana genes in each phylostratum. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 950 Ya-Long Guo Table S8. Divergences between A. thaliana genes of each phylostratum and their orthologous genes in A. lyrata. Table S9. Divergence time of each node on the phylogeny of green plants. REFERENCES Adams, K.L. and Wendel, J.F. (2005) Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8, 135–141. Alba , M.M. and Castresana, J. (2005) Inverse relationship between evolutionary rate and age of mammalian genes. Mol. Biol. Evol. 22, 598–606. Banks, J.A., Nishiyama, T., Hasebe, M. et al. (2011) The Selaginella genome identifies genetic changes associated with the evolution of vascular plants. Science, 332, 960–963. Beilstein, M.A., Nagalingum, N.S., Clements, M.D., Manchester, S.R. and Mathews, S. (2010) Dated molecular phylogenies indicate a Miocene origin for Arabidopsis thaliana. Proc. Natl Acad. Sci. USA, 107, 18724–18728. Blanc, G. and Wolfe, K.H. (2004) Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell, 16, 1667–1678. Buschiazzo, E., Ritland, C., Bohlmann, J. and Ritland, K. (2012) Slow but not low: genomic comparisons reveals slower evolutionary rate and higher dN/dS in conifers compared to angiosperms. BMC Evol. Biol. 12, 8. Cai, J.J., Borenstein, E., Chen, R. and Petrov, D.A. (2009) Similarly strong purifying selection acts on human disease genes of all evolutionary ages. Genome Biol. Evol. 1, 131–144. Carroll, S., Wetherbee, S. and Grenier, J. (2001) From DNA to Diversity. Oxford, UK: Blackwell Science. Chardon, F. and Damerval, C. (2005) Phylogenomic analysis of the PEBP gene family in cereals. J. Mol. Evol. 61, 579–590. Chen, S., Zhang, Y.E. and Long, M. (2010) New genes in Drosophila quickly become essential. Science, 330, 1682–1685. Conesa, A., Gotz, S., Garcia-Gomez, J.M., Terol, J., Talon, M. and Robles, M. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674–3676. David, L.A. and Alm, E.J. (2011) Rapid evolutionary innovation during an Archaean genetic expansion. Nature, 469, 93–96. De Bie, T., Cristianini, N., Demuth, J.P. and Hahn, M.W. (2006) CAFE: a computational tool for the study of gene family evolution. Bioinformatics, 22, 1269–1271. De Bodt, S., Maere, S. and Van de Peer, Y. (2005) Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 20, 591–597. De Smet, R. and Van De Peer, Y. (2012) Redundancy and rewiring of genetic networks following genome-wide duplication events. Curr. Opin. Plant Biol. 15, 168–176. Demuth, J.P. and Hahn, M.W. (2009) The life and death of gene families. BioEssays, 31, 29–39. Demuth, J.P., De Bie, T., Stajich, J.E., Cristianini, N. and Hahn, M.W. (2006) The evolution of mammalian gene families. PLoS ONE, 1, e85. Domazet-Loso, T. and Tautz, D. (2003) An evolutionary analysis of orphan genes in Drosophila. Genome Res. 13, 2213–2219. Domazet-Loso, T. and Tautz, D. (2008) An ancient evolutionary origin of genes associated with human genetic diseases. Mol. Biol. Evol. 25, 2699–2707. Domazet-Loso, T. and Tautz, D. (2010a) A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature, 468, 815–818. Domazet-Loso, T. and Tautz, D. (2010b) Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa. BMC Biol. 8, 66. Domazet-Loso, T., Brajkovic, J. and Tautz, D. (2007) A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23, 533–539. Doyle, J.J., Flagel, L.E., Paterson, A.H., Rapp, R.A., Soltis, D.E., Soltis, P.S. and Wendel, J.F. (2008) Evolutionary genetics of genome merger and doubling in plants. Annu. Rev. Genet. 42, 443–461. Drosophila 12 Genomes Consortium. (2007) Evolution of genes and genomes on the Drosophila phylogeny. Nature, 450, 203–218. Enright, A.J., Van Dongen, S. and Ouzounis, C.A. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584. Fawcett, J.A., Rouze, P. and Van de Peer, Y. (2012) Higher intron loss rate in Arabidopsis thaliana than A. lyrata is consistent with stronger selection for a smaller genome. Mol. Biol. Evol. 29, 849–859. Flagel, L.E. and Wendel, J.F. (2009) Gene duplication and evolutionary novelty in plants. New Phytol. 183, 557–564. Freeling, M. (2009) Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev. Plant Biol. 60, 433–453. Gaut, B. (2002) Evolutionary dynamics of grass genomes. New Phytol. 154, 15–28. Gaut, B.S., Yang, L., Takuno, S. and Eguiarte, L.E. (2011) The patterns and causes of variation in plant nucleotide substitution rates. Annu. Rev. Ecol. Evol. Syst. 42, 245–266. Gene Ontology Consortium. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261. Gu, X., Wang, Y. and Gu, J. (2002) Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat. Genet. 31, 205–209. Guo, Y.L., Fitz, J., Schneeberger, K., Ossowski, S., Cao, J. and Weigel, D. (2011a) Genome-wide comparison of nucleotide-binding site-leucine-rich repeat-encoding genes in Arabidopsis. Plant Physiol. 157, 757–769. Guo, Y.L., Zhao, X., Lanz, C. and Weigel, D. (2011b) Evolution of the S-locus region in Arabidopsis thaliana relatives. Plant Physiol. 157, 937–946. Hahn, M.W., De Bie, T., Stajich, J.E., Nguyen, C. and Cristianini, N. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res. 15, 1153–1160. Hahn, M.W., Han, M.V. and Han, S.G. (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genet. 3, e197. Hollister, J.D., Smith, L.M., Guo, Y.L., Ott, F., Weigel, D. and Gaut, B.S. (2011) Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata. Proc. Natl Acad. Sci. USA, 108, 2322–2327. Hu, T.T., Pattyn, P., Bakker, E.G. et al. (2011) The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet. 43, 476–481. International Rice Genome Sequencing Project. (2005) The map-based sequence of the rice genome. Nature, 436, 793–800. Jiao, Y., Wickett, N.J., Ayyampalayam, S. et al. (2011) Ancestral polyploidy in seed plants and angiosperms. Nature, 473, 97–100. Kellogg, E.A. (2001) Evolutionary history of the grasses. Plant Physiol. 125, 1198–1205. Kersting, A.R., Bornberg-Bauer, E., Moore, A.D. and Grath, S. (2012) Dynamics and adaptive benefits of protein domain emergence and arrangements during plant genome evolution. Genome Biol. Evol. 4, 316–329. Kim, S., Soltis, P.S., Wall, K. and Soltis, D.E. (2006) Phylogeny and domain evolution in the APETALA2-like gene family. Mol. Biol. Evol. 23, 107–120. Koh, J., Soltis, P.S. and Soltis, D.E. (2010) Homeolog loss and expression changes in natural populations of the recently and repeatedly formed allotetraploid Tragopogon mirus (Asteraceae). BMC Genomics, 11, 97. Lander, E.S., Linton, L.M., Birren, B. et al. International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. Lang, D., Weiche, B., Timmerhaus, G., Richardt, S., Riano-Pachon, D.M., Correa, L.G., Reski, R., Mueller-Roeber, B. and Rensing, S.A. (2010) Genome-wide phylogenetic comparative analysis of plant transcriptional regulation: a timeline of loss, gain, expansion, and correlation with complexity. Genome Biol. Evol. 2, 488–503. Larracuente, A.M., Sackton, T.B., Greenberg, A.J., Wong, A., Singh, N.D., Sturgill, D., Zhang, Y., Oliver, B. and Clark, A.G. (2008) Evolution of protein-coding genes in Drosophila. Trends Genet. 24, 114–123. Leseberg, C.H., Li, A., Kang, H., Duvall, M. and Mao, L. (2006) Genome-wide analysis of the MADS-box gene family in Populus trichocarpa. Gene, 378, 84–94. Li, X., Duan, X., Jiang, H. et al. (2006) Genome-wide analysis of basic/helixloop-helix transcription factor family in rice and Arabidopsis. Plant Physiol. 141, 1167–1184. Liao, B.Y., Weng, M.P. and Zhang, J. (2010) Impact of extracellularity on the evolutionary rate of mammalian proteins. Genome Biol. Evol. 2, 39–43. Lynch, M. and Conery, J.S. (2000) The evolutionary fate and consequences of duplicate genes. Science, 290, 1151–1155. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951 Evolution of gene families in green plants 951 Lynch, M. and Conery, J.S. (2003) The evolutionary demography of duplicate genes. J. Struct. Funct. Genomics, 3, 35–44. Maere, S., De Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M. and Van de Peer, Y. (2005) Modeling gene and genome duplications in eukaryotes. Proc. Natl Acad. Sci. USA, 102, 5454–5459. Magallon, S.A. and Sanderson, M.J. (2005) Angiosperm divergence times: the effect of genes, codon positions, and time constraints. Evolution, 59, 1653–1670. McLysaght, A., Hokamp, K. and Wolfe, K.H. (2002) Extensive genomic duplication during early chordate evolution. Nat. Genet. 31, 200–204. Merchant, S.S., Prochnik, S.E., Vallon, O. et al. (2007) The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science, 318, 245–250. Milinkovitch, M.C., Helaers, R., Depiereux, E., Tzika, A.C. and Gabaldon, T. (2010) 2x genomes—depth does matter. Genome Biol. 11, R16. Nei, M. and Rooney, A.P. (2005) Concerted and birth-and-death evolution of multigene families. Annu. Rev. Genet. 39, 121–152. Ouyang, S., Zhu, W., Hamilton, J. et al. (2007) The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, D883–D887. Paterson, A.H., Bowers, J.E., Bruggmann, R. et al. (2009) The Sorghum bicolor genome and the diversification of grasses. Nature, 457, 551–556. Porter, B.W., Paidi, M., Ming, R., Alam, M., Nishijima, W.T. and Zhu, Y.J. (2009) Genome-wide analysis of Carica papaya reveals a small NBS resistance gene family. Mol. Genet. Genomics, 281, 609–626. Prachumwat, A. and Li, W.H. (2008) Gene number expansion and contraction in vertebrate genomes with respect to invertebrate genomes. Genome Res. 18, 221–232. Proost, S., Pattyn, P., Gerats, T. and Van De Peer, Y. (2011) Journey through the past: 150 million years of plant genome evolution. Plant J. 66, 58–65. Rensing, S.A., Lang, D., Zimmer, A.D. et al. (2007) The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science, 319, 64–69. Rhesus Macaque Genome Sequencing and Analysis Consortium. (2007) Evolutionary and biomedical insights from the rhesus macaque genome. Science, 316, 222–234. Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D. and Lohmann, J.U. (2005) A gene expression map of Arabidopsis thaliana development. Nat. Genet. 37, 501–506. Schneider, A., Souvorov, A., Sabath, N., Landan, G., Gonnet, G.H. and Graur, D. (2009) Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol. Evol. 1, 114–118. Shiu, S.H., Karlowski, W.M., Pan, R., Tzeng, Y.H., Mayer, K.F. and Li, W.H. (2004) Comparative analysis of the receptor-like kinase family in Arabidopsis and rice. Plant Cell, 16, 1220–1234. Simillion, C., Vandepoele, K., Van Montagu, M.C., Zabeau, M. and Van de Peer, Y. (2002) The hidden duplication past of Arabidopsis thaliana. Proc. Natl Acad. Sci. USA, 99, 13627–13632. Singh, N.D., Larracuente, A.M., Sackton, T.B. and Clark, A.G. (2009) Comparative genomics on the Drosophila phylogenetic tree. Annu. Rev. Ecol. Evol. Syst. 40, 459–480. Slotte, T., Bataillon, T., Hansen, T.T., St Onge, K., Wright, S.I. and Schierup, M.H. (2011) Genomic determinants of protein evolution and polymorphism in Arabidopsis. Genome Biol. Evol. 3, 1210–1219. Soltis, D.E., Albert, V.A., Leebens-Mack, J., Bell, C.D., Paterson, A.H., Zheng, C., Sankoff, D., Depamphilis, C.W., Wall, P.K. and Soltis, P.S. (2009) Polyploidy and angiosperm diversification. Am. J. Bot. 96, 336–348. Suyama, M., Torrents, D. and Bork, P. (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609–W612. Tautz, D. and Domazet-Loso, T. (2011) The evolutionary origin of orphan genes. Nat. Rev. Genet. 12, 692–702. The Arabidopsis Genome Initiative. (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796–815. Tuskan, G.A., Difazio, S., Jansson, S. et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, 313, 1596–1604. Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer, Y. and Vandepoele, K. (2012) Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158, 590–600. Van De Peer, Y. (2011) A mystery unveiled. Genome Biol. 12, 113. Van De Peer, Y., Maere, S. and Meyer, A. (2009) The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725–732. Van Dongen, S. (2000) Graph clustering by flow simulation. PhD Thesis, University of Utrecht, The Netherlands. Warringer, J. and Blomberg, A. (2006) Evolutionary constraints on yeast protein size. BMC Evol. Biol. 6, 61. Wellman, C.H., Osterloff, P.L. and Mohiuddin, U. (2003) Fragments of the earliest land plants. Nature, 425, 282–285. Wikstro€ m, N., Savolainen, V. and Chase, M.W. (2001) Evolution of the angiosperms: calibrating the family tree. Proc. Biol. Sci. 268, 2211–2220. Woodhouse, M.R., Tang, H. and Freeling, M. (2011) Different genes families in Arabidopsis thaliana transposed in different epochs and at different frequencies throughout the rosids. Plant Cell, 23, 4241–4253. Yang, Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556. Yang, L. and Gaut, B.S. (2011) Factors that contribute to variation in evolutionary rate among Arabidopsis Genes. Mol. Biol. Evol. 28, 2359–2369. Yang, X., Jawdy, S., Tschaplinski, T.J. and Tuskan, G.A. (2009) Genomewide identification of lineage-specific genes in Arabidopsis, Oryza and Populus. Genomics, 93, 473–480. Yang, L., Takuno, S., Waters, E.R. and Gaut, B.S. (2011) Lowly expressed genes in Arabidopsis thaliana bear the signature of possible pseudogenization by promoter degradation. Mol. Biol. Evol. 28, 1193–1203. Zhang, Y.E., Vibranovski, M.D., Krinsky, B.H. and Long, M. (2010a) Agedependent chromosomal distribution of male-biased genes in Drosophila. Genome Res. 20, 1526–1533. Zhang, Y.E., Vibranovski, M.D., Landback, P., Marais, G.A. and Long, M. (2010b) Chromosomal redistribution of male-biased genes in mammalian evolution with two bursts of gene gain on the X chromosome. PLoS Biol. 8, e1000494. Zhang, Y.E., Landback, P., Vibranovski, M.D. and Long, M. (2011) Accelerated recruitment of new brain development genes into the human genome. PLoS Biol. 9, e1001179. Zhang, X.C., Wang, Z., Zhang, X., Le, M.H., Sun, J., Xu, D., Cheng, J. and Stacey, G. (2012) Evolutionary dynamics of protein domain architecture in plants. BMC Evol. Biol. 12, 6. Zimmer, A., Lang, D., Richardt, S., Frank, W., Reski, R. and Rensing, S.A. (2007) Dating the early evolution of plants: detection and molecular clock analyses of orthologs. Mol. Genet. Genomics, 278, 393–402. © 2012 The Author The Plant Journal © 2012 Blackwell Publishing Ltd, The Plant Journal, (2013), 73, 941–951
© Copyright 2025 Paperzz