Conservation and Functional Element Discovery in 20 Angiosperm Plant Genomes Daniel Hupalo*,1 and Andrew D. Kern2 1 Department of Biological Sciences, Dartmouth College, Hanover, New Hampshire Department of Genetics, Rutgers University *Corresponding author: E-mail: [email protected]. Associate editor: Hideki Innan 2 Abstract Here, we describe the construction of a phylogenetically deep, whole-genome alignment of 20 flowering plants, along with an analysis of plant genome conservation. Each included angiosperm genome was aligned to a reference genome, Arabidopsis thaliana, using the LASTZ/MULTIZ paradigm and tools from the University of California–Santa Cruz Genome Browser source code. In addition to the multiple alignment, we created a local genome browser displaying multiple tracks of newly generated genome annotation, as well as annotation sourced from published data of other research groups. An investigation into A. thaliana gene features present in the aligned A. lyrata genome revealed better conservation of start codons, stop codons, and splice sites within our alignments (51% of features from A. thaliana conserved without interruption in A. lyrata) when compared with previous publicly available plant pairwise alignments (34% of features conserved). The detailed view of conservation across angiosperms revealed not only high coding-sequence conservation but also a large set of previously uncharacterized intergenic conservation. From this, we annotated the collection of conserved features, revealing dozens of putative noncoding RNAs, including some with recorded small RNA expression. Comparing conservation between kingdoms revealed a faster decay of vertebrate genome features when compared with angiosperm genomes. Finally, conserved sequences were searched for folding RNA features, including but not limited to noncoding RNA (ncRNA) genes. Among these, we highlight a double hairpin in the 50 -untranslated region (50 -UTR) of the PRIN2 gene and a putative ncRNA with homology targeting the LAF3 protein. Key words: Arabidopsis, alignment, conservation, comparative genomics, ultraconserved elements, angiosperm, RNA folding. Introduction ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 30(7):1729–1744 doi:10.1093/molbev/mst082 Advance Access publication May 2, 2013 1729 Article Within the past decade, a flood of whole-genome data has enabled a comparative genomics approach to functional element discovery. The construction of phylogenetically deep, whole-genome multiple alignments in models such as humans (Miller et al. 2007; Rhead et al. 2010; Fujita et al. 2011), Drosophila (Drosophila 12 Genomes Consortium et al. 2007), and yeast (Kellis et al. 2003) has allowed the research community to understand each genome in a comparative framework. These alignments have bridged annotation between similar species, and subsequent investigations in each individual organism have utilized these resources to discover a variety of functional genomic elements and genome characteristics (Pedersen et al. 2006; Stark et al. 2007; Friedman et al. 2009; Kim et al. 2009; Stojanovic 2009). Comparative genomic methods that use sequence similarity, protein alignments, and whole-genome alignments between two and five species have been widely applied by plant scientists to rice and Arabidopsis. Initially, these investigations into angiosperms focused primarily on synteny relationships between species (Acarkan et al. 2000; Ku et al. 2000; Gebhardt et al. 2003; Tang, Bowers, et al. 2008; Tang, Wang, et al. 2008), but have subsequently expanded into observations of lineage specific protein-coding genes (Campbell et al. 2007; Yang et al. 2009), RNA genes (Michaud et al. 2011), miRNAs (Zhang et al. 2006; Lenz et al. 2011), and of particular note, conserved noncoding sequences (Kaplinsky 2002; Guo 2003; Inada et al. 2003; Thomas et al. 2007; Wang et al. 2009; Kritsas et al. 2012). As the availability of sequenced species increases, comparative genomics in plants may now be performed using the same powerful frameworks and methodologies that have been applied to other model systems. The wealth of genetic resources available for work on Arabidopsis thaliana, combined with its compact genome, has made it the prime target for comparative genomics research within plants (Schmidt 2002). Currently, there exist dozens of sequenced angiosperm genomes, along with a large number of sequenced Arabidopsis genomes. This wealth of data, in conjunction with the detailed molecular biological characterization of plant genes available from The Arabidopsis Information Resource (TAIR) (Lamesch et al. 2011), has the potential to reveal a more complete set of functional elements in the A. thaliana genome through the use of sequence comparison. One major axis of motivation for this research is the need to bridge biological knowledge gained from study of Arabidopsis to agricultural plants (Morrell et al. 2011); comparative genomics can be a potent tool toward these ends. MBE Hupalo and Kern . doi:10.1093/molbev/mst082 For some time now, many pairwise and small-scale multiple plant genome alignments have been available, mainly based on the VISTA comparative genomics pipeline (Dubchak et al. 2000; Frazer et al. 2004). This system has utilized the LAGAN alignment tool (Brudno et al. 2003) to generate dozens of Arabidopsis-based pairwise alignments, as well as create five-way multiple alignments in model organisms (Brudno et al. 2007). Yet, no attempt is known to have been made to create or analyze a deep merged data set that can assess general conservation across genera, in similar treatment to that seen in all other kingdoms of life. To address this, we have used the University of California– Santa Cruz (UCSC) source tree (Kent et al. 2002) in combination with a LASTZ/MULTIZ paradigm (Blanchette et al. 2004; Harris 2007) to create a 20way plant alignment that reaches nearly to single-nucleotide resolution of conservation; we have provided that information in its entirety to the plant community via a plant genome browser available at genome.genetics.rutgers.edu. A major goal for our research is to characterize patterns of global conservation within angiosperms and leverage conservation data for functional element discovery. A recent analysis of a 105 kb syntenic segment of sequence between five Solanaceae demonstrated that measuring the conservation of DNA in plants can be a potent method of investigation for coding and noncoding sequence (Wang et al. 2008). This look into the nightshade family, along with investigations in fruit flies (Drosophila 12 Genomes Consortium et al. 2007), humans (Miller et al. 2007; Rhead et al. 2010; Fujita et al. 2011), and yeast (Kellis et al. 2003), has made clear the utility of a comparative genomics perspective on genome function. Identifying and combining conserved regions of the A. thaliana genome with known annotation from the plant community will help identify novel highly conserved features and provide insight into contrasting evolutionary histories among the kingdoms of life. Results Alignment of Angiosperms to an A. thaliana Reference Genome We have assembled the largest comparative genomic data set in plants to date, using whole-genome sequence data spanning the breadth of flowering plants. Choice of species to include in the alignment was based on data availability, and, in some cases, by simplicity of genome architecture. The wheat genome, for example, was excluded due to its size and complexity. The included species span all angiosperms, with representatives from four monocot Poaceae (Goff et al. 2002; Paterson et al. 2009; Schnable et al. 2009; Vogel et al. 2010), as well as 16 eudicots including four Brassicales (Arabidopsis Genome Initiative 2000; Ming et al. 2008; Hu et al. 2011; Wang et al. 2011), one Malvale (Argout et al. 2011), two Malpighiales (Tuskan et al. 2006; Chan et al. 2010), four Fabales (Retzel et al. 2007; Sato et al. 2008; Kim et al. 2010; Schmutz et al. 2010), one Cucurbitale (Huang et al. 2009), two Rosales (Velasco et al. 2010; Shulaev et al. 2011), one Vitale (Velasco et al. 2007), and one Solanaceae (Xu et al. 2011). The common names and genome details can be reviewed in table 1, sorted by their alignment coverage of A. thaliana. Table 1. Species Information and Alignment Coverage for Each Included Species in the 20way Comparison. Name Arabidopsis thaliana Arabidopsis lyrata Brassica rapa Carica papaya Linnaeus Theobroma cacao Vitis vinifera Populus trichocarpa Malus domestica Borkh. Ricinius communis Fragaria vesca Glycine max Glycine soja Lotus japonica Cucumis sativus var. sativus L. Medicago truncatula Solanum tuberosum Sorghum bicolor Oryza sativa L. ssp. Japonica Brachypodium distachyon Zea mays ssp. Mays a Common Name Type Thale cress Lyrate rockcress Chinese cabbage Papaya Cocoa Grape Poplar Apple Castor bean Strawberry Soybean Wild soybean Birdsfood trefoil Cucumber Clover Potato Sorghum Rice Purple false brome Corn Pseudochromosomes Pseudochromosomes Scaffold Scaffold Scaffold Pseudochromosomes Scaffold Scaffold Scaffold Scaffold Pseudochromosomes Sequence Reads Pseudochromosomes Scaffold Scaffold Scaffold Pseudochromosomes Pseudochromosomes Scaffold Pseudochromosomes Nucleotides 119 Mbp 206 Mbp 274 Mbp 342 Mbp 290 Mbp 497 Mbp 417 Mbp 881 Mbp 350 Mbp 214 Mbp 973 Mbp 973 Mbp 301 Mbp 203 Mbp 307 Mbp 727 Mbp 738 Mbp 373 Mbp 271 Mbp 2.06 Gbp Assembly Date February 2009 TAIRv9 May 2011 August 2011 December 2007 August 2010 March 2010 March 2011 November 2009 February 2009 June 2010 January 2010 — May 2008 January 2010 August 2007 July 2011 January 2007 January 2009 December 2009 March 2010 The “Substitutions per Site” column lists the divergence from A. thaliana based on the neutral tree of figure 1. 1730 Total Align (%) — 77.82 65.85 34.84 36.59 32.21 35.41 35.07 34.31 34.43 36.77 36.06 25.92 32.54 28.40 34.88 26.24 25.33 25.37 25.96 CDS Align (%) — 98.16 96.78 79.70 83.28 75.93 81.40 80.89 80.78 80.30 78.67 78.94 62.84 76.79 67.49 77.77 63.93 63.39 63.68 62.93 Subs/Sitea — 0.09 0.35 1.04 1.11 1.19 1.21 1.24 1.25 1.28 1.33 1.33 1.43 1.47 1.51 1.52 1.92 1.94 1.95 1.96 MBE Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 There is a diverse set of methods available for wholegenome alignment, including both open-source and commercial packages. Our goal was both to create a deep multiple alignment and to make that data set available for community use. The success in the alignment of vertebrate genomes, and their subsequent browsable alignment, demonstrated that both these goals can be achieved in an integrated opensource manner (Miller et al. 2007; Rhead et al. 2010; Fujita et al. 2011). Following this example, we created a mirror of the UCSC genome browser (genome.genetics.rutgers.edu) and built within its framework databases for multiple plant species. Currently, A. thaliana is the model browser for eudicot species, using TAIR version 9 annotations, and an additional browser with a TAIR version 8 assembly for legacy support. Each of the 20 genomes was aligned in a pairwise fashion using tuned parameters (see Materials and Methods), followed by chaining and conversion to pairwise alignment files. A phylogenetic tree covering all included species, without branch lengths, was drawn from an angiosperm supertree (Davies et al. 2004) and used to guide the MULTIZ’s (Blanchette et al. 2004) merging of pairwise alignments. Using the 20-way alignment, branch lengths for a neutral tree based on 4-fold degenerate sites were computed using the PHAST package (Hubisz et al. 2011) and are displayed in figure 1. Other analyses of eudicots and angiosperms have constructed phylogenies with similar substitutions per site as those seen in the neutral tree used in this investigation (Yang et al. 1999; Tang, Wang, et al. 2008). Base pair coverage for the whole genome and for coding DNA sequence (CDS) regions is presented in table 1 sorted by divergence from A. thaliana based on a neutral phylogenetic tree. Genomes included in the alignment vary greatly in terms of genome architecture, sequence quality, size, and 0.1 Subst/site Eudicot Monocot phylogenetic distance from the reference. The coverage shows generally similar patterns compared with numbers gathered from mammalian alignments (Miller et al. 2007; Rhead et al. 2010; Fujita et al. 2011). It is informative to compare alignment coverage at various evolutionary distances between vertebrate alignments (Miller et al. 2007; Rhead et al. 2010; Fujita et al. 2011) and plant alignments. For instance, A. thaliana and Brassica rapa are roughly as divergent as humans and the galago Otolemur garnettii at 0.35 and 0.33 substitutions per 4D-site, respectively. At this level of divergence, our plant alignment shows a greater proportion of aligned bases (65.8% vs. 44.3% aligned, respectively). Coding region alignments in this comparison follow suit with 96% versus 80% aligned base pairs in plants versus animals. Looking at the most diverged species comparison in our analysis, A. thaliana to Zea mays (1.96 substitutions per site), we find this is roughly proportional to the amount of divergence between Human and Xenopus tropicalis (1.97 substitutions per site). In this comparison, vertebrates lose a greater amount of overall alignment (26% vs. 8% aligned); however, the coding regions are more conserved (62% vs. 87% aligned). Despite the differences seen in one to one comparisons, we observe a shared pattern that as distance increases, coverage by whole-genome sequence drops precipitously, bottoming at roughly 35% across eudicots and 26% across monocots. Unsurprisingly, protein-coding sequence shows higher conservation, never dropping below 62%. Coverage and Gene Feature Comparisons in the Arabidopsis Genus The VISTA genome browser has made available for public use a number of precomputed whole-genome alignments of plant genomes (Frazer et al. 2004; Brudno et al. 2007). Malus x domestica Fragaria vesca Cucumis sativus Medicago truncula Lotus japonica Glycine max Glycine soja Ricinus communis Populus trichocarpa Arabidopsis thaliana Arabidopsis lyrata Brassica rapa Carica papaya Theobroma cacao Vitis vinifera Solanum tuberosum Sorghum bicolor Zea mays Oryza sativa Brachypodium distachyon FIG. 1. A phylogenetic tree of the relationships between species included in the 20way angiosperm alignment and used to guide MULTIZ merging of pairwise alignments. The neutral tree is based on 4-fold degenerate sites sampled from each chromosome with branches proportional to the listed scale, with substitutions per site determined by the PhyloFit software. Average trees for conserved and nonconserved regions with branch lengths are available in supplementary figure S5, Supplementary Material online. 1731 MBE Hupalo and Kern . doi:10.1093/molbev/mst082 These alignments range from pairwise up to 4way and have been used by the scientific community for comparisons between angiosperm genomes (Swarbreck et al. 2008; Zeller et al. 2009). We used one of these alignments created by the VISTA pipeline comparing A. thaliana to A. lyrata as a benchmark for the quality of our pairwise alignments, which used the LASTZ/MULTIZ and axtChain methodology (Kent et al. 2003; Blanchette et al. 2004; Harris 2007). This VISTA A. thaliana vs. A. lyrata (Ath/Aly) alignment is available on the araTha8 genome browser, along with its corresponding conservation track and TAIR version 8 gene annotation. To evaluate nucleotide coverage, several types of base alignment were measured, including the number of exact base pair matches, the number of mismatched nucleotides, the number of gaps, RepeatMasked regions, and regions where no relationship between the two genomes was assigned, which is equivalent to a gap (fig. 2A, supplementary table S1, Supplementary Material online). The number of exact matches and mismatches between the VISTA alignments and our alignment was close to identical, with each covering 65% and 66% of the A. thaliana genome, respectively. Large differences can be seen in the amount of masking applied to both the reference and query genome. Comparing the RepeatMasker track created during our masking of A. thaliana to the VISTA alignment track on the TAIR v8 genome browser, it is evident that, although the VISTA alignment employs some masking, it is limited, and repeat regions are often gapped. This results in a comparatively higher proportion of gapped sequences within the VISTA alignment. Both methods result in raw coverage, which is within 10% of other pairwise alignments of Ath/Aly (Hu et al. 2011). A B No Align 7% Masked 14% Gap 5% Mismatch 8% With coverage numbers comparable to previous alignments, we wanted to investigate how the constituent parts in A. thaliana are affected by the process of multiple alignment. Base-by-base coverage numbers may conceal errors in reading frame or poorly aligned functional sites such as splice sites. We used TAIR protein coding gene annotation (Lamesch et al. 2011) and the cleanGenes program in the PHAST package (Hubisz et al. 2011) to locate and evaluate start codons, stop codons, and splice sites, and to identify frameshift/nonsense mutations. Annotated gene regions containing all the listed functional elements without interruptions between an A. thaliana and A. lyrata alignment were 16% greater (5,240) in our alignments, compared with the VISTA alignments (fig. 2B). Features listed as having no alignment have an excess of gaps obscuring any measurement of features. Gene regions with no alignment occur more than twice as often in the VISTA data set. In the subsequent observations, there were marginal increases in failed tests of gene features for our alignments, attributable to a greater proportion of features passing the initial “no alignment” test. The cleanGenes software also examines full exons listed in the annotation for the A. thaliana genome. Using this function, we tabulated the number of exons with uninterrupted alignment in an Ath/Aly alignment. Over 4,000 more conserved exons with no gaps in alignment were identified in our alignments, compared with the VISTA alignments (supplementary table S1, Supplementary Material online). To further address whether our methods are creating suitable alignments beyond pairwise comparisons to A. lyrata, we investigated gene conservation in two additional species present in the alignment. Alignments created using LASTZ/ MULTIZ for Vitis vinifera and Glycine max were compared 16000 No Align 14% Masked 1% Gap 13% Mismatch 7% VISTA Ath/Aly 20way Ath/Aly 8000 Exact 65% Exact 66% 0 Failed Features No Passing All Alignment Start Tests Codon Failed Stop Codon Failed 5' Splice Site Failed 3' Nonsense Frameshift Splice Mutation Mutation Site FIG. 2. Alignment coverage and quality comparison between our implementation of the LASTZ/MULTIZ paradigm and a publicly available alignment hosted by the VISTA genome browser using a Lagan-based alignment. (A) Coverage statistics as tabulated by the mafCoverage utility for each methodology detailing exact nucleotide matches, alignment with mismatched nucleotides, gapped sequence, sequence intentionally removed due to repeats, and regions where no relationship between the two genomes was assigned (equivalent to a gap). (B) Results from the cleanGenes utility that takes a TAIR v8 annotation and measures whether a given alignment has conserved the gene feature and maintained its protein coding ability. If the gene alignment between the two genomes is not cleanly conserved, the type of error is recorded. 1732 Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 with existing VISTA pairwise alignments created using the LAGAN pipeline. Similar to the results observed when comparing A. thaliana with A. lyrata, our alignments outperformed existing alignments. As detailed in supplementary table S2, Supplementary Material online, LASTZ was able to cleanly align nearly twice (801/427) the number of features compared with LAGAN when comparing Arabidopsis to wine grape. The same result was found in the A. thaliana/ G. Max alignment, which found more than twice the number of gene features cleanly conserved when compared with LAGAN (837/336). This difference can be attributed to the preservation of start/stop codons and splice sites during the alignment process. Base-by-Base Conservation and Discrete Conserved Elements among Angiosperms To predict conserved regions in our multiple alignment, we used the phyloHMM method of Siepel et al. (2005), which searched for conserved elements within four different groups of organisms: vertebrates, insects, worms, and yeast. This phyloHMM scores general conservation across an alignment and also creates a smaller set of discrete elements, representing the most highly conserved blocks of sequence (mostCons). The normal composition of genome features in A. thaliana is illustrated in figure 3A and serves as a reference to which the predicted conserved regions can be compared. The composition of all scored conserved elements (fig. 3B) can be contrasted with the normal distribution, revealing an expansion in the proportion of protein-coding sequence and unannotated intergenic sequence. Annotations specific to regions not associated with protein-coding genes, such as noncoding RNAs and translational RNAs, only represent a fraction of the larger conserved region data set. This contrasts with the relatively higher proportion of conserved RNAs observed in previous analyses in organismal groups such as vertebrates (Siepel et al. 2005). Using this previous data, and reproducing the analysis for angiosperm genomes, we observe a greater amount of CDS conservation in angiosperms (42%) than seen in vertebrate (18%) and insect (26%) conservation but less than that seen in worms (55%) and yeast (86%). In general, when comparing the patterns of element conservation and diversity seen in angiosperm genomes to the same distributions previously mapped in reference vertebrate, yeast, insect, and worm genomes, we find that angiosperms most closely resemble the distribution of conserved elements seen in nematodes such as Caenorhabditis elegans. PhastCons produces a set of discrete regions that are the most conserved within the alignment and is graphed by annotation type in figure 3C. To further isolate regions with the deepest phylogenetic conservation, we selected the top 10% of this mostCons set, as defined by having a logarithm of the odds (LOD) score greater than 88. The majority of the mostCons set and the tail of its distribution are annotated as protein-coding sequence. Despite filtering for only the highest scoring regions in the mostCons set, elements mapping to intergenic regions are still represented. MBE These intergenic regions do not include any known DNAlevel annotation; this suggests that there is substantial undiscovered functionality present in A. thaliana and other plant genomes. Cis-regulatory elements are equally represented among the normal composition, conserved, and most-conserved regions. Although using short sequence motifs to identify regulatory elements may accrue false-positive regions that share sequence identity but are nonfunctional, the deep conservation of many of these sites demonstrates that most are likely functional in some way. In general, the mostCons data set serves as the starting point for further analysis and annotation of conserved regions within angiosperms. One way of characterizing the conserved portion of the genome is to ask what functional annotations are enriched among identified conserved elements. Figure 3D provides such a view of the conserved portion of plant genomes. In particular, translational RNAs are the most enriched annotation among conserved regions, followed by protein-coding sequences. Following these two groups, we observe that RNAs that regulate transcription, such as miRNAs, are enriched among conserved sequences. We observed that this annotation set of miRNA and noncoding RNA (ncRNA) annotations was diverse in its alignment depth. It included RNAs present in many, if not all the 20 included species, and RNAs with alignment to only Brassicales. This wide variation in the depth of conservation of RNAs makes their mild enrichment unsurprising. In addition to the enrichment of regulatory RNAs, regions tentatively annotated as binding transcription factors. As a control, transposable elements are drastically under-represented, as being highly repetitive they do not align well nor should they be conserved in most cases between species. Comparing the enrichment of angiosperm annotations among conserved regions to vertebrate annotation enriched in conserved regions determined by the 46way vertebrate conservation track showed nearly the same ordering of enriched annotation types (supplementary fig. S1, Supplementary Material online). Vertebrate enrichment values trended higher in all categories compared with angiosperm enrichment. Previous investigations into conservation between vertebrate species have looked into the alignability (i.e., percentage of bp with aligned sequence to a reference) of different components of the genome as a function of evolutionary divergence (Miller et al. 2007). To compare and contrast the animal results of Miller et al. (2007) with conservation within plant species, we recapitulated their analysis using the 46way alignment information (Fujita et al. 2011) and overlaid the trend lines on selected angiosperm results (fig. 3F). Comparing conservation of RefSeq CDS regions from vertebrates to conservation of TAIR CDS regions within angiosperms showed a faster decline of alignability in vertebrate species. Similarly, a faster decline in vertebrate alignability was observed when comparing angiosperms to vertebrate cis-regulatory sites as seen in the trend line of figure 3F. This may be due to ORegAnno annotation being biochemically validated compared with our initial set of regulatory sites that are bioinformatically predicted. 1733 F Rapa .5 Other Eudictos Transposable Elements 1% CDS 42% Other 1% Distance (Substitutions Per Site) 1 Unannotated Intergenic Conservation 33% Intron 19% cis-Regulatory 3% 1.5 Most Conserved 2 Monocots Top 10% Most Conserved C Cis Reg Exons Vertebrates TE UTR Regulatory RNAs Cis Reg Intron CDS tRNA G As ** ** ** ** ry e Ps 85 n 43 24 8 30 127 ts 53 1 7 Small RNA Expresion 185 58 EvoFold 18 Secondary Structure 120 27 53 en em El ** 0.04 E Good ORF a Tr la ns 7 23 302 120 eg R ul 154 * R sci u eg e Ps y e El s a Tr tRNA + EvoFold 52 tRNA 100 Pseudogene Homology 78 ts b sa le en em ts No Homology 285 El * ** 0.41x 0.07x Transposable Element Homology 533 EvoFold 27 Small RNA 30 4 po ns en m 1.23x snRNA/snoRNA Homology 77 r to ne 1.64x ge la o ud S D C ** 3.0x ncRNA Homology 40 As N R 4.08x EST 26 y or at As N R rRNA 17 tio l na ** 17.54x Exons Flanking Conservation Conserved Regions with Homology to Known Proteins Conserved Regions with 1787 - Total to Homology Known Proteins 1787 a Tr e ns bl sa o sp ** 0.39 tro In Small RNA Expresion Exons Flanking 29 EvoFold Conservation Secondary 266 Structure to s ne 1.07 ge o ud ts en em El As 1.33 1.27 N R la u eg R s- ci y or at ul S D C 2.28 eg R 3.55 N lR a on ti la ns a Tr D Conserved Region Enrichment B Conserved Secondary Structure Enrichment FIG. 3. An analysis of conservation in the 20way plant alignment compared to. (A) The composition of the Arabidopsis thaliana genome sourced from TAIR v9 and newly generated annotations. “Other” contains ncRNAs, miRNAs, tRNAs, rRNAs, small nuclear RNAs (snRNAs)/small nucleolar RNAs (snoRNAs), and pseudogenes. (B) PhastCons-predicted conserved elements sorted by annotation. (C) The discrete “mostCons” regions predicted by phastCons and the 10% highest scoring tail of the distribution of conservation scores for mostCons regions. Color and segment position correspond to annotation type described in (A) and (B). (D) Enrichment of conserved elements within different feature types. Significance was determined by a Fischer’s exact test with single stars denoting P < 0.05 and double stars P < 0.01. (E) Enrichment of EvoFoldpredicted secondary structures in different types of genome features. (F) Alignability of A. thaliana genome features to corresponding features in plants at increasing phylogenetic distances. Also plotted are proportional trend lines of vertebrate alignability of cis-regulatory sites and exons drawn from 46 vertebrate species (Fujita et al. 2011). The distance is scaled according to substitutions per site drawn from a 4-fold degenerate neutral tree. (G) BLAST homology annotation of unannotated intergenic conserved elements from (B). Annotation categories are shown as proportional areas with subsets shown as Venn diagrams. Regions are labeled with their putative annotation type and total number of elements identified. 0.2 0.4 0.6 0.8 1 CDS 30% Transposable Elements 29% Lyrata Intergenic 21% Intron 16% Other 1% ns A Alignability tro In 1734 cis-Regulatory 3% Hupalo and Kern . doi:10.1093/molbev/mst082 MBE MBE Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 De novo Annotation of Unknown Conserved Elements The conservation analysis from figure 3B revealed that there remains a large percentage of conserved intergenic DNA that is not associated with any documented annotation or function. To further investigate this set of regions, we applied a BLAST homology search to all existing plant databases to annotate new sequence (see Materials and Methods). BLAST annotation terms associated with each region of unannotated conservation were recorded and graphed as proportional area, so as to visualize the diversity of the previously unknown conservation (fig. 3G). Each circular area represents a group of conserved regions that do not overlap existing annotation in Arabidopsis but that share sequence identity to an annotation group in Arabidopsis or in any other plant genome. It is important to note that this is by no means an exact one-to-one annotation, as most regions show moderate sequence identity. However, it illustrates that single-genome computational predictions of functional elements have overlooked many biologically relevant sites within the Arabidopsis genome and provides inroads toward their further characterization. Intersecting predicted folding RNAs (fRNAs) with conserved regions with tRNA homology revealed that half of these regions with tRNA homology also exhibited folding. More than half of conserved regions that show sequence homology to angiosperm tRNAs also exhibit complex RNA folding patterns. This overlap of independent methods of identification gives a strong indication that these conserved regions are part of previously unannotated tRNAs in Arabidopsis. The remaining conserved regions that only have homology to tRNAs may be truncated and lack the complementary sequence to accurately predict a fold within that region. Forty conserved regions were found that showed some sequence identity to, but not overlap with, currently annotated noncoding RNAs in plants; these regions were intersected with a track listing regions of small RNA expression, which revealed that 14 sequences in that set were transcribed. Of those 14 regions that expressed RNA, seven also have the predicted folding structures associated with the conserved region. Although the first BLAST sequence identity term for these elements was similar to ncRNAs, many also have protein-coding homology as a secondary BLAST term, suggesting their potential targets for regulation. Despite all attempts at classification, the function of 10% of the starting data set of unannotated conserved elements remains unknown. These elements of unusually high sequence conservation among species, labeled in figure 3G as “no homology,” cannot yet be fully characterized; however, similar to many of the other regions successfully identified, a subset shows small RNA expression or predicted folding, giving clues to a currently veiled function. The most prominent set of newly annotated elements (fig. 3G) is conserved regions with homology to proteincoding sequence. Overlapping these regions of protein homology are subsets that have been intersected with different whole-genome annotation tracks. This is visualized as an internal Venn diagram of different types of feature characteristics, such as structure or expression. One possibility is that this large group of elements comprised regions that could code for proteins, either currently or ancestrally. To explore this, we evaluated the reading frames of each conserved region and identified that, for the length of the conservation, at least one-third (573) have one or more viable reading frames without stop codons. An additional measure of potential protein-coding ability was evaluating the proximity to known exons. About one-fifth of the regions with good open reading frames (174) were within 300 bp of a known exon, making them candidates for being involved as an alternative variant of a transcript or as an unknown exon of an annotated gene. Although all these included regions have some homology to protein sequence, such homology is not always an indication that the conserved sequence contributes to an mRNA transcript. RNAs that regulate transcripts or target DNA require homology to that target DNA (e.g., miRNAs). As such, parts of these regions of homology could result from targeting protein-coding regions as part of a regulatory mechanism. To further differentiate this large group of 1,787 elements, other features were employed to identify additional characteristics of each conserved region. Secondary structure and small RNA expression were used to elucidate potential RNA genes within this set. This resulted in 53 elements at the intersection of secondary structure and small RNA expression, which made prime targets for further investigation. One intriguing region from this set had a top BLAST term, which listed an unknown protein, and a second BLAST term, which listed the protein LAF3 (AT3G55850) (fig. 5). The LAF3 protein participates in regulating phytochrome A signal transduction in the cytosol (Hare et al. 2003). The third highest scoring BLAST hit was the noncoding RNA AT1g70185, located 500 kb downstream of the unknown conserved region on chromosome 1, with a stretch of homology 80 bp long with 10 substitutions along that stretch. The biological function of this related ncRNA is unknown. Our predicted ncRNA, which was found through its pattern of conservation, shares homology with the sequence for the LAF3 protein, as well as with the TAIR-annotated ncRNA. The protein homology overlaps the expressed small RNAs mapped to the region. EvoFold-predicted secondary structure shows an unusually high level of conservation, with almost no substitutions in the fold found among angiosperms. In addition to this ncRNA, several other targets from this data set share similar patterns of high conservation, expression, and high-scoring secondary structure; these are annotated as part of a browser track on the A. thaliana genome. RNA Secondary Structure Prediction across A. thaliana Previous successes in whole-genome comparisons among species groups have opened a window onto using multiple alignments and phylogenetic trees to identify RNA genes (Pedersen et al. 2006; Stark et al. 2007). To identify possible RNA genes in our plant alignment, we used the phyloSCFG 1735 MBE Hupalo and Kern . doi:10.1093/molbev/mst082 algorithm implemented in the EvoFold software package (Pedersen et al. 2006), in addition to the RNAalifold program (Bernhart et al. 2008). These previous RNA structure analyses have highlighted the inherent high rates of false positives in folding prediction. Using these two independent prediction methods provided the opportunity to evaluate each fRNA from multiple perspectives; this helped to eliminate false positives that may have resulted from characteristics that are unique to a particular algorithm. An example of this approach can be seen in predictions such as those illustrated in figure 4E, where the two algorithms overlap in their annotation of a fold. The combined predictions of the two approaches identified 86,000 sites that could potentially fold. Short folds of less than 15 bp were found to be the majority of predictions, though longer folds were also found in large numbers (fig. 4A). To assess the accuracy in determining fRNA from highly conserved alignments, the set of TAIR annotations for transfer RNAs consisting of 689 sites was used as a positive control for fRNA prediction. Our fold classifications predict 97% (637) of these established fRNAs, figure 4B. The remaining 3% of annotated tRNAs were not identified, due to poor alignment or low conservation. This suggests accurate prediction of known, conserved fRNAs, on par with previous investigations into fRNA genes in other organisms. Secondary structure in RNAs can take many physical forms; we quantified this variation in shape by recording the type of matching seen in both long and short folds (fig. 4C). The hairpin type dominates among shorter folds. Long folds show much higher diversity in shape, including complex folds that have more than three hairpins in the folded structure. Both long and short regions show a greater proportion of folds comprising two hairpins in angiosperms, compared with the distribution observed among folds in the human genome (Pedersen et al. 2006) that observed that double hairpins are more rare in primates. The types of annotation that overlap regions which fold are described in figure 4D for long and short folds. In vertebrates, nearly half of all known folds are intergenic, with the remainder being associated with introns and CDS. In contrast, angiosperms have few intergenic folds, with 70% or more occurring within coding sequence. This difference mirrors the differences seen in the type of conservation of all sequence between species. The data set used for both analyses, the “mostCons” (most-conserved regions identified by phastCons), impacts the distribution of folds among annotation types. As a result, we see that the mostCons composition in figure 3C is similar to the composition of folds detected in figure 4D. As a vignette describing one of the types of folds detected in this analysis, we selected a previously undescribed conserved high-scoring double hairpin within the 50 -untranslated region of the plastid redox insensitive 2 (PRIN2) gene (fig. 4E). PRIN2 is a nuclear-encoded chloroplast-localized protein whose expression levels are altered by light (Kindgren et al. 2011). The PRIN2 protein was also found to interact with the plastid-encoded RNA polymerase-altering expression and therefore is thought to be a nonessential regulator of plastid 1736 gene expression. The folded RNA structure is highly conserved among all flowering plants, with few mismatching base pairs (fig. 4F). The consensus fold shows two strongly conserved hairpins joined by a more variable region (fig. 4G). The gene shows two transcripts scored 3 and 4 stars by TAIR, differing only in the length of the 50 -untranslated region (50 -UTR): one with the predicted folds and one without. Interestingly, in the longer transcript, the hairpins directly overlap the ribosome initiation site that begins at the 50 cap. Additionally, we detected two cis-regulatory motifs: one near, and one within, the UTR regions (as seen in fig. 4E). The first cis-regulatory element, a “sequence overrepresented in light repressed promoters number 3” (SORLREP3) motif has previously been found to occur near promoters whose transcript levels are reduced under a continuous red light stimulus (Hudson and Quail 2003). The second motif, found within the 50 -UTR of both transcripts, is an I-box and is known to exist in the promoter regions of light-regulated genes (Giuliano et al. 1988). How these two cis-regulatory elements contribute to expression levels of each of the alternative PRIN2 transcripts is unknown, but observed regulatory pattern fits with previous knowledge about stimuli associated with PRIN2 expression. Their presence flanking the predicted folds may indicate that different expression patterns of the gene are possible, depending on transcription factor binding. Although these new predictions need further validation, they highlight the ability of these genome-wide data sets to add value to existing gene investigations. Uninterrupted Conservation in Angiosperms One peculiarity found in the genomes of mammals and insects is long stretches of uninterrupted conservation (Bejerano et al. 2004). These ultraconserved elements (UCEs) were originally located by a comparative genomic search between human, mouse, and rat, which showed evidence of deep phylogenetic conservation, as well as ongoing purifying selection in the human genome (Bejerano et al. 2004; Katzman et al. 2007; Chiang et al. 2008). UCEs can extend to lengths of more than 500 bp and can be best described as the extreme tail of the distribution of genome-wide conserved elements. There is a degree of controversy as to whether they exist within plant genomes, with some researchers reporting their discovery and others remaining skeptical (Zheng and Zhang 2008; Freeling and Subramaniam 2009). More recent research has used BLAST searches across multiple plant genomes, identifying regions that have been termed ultraconserved-like elements (ULEs) (Kritsas et al. 2012). ULEs have unusually high levels of conservation, and negative selection acting on their sequence, but lack the uninterrupted segments and extreme purifying selection that are found in mammals. To explore whether flowering plants contain even modestly extended stretches of uninterrupted conservation seen in mammals and insects, we conducted a search using methods mirroring those used to detect these regions in mammals (Bejerano et al. 2004; Glazov et al. 2005). These approaches use whole-genome multiple alignment to detect blocks of conservation. Specifically, the algorithm Fold Lengths 22% 25% 71% 44% 10% Long Folds 97% (637) Coverage of Known tRNA Accuracy 8% 11% 87% 88% Short Folds (<15bp) B Scale chr1: Other UTR RNAs Intron Intergenic CDS F 3469650 AT1G10522.1 3469600 araTha9.chr1 lyrata rapa papaya cacao glycineMax glycineSoja malus fragaria cucumis ricinus populus vitis tuberosum sorghum zea oryza brachypodium SS anno pair symbol vf_1_142172 ef_1_142172 SORLREP3 - Fold Complex Fold - Fold - Fold Conservation E AT1G10522.2 Putative Cis-Regulatory Elements 3469850 3469900 3469950 User Supplied Track TAIR9 Protein-Coding Genes 3469800 3470000 3470050 3470100 ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTTTCCTTGATTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTTTCCTTGATTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGA-AGTACCTCTGTATCCTTGATTTCTAAGGAG-CAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGTTCGTACCTCTGTTTCCTGGAGTTCGAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTATCCTTTAGCACAAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTGAGTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTGAGTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTGATTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTGACTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGG-TTGTACATCTGTGTCCTTGAGTTCTAAGGAGACAA ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTTATCACAAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTAATCACTAAGGAGACAG ACGACCTTACTTGAACAGGATCTATTCTATAGA-TTGTACCTCTGTATCCTTGAGTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCACTGAATCCTTGATTTCTAAGGAGACAG ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCGCTGCATCCTTGATTAATAAGGAGGCAA ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACTGTTGTATCCTTGATTGATAAGGAGGCAA ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCGTTGCATCCTTGACTAATAAGGAGGCAA ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCGCTACATCCTTTACCAAAAAGGAGGCAA (((((((.....((((((...))))))...))).))))....((((.((((((.....)))))).)))) abcdefg hijklm mlkjih gfe dcba abcd efghij jihgfe dcba 20 Way Multiz Angiosperm Genome Alignment I Box RNAalifold and Evofold Predictions of RNA Secondary Structure vf_1_142175 200 bases 3469700 3469750 5’UTR 3470150 G 5’ 0 1 C G U U A C GC GC U UA A CU U C GA GU UU G G CA UU U CU CA A GA CC A GA GG UU G A A G AU U C UU FIG. 4. Predicted secondary structure based on a 20way angiosperm alignment using EvoFold and RNAalifold. (A) Fold lengths separated into short (<15 bp) or long folds. (B) Coverage of known tRNAs intersected with fold predictions. Remaining 3% (21) of folds were due to low conservation or poor alignment. (C) Fold structure for both long and short fRNA sets. The number of hairpins was counted in a single fold, and classified based on the fold’s structure. (D) Type of overlapping annotation for both long and short data sets. (E) UCSC genome browser screenshot of predicted hairpins in the PRIN2 (AT1G10522) gene that is involved in plastid gene transcription, and is alternatively spliced. The hairpins were predicted by both Evofold and RNAalifold, and overlap the ribosomal initiation site. Also pictured are cis-regulatory predictions, a TAIR gene track, and the phastCons conservation track (g) 20way alignment of the region colored blue where there is a single substitution compatible with the annotated pair, green with a compatible double substitution, and red where there is a substitution not compatible with the annotated pair. (F) Consensus Vienna RNAfold predicted MFE structure of the highlighted 50 -UTR region. D 21% C Long 37% (39681) Short 63% (54981) A Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE 1737 MBE Hupalo and Kern . doi:10.1093/molbev/mst082 Scale chr1: 50 bases 26891600 26891650 Protein Homology vf_1_107594 ef_1_107594 lyrata rapa papaya cacao ricinus populus malus fragaria cucumis medicago lotus glycineMax glycineSoja vitis tuberosum oryza brachypodium sorghum zea 0 Conservation 1 smallRNAs FIG. 5. A screenshot from the Arabidopsis thaliana genome browser displaying tracks overlaid on a putative noncoding RNA, detected due to its high conservation and expression. Tracks include conservation for each of the 20 included species, a track showing conserved regions BLAST-annotated as having protein homology, a track showing secondary structure computed with EvoFold and RNAalifold with dark green denoting fold predictions and light green nonfolding regions, and a track from the Arabidopsis Small RNA Project Database showing small RNA expression overlapping the conserved regions. Also shown is the consensus Vienna RNAfold predicted MFE structure of the putative ncRNA. identifies UCEs by starting with a conserved alignment column and stringing together subsequent preserved columns until this pattern breaks due to any kind of nucleotide change. Constraining this algorithm are two parameters: the number of genomes within the alignment and the minimum threshold for declaring an UCE. Using all 20 aligned species to search for regions of consistent alignment column conservation returns no regions when using a cutoff of 18 bp. One possibility is that gaps in alignment could be due to alignment or assembly quality errors. To better account for this, we limited the search space by using three-way alignments to A. thaliana. The alignment of G. max and V. vinifera to A. thaliana returned the largest number of uninterrupted regions greater than 18 bp. Using TAIR annotation and BLAST homology, we annotated 1,600 uninterrupted regions detected in this 3way alignment and determined they all fall within known types of genome features (supplementary fig. S2, Supplementary Material online). Considering that novel metazoan-type uninterrupted conservation has not been found, it can be concluded that metazoan-like UCEs are not present in angiosperm genomes at the investigated phylogenetic depths. As suggested by Kritsas et al., plant genomes may contain features that may serve a similar purpose but with altered or reduced conservation characteristics. Discussion Is the evolution of plant genomes distinct from that of animals? Here, we construct a phylogenetically deep alignment of angiosperm genomes, to ask how sequence conservation in angiosperms compares to groups of species in other kingdoms of life. Relating the entirety of currently sequenced genomes can reveal a more complete story on how similar plant genomes are and on what features they “value” as part of a shared evolutionary history. Additionally, conservation of sequence has been shown to quickly and clearly identify functional regions which might otherwise have been overlooked. 1738 As such, by analyzing genome conservation in flowering plants, we have been able to add new annotations based on patterns of conservation and identify novel features with secondary structure and potential target sequences. Information content of multiple alignments increases as the number of species and the breadth of the phylogeny increase. Although this is true for the first few included species, there are diminishing returns as further species are added. To better quantify this, some have looked into how many genomes are necessary to reach the nucleotide-level resolution of conservation in comparative studies (Cooper et al. 2003; Eddy 2005). Although these investigations focus on the number of mammalian genomes, they still provide potent rules of thumb for estimating how many genomes are needed for high resolution. Depending on the phylogenetic relationships, anywhere between 15 and 40 genomes may be necessary. Our choice of how many genomes to include was largely dictated by availability, as roughly 20 genomes were available to us for use. Although plant alignments could benefit from further inclusion of comparative data of close phylogenetic distance to A. thaliana, this phylogenetically broad 20way alignment is a large step toward nucleotide-level identification of conserved sequences in angiosperm genomes. Aligning plant genomes inevitably leads to the question of how recent polyploid events impact the construction of data sets and their analysis. Arabidopsis thaliana is one of the smallest sequenced plant genomes; as a result, alignments generated with an A. thaliana reference will be equally minimal. In instances where the reference (Arabidopsis) only has a single copy, species aligned to this compact reference will exclude regions of less conserved paralogs within the query genome. This method has worked well for broad use in less dynamic whole-genome comparative data sets such as vertebrates. In light of the complex ploidy history of plant genomes, we want our inference to be conservative, relative to the influence of polyploidy; thus, we focus attention only Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 on questions of genome conservation within the A. thaliana genome. A modified alignment process, using high-quality chromosome data to resolve genome duplications events before the LASTZ alignment step, could produce alignments without such a complex composite nature. However, a sparse number of genomes have such data available. Our investigation into coding-sequence conservation demonstrates that we are generating quality whole-genome alignments, which preserve essential gene features in aligned regions. Our approach also implemented a new platform for visualizing and accessing alignment data for plants; such tools have long been available for other model species, but this is the first instance a deep comparative browser for plants. The results from benchmarking the gene quality of the alignment (fig. 2) was surprising in that almost half of A. thaliana gene annotations contained a change that would disrupt function when aligned to A. lyrata. This is despite the two sharing the large majority of A. thaliana’s coding sequence, with A. lyrata having alignments for 98% of the coding sequence present in A. thaliana. This stands in stark contrast to start/stop codon conservation comparisons in vertebrate species, where even distant mammals such as platypus retain more than 60% of these essential sites (Miller et al. 2007). One explanation for this observation could be that the alignment process creates composite genes by aligning multiple copies of A. lyrata genes onto single copy genes in the A. thaliana reference. This hypothesis, however, does not fully explain the number of disruptions in aligned genes. The majority of genes between the two species occur colinearly, with a minority being duplicated (Hu et al. 2011). If alignment errors from paralogs were creating all these observed disruptions, then we would expect them to occur at a similar rate to the proportion of duplicated genes overall. More likely, there is a mix of causes, with only some being false positives due to poor alignment. Even taking these potential false positives into consideration, we see a trend that is markedly different from the gene feature conservation seen in vertebrate species. The proportion of conserved genome features, such as coding sequence, introns, and UTRs, relative to the complete set of detected conserved elements within a reference genome, has been previously investigated for vertebrates, insects, worms, and yeasts (Siepel et al. 2005). Specifically, this comparison shows a trend relating an increase in the complexity of the conserved element set to an increase in overall organismal complexity. Reproducing this analysis (fig. 3B) for angiosperms revealed that the proportion of gene features among the complete set of conserved elements most closely parallels the proportions observed in worms such as C. elegans. Both nematodes and plants are known to exhibit a wide degree of phenotypic plasticity, making drastic alterations to body structure due to environmental stress (Sultan 2000; Sommer and Ogawa 2011). The observations that both angiosperms and nematodes share a common distribution of conserved elements, and that they both make use of a more flexible phenotypic landscape, may imply that this less diverse composition of conserved elements is necessary for environmentally induced large phenotypic changes. Moreover, it could be that the developmental plans of plants are relatively MBE flexible in comparison to animals, and thus, this developmental lability is reflected in genomic architecture and evolution. It is clear that angiosperm coding sequence and essential RNAs can be reliably aligned and identified, even across protracted phylogenetic timelines. However, these components represent only a fraction of genome features. The question of the phylogenetic distance at which we lose sequence identity for rapidly diverging features in plants can help make informed decisions in experimental design. Comparing alignability in vertebrates to angiosperm species reveals a faster decay in vertebrates compared with plant species. Both between closely related species and phylogenetic comparisons beyond the Brassicales, the alignability of coding sequence to the A. thaliana reference was greater than equally distant vertebrates to a human reference (fig. 3F). Similarly, the alignment coverage of reference genome coding sequence between equally distant vertebrate and plant species showed a trend of higher coverage in plant species that were recently and distantly diverged. The implication of this result is that although plant genomes can be highly variable intergenically, essential features such as coding sequences are highly conserved between species. It is more difficult to draw definitive conclusions about the alignability of cis-regulatory modules between kingdoms due to differing quality of annotations. However, we do observe a substantial difference in alignability when comparing the two kingdoms of life. The preservation of conserved elements with no known annotation, even in the most stringently filtered sets, was surprising. This pattern illustrates that there are still segments of the A. thaliana genome, which are conceivably functional, but which are as yet uncharacterized. Our first-pass annotation of these conserved regions (fig. 3G) has shed light on the type of function associated with this DNA. Although we detected several types of features, the highlight is the identification of dozens of new potential RNA genes. These regions, found by partial BLAST homology to existing ncRNAs, or alternatively by finding protein homology with overlapping small RNA expression, may represent a previously unknown source of regulation in A. thaliana. However, the homologybased method used to annotate conserved regions is simple and broad and as a result cannot produce truly definitive annotations. These new annotation groups, however, are small enough for future manual refinement. Ultimately, function can only truly be assigned as a result of validation through benchwork that verifies RNA expression and effects on phenotype. The vignettes of novel folding regions highlighted here demonstrate the ability to quickly identify novel genome features and can serve as a guide for bench scientists to probe deeper into their gene families of choice with the help of our genome browser. Novel putative ncRNAs, such as that described in figure 5, are promising but require further investigation to confirm the paradigm that conservation implies function. Beyond identifying new annotations via conservation, we have added depth to existing annotation by layering it with RNA folding and cis-regulatory information. In doing so, we have characterized on a genome-wide scale how RNA genes fold in flowering plants. Our leveraging of two independent 1739 MBE Hupalo and Kern . doi:10.1093/molbev/mst082 algorithms, to precompute RNA folds and display them genome-wide, gives researchers an instant second opinion on a methodology, which is often subject to high false discovery rates and intense computational time (Gorodkin et al. 2010). This folding information, combined with TAIR gene tracks, conservation, and cis-regulatory motifs, was able to identify a potential fold, which may well control regulation of a gene (fig. 4E). This study articulated a few avenues for investigation into a comparative alignment. Most of the analyses presented can be viewed as tracks on the Arabidopsis genome browser; many can be reconstituted using the genome browser and table browser web tools. Plant comparative genomics has unique challenges due to the architecture of genomes in this kingdom of life. Continued accrual of sequence information and annotation will help empower further analysis of these complex organisms. At both the gene level and the genome level, this integration of plant DNA information will help inform decisions and formulate targets for investigation to gain further insight about plant evolution. Materials and Methods Pairwise Alignment In this analysis, we included only those angiosperm genomes that have been published on previously, following the guidelines of the Ft. Lauderdale agreement on rapid data release. The 20 genomes that were included in this analysis and their version numbers are listed in table 1, sorted by coverage. Total align refers to the number of nucleotides aligned to A. thaliana determined by the mafCoverage software. CDS align refers to the overlap of alignment with existing A. thaliana CDS annotation as determined by intersections using the featureBits software, both programs are part of the UCSC source tree (Kent et al. 2002). A pairwise alignment pipeline was used to generate wholegenome alignments against a version 9 A. thaliana reference genome sequence assembled by TAIR (Swarbreck et al. 2008). All sequences were obtained as scaffolds or pseudochromosomes from the web repositories of the respective sequencing groups with the exception of the G. soja genome. Sequence data for G. soja were obtained from the Sequence Read Archive and mapped to G. max using Maq (Li et al. 2008), following the methods of the sequencing group (Kim et al. 2010). Masking was employed to remove lineage-specific repetitive regions; resulting in improved BLAST results, this was accomplished using the RepeatMasker (Smit and Hubley 2004) software suite. Each query genome was split into regions of 1 million base pairs or less, whereas the reference genome, A. thaliana, was split into its seven pseudochromosomes. The alignment then proceeded using the LASTZ program (Harris 2007), a local alignment algorithm optimized for whole-genome alignment, which locally compared the A. thaliana reference genome sequence against all sequences in each query genome. This process was parallelized across a computer cluster to efficiently generate alignments from large data sets. LASTZ output relating query to reference was then linked into longer chains of contiguous alignment using 1740 axtChain (Kent et al. 2003). The alignment chains were sorted using chainNet, which filters only the single bestaligned chain, and maximizes coverage across the reference genome. Converting the nets to multiple alignment files followed this. The resulting pairwise alignments of each query genome to the A. thaliana reference were joined using MULTIZ (Blanchette et al. 2004) and guided by the tree topology in figure 1. Postprocessing of the alignments included inserting annotations for alignment breaks and gaps using the mafAddIrows tool, and identifying regions removed by RepeatMasker. Evaluating Alignment Quality and Refining Parameters To determine whether the alignment process is producing reliable sequence relationships, alignments were evaluated based on base coverage and on annotation-specific quality. To measure the number of raw bases aligned, the number of exact base matches, the number of base mismatches, and the coverage the mafCoverage program, part of the UCSC source tree was used. Starting with default parameters for all programs, each step in the alignment process was tuned to maximize coverage and minimize mismatch. The LASTZ alignment algorithm proved to be robust; even without any tuned parameters, the software produced pairwise alignments between A. thaliana and A. lyrata with coverage only 5% less than pairwise alignments made by the A. lyrata genome sequencing project. The final LASTZ parameters were as follows for all alignments: inner = 2,000, xdrop = 9,400, gappedthresh = 3,000, hspthresh = 2,200. Using these parameters, for example, coverage was increased by 4% between Ath/Aly compared with the default parameter baseline. Evaluating Gene Quality To judge the conservation of known A. thaliana gene features in pairwise aligned genomes beyond simple coverage numbers, the program cleanGenes (part of the PHAST package) was used to evaluate feature conservation. Using version 9 genome annotation from TAIR, gene-feature coordinates were extracted and located in pairwise alignments, and conservation was assessed. The types of features evaluated included start sites, stop sites, splice sites, frameshifts, and nonsense mutations; these were searched for “cleanly” conserved exons without gaps or mutations. Features were tallied as passing or failing after evaluating the features conservation in a pairwise alignment. To compare our alignment pipeline to previously available whole-genome plant alignments created for the VISTA genome browser, an alignment using TAIR version 8 A. thaliana sequence was aligned to A. lyrata, so that we could benchmark our alignments against previously released publicly available alignments. Additional more recent alignments of A. thaliana/V. vinifera and A. thaliana/G. max, which utilized a TAIR10 reference, were used for additional comparisons. VISTA alignments were incompatible with genome browser tools; thus, the mfa alignments were converted into MAF block format using a custom Python script. The MBE Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 resulting MAF format alignments for Ath/Aly were uploaded to a TAIR version 8 genome browser as browser MAF tracks. This allowed mafCoverage, as well as cleanGenes results from VISTA alignments, to be directly compared with alignments from the LASTZ/MULTIZ pipeline. Scoring Conservation To compute conservation tracks for the multiple alignment, phyloFit (Siepel and Haussler 2004) (a component of the PHAST package) was used to fit a phylogenetic model to 4-fold-degenerate sites found on each chromosome of the MULTIZ alignment as an initial starting sample (as described in the PHAST documentation). The resulting phylogenetic model was used in conjunction with the phastCons (Siepel et al. 2005) tool to create conserved and nonconserved phylogenetic trees. The phastCons program requires several iterations to refine parameters that predict conservation as part of its phylo-HMM. Starting with parameters for expected coverage and expected length gathered from a previous conservation analysis focusing on Solanaceae (Wang et al. 2008), the phastCons run was tuned to fit predefined criteria. Similar to previous studies analyzing conservation, our criteria was 60% coverage of the annotated coding regions by predicted conserved elements, as well as phylogenetic information threshold score close to 10 bits measured by the consEntropy software. The resulting parameters that fit our criteria were an expected coverage of 0.2 and an expected length of 80. Wig format data files were used to create a conservation track on the A. thaliana genome browser, which visualizes conservation scores as a continuous variable. Resulting conserved region lengths are graphed in supplementary figure S3, Supplementary Material online. Conserved regions were classified using A. thaliana annotation tracks based on TAIR version 9 GFF files. Intersections and enrichment values of the annotation tracks versus the conserved region track were achieved using the featureBits command line tool, part of the UCSC source tree. Significance was determined by Fischer’s exact test, using values gathered by featureBits, to determine whether certain groups were over-represented versus the normal composition. Normal composition of the A. thaliana genome was determined using the same methodology. Vertebrate conserved element enrichment was determined using featureBits and the phastConsElements46way track with annotation drawn from UCSC (Fujita et al. 2011). To evaluate and compare the conservation of specific genome features between species, the tool maf_interval_alignability.py was employed. This tool, part of the bxpython package and utilized in Miller et al. (2007), scores alignments to annotated features by measuring presence or absence of aligned sequence. Specifically, the program proceeds by tabulating the number of bases covered by a query species, compared with the number of bases within an interval that have missing alignment information. The alignability value is the number of bases with alignment divided by the sum of the number of positions with and without alignment. The graph in figure 3F displays mean values of alignability for annotation groups in select species. The columns of mean alignability values for each species are then scaled based on phylogenetic distance, as determined by substitutions per site drawn from a 4-fold degenerate neutral tree. Trend lines for vertebrate data were recapitulated using the latest alignment information drawn from the phastCons46way alignment (Fujita et al. 2011) to confirm the previously observed pattern. Annotation for cis-regulatory sites in the 46way alignment was drawn from the ORegAnno track annotation (Griffith et al. 2008). Building the Browser To visualize alignments, and make use of the collection of browser genomics tools, a mirror of the UCSC genome browser (Kent et al. 2002) was installed at local facilities and remains available at genome.genetics.rutgers.edu. The focus of this browser is to host comparative genomics data for Drosophila and plant species. We selected Oryza sativa and A. thaliana as reference genome browsers for monocots and eudicots, respectively, due to their extensive annotation and high-quality pseudochromosomes. Development has focused on the A. thaliana browser tracks as a prototype for a plant comparative genomics browser. These tracks include a bed file-based display of regions identified by RepeatMasker as repetitive sequence and gene tracks based on known genome annotations. Specifically, the foundation of the browser is gff3 format annotation created by TAIR, filtered for a single coverage of genes across each genome, and then converted to gene prediction (genePred) format and uploaded to the browser MySQL database. An alternative gene prediction track, created using Gnomon gene prediction software as part of a recent TAIR release (Lamesch et al. 2011), was also included as part of the browser. Cis-regulatory elements were predicted based on regular expressions of A. thaliana transcription factor binding sites as listed in the AGRIS cis-regulatory database (Yilmaz et al. 2011). To create a browser-compatible track of elements, putative binding sites were called using GREP according to TAIR version 9 chromosomes and then mapped using BLAT (Kent 2002). The resulting coordinates were formatted into an extended bed genome-browser track, labeling the type of motif and its coordinates. Identifying Uninterrupted Conservation To locate regions within the A. thaliana genome that are also found in all other sequenced and aligned angiosperm genomes in an uninterrupted block, a Python program based on mafUltras was written. This software was used to identify ultraconserved elements in vertebrates and has been adapted for use here with the 20way alignment. Unlike a phastCons conservation analysis, this search method is dependent on a user-defined threshold; specifically, the threshold is the minimum length of uninterrupted alignment columns. When searching the human genome, this threshold was defined as 100 bp. To maximize inclusion of highly conserved elements and account for the overall shorter length of plant conserved elements, this threshold was set to 18 bp for the search in the 20way alignment. This was chosen because in general 1741 MBE Hupalo and Kern . doi:10.1093/molbev/mst082 angiosperm-conserved regions are substantially shorter than their mammalian counterparts (supplementary fig. S4, Supplementary Material online) and that 18 bp is the shortest length an expected noncoding RNA might be. We expect that this shortening of the threshold compared with mammals, from 100 to 18 bp, is an inclusive estimate rather than exclusive. Any detected elements were sorted according to overlap of known A. thaliana annotation from TAIR. Regions that did not map to known annotation were de novo annotated based on BLAST homology. Computing Secondary Structure Computing folding structure of RNA molecules can be informed by conservation between related genomes. To identify secondary structure conservation, EvoFold was implemented to predict folding given a MAF block and a phylogenetic tree. As with similar EvoFold studies (Pedersen et al. 2006; Stark et al. 2007), conserved regions predicted by PhastCons were first joined to any neighboring conserved region at a distance no greater than 30 bp. These extended regions were subsequently split into lengths no greater than 750 bp. MAF blocks were extracted from the 20way alignment using the MafFrag utility, part of the UCSC software package, and postprocessed to be compatible with EvoFold alignment format. The 20 species newick tree used for all EvoFold runs was sourced from the PhastCons conservation run. EvoFold predictions were distributed across a compute cluster using default values in the control file provided in the EvoFold source code. Folds with scores below an LOD of 100 and folds with overlap of repetitive elements were filtered out. Result files were formatted into a BED6 structure, adhering to the file format used for previously implemented UCSC genome browser EvoFold tracks, and uploaded to a MySQL database for use as a browser track. Resulting fold lengths are graphed in supplementary figure S3, Supplementary Material online. A similar approach was taken to predict secondary structure in RNAs using the RNAalifold algorithm, part of the Vienna RNA package (Hofacker et al. 2002; Bernhart et al. 2008). This method uses the same data set described earlier, which contains conserved elements identified by PhastCons and processed for length and format. Results were filtered using the same thresholds as above and postprocessed to create a genome browser track. Secondary structure enrichment was found by intersecting the EvoFold annotation track with other annotation tracks using the featureBits command line tool. A final browser track was created containing the composite scores of both independent prediction methods. This data set was used for the results in figure 4. The control data set used to verify the accuracy of predictions was sourced from TAIR annotation of tRNAs, which produced 658 annotations. Predicted fRNAs overlap 637 of the 658 annotations. The enrichment of predicted fRNAs in the set of existing annotation for the A. thaliana genome can be seen in figure 3E. As would be expected, translational related RNAs (including tRNAs, rRNAs, snRNAs, and snoRNAs) are significantly enriched for having folding regions, 1742 more than triple the enrichment (17.52x) of the next nearest category, regulatory RNAs (4.08x). Annotating Conserved Noncoding Regions To characterize unannotated conserved regions scored by phastCons as most conserved within flowering plants, we relied on BLAST-based homology searches with default search parameters. The top 10% of the distribution of most-conserved elements was focused on for annotation, so as to limit a considerably large data set to only the most highly conserved regions. A first-pass search for homology was performed using the BLAST algorithm to scan TAIR version 10 genome-wide annotation. BLAST results from this first pass search were parsed using a custom script, to which were extracted the top scoring search term for any result with an evalue cutoff of 0.1 or less. Regions with no homology within known A. thaliana annotation were then searched for homology to any known plant annotation contained in the Plant Genome Database, using an e-value cutoff of 0.1 or less (Duvick et al. 2007). In each case, the top BLAST search term was used as its tentative annotation. Further annotation was achieved by intersecting bed files containing coordinates of conserved regions annotated by BLAST homology with secondary structure browser tracks, proximity to exons, and existing small RNA expression databases sourced from the ASRP (Backman et al. 2008). Evaluating exon proximity was determined by searching for coordinates that were within 164 bp of an annotated exon, the average intron length in A. thaliana. Programming and Data All programs were written in the Python and the C programming language. All custom software used in the development and analyses are available upon request. All data sets of conserved elements and annotations have been made available as files and tracks on the A. thaliana genome browser (araTha9) located at genome.genetics.rutgers.edu. Supplementary Material Supplementary tables S1 and S2 and figures S1–S5 are available at Molecular Biology and Evolution online (http:// www.mbe.oxfordjournals.org/). Acknowledgments This work was supported by two grants to A.D.K., NSF MCB-1052148 and DOE/USDA 124336, as well as the Human Genetics Institute of New Jersey. References Acarkan A, Rossberg M, Koch M, Schmidt R. 2000. Comparative genome analysis reveals extensive conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J. 23:55–62. Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815. Argout X, Salse J, Aury J-M, et al. (61 co-authors). 2011. The genome of Theobroma cacao. Nat Genet. 43:101–108. Backman TWH, Sullivan CM, Cumbie JS, Miller ZA, Chapman EJ, Fahlgren N, Givan SA, Carrington JC, Kasschau KD. 2008. Update Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 of ASRP: the Arabidopsis small RNA project database. Nucleic Acids Res. 36:D982–D985. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D. 2004. Ultraconserved elements in the human genome. Science 304:1321–1325. Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. 2008. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9:474. Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors). 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14:708–715. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative Sequencing Program, Green ED, Sidow A, Batzoglou S. 2003. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13:721–731. Brudno M, Poliakov A, Minovitsky S, Ratnere I, Dubchak I. 2007. Multiple whole genome alignments and novel biomedical applications at the VISTA portal. Nucleic Acids Res. 35:W669–W674. Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas BJ, Hamilton JP, Buell CR. 2007. Identification and characterization of lineage-specific genes within the Poaceae. Plant Physiol. 145: 1311–1322. Chan AP, Crabtree J, Zhao Q, et al. (18 co-authors). 2010. Draft genome sequence of the oilseed species Ricinus communis. Nat Biotechnol. 28:951–956. Chiang CWK, Derti A, Schwartz D, Chou MF, Hirschhorn JN, Wu C-T. 2008. Ultraconserved elements: analyses of dosage sensitivity, motifs and boundaries. Genetics 180:2277–2293. Cooper GM, Brudno M, NISC Comparative Sequencing Program., Green ED, Batzoglou S, Sidow A. 2003. Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13:813–820. Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V. 2004. Darwin’s abominable mystery: insights from a supertree of the angiosperms. Proc Natl Acad Sci U S A. 101:1904–1909. Drosophila 12 Genomes Consortium, Clark AG, Eisen MB, et al. (418 co-authors). 2007. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450:203–218. Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM, Frazer KA. 2000. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10: 1304–1306. Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ, Lushbough C, Brendel V. 2007. PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 36:D959–D965. Eddy SR. 2005. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3:e10. Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. 2004. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32:W273–W279. Freeling M, Subramaniam S. 2009. Conserved noncoding sequences (CNSs) in higher plants. Curr Opin Plant Biol. 12:126–132. Friedman RC, Farh KK-H, Burge CB, Bartel DP. 2009. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 19: 92–105. Fujita PA, Rhead B, Zweig AS, et al. (27 co-authors). 2011. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39: D876–D882. Gebhardt C, Walkemeier B, Henselewski H, Barakat A, Delseny M, Stuber K. 2003. Comparative mapping between potato (Solanum tuberosum) and Arabidopsis thaliana reveals structurally conserved domains and ancient duplications in the potato genome. Plant J. 34:529–541. Giuliano G, Pichersky E, Malik VS, Timko MP, Scolnik PA, Cashmore AR. 1988. An evolutionarily conserved protein binding sequence upstream of a plant light-regulated gene. Proc Natl Acad Sci U S A. 85:7089–7093. Glazov EA, Pheasant M, McGraw EA, Bejerano G, Mattick JS. 2005. Ultraconserved elements in insect genomes: a highly conserved MBE intronic sequence implicated in the control of homothorax mRNA splicing. Genome Res. 15:800–808. Goff SA, Ricke D, Lan T-H, et al. (55 co-authors). 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100. Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL. 2010. De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol. 28:9–19. Griffith O, Montgomery SB, Bernier B, et al. (27 co-authors). 2008. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36:D107–D113. Guo H. 2003. Conserved Noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell 15:1143–1158. Hare PD, Moller SG, Huang L-F, Chua N-H. 2003. LAF3, a novel factor required for normal phytochrome A signaling. Plant Physiol. 133: 1592–1604. Harris RS. 2007. Improved pairwise alignment of genomic DNA. [PhD thesis]. [University Park (PA)]: The Pennsylvania State University. Hofacker IL, Fekete M, Stadler PF. 2002. Secondary structure prediction for aligned RNA sequences. J Mol Biol. 319:1059–1066. Hu TT, Pattyn P, Bakker EG, et al. (30 co-authors). 2011. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet. 43:476–481. Huang S, Li R, Zhang Z, et al. (96 co-authors). 2009. The genome of the cucumber, Cucumis sativus. L. Nat Genet. 41:1275–1281. Hubisz MJ, Pollard KS, Siepel A. 2011. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform. 12:41–51. Hudson ME, Quail PH. 2003. Identification of promoter motifs involved in the network of phytochrome A-regulated gene expression by combined analysis of genomic sequence and microarray data. Plant Physiol. 133:1605–1616. Inada DC, Bashir A, Lee C, Thomas BC, Ko C, Goff SA, Freeling M. 2003. Conserved noncoding sequences in the grasses. Genome Res. 13: 2030–2041. Kaplinsky NJ. 2002. Utility and distribution of conserved noncoding sequences in the grasses. Proc Natl Acad Sci U S A. 99:6147–6151. Katzman S, Kern AD, Bejerano G, Fewell G, Fulton L, Wilson RK, Salama SR, Haussler D. 2007. Human genome ultraconserved elements are ultraselected. Science 317:915. Kellis M, Patterson N, Endrizzi M, Birren B. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254. Kent WJ. 2002. BLAT—the BLAST-like alignment tool. Genome Res. 12: 656–664. Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. 2003. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 100:11484–11489. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The human genome browser at UCSC. Genome Res. 12:996–1006. Kim J, He X, Sinha S. 2009. Evolution of regulatory sequences in 12 Drosophila species. PLoS Genet. 5:e1000330. Kim MY, Lee S, Van K, et al. (29 co-authors). 2010. Whole-genome sequencing and intensive analysis of the undomesticated soybean (Glycine soja Sieb. and Zucc.) genome. Proc Natl Acad Sci U S A. 107: 22032–22037. Kindgren P, Kremnev D, Blanco NE, de Dios Barajas López J, Fernández AP, Tellgren-Roth C, Small I, Strand A. 2011. The plastid redox insensitive 2 mutant of Arabidopsis is impaired in PEP activity and high light-dependent plastid redox signalling to the nucleus. Plant J. 70:279–291. Kritsas K, Wuest SE, Hupalo D, Kern AD, Wicker T, Grossniklaus U. 2012. Computational analysis and characterization of UCE-like elements (ULEs) in plant genomes. Genome Res. 22:2455–2466. Ku HM, Vision T, Liu J, Tanksley SD. 2000. Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny. Proc Natl Acad Sci U S A. 97:9121–9126. 1743 Hupalo and Kern . doi:10.1093/molbev/mst082 Lamesch P, Berardini T, Li D. 2011. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 21:1–9. Lenz D, May P, Walther D. 2011. Comparative analysis of miRNAs and their targets across four plant species. BMC Res Notes. 4:483. Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18: 1851–1858. Michaud M, Cognat V, Duchêne A-M, Maréchal-Drouard L. 2011. A global picture of tRNA genes in plant genomes. Plant J. 66:80–93. Miller W, Rosenbloom K, Hardison RC, et al. (26 co-authors). 2007. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 17:1797–1808. Ming R, Hou S, Feng Y, et al. (85 co-authors). 2008. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452:991–996. Morrell PL, Buckler ES, Ross-Ibarra J. 2011. Crop genomics: advances and applications. Nat Rev Genet. 13:85–96. Paterson AH, Bowers JE, Bruggmann R, et al. (45 co-authors). 2009. The Sorghum bicolor genome and the diversification of grasses. Nature 457:551–556. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. 2006. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2:e33. Retzel EF, Johnson JE, Crow JA, Lamblin AF, Paule CE. 2007. Legume resources: MtDB and Medicago.Org. Methods Mol Biol. 406:261–274. Rhead B, Karolchik D, Kuhn RM, et al. (20 co-authors). 2010. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 38: D613–D619. Sato S, Nakamura Y, Kaneko T, et al. (29 co-authors). 2008. Genome structure of the legume, Lotus japonicus. DNA Res. 15:227–239. Schmutz J, Cannon SB, Schlueter J, et al. (45 co-authors). 2010. Genome sequence of the palaeopolyploid soybean. Nature 463:178–183. Schmidt R. 2002. Plant genome evolution: lessons from comparative genomics at the DNA level. Plant Mol Biol. 48:21–37. Schnable PS, Ware D, Fulton RS, et al. (157 co-authors). 2009. The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112–1115. Shulaev V, Sargent DJ, Crowhurst RN, et al. (71 co-authors). 2011. The genome of woodland strawberry (Fragaria vesca). Nat Genet. 43: 109–116. Siepel A, Bejerano G, Pedersen JS, et al. (16 co-authors). 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15:1034–1050. Siepel A, Haussler D. 2004. Phylogenetic estimation of contextdependent substitution rates by maximum likelihood. Mol Biol Evol. 21:468–488. Smit A, Hubley R. 2004. RepeatMasker Open-3.0. 1996–2010 [Internet]. Institute for Systems Biology. Available from: http://www.repeat masker.org Sommer RJ, Ogawa A. 2011. Hormone signaling and phenotypic plasticity in nematode development and evolution. Curr Opin Biol. 21:R758–66. Stark A, Lin MF, Kheradpour P, et al. (44 co-authors). 2007. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450:219–232. 1744 MBE Stojanovic N. 2009. A study of the distribution of phylogenetically conserved blocks within clusters of mammalian homeobox genes. Genet Mol Biol. 32:666–673. Sultan SE. 2000. Phenotypic plasticity for plant development, function and life history. Trends Plant Sci. 5:537–542. Swarbreck D, Wilks C, Lamesch P, et al. (16 co-authors). 2008. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 36:D1009–D1014. Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008. Synteny and collinearity in plant genomes. Science 320:486–488. Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH. 2008. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res. 18:1944–1954. Thomas BC, Rapaka L, Lyons E, Pedersen B, Freeling M. 2007. Arabidopsis intragenomic conserved noncoding sequence. Proc Natl Acad Sci U S A. 104:3348–3353. Tuskan GA, Difazio S, Jansson S, et al. (110 co-authors). 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596–1604. Velasco R, Zharkikh A, Affourtit J, et al. (86 co-authors). 2010. The genome of the domesticated apple (Malus domestica Borkh.). Nat Genet. 42:833–839. Velasco R, Zharkikh A, Troggio M, et al. (57 co-authors). 2007. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS One 2:e1326. Vogel J, Garvin D, Mockler T, Schmutz J. 2010. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463: 763–768. Wang X, Haberer G, Mayer KF. 2009. Discovery of cis-elements between sorghum and rice using co-expression and evolutionary conservation. BMC Genomics 10:284. Wang XX, Wang HHH, Wang JJJ, et al. (110 co-authors). 2011. The genome of the mesopolyploid crop species Brassica rapa. Nat Genet. 43:1035–1039. Wang Y, Diehl A, Wu F, Vrebalov J, Giovannoni J, Siepel A, Tanksley SD. 2008. Sequencing and comparative analysis of a conserved syntenic segment in the Solanaceae. Genetics 180:391–408. Xu X, Pan S, Cheng S, et al. (98 co-authors). 2011. Genome sequence and analysis of the tuber crop potato. Nature 475:189–195. Yang X, Jawdy S, Tschaplinski T. 2009. Genome-wide identification of lineage-specific genes in Arabidopsis, Oryza and Populus. Genomics 93:473–480. Yang Y-W, Lai K-N, Tai P-Y, Li W-H. 1999. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J Mol Evol. 48:597–604. Yilmaz A, Mejia-Guerra MK, Kurz K, Liang X, Welch L, Grotewold E. 2011. AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Res. 39:D1118–D1122. Zeller G, Henz SR, Widmer CK, Sachsenberg T, Rätsch G, Weigel D, Laubinger S. 2009. Stress-induced changes in the Arabidopsis thaliana transcriptome analyzed using whole-genome tiling arrays. Plant J. 58:1068–1082. Zhang B, Pan X, Cannon C, Cobb G. 2006. Conservation and divergence of plant microRNA genes. Plant J. 46:243–259. Zheng W-X, Zhang C-T. 2008. Ultraconserved elements between the genomes of the plants Arabidopsis thaliana and rice. J Biomol Struct Dyn. 26:1–8.
© Copyright 2026 Paperzz