Article Conservation and Functional Element Discovery in 20

Conservation and Functional Element Discovery in
20 Angiosperm Plant Genomes
Daniel Hupalo*,1 and Andrew D. Kern2
1
Department of Biological Sciences, Dartmouth College, Hanover, New Hampshire
Department of Genetics, Rutgers University
*Corresponding author: E-mail: [email protected].
Associate editor: Hideki Innan
2
Abstract
Here, we describe the construction of a phylogenetically deep, whole-genome alignment of 20 flowering plants, along
with an analysis of plant genome conservation. Each included angiosperm genome was aligned to a reference genome,
Arabidopsis thaliana, using the LASTZ/MULTIZ paradigm and tools from the University of California–Santa Cruz Genome
Browser source code. In addition to the multiple alignment, we created a local genome browser displaying multiple tracks
of newly generated genome annotation, as well as annotation sourced from published data of other research groups.
An investigation into A. thaliana gene features present in the aligned A. lyrata genome revealed better conservation of
start codons, stop codons, and splice sites within our alignments (51% of features from A. thaliana conserved without
interruption in A. lyrata) when compared with previous publicly available plant pairwise alignments (34% of features
conserved). The detailed view of conservation across angiosperms revealed not only high coding-sequence conservation
but also a large set of previously uncharacterized intergenic conservation. From this, we annotated the collection of
conserved features, revealing dozens of putative noncoding RNAs, including some with recorded small RNA expression.
Comparing conservation between kingdoms revealed a faster decay of vertebrate genome features when compared with
angiosperm genomes. Finally, conserved sequences were searched for folding RNA features, including but not limited
to noncoding RNA (ncRNA) genes. Among these, we highlight a double hairpin in the 50 -untranslated region (50 -UTR) of
the PRIN2 gene and a putative ncRNA with homology targeting the LAF3 protein.
Key words: Arabidopsis, alignment, conservation, comparative genomics, ultraconserved elements, angiosperm,
RNA folding.
Introduction
ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
Mol. Biol. Evol. 30(7):1729–1744 doi:10.1093/molbev/mst082 Advance Access publication May 2, 2013
1729
Article
Within the past decade, a flood of whole-genome data has
enabled a comparative genomics approach to functional element discovery. The construction of phylogenetically deep,
whole-genome multiple alignments in models such as
humans (Miller et al. 2007; Rhead et al. 2010; Fujita et al.
2011), Drosophila (Drosophila 12 Genomes Consortium
et al. 2007), and yeast (Kellis et al. 2003) has allowed the
research community to understand each genome in a comparative framework. These alignments have bridged annotation between similar species, and subsequent investigations
in each individual organism have utilized these resources
to discover a variety of functional genomic elements and
genome characteristics (Pedersen et al. 2006; Stark et al.
2007; Friedman et al. 2009; Kim et al. 2009; Stojanovic 2009).
Comparative genomic methods that use sequence similarity, protein alignments, and whole-genome alignments
between two and five species have been widely applied by
plant scientists to rice and Arabidopsis. Initially, these investigations into angiosperms focused primarily on synteny
relationships between species (Acarkan et al. 2000; Ku et al.
2000; Gebhardt et al. 2003; Tang, Bowers, et al. 2008; Tang,
Wang, et al. 2008), but have subsequently expanded into
observations of lineage specific protein-coding genes
(Campbell et al. 2007; Yang et al. 2009), RNA genes
(Michaud et al. 2011), miRNAs (Zhang et al. 2006; Lenz
et al. 2011), and of particular note, conserved noncoding sequences (Kaplinsky 2002; Guo 2003; Inada et al. 2003; Thomas
et al. 2007; Wang et al. 2009; Kritsas et al. 2012). As the availability of sequenced species increases, comparative genomics
in plants may now be performed using the same powerful
frameworks and methodologies that have been applied to
other model systems.
The wealth of genetic resources available for work on
Arabidopsis thaliana, combined with its compact genome,
has made it the prime target for comparative genomics
research within plants (Schmidt 2002). Currently, there exist
dozens of sequenced angiosperm genomes, along with a
large number of sequenced Arabidopsis genomes. This
wealth of data, in conjunction with the detailed molecular
biological characterization of plant genes available from
The Arabidopsis Information Resource (TAIR) (Lamesch
et al. 2011), has the potential to reveal a more complete
set of functional elements in the A. thaliana genome through
the use of sequence comparison. One major axis of motivation for this research is the need to bridge biological knowledge gained from study of Arabidopsis to agricultural plants
(Morrell et al. 2011); comparative genomics can be a potent
tool toward these ends.
MBE
Hupalo and Kern . doi:10.1093/molbev/mst082
For some time now, many pairwise and small-scale multiple plant genome alignments have been available, mainly
based on the VISTA comparative genomics pipeline
(Dubchak et al. 2000; Frazer et al. 2004). This system has
utilized the LAGAN alignment tool (Brudno et al. 2003) to
generate dozens of Arabidopsis-based pairwise alignments,
as well as create five-way multiple alignments in model
organisms (Brudno et al. 2007). Yet, no attempt is known
to have been made to create or analyze a deep merged
data set that can assess general conservation across genera,
in similar treatment to that seen in all other kingdoms of life.
To address this, we have used the University of California–
Santa Cruz (UCSC) source tree (Kent et al. 2002) in combination with a LASTZ/MULTIZ paradigm (Blanchette et al.
2004; Harris 2007) to create a 20way plant alignment that
reaches nearly to single-nucleotide resolution of conservation;
we have provided that information in its entirety to the
plant community via a plant genome browser available at
genome.genetics.rutgers.edu.
A major goal for our research is to characterize patterns
of global conservation within angiosperms and leverage
conservation data for functional element discovery.
A recent analysis of a 105 kb syntenic segment of sequence
between five Solanaceae demonstrated that measuring
the conservation of DNA in plants can be a potent method
of investigation for coding and noncoding sequence (Wang
et al. 2008). This look into the nightshade family, along
with investigations in fruit flies (Drosophila 12 Genomes
Consortium et al. 2007), humans (Miller et al. 2007; Rhead
et al. 2010; Fujita et al. 2011), and yeast (Kellis et al. 2003), has
made clear the utility of a comparative genomics perspective
on genome function. Identifying and combining conserved
regions of the A. thaliana genome with known annotation
from the plant community will help identify novel highly
conserved features and provide insight into contrasting
evolutionary histories among the kingdoms of life.
Results
Alignment of Angiosperms to an A. thaliana
Reference Genome
We have assembled the largest comparative genomic data set
in plants to date, using whole-genome sequence data spanning the breadth of flowering plants. Choice of species to
include in the alignment was based on data availability,
and, in some cases, by simplicity of genome architecture.
The wheat genome, for example, was excluded due to its
size and complexity. The included species span all angiosperms, with representatives from four monocot Poaceae
(Goff et al. 2002; Paterson et al. 2009; Schnable et al. 2009;
Vogel et al. 2010), as well as 16 eudicots including four
Brassicales (Arabidopsis Genome Initiative 2000; Ming et al.
2008; Hu et al. 2011; Wang et al. 2011), one Malvale (Argout
et al. 2011), two Malpighiales (Tuskan et al. 2006; Chan et al.
2010), four Fabales (Retzel et al. 2007; Sato et al. 2008; Kim et al.
2010; Schmutz et al. 2010), one Cucurbitale (Huang et al. 2009),
two Rosales (Velasco et al. 2010; Shulaev et al. 2011), one Vitale
(Velasco et al. 2007), and one Solanaceae (Xu et al. 2011).
The common names and genome details can be reviewed in
table 1, sorted by their alignment coverage of A. thaliana.
Table 1. Species Information and Alignment Coverage for Each Included Species in the 20way Comparison.
Name
Arabidopsis thaliana
Arabidopsis lyrata
Brassica rapa
Carica papaya Linnaeus
Theobroma cacao
Vitis vinifera
Populus trichocarpa
Malus domestica Borkh.
Ricinius communis
Fragaria vesca
Glycine max
Glycine soja
Lotus japonica
Cucumis sativus var. sativus L.
Medicago truncatula
Solanum tuberosum
Sorghum bicolor
Oryza sativa L. ssp. Japonica
Brachypodium distachyon
Zea mays ssp. Mays
a
Common
Name
Type
Thale cress
Lyrate rockcress
Chinese cabbage
Papaya
Cocoa
Grape
Poplar
Apple
Castor bean
Strawberry
Soybean
Wild soybean
Birdsfood trefoil
Cucumber
Clover
Potato
Sorghum
Rice
Purple false brome
Corn
Pseudochromosomes
Pseudochromosomes
Scaffold
Scaffold
Scaffold
Pseudochromosomes
Scaffold
Scaffold
Scaffold
Scaffold
Pseudochromosomes
Sequence Reads
Pseudochromosomes
Scaffold
Scaffold
Scaffold
Pseudochromosomes
Pseudochromosomes
Scaffold
Pseudochromosomes
Nucleotides
119 Mbp
206 Mbp
274 Mbp
342 Mbp
290 Mbp
497 Mbp
417 Mbp
881 Mbp
350 Mbp
214 Mbp
973 Mbp
973 Mbp
301 Mbp
203 Mbp
307 Mbp
727 Mbp
738 Mbp
373 Mbp
271 Mbp
2.06 Gbp
Assembly Date
February 2009 TAIRv9
May 2011
August 2011
December 2007
August 2010
March 2010
March 2011
November 2009
February 2009
June 2010
January 2010
—
May 2008
January 2010
August 2007
July 2011
January 2007
January 2009
December 2009
March 2010
The “Substitutions per Site” column lists the divergence from A. thaliana based on the neutral tree of figure 1.
1730
Total
Align
(%)
—
77.82
65.85
34.84
36.59
32.21
35.41
35.07
34.31
34.43
36.77
36.06
25.92
32.54
28.40
34.88
26.24
25.33
25.37
25.96
CDS
Align
(%)
—
98.16
96.78
79.70
83.28
75.93
81.40
80.89
80.78
80.30
78.67
78.94
62.84
76.79
67.49
77.77
63.93
63.39
63.68
62.93
Subs/Sitea
—
0.09
0.35
1.04
1.11
1.19
1.21
1.24
1.25
1.28
1.33
1.33
1.43
1.47
1.51
1.52
1.92
1.94
1.95
1.96
MBE
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
There is a diverse set of methods available for wholegenome alignment, including both open-source and commercial packages. Our goal was both to create a deep multiple
alignment and to make that data set available for community
use. The success in the alignment of vertebrate genomes, and
their subsequent browsable alignment, demonstrated that
both these goals can be achieved in an integrated opensource manner (Miller et al. 2007; Rhead et al. 2010; Fujita
et al. 2011). Following this example, we created a mirror of the
UCSC genome browser (genome.genetics.rutgers.edu) and
built within its framework databases for multiple plant species. Currently, A. thaliana is the model browser for eudicot
species, using TAIR version 9 annotations, and an additional
browser with a TAIR version 8 assembly for legacy support.
Each of the 20 genomes was aligned in a pairwise fashion
using tuned parameters (see Materials and Methods), followed by chaining and conversion to pairwise alignment
files. A phylogenetic tree covering all included species, without branch lengths, was drawn from an angiosperm supertree
(Davies et al. 2004) and used to guide the MULTIZ’s
(Blanchette et al. 2004) merging of pairwise alignments.
Using the 20-way alignment, branch lengths for a neutral
tree based on 4-fold degenerate sites were computed using
the PHAST package (Hubisz et al. 2011) and are displayed in
figure 1. Other analyses of eudicots and angiosperms have
constructed phylogenies with similar substitutions per site as
those seen in the neutral tree used in this investigation (Yang
et al. 1999; Tang, Wang, et al. 2008).
Base pair coverage for the whole genome and for coding
DNA sequence (CDS) regions is presented in table 1 sorted
by divergence from A. thaliana based on a neutral phylogenetic tree. Genomes included in the alignment vary greatly
in terms of genome architecture, sequence quality, size, and
0.1 Subst/site
Eudicot
Monocot
phylogenetic distance from the reference. The coverage
shows generally similar patterns compared with numbers
gathered from mammalian alignments (Miller et al. 2007;
Rhead et al. 2010; Fujita et al. 2011). It is informative to compare alignment coverage at various evolutionary distances
between vertebrate alignments (Miller et al. 2007; Rhead
et al. 2010; Fujita et al. 2011) and plant alignments. For instance, A. thaliana and Brassica rapa are roughly as divergent
as humans and the galago Otolemur garnettii at 0.35 and 0.33
substitutions per 4D-site, respectively. At this level of divergence, our plant alignment shows a greater proportion of
aligned bases (65.8% vs. 44.3% aligned, respectively). Coding
region alignments in this comparison follow suit with 96%
versus 80% aligned base pairs in plants versus animals.
Looking at the most diverged species comparison in our analysis, A. thaliana to Zea mays (1.96 substitutions per site), we
find this is roughly proportional to the amount of divergence
between Human and Xenopus tropicalis (1.97 substitutions
per site). In this comparison, vertebrates lose a greater
amount of overall alignment (26% vs. 8% aligned); however,
the coding regions are more conserved (62% vs. 87% aligned).
Despite the differences seen in one to one comparisons, we
observe a shared pattern that as distance increases, coverage
by whole-genome sequence drops precipitously, bottoming
at roughly 35% across eudicots and 26% across monocots.
Unsurprisingly, protein-coding sequence shows higher
conservation, never dropping below 62%.
Coverage and Gene Feature Comparisons in the
Arabidopsis Genus
The VISTA genome browser has made available for public
use a number of precomputed whole-genome alignments
of plant genomes (Frazer et al. 2004; Brudno et al. 2007).
Malus x domestica
Fragaria vesca
Cucumis sativus
Medicago truncula
Lotus japonica
Glycine max
Glycine soja
Ricinus communis
Populus trichocarpa
Arabidopsis thaliana
Arabidopsis lyrata
Brassica rapa
Carica papaya
Theobroma cacao
Vitis vinifera
Solanum tuberosum
Sorghum bicolor
Zea mays
Oryza sativa
Brachypodium distachyon
FIG. 1. A phylogenetic tree of the relationships between species included in the 20way angiosperm alignment and used to guide MULTIZ merging of
pairwise alignments. The neutral tree is based on 4-fold degenerate sites sampled from each chromosome with branches proportional to the listed scale,
with substitutions per site determined by the PhyloFit software. Average trees for conserved and nonconserved regions with branch lengths are available
in supplementary figure S5, Supplementary Material online.
1731
MBE
Hupalo and Kern . doi:10.1093/molbev/mst082
These alignments range from pairwise up to 4way and
have been used by the scientific community for comparisons
between angiosperm genomes (Swarbreck et al. 2008; Zeller
et al. 2009). We used one of these alignments created by the
VISTA pipeline comparing A. thaliana to A. lyrata as a benchmark for the quality of our pairwise alignments, which
used the LASTZ/MULTIZ and axtChain methodology
(Kent et al. 2003; Blanchette et al. 2004; Harris 2007).
This VISTA A. thaliana vs. A. lyrata (Ath/Aly) alignment
is available on the araTha8 genome browser, along with its
corresponding conservation track and TAIR version 8 gene
annotation.
To evaluate nucleotide coverage, several types of base
alignment were measured, including the number of exact
base pair matches, the number of mismatched nucleotides,
the number of gaps, RepeatMasked regions, and regions
where no relationship between the two genomes was assigned, which is equivalent to a gap (fig. 2A, supplementary
table S1, Supplementary Material online). The number of
exact matches and mismatches between the VISTA alignments and our alignment was close to identical, with
each covering 65% and 66% of the A. thaliana genome,
respectively. Large differences can be seen in the amount of
masking applied to both the reference and query genome.
Comparing the RepeatMasker track created during our masking of A. thaliana to the VISTA alignment track on the TAIR
v8 genome browser, it is evident that, although the VISTA
alignment employs some masking, it is limited, and repeat
regions are often gapped. This results in a comparatively
higher proportion of gapped sequences within the VISTA
alignment. Both methods result in raw coverage, which is
within 10% of other pairwise alignments of Ath/Aly (Hu
et al. 2011).
A
B
No Align
7%
Masked
14%
Gap
5%
Mismatch
8%
With coverage numbers comparable to previous alignments, we wanted to investigate how the constituent parts
in A. thaliana are affected by the process of multiple alignment. Base-by-base coverage numbers may conceal errors
in reading frame or poorly aligned functional sites such as
splice sites. We used TAIR protein coding gene annotation
(Lamesch et al. 2011) and the cleanGenes program in the
PHAST package (Hubisz et al. 2011) to locate and evaluate
start codons, stop codons, and splice sites, and to identify
frameshift/nonsense mutations. Annotated gene regions
containing all the listed functional elements without interruptions between an A. thaliana and A. lyrata alignment were
16% greater (5,240) in our alignments, compared with the
VISTA alignments (fig. 2B). Features listed as having no alignment have an excess of gaps obscuring any measurement
of features. Gene regions with no alignment occur more
than twice as often in the VISTA data set. In the subsequent
observations, there were marginal increases in failed tests of
gene features for our alignments, attributable to a greater
proportion of features passing the initial “no alignment”
test. The cleanGenes software also examines full exons
listed in the annotation for the A. thaliana genome.
Using this function, we tabulated the number of exons with
uninterrupted alignment in an Ath/Aly alignment. Over
4,000 more conserved exons with no gaps in alignment
were identified in our alignments, compared with the
VISTA alignments (supplementary table S1, Supplementary
Material online).
To further address whether our methods are creating suitable alignments beyond pairwise comparisons to A. lyrata, we
investigated gene conservation in two additional species present in the alignment. Alignments created using LASTZ/
MULTIZ for Vitis vinifera and Glycine max were compared
16000
No Align
14%
Masked
1%
Gap
13%
Mismatch
7%
VISTA Ath/Aly
20way Ath/Aly
8000
Exact
65%
Exact
66%
0
Failed
Features
No
Passing All Alignment Start
Tests
Codon
Failed
Stop
Codon
Failed 5'
Splice
Site
Failed 3' Nonsense Frameshift
Splice Mutation Mutation
Site
FIG. 2. Alignment coverage and quality comparison between our implementation of the LASTZ/MULTIZ paradigm and a publicly available alignment
hosted by the VISTA genome browser using a Lagan-based alignment. (A) Coverage statistics as tabulated by the mafCoverage utility for each
methodology detailing exact nucleotide matches, alignment with mismatched nucleotides, gapped sequence, sequence intentionally removed due
to repeats, and regions where no relationship between the two genomes was assigned (equivalent to a gap). (B) Results from the cleanGenes utility that
takes a TAIR v8 annotation and measures whether a given alignment has conserved the gene feature and maintained its protein coding ability. If the
gene alignment between the two genomes is not cleanly conserved, the type of error is recorded.
1732
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
with existing VISTA pairwise alignments created using the
LAGAN pipeline. Similar to the results observed when comparing A. thaliana with A. lyrata, our alignments outperformed existing alignments. As detailed in supplementary
table S2, Supplementary Material online, LASTZ was able
to cleanly align nearly twice (801/427) the number of features compared with LAGAN when comparing Arabidopsis
to wine grape. The same result was found in the A. thaliana/
G. Max alignment, which found more than twice the number
of gene features cleanly conserved when compared with
LAGAN (837/336). This difference can be attributed to the
preservation of start/stop codons and splice sites during the
alignment process.
Base-by-Base Conservation and Discrete Conserved
Elements among Angiosperms
To predict conserved regions in our multiple alignment,
we used the phyloHMM method of Siepel et al. (2005),
which searched for conserved elements within four different
groups of organisms: vertebrates, insects, worms, and
yeast. This phyloHMM scores general conservation across
an alignment and also creates a smaller set of discrete
elements, representing the most highly conserved blocks of
sequence (mostCons).
The normal composition of genome features in A. thaliana is illustrated in figure 3A and serves as a reference to
which the predicted conserved regions can be compared.
The composition of all scored conserved elements (fig. 3B)
can be contrasted with the normal distribution, revealing an
expansion in the proportion of protein-coding sequence and
unannotated intergenic sequence. Annotations specific to
regions not associated with protein-coding genes, such as
noncoding RNAs and translational RNAs, only represent a
fraction of the larger conserved region data set. This contrasts with the relatively higher proportion of conserved
RNAs observed in previous analyses in organismal groups
such as vertebrates (Siepel et al. 2005). Using this previous
data, and reproducing the analysis for angiosperm genomes,
we observe a greater amount of CDS conservation in angiosperms (42%) than seen in vertebrate (18%) and insect
(26%) conservation but less than that seen in worms
(55%) and yeast (86%). In general, when comparing the patterns of element conservation and diversity seen in angiosperm genomes to the same distributions previously
mapped in reference vertebrate, yeast, insect, and worm
genomes, we find that angiosperms most closely resemble
the distribution of conserved elements seen in nematodes
such as Caenorhabditis elegans.
PhastCons produces a set of discrete regions that are the
most conserved within the alignment and is graphed by
annotation type in figure 3C. To further isolate regions
with the deepest phylogenetic conservation, we selected
the top 10% of this mostCons set, as defined by having a
logarithm of the odds (LOD) score greater than 88. The
majority of the mostCons set and the tail of its distribution
are annotated as protein-coding sequence. Despite filtering
for only the highest scoring regions in the mostCons set,
elements mapping to intergenic regions are still represented.
MBE
These intergenic regions do not include any known DNAlevel annotation; this suggests that there is substantial undiscovered functionality present in A. thaliana and other plant
genomes. Cis-regulatory elements are equally represented
among the normal composition, conserved, and most-conserved regions. Although using short sequence motifs to
identify regulatory elements may accrue false-positive regions that share sequence identity but are nonfunctional,
the deep conservation of many of these sites demonstrates
that most are likely functional in some way. In general, the
mostCons data set serves as the starting point for further
analysis and annotation of conserved regions within
angiosperms.
One way of characterizing the conserved portion of the
genome is to ask what functional annotations are enriched
among identified conserved elements. Figure 3D provides
such a view of the conserved portion of plant genomes.
In particular, translational RNAs are the most enriched annotation among conserved regions, followed by protein-coding
sequences. Following these two groups, we observe that
RNAs that regulate transcription, such as miRNAs, are enriched among conserved sequences. We observed that this
annotation set of miRNA and noncoding RNA (ncRNA) annotations was diverse in its alignment depth. It included
RNAs present in many, if not all the 20 included species,
and RNAs with alignment to only Brassicales. This wide variation in the depth of conservation of RNAs makes their mild
enrichment unsurprising. In addition to the enrichment of
regulatory RNAs, regions tentatively annotated as binding
transcription factors. As a control, transposable elements
are drastically under-represented, as being highly repetitive
they do not align well nor should they be conserved in
most cases between species. Comparing the enrichment of
angiosperm annotations among conserved regions to vertebrate annotation enriched in conserved regions determined
by the 46way vertebrate conservation track showed nearly
the same ordering of enriched annotation types (supplementary fig. S1, Supplementary Material online). Vertebrate enrichment values trended higher in all categories compared
with angiosperm enrichment.
Previous investigations into conservation between vertebrate species have looked into the alignability (i.e., percentage
of bp with aligned sequence to a reference) of different
components of the genome as a function of evolutionary
divergence (Miller et al. 2007). To compare and contrast
the animal results of Miller et al. (2007) with conservation
within plant species, we recapitulated their analysis using
the 46way alignment information (Fujita et al. 2011) and
overlaid the trend lines on selected angiosperm results
(fig. 3F). Comparing conservation of RefSeq CDS regions
from vertebrates to conservation of TAIR CDS regions
within angiosperms showed a faster decline of alignability
in vertebrate species. Similarly, a faster decline in vertebrate
alignability was observed when comparing angiosperms
to vertebrate cis-regulatory sites as seen in the trend line
of figure 3F. This may be due to ORegAnno annotation
being biochemically validated compared with our initial set
of regulatory sites that are bioinformatically predicted.
1733
F
Rapa
.5
Other Eudictos
Transposable
Elements
1%
CDS
42%
Other
1%
Distance (Substitutions Per Site)
1
Unannotated
Intergenic
Conservation
33%
Intron
19%
cis-Regulatory
3%
1.5
Most Conserved
2
Monocots
Top 10% Most Conserved
C
Cis Reg
Exons
Vertebrates
TE
UTR
Regulatory RNAs
Cis Reg
Intron
CDS
tRNA
G
As
**
**
**
**
ry
e
Ps
85
n
43
24
8
30
127
ts
53
1
7
Small RNA
Expresion
185
58
EvoFold
18
Secondary Structure
120
27
53
en
em
El
**
0.04
E
Good
ORF
a
Tr
la
ns
7
23
302
120
eg
R
ul
154
*
R
sci
u
eg
e
Ps
y
e
El
s
a
Tr
tRNA +
EvoFold
52
tRNA
100
Pseudogene
Homology
78
ts
b
sa
le
en
em
ts
No Homology
285
El
*
**
0.41x 0.07x
Transposable Element
Homology
533
EvoFold
27
Small RNA
30 4
po
ns
en
m
1.23x
snRNA/snoRNA
Homology
77
r
to
ne
1.64x
ge
la
o
ud
S
D
C
**
3.0x
ncRNA
Homology
40
As
N
R
4.08x
EST
26
y
or
at
As
N
R
rRNA
17
tio
l
na
**
17.54x
Exons Flanking
Conservation
Conserved Regions with
Homology to
Known Proteins
Conserved
Regions with
1787
- Total to
Homology
Known Proteins
1787
a
Tr
e
ns
bl
sa
o
sp
**
0.39
tro
In
Small RNA
Expresion
Exons Flanking 29
EvoFold
Conservation
Secondary
266
Structure
to
s
ne
1.07
ge
o
ud
ts
en
em
El
As
1.33 1.27
N
R
la
u
eg
R
s-
ci
y
or
at
ul
S
D
C
2.28
eg
R
3.55
N
lR
a
on
ti
la
ns
a
Tr
D
Conserved Region Enrichment
B
Conserved Secondary Structure
Enrichment
FIG. 3. An analysis of conservation in the 20way plant alignment compared to. (A) The composition of the Arabidopsis thaliana genome sourced from TAIR v9 and newly generated annotations. “Other” contains
ncRNAs, miRNAs, tRNAs, rRNAs, small nuclear RNAs (snRNAs)/small nucleolar RNAs (snoRNAs), and pseudogenes. (B) PhastCons-predicted conserved elements sorted by annotation. (C) The discrete “mostCons”
regions predicted by phastCons and the 10% highest scoring tail of the distribution of conservation scores for mostCons regions. Color and segment position correspond to annotation type described in (A) and (B).
(D) Enrichment of conserved elements within different feature types. Significance was determined by a Fischer’s exact test with single stars denoting P < 0.05 and double stars P < 0.01. (E) Enrichment of EvoFoldpredicted secondary structures in different types of genome features. (F) Alignability of A. thaliana genome features to corresponding features in plants at increasing phylogenetic distances. Also plotted are
proportional trend lines of vertebrate alignability of cis-regulatory sites and exons drawn from 46 vertebrate species (Fujita et al. 2011). The distance is scaled according to substitutions per site drawn from a 4-fold
degenerate neutral tree. (G) BLAST homology annotation of unannotated intergenic conserved elements from (B). Annotation categories are shown as proportional areas with subsets shown as Venn diagrams.
Regions are labeled with their putative annotation type and total number of elements identified.
0.2
0.4
0.6
0.8
1
CDS
30%
Transposable
Elements
29%
Lyrata
Intergenic
21%
Intron
16%
Other
1%
ns
A
Alignability
tro
In
1734
cis-Regulatory
3%
Hupalo and Kern . doi:10.1093/molbev/mst082
MBE
MBE
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
De novo Annotation of Unknown Conserved
Elements
The conservation analysis from figure 3B revealed that there
remains a large percentage of conserved intergenic DNA
that is not associated with any documented annotation or
function. To further investigate this set of regions, we applied
a BLAST homology search to all existing plant databases
to annotate new sequence (see Materials and Methods).
BLAST annotation terms associated with each region of unannotated conservation were recorded and graphed as proportional area, so as to visualize the diversity of the previously
unknown conservation (fig. 3G). Each circular area represents
a group of conserved regions that do not overlap existing
annotation in Arabidopsis but that share sequence identity
to an annotation group in Arabidopsis or in any other
plant genome. It is important to note that this is by no
means an exact one-to-one annotation, as most regions
show moderate sequence identity. However, it illustrates
that single-genome computational predictions of functional
elements have overlooked many biologically relevant sites
within the Arabidopsis genome and provides inroads toward
their further characterization.
Intersecting predicted folding RNAs (fRNAs) with conserved regions with tRNA homology revealed that half of
these regions with tRNA homology also exhibited folding.
More than half of conserved regions that show sequence
homology to angiosperm tRNAs also exhibit complex RNA
folding patterns. This overlap of independent methods of
identification gives a strong indication that these conserved
regions are part of previously unannotated tRNAs in
Arabidopsis. The remaining conserved regions that only
have homology to tRNAs may be truncated and lack the
complementary sequence to accurately predict a fold
within that region. Forty conserved regions were found that
showed some sequence identity to, but not overlap with,
currently annotated noncoding RNAs in plants; these regions
were intersected with a track listing regions of small RNA
expression, which revealed that 14 sequences in that set
were transcribed. Of those 14 regions that expressed RNA,
seven also have the predicted folding structures associated
with the conserved region. Although the first BLAST sequence identity term for these elements was similar to
ncRNAs, many also have protein-coding homology as a secondary BLAST term, suggesting their potential targets for
regulation. Despite all attempts at classification, the function
of 10% of the starting data set of unannotated conserved
elements remains unknown. These elements of unusually
high sequence conservation among species, labeled in figure
3G as “no homology,” cannot yet be fully characterized; however, similar to many of the other regions successfully identified, a subset shows small RNA expression or predicted
folding, giving clues to a currently veiled function.
The most prominent set of newly annotated elements
(fig. 3G) is conserved regions with homology to proteincoding sequence. Overlapping these regions of protein
homology are subsets that have been intersected with different whole-genome annotation tracks. This is visualized as an
internal Venn diagram of different types of feature characteristics, such as structure or expression. One possibility is that
this large group of elements comprised regions that could
code for proteins, either currently or ancestrally. To explore
this, we evaluated the reading frames of each conserved
region and identified that, for the length of the conservation,
at least one-third (573) have one or more viable reading
frames without stop codons. An additional measure of potential protein-coding ability was evaluating the proximity to
known exons. About one-fifth of the regions with good open
reading frames (174) were within 300 bp of a known exon,
making them candidates for being involved as an alternative
variant of a transcript or as an unknown exon of an annotated
gene. Although all these included regions have some homology to protein sequence, such homology is not always an
indication that the conserved sequence contributes to an
mRNA transcript. RNAs that regulate transcripts or target
DNA require homology to that target DNA (e.g., miRNAs).
As such, parts of these regions of homology could result from
targeting protein-coding regions as part of a regulatory
mechanism.
To further differentiate this large group of 1,787 elements,
other features were employed to identify additional characteristics of each conserved region. Secondary structure and
small RNA expression were used to elucidate potential RNA
genes within this set. This resulted in 53 elements at the
intersection of secondary structure and small RNA expression,
which made prime targets for further investigation. One
intriguing region from this set had a top BLAST term,
which listed an unknown protein, and a second BLAST
term, which listed the protein LAF3 (AT3G55850) (fig. 5).
The LAF3 protein participates in regulating phytochrome A
signal transduction in the cytosol (Hare et al. 2003). The third
highest scoring BLAST hit was the noncoding RNA
AT1g70185, located 500 kb downstream of the unknown
conserved region on chromosome 1, with a stretch of homology 80 bp long with 10 substitutions along that stretch. The
biological function of this related ncRNA is unknown. Our
predicted ncRNA, which was found through its pattern of
conservation, shares homology with the sequence for the
LAF3 protein, as well as with the TAIR-annotated ncRNA.
The protein homology overlaps the expressed small RNAs
mapped to the region. EvoFold-predicted secondary structure
shows an unusually high level of conservation, with almost no
substitutions in the fold found among angiosperms. In addition to this ncRNA, several other targets from this data
set share similar patterns of high conservation, expression,
and high-scoring secondary structure; these are annotated
as part of a browser track on the A. thaliana genome.
RNA Secondary Structure Prediction across
A. thaliana
Previous successes in whole-genome comparisons among
species groups have opened a window onto using multiple
alignments and phylogenetic trees to identify RNA genes
(Pedersen et al. 2006; Stark et al. 2007). To identify possible
RNA genes in our plant alignment, we used the phyloSCFG
1735
MBE
Hupalo and Kern . doi:10.1093/molbev/mst082
algorithm implemented in the EvoFold software package
(Pedersen et al. 2006), in addition to the RNAalifold program
(Bernhart et al. 2008). These previous RNA structure analyses
have highlighted the inherent high rates of false positives in
folding prediction. Using these two independent prediction
methods provided the opportunity to evaluate each fRNA
from multiple perspectives; this helped to eliminate false
positives that may have resulted from characteristics that
are unique to a particular algorithm. An example of this
approach can be seen in predictions such as those illustrated
in figure 4E, where the two algorithms overlap in their
annotation of a fold.
The combined predictions of the two approaches identified 86,000 sites that could potentially fold. Short folds of less
than 15 bp were found to be the majority of predictions,
though longer folds were also found in large numbers
(fig. 4A). To assess the accuracy in determining fRNA from
highly conserved alignments, the set of TAIR annotations for
transfer RNAs consisting of 689 sites was used as a positive
control for fRNA prediction. Our fold classifications predict
97% (637) of these established fRNAs, figure 4B. The remaining 3% of annotated tRNAs were not identified, due to
poor alignment or low conservation. This suggests accurate
prediction of known, conserved fRNAs, on par with previous
investigations into fRNA genes in other organisms.
Secondary structure in RNAs can take many physical
forms; we quantified this variation in shape by recording
the type of matching seen in both long and short folds
(fig. 4C). The hairpin type dominates among shorter folds.
Long folds show much higher diversity in shape, including
complex folds that have more than three hairpins in the
folded structure. Both long and short regions show a greater
proportion of folds comprising two hairpins in angiosperms,
compared with the distribution observed among folds in the
human genome (Pedersen et al. 2006) that observed that
double hairpins are more rare in primates. The types of
annotation that overlap regions which fold are described
in figure 4D for long and short folds. In vertebrates, nearly
half of all known folds are intergenic, with the remainder
being associated with introns and CDS. In contrast, angiosperms have few intergenic folds, with 70% or more occurring
within coding sequence. This difference mirrors the differences seen in the type of conservation of all sequence
between species. The data set used for both analyses,
the “mostCons” (most-conserved regions identified by
phastCons), impacts the distribution of folds among annotation types. As a result, we see that the mostCons composition in figure 3C is similar to the composition of folds
detected in figure 4D.
As a vignette describing one of the types of folds detected
in this analysis, we selected a previously undescribed conserved high-scoring double hairpin within the 50 -untranslated
region of the plastid redox insensitive 2 (PRIN2) gene (fig. 4E).
PRIN2 is a nuclear-encoded chloroplast-localized protein
whose expression levels are altered by light (Kindgren et al.
2011). The PRIN2 protein was also found to interact with the
plastid-encoded RNA polymerase-altering expression and
therefore is thought to be a nonessential regulator of plastid
1736
gene expression. The folded RNA structure is highly conserved among all flowering plants, with few mismatching
base pairs (fig. 4F). The consensus fold shows two strongly
conserved hairpins joined by a more variable region (fig. 4G).
The gene shows two transcripts scored 3 and 4 stars by TAIR,
differing only in the length of the 50 -untranslated region
(50 -UTR): one with the predicted folds and one without.
Interestingly, in the longer transcript, the hairpins directly
overlap the ribosome initiation site that begins at the 50 cap. Additionally, we detected two cis-regulatory motifs:
one near, and one within, the UTR regions (as seen in fig.
4E). The first cis-regulatory element, a “sequence overrepresented in light repressed promoters number 3” (SORLREP3)
motif has previously been found to occur near promoters
whose transcript levels are reduced under a continuous red
light stimulus (Hudson and Quail 2003). The second motif,
found within the 50 -UTR of both transcripts, is an I-box and is
known to exist in the promoter regions of light-regulated
genes (Giuliano et al. 1988). How these two cis-regulatory
elements contribute to expression levels of each of the alternative PRIN2 transcripts is unknown, but observed regulatory
pattern fits with previous knowledge about stimuli associated
with PRIN2 expression. Their presence flanking the predicted
folds may indicate that different expression patterns of the
gene are possible, depending on transcription factor binding.
Although these new predictions need further validation, they
highlight the ability of these genome-wide data sets to add
value to existing gene investigations.
Uninterrupted Conservation in Angiosperms
One peculiarity found in the genomes of mammals and insects is long stretches of uninterrupted conservation
(Bejerano et al. 2004). These ultraconserved elements
(UCEs) were originally located by a comparative genomic
search between human, mouse, and rat, which showed evidence of deep phylogenetic conservation, as well as ongoing
purifying selection in the human genome (Bejerano et al.
2004; Katzman et al. 2007; Chiang et al. 2008). UCEs can
extend to lengths of more than 500 bp and can be best described as the extreme tail of the distribution of genome-wide
conserved elements. There is a degree of controversy as
to whether they exist within plant genomes, with some
researchers reporting their discovery and others remaining
skeptical (Zheng and Zhang 2008; Freeling and
Subramaniam 2009). More recent research has used BLAST
searches across multiple plant genomes, identifying regions
that have been termed ultraconserved-like elements (ULEs)
(Kritsas et al. 2012). ULEs have unusually high levels of conservation, and negative selection acting on their sequence,
but lack the uninterrupted segments and extreme purifying
selection that are found in mammals.
To explore whether flowering plants contain even modestly extended stretches of uninterrupted conservation
seen in mammals and insects, we conducted a search using
methods mirroring those used to detect these regions in
mammals (Bejerano et al. 2004; Glazov et al. 2005). These
approaches use whole-genome multiple alignment to
detect blocks of conservation. Specifically, the algorithm
Fold Lengths
22%
25%
71%
44%
10%
Long Folds
97% (637)
Coverage of
Known tRNA
Accuracy
8%
11%
87%
88%
Short Folds (<15bp)
B
Scale
chr1:
Other
UTR
RNAs
Intron
Intergenic
CDS
F
3469650
AT1G10522.1
3469600
araTha9.chr1
lyrata
rapa
papaya
cacao
glycineMax
glycineSoja
malus
fragaria
cucumis
ricinus
populus
vitis
tuberosum
sorghum
zea
oryza
brachypodium
SS anno
pair symbol
vf_1_142172
ef_1_142172
SORLREP3
- Fold
Complex Fold
- Fold
- Fold
Conservation
E
AT1G10522.2
Putative Cis-Regulatory Elements
3469850 3469900 3469950
User Supplied Track
TAIR9 Protein-Coding Genes
3469800
3470000
3470050
3470100
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTTTCCTTGATTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTTTCCTTGATTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGA-AGTACCTCTGTATCCTTGATTTCTAAGGAG-CAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGTTCGTACCTCTGTTTCCTGGAGTTCGAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTATCCTTTAGCACAAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTGAGTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTGAGTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTGATTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTGACTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGG-TTGTACATCTGTGTCCTTGAGTTCTAAGGAGACAA
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTTATCACAAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTAATCACTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTATTCTATAGA-TTGTACCTCTGTATCCTTGAGTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCACTGAATCCTTGATTTCTAAGGAGACAG
ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCGCTGCATCCTTGATTAATAAGGAGGCAA
ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACTGTTGTATCCTTGATTGATAAGGAGGCAA
ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCGTTGCATCCTTGACTAATAAGGAGGCAA
ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCGCTACATCCTTTACCAAAAAGGAGGCAA
(((((((.....((((((...))))))...))).))))....((((.((((((.....)))))).))))
abcdefg
hijklm
mlkjih
gfe dcba
abcd efghij
jihgfe dcba
20 Way Multiz Angiosperm Genome Alignment
I Box
RNAalifold and Evofold Predictions of RNA Secondary Structure
vf_1_142175
200 bases
3469700 3469750
5’UTR
3470150
G
5’
0
1
C G
U
U
A
C
GC
GC
U
UA
A
CU
U
C
GA GU
UU G
G
CA UU
U CU CA A
GA CC
A GA
GG UU
G
A A G AU
U
C UU
FIG. 4. Predicted secondary structure based on a 20way angiosperm alignment using EvoFold and RNAalifold. (A) Fold lengths separated into short (<15 bp) or long folds. (B) Coverage of known tRNAs intersected
with fold predictions. Remaining 3% (21) of folds were due to low conservation or poor alignment. (C) Fold structure for both long and short fRNA sets. The number of hairpins was counted in a single fold, and
classified based on the fold’s structure. (D) Type of overlapping annotation for both long and short data sets. (E) UCSC genome browser screenshot of predicted hairpins in the PRIN2 (AT1G10522) gene that is
involved in plastid gene transcription, and is alternatively spliced. The hairpins were predicted by both Evofold and RNAalifold, and overlap the ribosomal initiation site. Also pictured are cis-regulatory predictions, a
TAIR gene track, and the phastCons conservation track (g) 20way alignment of the region colored blue where there is a single substitution compatible with the annotated pair, green with a compatible double
substitution, and red where there is a substitution not compatible with the annotated pair. (F) Consensus Vienna RNAfold predicted MFE structure of the highlighted 50 -UTR region.
D
21%
C
Long 37%
(39681)
Short 63%
(54981)
A
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
MBE
1737
MBE
Hupalo and Kern . doi:10.1093/molbev/mst082
Scale
chr1:
50 bases
26891600
26891650
Protein Homology
vf_1_107594
ef_1_107594
lyrata
rapa
papaya
cacao
ricinus
populus
malus
fragaria
cucumis
medicago
lotus
glycineMax
glycineSoja
vitis
tuberosum
oryza
brachypodium
sorghum
zea
0
Conservation
1
smallRNAs
FIG. 5. A screenshot from the Arabidopsis thaliana genome browser displaying tracks overlaid on a putative noncoding RNA, detected due to its high
conservation and expression. Tracks include conservation for each of the 20 included species, a track showing conserved regions BLAST-annotated as
having protein homology, a track showing secondary structure computed with EvoFold and RNAalifold with dark green denoting fold predictions and
light green nonfolding regions, and a track from the Arabidopsis Small RNA Project Database showing small RNA expression overlapping the conserved
regions. Also shown is the consensus Vienna RNAfold predicted MFE structure of the putative ncRNA.
identifies UCEs by starting with a conserved alignment column and stringing together subsequent preserved columns
until this pattern breaks due to any kind of nucleotide change.
Constraining this algorithm are two parameters: the number
of genomes within the alignment and the minimum threshold for declaring an UCE. Using all 20 aligned species to search
for regions of consistent alignment column conservation returns no regions when using a cutoff of 18 bp. One possibility is that gaps in alignment could be due to alignment or
assembly quality errors. To better account for this, we limited
the search space by using three-way alignments to A. thaliana.
The alignment of G. max and V. vinifera to A. thaliana
returned the largest number of uninterrupted regions greater
than 18 bp. Using TAIR annotation and BLAST homology,
we annotated 1,600 uninterrupted regions detected in
this 3way alignment and determined they all fall within
known types of genome features (supplementary fig. S2,
Supplementary Material online). Considering that novel
metazoan-type uninterrupted conservation has not been
found, it can be concluded that metazoan-like UCEs are
not present in angiosperm genomes at the investigated phylogenetic depths. As suggested by Kritsas et al., plant genomes
may contain features that may serve a similar purpose but
with altered or reduced conservation characteristics.
Discussion
Is the evolution of plant genomes distinct from that of animals? Here, we construct a phylogenetically deep alignment
of angiosperm genomes, to ask how sequence conservation in
angiosperms compares to groups of species in other kingdoms of life. Relating the entirety of currently sequenced genomes can reveal a more complete story on how similar plant
genomes are and on what features they “value” as part of a
shared evolutionary history. Additionally, conservation of sequence has been shown to quickly and clearly identify functional regions which might otherwise have been overlooked.
1738
As such, by analyzing genome conservation in flowering
plants, we have been able to add new annotations based
on patterns of conservation and identify novel features with
secondary structure and potential target sequences.
Information content of multiple alignments increases as
the number of species and the breadth of the phylogeny
increase. Although this is true for the first few included species, there are diminishing returns as further species are
added. To better quantify this, some have looked into how
many genomes are necessary to reach the nucleotide-level
resolution of conservation in comparative studies (Cooper
et al. 2003; Eddy 2005). Although these investigations focus
on the number of mammalian genomes, they still provide
potent rules of thumb for estimating how many genomes are
needed for high resolution. Depending on the phylogenetic
relationships, anywhere between 15 and 40 genomes may be
necessary. Our choice of how many genomes to include was
largely dictated by availability, as roughly 20 genomes were
available to us for use. Although plant alignments could benefit from further inclusion of comparative data of close phylogenetic distance to A. thaliana, this phylogenetically broad
20way alignment is a large step toward nucleotide-level identification of conserved sequences in angiosperm genomes.
Aligning plant genomes inevitably leads to the question of
how recent polyploid events impact the construction of data
sets and their analysis. Arabidopsis thaliana is one of the
smallest sequenced plant genomes; as a result, alignments
generated with an A. thaliana reference will be equally minimal. In instances where the reference (Arabidopsis) only has a
single copy, species aligned to this compact reference will
exclude regions of less conserved paralogs within the query
genome. This method has worked well for broad use in less
dynamic whole-genome comparative data sets such as vertebrates. In light of the complex ploidy history of plant genomes, we want our inference to be conservative, relative
to the influence of polyploidy; thus, we focus attention only
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
on questions of genome conservation within the A. thaliana
genome. A modified alignment process, using high-quality
chromosome data to resolve genome duplications events before the LASTZ alignment step, could produce alignments
without such a complex composite nature. However, a
sparse number of genomes have such data available.
Our investigation into coding-sequence conservation
demonstrates that we are generating quality whole-genome
alignments, which preserve essential gene features in aligned
regions. Our approach also implemented a new platform for
visualizing and accessing alignment data for plants; such tools
have long been available for other model species, but this is
the first instance a deep comparative browser for plants. The
results from benchmarking the gene quality of the alignment
(fig. 2) was surprising in that almost half of A. thaliana gene
annotations contained a change that would disrupt function
when aligned to A. lyrata. This is despite the two sharing the
large majority of A. thaliana’s coding sequence, with A. lyrata
having alignments for 98% of the coding sequence present in
A. thaliana. This stands in stark contrast to start/stop codon
conservation comparisons in vertebrate species, where even
distant mammals such as platypus retain more than 60% of
these essential sites (Miller et al. 2007). One explanation for
this observation could be that the alignment process creates
composite genes by aligning multiple copies of A. lyrata genes
onto single copy genes in the A. thaliana reference. This hypothesis, however, does not fully explain the number of disruptions in aligned genes. The majority of genes between the
two species occur colinearly, with a minority being duplicated
(Hu et al. 2011). If alignment errors from paralogs were creating all these observed disruptions, then we would expect
them to occur at a similar rate to the proportion of duplicated genes overall. More likely, there is a mix of causes, with
only some being false positives due to poor alignment. Even
taking these potential false positives into consideration, we
see a trend that is markedly different from the gene feature
conservation seen in vertebrate species.
The proportion of conserved genome features, such as
coding sequence, introns, and UTRs, relative to the complete
set of detected conserved elements within a reference
genome, has been previously investigated for vertebrates, insects, worms, and yeasts (Siepel et al. 2005). Specifically, this
comparison shows a trend relating an increase in the complexity of the conserved element set to an increase in overall
organismal complexity. Reproducing this analysis (fig. 3B) for
angiosperms revealed that the proportion of gene features
among the complete set of conserved elements most closely
parallels the proportions observed in worms such as C. elegans. Both nematodes and plants are known to exhibit a wide
degree of phenotypic plasticity, making drastic alterations to
body structure due to environmental stress (Sultan 2000;
Sommer and Ogawa 2011). The observations that both angiosperms and nematodes share a common distribution of
conserved elements, and that they both make use of a more
flexible phenotypic landscape, may imply that this less diverse
composition of conserved elements is necessary for environmentally induced large phenotypic changes. Moreover, it
could be that the developmental plans of plants are relatively
MBE
flexible in comparison to animals, and thus, this developmental lability is reflected in genomic architecture and evolution.
It is clear that angiosperm coding sequence and essential
RNAs can be reliably aligned and identified, even across protracted phylogenetic timelines. However, these components
represent only a fraction of genome features. The question of
the phylogenetic distance at which we lose sequence identity
for rapidly diverging features in plants can help make informed decisions in experimental design. Comparing alignability in vertebrates to angiosperm species reveals a faster
decay in vertebrates compared with plant species. Both between closely related species and phylogenetic comparisons
beyond the Brassicales, the alignability of coding sequence to
the A. thaliana reference was greater than equally distant
vertebrates to a human reference (fig. 3F). Similarly, the alignment coverage of reference genome coding sequence between equally distant vertebrate and plant species showed
a trend of higher coverage in plant species that were recently
and distantly diverged. The implication of this result is that
although plant genomes can be highly variable intergenically,
essential features such as coding sequences are highly conserved between species. It is more difficult to draw definitive
conclusions about the alignability of cis-regulatory modules
between kingdoms due to differing quality of annotations.
However, we do observe a substantial difference in alignability
when comparing the two kingdoms of life.
The preservation of conserved elements with no known
annotation, even in the most stringently filtered sets, was
surprising. This pattern illustrates that there are still segments
of the A. thaliana genome, which are conceivably functional,
but which are as yet uncharacterized. Our first-pass annotation of these conserved regions (fig. 3G) has shed light on the
type of function associated with this DNA. Although we detected several types of features, the highlight is the identification of dozens of new potential RNA genes. These regions,
found by partial BLAST homology to existing ncRNAs, or
alternatively by finding protein homology with overlapping
small RNA expression, may represent a previously unknown
source of regulation in A. thaliana. However, the homologybased method used to annotate conserved regions is simple
and broad and as a result cannot produce truly definitive
annotations. These new annotation groups, however, are
small enough for future manual refinement. Ultimately, function can only truly be assigned as a result of validation
through benchwork that verifies RNA expression and effects
on phenotype. The vignettes of novel folding regions highlighted here demonstrate the ability to quickly identify novel
genome features and can serve as a guide for bench scientists
to probe deeper into their gene families of choice with the
help of our genome browser. Novel putative ncRNAs, such as
that described in figure 5, are promising but require further
investigation to confirm the paradigm that conservation
implies function.
Beyond identifying new annotations via conservation, we
have added depth to existing annotation by layering it with
RNA folding and cis-regulatory information. In doing so, we
have characterized on a genome-wide scale how RNA genes
fold in flowering plants. Our leveraging of two independent
1739
MBE
Hupalo and Kern . doi:10.1093/molbev/mst082
algorithms, to precompute RNA folds and display them
genome-wide, gives researchers an instant second opinion
on a methodology, which is often subject to high false discovery rates and intense computational time (Gorodkin et al.
2010). This folding information, combined with TAIR gene
tracks, conservation, and cis-regulatory motifs, was able to
identify a potential fold, which may well control regulation
of a gene (fig. 4E).
This study articulated a few avenues for investigation into a
comparative alignment. Most of the analyses presented can
be viewed as tracks on the Arabidopsis genome browser;
many can be reconstituted using the genome browser and
table browser web tools. Plant comparative genomics has
unique challenges due to the architecture of genomes in
this kingdom of life. Continued accrual of sequence information and annotation will help empower further analysis of
these complex organisms. At both the gene level and the
genome level, this integration of plant DNA information
will help inform decisions and formulate targets for investigation to gain further insight about plant evolution.
Materials and Methods
Pairwise Alignment
In this analysis, we included only those angiosperm genomes
that have been published on previously, following the guidelines of the Ft. Lauderdale agreement on rapid data release.
The 20 genomes that were included in this analysis and their
version numbers are listed in table 1, sorted by coverage. Total
align refers to the number of nucleotides aligned to A. thaliana determined by the mafCoverage software. CDS align
refers to the overlap of alignment with existing A. thaliana
CDS annotation as determined by intersections using the
featureBits software, both programs are part of the UCSC
source tree (Kent et al. 2002).
A pairwise alignment pipeline was used to generate wholegenome alignments against a version 9 A. thaliana reference
genome sequence assembled by TAIR (Swarbreck et al. 2008).
All sequences were obtained as scaffolds or pseudochromosomes from the web repositories of the respective sequencing
groups with the exception of the G. soja genome. Sequence
data for G. soja were obtained from the Sequence Read
Archive and mapped to G. max using Maq (Li et al. 2008),
following the methods of the sequencing group (Kim et al.
2010). Masking was employed to remove lineage-specific repetitive regions; resulting in improved BLAST results, this was
accomplished using the RepeatMasker (Smit and Hubley
2004) software suite. Each query genome was split into regions of 1 million base pairs or less, whereas the reference
genome, A. thaliana, was split into its seven pseudochromosomes. The alignment then proceeded using the LASTZ
program (Harris 2007), a local alignment algorithm optimized
for whole-genome alignment, which locally compared the
A. thaliana reference genome sequence against all sequences
in each query genome. This process was parallelized across a
computer cluster to efficiently generate alignments from large
data sets. LASTZ output relating query to reference was then
linked into longer chains of contiguous alignment using
1740
axtChain (Kent et al. 2003). The alignment chains were
sorted using chainNet, which filters only the single bestaligned chain, and maximizes coverage across the reference
genome. Converting the nets to multiple alignment files followed this. The resulting pairwise alignments of each query
genome to the A. thaliana reference were joined using
MULTIZ (Blanchette et al. 2004) and guided by the tree topology in figure 1. Postprocessing of the alignments included
inserting annotations for alignment breaks and gaps using the
mafAddIrows tool, and identifying regions removed by
RepeatMasker.
Evaluating Alignment Quality and Refining Parameters
To determine whether the alignment process is producing
reliable sequence relationships, alignments were evaluated
based on base coverage and on annotation-specific quality.
To measure the number of raw bases aligned, the number of
exact base matches, the number of base mismatches, and the
coverage the mafCoverage program, part of the UCSC source
tree was used. Starting with default parameters for all programs, each step in the alignment process was tuned to maximize coverage and minimize mismatch. The LASTZ
alignment algorithm proved to be robust; even without any
tuned parameters, the software produced pairwise alignments between A. thaliana and A. lyrata with coverage
only 5% less than pairwise alignments made by the
A. lyrata genome sequencing project. The final LASTZ parameters were as follows for all alignments: inner = 2,000,
xdrop = 9,400, gappedthresh = 3,000, hspthresh = 2,200.
Using these parameters, for example, coverage was increased
by 4% between Ath/Aly compared with the default parameter baseline.
Evaluating Gene Quality
To judge the conservation of known A. thaliana gene features
in pairwise aligned genomes beyond simple coverage numbers, the program cleanGenes (part of the PHAST package)
was used to evaluate feature conservation. Using version 9
genome annotation from TAIR, gene-feature coordinates
were extracted and located in pairwise alignments, and conservation was assessed. The types of features evaluated included start sites, stop sites, splice sites, frameshifts, and
nonsense mutations; these were searched for “cleanly” conserved exons without gaps or mutations. Features were tallied
as passing or failing after evaluating the features conservation
in a pairwise alignment.
To compare our alignment pipeline to previously available
whole-genome plant alignments created for the VISTA
genome browser, an alignment using TAIR version 8 A. thaliana sequence was aligned to A. lyrata, so that we could
benchmark our alignments against previously released publicly available alignments. Additional more recent alignments
of A. thaliana/V. vinifera and A. thaliana/G. max, which utilized a TAIR10 reference, were used for additional comparisons. VISTA alignments were incompatible with genome
browser tools; thus, the mfa alignments were converted
into MAF block format using a custom Python script. The
MBE
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
resulting MAF format alignments for Ath/Aly were uploaded
to a TAIR version 8 genome browser as browser MAF tracks.
This allowed mafCoverage, as well as cleanGenes results from
VISTA alignments, to be directly compared with alignments
from the LASTZ/MULTIZ pipeline.
Scoring Conservation
To compute conservation tracks for the multiple alignment,
phyloFit (Siepel and Haussler 2004) (a component of the
PHAST package) was used to fit a phylogenetic model to
4-fold-degenerate sites found on each chromosome of the
MULTIZ alignment as an initial starting sample (as described
in the PHAST documentation). The resulting phylogenetic model was used in conjunction with the phastCons
(Siepel et al. 2005) tool to create conserved and nonconserved phylogenetic trees. The phastCons program requires
several iterations to refine parameters that predict conservation as part of its phylo-HMM. Starting with parameters
for expected coverage and expected length gathered from a
previous conservation analysis focusing on Solanaceae
(Wang et al. 2008), the phastCons run was tuned to fit
predefined criteria. Similar to previous studies analyzing conservation, our criteria was 60% coverage of the annotated
coding regions by predicted conserved elements, as well as
phylogenetic information threshold score close to 10 bits
measured by the consEntropy software. The resulting parameters that fit our criteria were an expected coverage of
0.2 and an expected length of 80. Wig format data files were
used to create a conservation track on the A. thaliana
genome browser, which visualizes conservation scores as
a continuous variable. Resulting conserved region lengths
are graphed in supplementary figure S3, Supplementary
Material online.
Conserved regions were classified using A. thaliana annotation tracks based on TAIR version 9 GFF files. Intersections
and enrichment values of the annotation tracks versus the
conserved region track were achieved using the featureBits
command line tool, part of the UCSC source tree. Significance
was determined by Fischer’s exact test, using values gathered
by featureBits, to determine whether certain groups were
over-represented versus the normal composition. Normal
composition of the A. thaliana genome was determined
using the same methodology. Vertebrate conserved element
enrichment was determined using featureBits and the
phastConsElements46way track with annotation drawn
from UCSC (Fujita et al. 2011).
To evaluate and compare the conservation of
specific genome features between species, the tool maf_interval_alignability.py was employed. This tool, part of the bxpython package and utilized in Miller et al. (2007), scores
alignments to annotated features by measuring presence or
absence of aligned sequence. Specifically, the program proceeds by tabulating the number of bases covered by a query
species, compared with the number of bases within an interval that have missing alignment information. The alignability
value is the number of bases with alignment divided by the
sum of the number of positions with and without alignment.
The graph in figure 3F displays mean values of alignability for
annotation groups in select species. The columns of mean
alignability values for each species are then scaled based on
phylogenetic distance, as determined by substitutions per site
drawn from a 4-fold degenerate neutral tree. Trend lines for
vertebrate data were recapitulated using the latest alignment
information drawn from the phastCons46way alignment
(Fujita et al. 2011) to confirm the previously observed pattern.
Annotation for cis-regulatory sites in the 46way alignment
was drawn from the ORegAnno track annotation (Griffith
et al. 2008).
Building the Browser
To visualize alignments, and make use of the collection of
browser genomics tools, a mirror of the UCSC genome browser (Kent et al. 2002) was installed at local facilities and remains available at genome.genetics.rutgers.edu. The focus of
this browser is to host comparative genomics data for
Drosophila and plant species. We selected Oryza sativa and
A. thaliana as reference genome browsers for monocots
and eudicots, respectively, due to their extensive annotation
and high-quality pseudochromosomes. Development has
focused on the A. thaliana browser tracks as a prototype
for a plant comparative genomics browser. These tracks
include a bed file-based display of regions identified by
RepeatMasker as repetitive sequence and gene tracks based
on known genome annotations. Specifically, the foundation
of the browser is gff3 format annotation created by TAIR,
filtered for a single coverage of genes across each genome,
and then converted to gene prediction (genePred) format
and uploaded to the browser MySQL database. An alternative
gene prediction track, created using Gnomon gene prediction
software as part of a recent TAIR release (Lamesch et al. 2011),
was also included as part of the browser.
Cis-regulatory elements were predicted based on regular
expressions of A. thaliana transcription factor binding sites as
listed in the AGRIS cis-regulatory database (Yilmaz et al. 2011).
To create a browser-compatible track of elements, putative
binding sites were called using GREP according to TAIR version 9 chromosomes and then mapped using BLAT (Kent
2002). The resulting coordinates were formatted into an extended bed genome-browser track, labeling the type of motif
and its coordinates.
Identifying Uninterrupted Conservation
To locate regions within the A. thaliana genome that are also
found in all other sequenced and aligned angiosperm genomes in an uninterrupted block, a Python program based
on mafUltras was written. This software was used to identify
ultraconserved elements in vertebrates and has been adapted
for use here with the 20way alignment. Unlike a phastCons
conservation analysis, this search method is dependent on a
user-defined threshold; specifically, the threshold is the minimum length of uninterrupted alignment columns. When
searching the human genome, this threshold was defined as
100 bp. To maximize inclusion of highly conserved elements
and account for the overall shorter length of plant conserved
elements, this threshold was set to 18 bp for the search in
the 20way alignment. This was chosen because in general
1741
MBE
Hupalo and Kern . doi:10.1093/molbev/mst082
angiosperm-conserved regions are substantially shorter
than their mammalian counterparts (supplementary fig. S4,
Supplementary Material online) and that 18 bp is the shortest
length an expected noncoding RNA might be. We expect that
this shortening of the threshold compared with mammals,
from 100 to 18 bp, is an inclusive estimate rather than exclusive. Any detected elements were sorted according to overlap
of known A. thaliana annotation from TAIR. Regions that did
not map to known annotation were de novo annotated
based on BLAST homology.
Computing Secondary Structure
Computing folding structure of RNA molecules can be
informed by conservation between related genomes. To identify secondary structure conservation, EvoFold was implemented to predict folding given a MAF block and a
phylogenetic tree. As with similar EvoFold studies (Pedersen
et al. 2006; Stark et al. 2007), conserved regions predicted by
PhastCons were first joined to any neighboring conserved
region at a distance no greater than 30 bp. These extended
regions were subsequently split into lengths no greater than
750 bp. MAF blocks were extracted from the 20way alignment using the MafFrag utility, part of the UCSC software
package, and postprocessed to be compatible with EvoFold
alignment format. The 20 species newick tree used for all
EvoFold runs was sourced from the PhastCons conservation
run. EvoFold predictions were distributed across a compute
cluster using default values in the control file provided in the
EvoFold source code. Folds with scores below an LOD of 100
and folds with overlap of repetitive elements were filtered
out. Result files were formatted into a BED6 structure, adhering to the file format used for previously implemented UCSC
genome browser EvoFold tracks, and uploaded to a MySQL
database for use as a browser track. Resulting fold lengths are
graphed in supplementary figure S3, Supplementary Material
online.
A similar approach was taken to predict secondary structure in RNAs using the RNAalifold algorithm, part of the
Vienna RNA package (Hofacker et al. 2002; Bernhart et al.
2008). This method uses the same data set described earlier,
which contains conserved elements identified by PhastCons
and processed for length and format. Results were filtered
using the same thresholds as above and postprocessed to
create a genome browser track. Secondary structure enrichment was found by intersecting the EvoFold annotation
track with other annotation tracks using the featureBits
command line tool. A final browser track was created
containing the composite scores of both independent prediction methods. This data set was used for the results in
figure 4.
The control data set used to verify the accuracy of predictions was sourced from TAIR annotation of tRNAs, which
produced 658 annotations. Predicted fRNAs overlap 637
of the 658 annotations. The enrichment of predicted fRNAs
in the set of existing annotation for the A. thaliana genome can be seen in figure 3E. As would be expected, translational related RNAs (including tRNAs, rRNAs, snRNAs, and
snoRNAs) are significantly enriched for having folding regions,
1742
more than triple the enrichment (17.52x) of the next nearest
category, regulatory RNAs (4.08x).
Annotating Conserved Noncoding Regions
To characterize unannotated conserved regions scored by
phastCons as most conserved within flowering plants, we
relied on BLAST-based homology searches with default
search parameters. The top 10% of the distribution of
most-conserved elements was focused on for annotation,
so as to limit a considerably large data set to only the most
highly conserved regions. A first-pass search for homology
was performed using the BLAST algorithm to scan TAIR version 10 genome-wide annotation. BLAST results from this first
pass search were parsed using a custom script, to which were
extracted the top scoring search term for any result with an evalue cutoff of 0.1 or less. Regions with no homology within
known A. thaliana annotation were then searched for homology to any known plant annotation contained in the Plant
Genome Database, using an e-value cutoff of 0.1 or less
(Duvick et al. 2007). In each case, the top BLAST search
term was used as its tentative annotation. Further annotation
was achieved by intersecting bed files containing coordinates
of conserved regions annotated by BLAST homology with
secondary structure browser tracks, proximity to exons, and
existing small RNA expression databases sourced from the
ASRP (Backman et al. 2008). Evaluating exon proximity was
determined by searching for coordinates that were within
164 bp of an annotated exon, the average intron length in
A. thaliana.
Programming and Data
All programs were written in the Python and the C programming language. All custom software used in the development
and analyses are available upon request. All data sets of conserved elements and annotations have been made available as
files and tracks on the A. thaliana genome browser (araTha9)
located at genome.genetics.rutgers.edu.
Supplementary Material
Supplementary tables S1 and S2 and figures S1–S5 are
available at Molecular Biology and Evolution online (http://
www.mbe.oxfordjournals.org/).
Acknowledgments
This work was supported by two grants to A.D.K., NSF
MCB-1052148 and DOE/USDA 124336, as well as the
Human Genetics Institute of New Jersey.
References
Acarkan A, Rossberg M, Koch M, Schmidt R. 2000. Comparative genome
analysis reveals extensive conservation of genome organisation for
Arabidopsis thaliana and Capsella rubella. Plant J. 23:55–62.
Arabidopsis Genome Initiative. 2000. Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana. Nature 408:
796–815.
Argout X, Salse J, Aury J-M, et al. (61 co-authors). 2011. The genome of
Theobroma cacao. Nat Genet. 43:101–108.
Backman TWH, Sullivan CM, Cumbie JS, Miller ZA, Chapman EJ,
Fahlgren N, Givan SA, Carrington JC, Kasschau KD. 2008. Update
Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082
of ASRP: the Arabidopsis small RNA project database. Nucleic Acids
Res. 36:D982–D985.
Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS,
Haussler D. 2004. Ultraconserved elements in the human genome.
Science 304:1321–1325.
Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. 2008. RNAalifold:
improved consensus structure prediction for RNA alignments.
BMC Bioinformatics 9:474.
Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors). 2004. Aligning
multiple genomic sequences with the threaded blockset aligner.
Genome Res. 14:708–715.
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC Comparative
Sequencing Program, Green ED, Sidow A, Batzoglou S. 2003. LAGAN
and Multi-LAGAN: efficient tools for large-scale multiple alignment
of genomic DNA. Genome Res. 13:721–731.
Brudno M, Poliakov A, Minovitsky S, Ratnere I, Dubchak I. 2007. Multiple
whole genome alignments and novel biomedical applications at the
VISTA portal. Nucleic Acids Res. 35:W669–W674.
Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas BJ,
Hamilton JP, Buell CR. 2007. Identification and characterization
of lineage-specific genes within the Poaceae. Plant Physiol. 145:
1311–1322.
Chan AP, Crabtree J, Zhao Q, et al. (18 co-authors). 2010. Draft genome
sequence of the oilseed species Ricinus communis. Nat Biotechnol.
28:951–956.
Chiang CWK, Derti A, Schwartz D, Chou MF, Hirschhorn JN, Wu C-T.
2008. Ultraconserved elements: analyses of dosage sensitivity, motifs
and boundaries. Genetics 180:2277–2293.
Cooper GM, Brudno M, NISC Comparative Sequencing Program., Green
ED, Batzoglou S, Sidow A. 2003. Quantitative estimates of sequence
divergence for comparative analyses of mammalian genomes.
Genome Res. 13:813–820.
Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V.
2004. Darwin’s abominable mystery: insights from a supertree of
the angiosperms. Proc Natl Acad Sci U S A. 101:1904–1909.
Drosophila 12 Genomes Consortium, Clark AG, Eisen MB, et al.
(418 co-authors). 2007. Evolution of genes and genomes on the
Drosophila phylogeny. Nature 450:203–218.
Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM,
Frazer KA. 2000. Active conservation of noncoding sequences
revealed by three-way species comparisons. Genome Res. 10:
1304–1306.
Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ,
Lushbough C, Brendel V. 2007. PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 36:D959–D965.
Eddy SR. 2005. A model of the statistical power of comparative genome
sequence analysis. PLoS Biol. 3:e10.
Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. 2004. VISTA:
computational tools for comparative genomics. Nucleic Acids Res.
32:W273–W279.
Freeling M, Subramaniam S. 2009. Conserved noncoding sequences
(CNSs) in higher plants. Curr Opin Plant Biol. 12:126–132.
Friedman RC, Farh KK-H, Burge CB, Bartel DP. 2009. Most mammalian
mRNAs are conserved targets of microRNAs. Genome Res. 19:
92–105.
Fujita PA, Rhead B, Zweig AS, et al. (27 co-authors). 2011. The UCSC
Genome Browser database: update 2011. Nucleic Acids Res. 39:
D876–D882.
Gebhardt C, Walkemeier B, Henselewski H, Barakat A, Delseny M, Stuber
K. 2003. Comparative mapping between potato (Solanum tuberosum) and Arabidopsis thaliana reveals structurally conserved
domains and ancient duplications in the potato genome. Plant J.
34:529–541.
Giuliano G, Pichersky E, Malik VS, Timko MP, Scolnik PA, Cashmore AR.
1988. An evolutionarily conserved protein binding sequence
upstream of a plant light-regulated gene. Proc Natl Acad Sci
U S A. 85:7089–7093.
Glazov EA, Pheasant M, McGraw EA, Bejerano G, Mattick JS. 2005.
Ultraconserved elements in insect genomes: a highly conserved
MBE
intronic sequence implicated in the control of homothorax
mRNA splicing. Genome Res. 15:800–808.
Goff SA, Ricke D, Lan T-H, et al. (55 co-authors). 2002. A draft sequence
of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100.
Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL.
2010. De novo prediction of structured RNAs from genomic
sequences. Trends Biotechnol. 28:9–19.
Griffith O, Montgomery SB, Bernier B, et al. (27 co-authors). 2008.
ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36:D107–D113.
Guo H. 2003. Conserved Noncoding sequences among cultivated cereal
genomes identify candidate regulatory sequence elements and
patterns of promoter evolution. Plant Cell 15:1143–1158.
Hare PD, Moller SG, Huang L-F, Chua N-H. 2003. LAF3, a novel factor
required for normal phytochrome A signaling. Plant Physiol. 133:
1592–1604.
Harris RS. 2007. Improved pairwise alignment of genomic DNA.
[PhD thesis]. [University Park (PA)]: The Pennsylvania State
University.
Hofacker IL, Fekete M, Stadler PF. 2002. Secondary structure prediction
for aligned RNA sequences. J Mol Biol. 319:1059–1066.
Hu TT, Pattyn P, Bakker EG, et al. (30 co-authors). 2011. The Arabidopsis
lyrata genome sequence and the basis of rapid genome size change.
Nat Genet. 43:476–481.
Huang S, Li R, Zhang Z, et al. (96 co-authors). 2009. The genome of the
cucumber, Cucumis sativus. L. Nat Genet. 41:1275–1281.
Hubisz MJ, Pollard KS, Siepel A. 2011. PHAST and RPHAST: phylogenetic
analysis with space/time models. Brief Bioinform. 12:41–51.
Hudson ME, Quail PH. 2003. Identification of promoter motifs involved
in the network of phytochrome A-regulated gene expression by
combined analysis of genomic sequence and microarray data.
Plant Physiol. 133:1605–1616.
Inada DC, Bashir A, Lee C, Thomas BC, Ko C, Goff SA, Freeling M. 2003.
Conserved noncoding sequences in the grasses. Genome Res. 13:
2030–2041.
Kaplinsky NJ. 2002. Utility and distribution of conserved noncoding
sequences in the grasses. Proc Natl Acad Sci U S A. 99:6147–6151.
Katzman S, Kern AD, Bejerano G, Fewell G, Fulton L, Wilson RK, Salama
SR, Haussler D. 2007. Human genome ultraconserved elements are
ultraselected. Science 317:915.
Kellis M, Patterson N, Endrizzi M, Birren B. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements.
Nature 423:241–254.
Kent WJ. 2002. BLAT—the BLAST-like alignment tool. Genome Res. 12:
656–664.
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. 2003. Evolution’s
cauldron: duplication, deletion, and rearrangement in the mouse
and human genomes. Proc Natl Acad Sci U S A. 100:11484–11489.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,
Haussler D. 2002. The human genome browser at UCSC. Genome
Res. 12:996–1006.
Kim J, He X, Sinha S. 2009. Evolution of regulatory sequences in
12 Drosophila species. PLoS Genet. 5:e1000330.
Kim MY, Lee S, Van K, et al. (29 co-authors). 2010. Whole-genome
sequencing and intensive analysis of the undomesticated soybean
(Glycine soja Sieb. and Zucc.) genome. Proc Natl Acad Sci U S A. 107:
22032–22037.
Kindgren P, Kremnev D, Blanco NE, de Dios Barajas López J, Fernández
AP, Tellgren-Roth C, Small I, Strand A. 2011. The plastid redox
insensitive 2 mutant of Arabidopsis is impaired in PEP activity and
high light-dependent plastid redox signalling to the nucleus. Plant J.
70:279–291.
Kritsas K, Wuest SE, Hupalo D, Kern AD, Wicker T, Grossniklaus U. 2012.
Computational analysis and characterization of UCE-like elements
(ULEs) in plant genomes. Genome Res. 22:2455–2466.
Ku HM, Vision T, Liu J, Tanksley SD. 2000. Comparing sequenced
segments of the tomato and Arabidopsis genomes: large-scale
duplication followed by selective gene loss creates a network of
synteny. Proc Natl Acad Sci U S A. 97:9121–9126.
1743
Hupalo and Kern . doi:10.1093/molbev/mst082
Lamesch P, Berardini T, Li D. 2011. The Arabidopsis Information
Resource (TAIR): improved gene annotation and new tools.
Nucleic Acids Res. 21:1–9.
Lenz D, May P, Walther D. 2011. Comparative analysis of miRNAs and
their targets across four plant species. BMC Res Notes. 4:483.
Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads and
calling variants using mapping quality scores. Genome Res. 18:
1851–1858.
Michaud M, Cognat V, Duchêne A-M, Maréchal-Drouard L. 2011.
A global picture of tRNA genes in plant genomes. Plant J. 66:80–93.
Miller W, Rosenbloom K, Hardison RC, et al. (26 co-authors). 2007.
28-way vertebrate alignment and conservation track in the UCSC
Genome Browser. Genome Res. 17:1797–1808.
Ming R, Hou S, Feng Y, et al. (85 co-authors). 2008. The draft genome of
the transgenic tropical fruit tree papaya (Carica papaya Linnaeus).
Nature 452:991–996.
Morrell PL, Buckler ES, Ross-Ibarra J. 2011. Crop genomics: advances and
applications. Nat Rev Genet. 13:85–96.
Paterson AH, Bowers JE, Bruggmann R, et al. (45 co-authors). 2009. The
Sorghum bicolor genome and the diversification of grasses. Nature
457:551–556.
Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K,
Lander ES, Kent J, Miller W, Haussler D. 2006. Identification and
classification of conserved RNA secondary structures in the
human genome. PLoS Comput Biol. 2:e33.
Retzel EF, Johnson JE, Crow JA, Lamblin AF, Paule CE. 2007.
Legume resources: MtDB and Medicago.Org. Methods Mol Biol.
406:261–274.
Rhead B, Karolchik D, Kuhn RM, et al. (20 co-authors). 2010. The UCSC
Genome Browser database: update 2010. Nucleic Acids Res. 38:
D613–D619.
Sato S, Nakamura Y, Kaneko T, et al. (29 co-authors). 2008. Genome
structure of the legume, Lotus japonicus. DNA Res. 15:227–239.
Schmutz J, Cannon SB, Schlueter J, et al. (45 co-authors). 2010. Genome
sequence of the palaeopolyploid soybean. Nature 463:178–183.
Schmidt R. 2002. Plant genome evolution: lessons from comparative
genomics at the DNA level. Plant Mol Biol. 48:21–37.
Schnable PS, Ware D, Fulton RS, et al. (157 co-authors). 2009. The B73
maize genome: complexity, diversity, and dynamics. Science 326:
1112–1115.
Shulaev V, Sargent DJ, Crowhurst RN, et al. (71 co-authors). 2011. The
genome of woodland strawberry (Fragaria vesca). Nat Genet. 43:
109–116.
Siepel A, Bejerano G, Pedersen JS, et al. (16 co-authors). 2005.
Evolutionarily conserved elements in vertebrate, insect, worm, and
yeast genomes. Genome Res. 15:1034–1050.
Siepel A, Haussler D. 2004. Phylogenetic estimation of contextdependent substitution rates by maximum likelihood. Mol Biol
Evol. 21:468–488.
Smit A, Hubley R. 2004. RepeatMasker Open-3.0. 1996–2010 [Internet].
Institute for Systems Biology. Available from: http://www.repeat
masker.org
Sommer RJ, Ogawa A. 2011. Hormone signaling and phenotypic
plasticity in nematode development and evolution. Curr Opin
Biol. 21:R758–66.
Stark A, Lin MF, Kheradpour P, et al. (44 co-authors). 2007. Discovery of
functional elements in 12 Drosophila genomes using evolutionary
signatures. Nature 450:219–232.
1744
MBE
Stojanovic N. 2009. A study of the distribution of phylogenetically
conserved blocks within clusters of mammalian homeobox genes.
Genet Mol Biol. 32:666–673.
Sultan SE. 2000. Phenotypic plasticity for plant development, function
and life history. Trends Plant Sci. 5:537–542.
Swarbreck D, Wilks C, Lamesch P, et al. (16 co-authors). 2008.
The Arabidopsis Information Resource (TAIR): gene structure and
function annotation. Nucleic Acids Res. 36:D1009–D1014.
Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008.
Synteny and collinearity in plant genomes. Science 320:486–488.
Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH. 2008.
Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res. 18:1944–1954.
Thomas BC, Rapaka L, Lyons E, Pedersen B, Freeling M. 2007. Arabidopsis
intragenomic conserved noncoding sequence. Proc Natl Acad Sci
U S A. 104:3348–3353.
Tuskan GA, Difazio S, Jansson S, et al. (110 co-authors). 2006. The
genome of black cottonwood, Populus trichocarpa (Torr. & Gray).
Science 313:1596–1604.
Velasco R, Zharkikh A, Affourtit J, et al. (86 co-authors). 2010. The
genome of the domesticated apple (Malus domestica Borkh.).
Nat Genet. 42:833–839.
Velasco R, Zharkikh A, Troggio M, et al. (57 co-authors). 2007. A high
quality draft consensus sequence of the genome of a heterozygous
grapevine variety. PLoS One 2:e1326.
Vogel J, Garvin D, Mockler T, Schmutz J. 2010. Genome sequencing and
analysis of the model grass Brachypodium distachyon. Nature 463:
763–768.
Wang X, Haberer G, Mayer KF. 2009. Discovery of cis-elements between
sorghum and rice using co-expression and evolutionary conservation. BMC Genomics 10:284.
Wang XX, Wang HHH, Wang JJJ, et al. (110 co-authors). 2011. The
genome of the mesopolyploid crop species Brassica rapa. Nat
Genet. 43:1035–1039.
Wang Y, Diehl A, Wu F, Vrebalov J, Giovannoni J, Siepel A, Tanksley SD.
2008. Sequencing and comparative analysis of a conserved syntenic
segment in the Solanaceae. Genetics 180:391–408.
Xu X, Pan S, Cheng S, et al. (98 co-authors). 2011. Genome sequence
and analysis of the tuber crop potato. Nature 475:189–195.
Yang X, Jawdy S, Tschaplinski T. 2009. Genome-wide identification of
lineage-specific genes in Arabidopsis, Oryza and Populus. Genomics
93:473–480.
Yang Y-W, Lai K-N, Tai P-Y, Li W-H. 1999. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of
divergence between Brassica and other angiosperm lineages. J Mol
Evol. 48:597–604.
Yilmaz A, Mejia-Guerra MK, Kurz K, Liang X, Welch L, Grotewold E.
2011. AGRIS: the Arabidopsis Gene Regulatory Information Server,
an update. Nucleic Acids Res. 39:D1118–D1122.
Zeller G, Henz SR, Widmer CK, Sachsenberg T, Rätsch G, Weigel D,
Laubinger S. 2009. Stress-induced changes in the Arabidopsis thaliana transcriptome analyzed using whole-genome tiling arrays.
Plant J. 58:1068–1082.
Zhang B, Pan X, Cannon C, Cobb G. 2006. Conservation and divergence
of plant microRNA genes. Plant J. 46:243–259.
Zheng W-X, Zhang C-T. 2008. Ultraconserved elements between the
genomes of the plants Arabidopsis thaliana and rice. J Biomol Struct
Dyn. 26:1–8.