Genome: A Major Role for CR1 and L2 LINE Elements

Evolution of the Australian Lungfish (Neoceratodus forsteri)
Genome: A Major Role for CR1 and L2 LINE Elements
Cushla J. Metcalfe,z,1 Jonathan Filée,1 Isabelle Germon,1 Jean Joss,2 and Didier Casane*,1
1
Laboratoire Evolution, Génomes et Spéciation, Centre National de la Recherche Scientifique, Gif-sur-Yvette, and Université Paris
Diderot, Paris, France
2
Department of Biological Sciences, Macquarie University, New South Wales, Australia
zPresent address: Universidade de São Paulo, Instituto de Biociências, Cidade Universitária, São Paulo, Brazil
*Corresponding author: E-mail: [email protected].
Associate editor: Todd Oakley
Haploid genomes greater than 25,000 Mb are rare, within the animals only the lungfish and some of the salamanders and
crustaceans are known to have genomes this large. There is very little data on the structure of genomes this size. It is known,
however, that for animal genomes up to 3,000 Mb, there is in general a good correlation between genome size and the percent of
the genome composed of repetitive sequence and that this repetitive component is highly dynamic. In this study, we sampled
the Australian lungfish genome using three mini-genomic libraries and found that with very little sequence, the results converged
on an estimate of 40% of the genome being composed of recognizable transposable elements (TEs), chiefly from the CR1 and L2
long interspersed nuclear element clades. We further characterized the CR1 and L2 elements in the lungfish genome and show
that although most CR1 elements probably represent recent amplifications, the L2 elements are more diverse and are more likely
the result of a series of amplifications. We suggest that our sampling method has probably underestimated the recognizable TE
content. However, on the basis of the most likely sources of error, we suggest that this very large genome is not largely composed
of recently amplified, undetected TEs but may instead include a large component of older degenerate TEs. Based on these
estimates, and on Thomson’s (Thomson K. 1972. An attempt to reconstruct evolutionary changes in the cellular DNA content of
lungfish. J Exp Zool. 180:363–372) inference that in the lineage leading to the extant Australian lungfish, there was massive
increase in genome size between 350 and 200 mya, after which the size of the genome changed little, we speculate that the very
large Australian lungfish genome may be the result of a massive amplification of TEs followed by a long period with a very low
rate of sequence removal and some ongoing TE activity.
Key words: lungfish, genome size, transposable elements, CR1-like element.
Introduction
The lungfishes are likely the closest extant relatives to the
tetrapods (Brinkmann et al. 2004). They are, therefore, important to our understanding of the evolutionary transition from
water to air. The lungfishes are of additional interest because
all extant lungfish have very large genomes: the smallest of
them is that of Neoceratodus forsteri (the Australian lungfish),
with a haploid genome of 50,000 Mb, and the largest is the
enormous 130,000 Mb genome of Protopterus aethiopicus
(the Marbled lungfish). Within the metazoans examined
only some of the salamanders and amphipods have comparable sized genomes (>25,000 Mb) (Gregory 2010). Genome
size is linked with features at the organismal level, such as cell
size and cell division, metabolic rate, and developmental rate
(Gregory 2005). Of particular interest is the link between developmental complexity and genome size, best described in
the salamanders, where normal metamorphosis is associated
with the smallest genomes, whereas obligate neotenic development is associated with the largest genomes (Gregory
2005). It has been hypothesized that extant lungfish are actually obligate neotenics (Joss 2006), based on deficiencies in
the thyroid axis but entirely consistent with the very large
genome sizes of extant lungfish.
The lack of correlation between genome size and the
number of genes it contains or the complexity of the organism in which it is found is known as the "C-value paradox"
(Thomas 1971). This "paradox" can be answered at the most
basic level by the observation that, above a genome size of
100 Mb, as genomes become larger the greater the contribution of transposable elements (TEs), relative to other
sources of length variation (Kidwell 2002; Lynch and Conery
2003). For example, TEs make up 45% of the human genome
(3,000 Mb) but only 2.7% of the pufferfish genome (330 Mb)
(Hua-Van et al. 2011). In larger genomes, therefore, cellular
genes are a tiny fraction of the genome, whereas TEs form the
bulk. The variation in genome size is not a one-way street of
increasing genome size by TE amplification but more likely an
interplay between the rate of removal and addition of DNA.
For example, in insects, the small Drosophila melanogaster
genome (175 Mb) removes DNA 40 times faster than the
11 times larger genomes of the Laupala crickets (Petrov
et al. 2000), whereas in plants, the size and frequency of deletions is greater in Arabidopsis (125 Mb) than in tobacco
(5,100 Mb). TEs can impact the genome in many diverse ways,
not only by affecting genome size but also by, for example,
providing a template for recombination, by inserting within
ß The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: [email protected]
Mol. Biol. Evol. 29(11):3529–3539 doi:10.1093/molbev/mss159 Advance Access publication June 24, 2012
3529
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
Research
article
Abstract
MBE
Metcalfe et al. . doi:10.1093/molbev/mss159
3530
tested whether such a small amount of sequence could give
a broad estimate of the TE composition of a genome by
an in silico simulation of random sequencing in the wellcharacterized human genome. We compare our results for
the Australian lungfish with those from the salamanders,
which have comparable genome sizes (Sun et al. 2011).
Materials and Methods
Unless otherwise noted, all kits were used according to the
manufacturer’s instructions.
Sample Collection and DNA Extraction
Animals were held at Macquarie University (Australia) under
license no. 2009/039. Blood was taken using the protocol
under license no. 2006/020, and DNA was extracted using
the DNeasy Tissue Kit (Qiagen). Tissue samples were taken
using the protocol under license no. 2006/020 and DNA extracted using either a standard phenol chloroform extraction
method (Sambrook et al. 1989) or the High Pure PCR
Template Preparation Kit (Roche).
Construction and Sequencing of
Mini-Genomic Libraries
The composition of the genome was estimated by random
sequencing of three mini-genomic libraries. For all three libraries, 6 g of gDNA was digested. The first library was constructed by full digest with PvuII and EcoRV (New England
BioLabs) of gDNA extracted from tissue. The digested
DNA was migrated on a gel and a smear between 10 and
1 kb excised and purified using the NucleoTraPCR kit
(Macherey-Nagel). The other two libraries were constructed
using partial restriction digests following the protocol from
the Vosshall Laboratory (Rockefeller University) website,
http://vosshall.rockefeller.edu/protocols.php. The first partial
digest library was created using DNA extracted from blood
and using AluI (New England Biolabs). The second partial
digest library was constructed using DNA extracted from
tissue and MboI (New England Biolabs). For both enzymes,
a 1:20 dilution resulting in a smear between 10 and 1 kb was
purified using the High Pure PCR Template Preparation Kit
(Roche), precipitated using standard method and resuspended in 20 l 10 mM Tris. The 50 overhang of the MboI
digest was blunted using T4 DNA polymerase (Expand
Cloning kit from Roche).
Adenine overhangs were created for all three digests using
Go Taq Flexi DNA Polymerase (Promega) according to the
Promega Subcloning Notebook protocol, purified using the
High Pure PCR Template Preparation Kit (Roche) and precipitated using standard methods (Sambrook et al. 1989);
200–400 ng was used in a ligation reaction with pGEM-T
Easy Vector (Promega). The insert length of a random
sample was estimated by PCR, and inserts greater than 1 kb
were sequenced (National Center for Biotechnology
Information [NCBI] accession numbers JJ725300–JJ725335
for sequences from the AluI library, JJ725336–JJ725427 for
sequences from the PvuII/EcoRV library, and HR872187–
HR872226 for the sequences from the MboI library). A total
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
or near coding regions thereby disrupting gene expression or
modifying the transcription of neighboring genes, or by providing sequence that is co-opted by the genome for essential
functions (reviewed in Venner et al. 2009). An understanding
of the dynamics between the host genome and TEs is, therefore, crucial to our understanding of genome structure and
function.
The timing of the expansion of the Australian lungfish
genome may shed light on the dynamics of this genome,
was it a single rapid expansion, and was it recent or very
ancient, or were there a series of expansions and contractions?
Based on the correlation between genome size and cell size in
extant organisms (Gregory 2001), it is possible to estimate the
genome size of extinct organisms using fossil cell size as proxies (Morris and Harper 1988; Organ et al. 2007). Thomson
(1972) used the comprehensive lungfish fossil record to estimate the evolutionary history of lungfish genome sizes and
inferred that genome size remained small and constant until
350 mya, rapidly increased between 350 and 200 mya, and
then changed little in the lineage leading to the extant
Australian lungfish. On the basis of these results of
Thomson (1972) and their estimation that 0.05% of the
Australian lungfish genome is recognizable TEs, Sirijovski
et al. (2005) proposed that this genome is "a cemetery of
TEs" resulting from an ancient burst of transposition followed
by a long period of very low activity, resulting in massive
amounts of unique sequences. On the assumption that any
major bands from a restriction enzyme digest would be a
major component of the genome, Sirijovski et al. (2005) sequenced the only two bands resulting from multiple restriction digests of Australian lungfish genomic DNA (gDNA). The
chief component of both bands was fragments of a TE, a CR1
element, the copy number of which they estimated using
quantitative polymerase chain reaction (qPCR). If the lungfish
genome is not chiefly composed of TEs, could the large
genome be the result of a polyploidy event? The lungfish
karyotype (Rock et al. 1996) and a phylogeny of several Hox
genes (Longhurst and Joss 1999) suggest that the genome has
not undergone a recent polyploidy event. In addition, polyploidy does not necessarily result in large genomes, due to the
phenomenon of genome downsizing (Ozkan et al. 2003;
Leitch and Bennett 2004).
The data of Sirijovski et al. (2005), however, are indirect
evidence of how the Australian lungfish genome is organized.
In addition, a qPCR, because it uses specific primers, may only
identify a small proportion of the number of elements present. We, therefore, decided to revisit the question of the
composition of the Australian lungfish genome. As a first
estimate to determine what this genome is chiefly composed
of, we randomly sequenced three mini-genomic libraries.
With a very small amount of sequence, the results from all
three libraries converged, i.e., a large percentage, between 36.7
and 42.3%, of the sequence data was recognizable TEs, between 49.8 and 69.1% of this from the closely related long
interspersed nuclear element (LINE) CR1 and L2 clades. We
further characterized these elements to obtain an almost
full-length representative copy of each and confirm that
the random sequences are CR1 and L2 LINE elements. We
Evolution of the Size of the Australian Lungfish Genome . doi:10.1093/molbev/mss159
of 131.6 kb was sequenced. Sequences were examined using
Chromatogram Explorer (Heracle Software) and trimmed to
remove low-quality ends.
MBE
MEGA5 (Tamura et al. 2011) for two regions: 1) the contig
created by BioEdit and the genome-walking sequences at
each step and 2) the sequences in the overlapping regions
between each step.
In Silico Random Sequencing of Human Genome
Sequence of the complete human genome (version GRCh37)
was downloaded from the UCSC Genome Browser website
(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/).
A Perl script was written to extract 150 900 bp sequences
at random. We also checked that the AluI, EcoRV, MboI, and
PvuII restriction sites are frequent and quite evenly distributed throughout the human genome (data not shown).
Sequence Analysis
PCR Genome Walking
Genome walking was performed using the GenomeWalker
Universal Kit (Clontech). Specific primers (supplementary
table S1, Supplementary Material online) were designed
using Primer3Plus (Untergasser et al. 2007); 10 l PCR reactions were performed using Advantage Genomic LA
Polymerase Mix (Clontech) according to the Clontech’s instructions. PCR reactions were visualized using gel electrophoresis, and single bands were excised and purified using
NucleoSpin Extract II kit (Machery-Nagel). Purified DNA
then cloned into pGEM-T Easy Vector (Promega). Eight
clones were screened by PCR, two to four clones, except for
one step for the L2 sequence, were sequenced at each step
(NCBI accession numbers JF501664–JF501688 and JN935285–
JN935284). For each set of genome-walked sequences and for
the genome-walked fragments published by Sirijovski et al.
(2005), a single contig was created using the CAP contig assembly program in BioEdit (Hall 1999) with a minimum base
overlap of 20 and a minimum percent match of 85. These
were named including the appropriate prefix from Wicker
et al. (2007) RIJ_NfCR1_PE112GW, RIJ_NfCR1_PE114GW,
and RIJ_NfL2_PE19GW, and for the genome-walked fragments published by Sirijovski et al. (2005), RIJ_NfCR1_GW.
To check that the contig was representative of closely related
CR1-like elements, the mean p distance was estimated using
Degenerate primers were designed against amino acid motifs
to amplify the endonuclease and reverse transcriptase domains for 20 CR1 (RIJ_NfCR1_F1 and F5 clones) and 9 L2
sequences (RIJ_NfL2_49A and L2_59A clones). Primer sequences are shown in supplementary table S1, Supplementary Material online. The fragments were amplified using
DNA extracted from blood using the Advantage Genomic
LA Polymerase Mix (Clontech). Ten RIJ_NfCR1_F1 and 10
RIJ_NfCR1_F5 and nine RIJ-NfL2_49A/59A clones were fully
sequenced using internal primers and aligned with BioEdit
(Hall 1999). NCBI accession numbers are JF501689–
JF501708 and JN935276–JN935284.
Phylogenetic Analysis of CR1 and L2 Sequences
Ninety-four percent of the top hits of the mini-genomic libraries CR1 and L2 sequences were to sequences from other
chordates, therefore the phylogenetic analysis was done with
just chordate sequences. All CR1 and L2 nucleotide sequences
plus those from the other clades within the Jockey group
(Kapitonov et al. 2009) were retrieved from Repbase (Jurka
et al. 2005). A single lungfish PCR-amplified CR1 and L2 sequence was also used as a separate tblastx query on the NCBI
website. Incomplete sequences, that is, those that did not
include sequence from domain II in the endonuclease
domain to domain 8 in the reverse transcriptase domain
were removed. For the remaining CR1 sequences only, sequences were grouped together according to which organism
they came from and aligned using ClustalW within BioEdit
(Hall 1999). For sequences with a pairwise percent identity >
95% at the nucleotide level, redundant sequences were removed, so that only one remained. For organisms where there
were more than five sequences, a neighbor-joining (NJ) phylogeny was inferred, and three to five sequences representing
their diversity were selected (supplementary fig. S1,
Supplementary Material online). All sequences retrieved
from the NCBI website and Repbase (Jurka et al. 2005)
were submitted to RepeatMasker (Smit et al. 2010).
Sequences were renamed according to the type of element
as identified by RepeatMasker, a two-letter identifier for the
species, and then the original sequence name or identifier
from NCBI. The original CR1-like element from the
Australian lungfish (Sirijovski et al. 2005) was named
"NfCR1." The sequence used here is a contig using the genome-walked fragments and was renamed "CR1-NfCR1_GW"
to distinguish it from the original element described.
The lungfish CR1-like genome-walked contigs (RIJ_NfCR1_
PE112GW, RIJ_NfCR1_PE114GW and RIJ_NfCR1_GW, and
RIJ_NfL2_PE19GW), PCR-amplified elements and the collection of elements from Repbase and NCBI were conceptually
translated, aligned with MUSCLE (Edgar 2004), and manually
3531
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
The 150 900 bp sequences from the human genome were
analyzed only using RepeatMasker (Smit et al. 2010) against
the human library. Lungfish sequences were submitted to
CENSOR on the Repbase website (Kohany et al. 2006) using
the forced translate and report simple repeats option.
Sequences masked by CENSOR were screened for de novo
repeats using RepeatModeler (Smit and Hubley 2010).
CENSOR hits were reclassified by submitting them to
RepeatMasker and totals calculated using an Excel spreadsheet. This was done because although a reclassification of
some LINEs in the Repbase database has been published
(Kapitonov et al. 2009), the actual database does not reflect
it. CENSOR masked lungfish sequences were submitted to
self-Blastn and a Blastx against nonredundant protein sequences (cutoff < e15) on the NCBI website (Sayers et al.
2011) in an attempt to find further repetitive sequences and
"orphan" coding sequences, respectively.
Amplification of Endonuclease and Reverse
Transcriptase Domains
MBE
Metcalfe et al. . doi:10.1093/molbev/mss159
adjusted by eye using BioEdit (Hall 1999). All sequence apart
from the endonuclease domains I–IX and the reverse transcriptase domains 1–8 were removed and the remaining sequence concatenated (supplementary fig. S2, Supplementary
Material online). Using MEGA5 (Tamura et al. 2011), we estimated the optimal model of amino acid substitution. NJ and
maximum likelihood (ML) trees were inferred using a JonesTaylor-Thornton (JTT) substitution matrix (Jones et al. 1992)
with a gamma distribution of substitution rate variation
across sites (optimal model: JTT + D with = 2.8). In both
case, the robustness of the nodes was estimated by 1,000
bootstrap replicates.
Phylogenetic Analysis of Random-Sequenced
Fragments
Results
Estimated Composition of the Australian Lungfish
(Neoceratodus fosteri) Genome
Analysis of each of the three mini-genomic libraries resulted in
a similar estimated composition of the Australian lungfish
genome (table 1 and supplementary table S2, Supplementary
Material online). Approximately 40% of the genome was estimated to be composed of repetitive sequences. Of these, the
largest component was non–long terminal repeat (LTR) retrotransposons, 77% of the identified repeats. Within the
non-LTR retrotransposons, the most highly represented superfamily was Jockey (the CR1, L2, and Rex-Babar clades). The
CR1 and L2 clades alone were estimated to make up 22% of
genome, that is, 56% of repetitive component (table 1 and
supplementary table S2, Supplementary Material online).
Within the LTR retrotransposons, the Dictyostelium intermediate repeat sequence (DIRS) superfamily is the most
highly represented (11% of repeats). DNA transposons are
the smallest TE component, (3% of repeats), whereas
simple repeats were estimated to be 0.8% of the genome or
2% of repeats identified. No coding sequences, apart from
those from TEs, were identified using Blastx (NCBI), leaving
60% of the sequences unidentifiable.
3532
Repeat class
Length
Satellite
Low complexity
403
Simple repeat
642
Total satellite
1,045
Transposable element
DNA transposon
hAT
155
Academ
1,274
Harbinger
295
Total DNA
1,724
transposon
LTR retrotransposon
ERV
245
DIRS
5,836
Gypsy
1,390
Ngaro
2,086
Total LTR
9,557
retrotransposon
Non-LTR retrotransposon
HER1
57
CR1
20,035
L2
9,430
Tx1
2,963
RTE-X
2,179
Rex-babar
89
L1
4,379
Penelope
673
SINE
734
Total non-LTR
40,539
retrotransposon
Total TEs
51,820
Total repeats
52,865
Total sequence
131,622
Fraction
repeats
identified
(%)
Fraction
sampled
sequence
(%)
Fraction of
sampled
sequences,
min–max (%)
0.8
1.2
2.0
0.3
0.5
0.8
0.0–0.5
0.0–0.7
0.0–1.2
0.3
2.4
0.6
3.3
0.1
1.0
0.2
1.3
0.0–0.5
0.0–2.8
0.0–0.3
0.5–3.0
0.5
11.0
2.6
3.9
18.1
0.2
4.4
1.1
1.6
7.3
0.0–0.3
4.3–4.8
0.0–2.8
0.0–3.0
6.3–7.6
0.1
37.9
17.8
5.6
4.1
0.2
8.3
1.3
1.4
76.7
0.0
15.2
7.2
2.3
1.7
0.1
3.3
0.5
0.6
30.8
0.0–0.1
13.5–18.8
6.6–8.9
1.7–2.5
0.0–2.6
0.0–0.1
0.4–4.4
0.0–1.0
0.0–1.0
27.4–34.2
98.0
100.0
39.4
40.2
36.7–42.3
36.7–43.0
A total of 167 genomic fragments (131,622 bp) were sequenced. Min–Max: the
lowest and highest percentage found in the three genomic libraries.
Characterization of CR1 and L2 Elements by PCR
Genome Walking
Fourteen CR1 and L2 fragments were genome walked from
the largest mini-genomic (PvuII/EcoRV) library, 8 of the 30 CR1
sequences and 6 of the 14 L2 sequences. We chose to genome
walk not only the top scoring elements but also some more
low scoring elements, in case the lungfish genome had unusual
CR1 and L2 elements. After one to two genome walking steps,
most of these elements were found to be nested or incomplete. For three sequences, two CR1 sequences and one L2
sequence, genome walking resulted in sequence that covered
the open reading frame (ORF) 2 and the 30 -untranslated
region (UTR) (fig. 1). The zinc finger/leucine zipper and an
esterase domain (Kapitonov and Jurka 2003) in ORF 1 (ORF1)
were also retrieved for the CR1 elements, but we were unable
to find the 50 -end. The inverted repeat and 8-bp tandem
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
The random-sequenced fragments of CR1 elements were
aligned against the three genome-walked contigs, RIJ_
NfCR1_PE112GW, RIJ_NfCR1_PE114GW, and RIJ_NfCR1_
GW. The random-sequenced fragments of L2 elements
were aligned against the genome-walked contig RIJ_NfL2_
PE19GW, L2-Ambystoma [email protected]:4379248294, and Branchiostoma floridae@Crack-17_BF sequences.
With the optimal model defined earlier, for each randomly
sequenced fragment, a NJ and a ML tree (100 bootstrap replicates) was inferred with the sequence and the corresponding
reference sequences using the "complete deletion" option in
MEGA5 (Tamura et al. 2011). Each phylogenetic analysis has
thus been performed using only the region in the reference
sequences that is homologous to the random sequence. The
trees were rooted at the midpoint. We counted how many
random sequences were inside, and outside, the clade of the
three reference sequences (supplementary fig. S3,
Supplementary Material online).
Table 1. Repetitive elements identified in the lungfish genome.
MBE
Evolution of the Size of the Australian Lungfish Genome . doi:10.1093/molbev/mss159
1kb
A
esterase
zf/lz
5‘ UTR
reverse transcriptase
endonuclease
ORF1
3‘ UTR
ORF2
B
i RIJ_NfCR1_PE114GW
83%
99%
97%
clone 114 from 99%
PvuII/EcoRV library
84%
78%
88%
99%
81%
98%
92%
99%
94%
clone 112 from
PvuII/EcoRV library
84%
98%
81%
iii RIJ_NfCR1_GW
85%
83%
83%
81%
85%
BglIII
fragments
84%
84%
EcoRI
fragments
79%
82%
iv RIJ_NfL2_PE19GW
89%
99%
88% 100%85%
94%
100%
clone 19 from
PvuII/EcoRV library
81%
98%
FIG. 1. Representation of the structure of CR1 and L2 elements found in the Australian lungfish genome and diagrammatic summary of genome
walking. (A) Structure of CR1 and L2 elements. Zf/lz = zinc finger/leucine zipper domain. Domains are indicated by gray shading. Sequence obtained by
genome walking in outlined in black. We were unable to obtain the 50 -UTR for any elements, shown as a gray box with no outline. The region used for
phylogenetic analysis is shown by dashed lines. Arrows indicate the position of PCR primers used to obtain this region. (B) Contigs of elements genome
walked by us (i, ii, and iv) and Sirijovski et al. (2005) (iii). Double lines indicate sequence obtained by random sequencing (this article) or restriction
digest (Sirijovski et al. 2005). Single lines indicate sequence obtained by genome walking. Percentages above the line indicate mean percent identity at
the overlaps. Percentages below the line indicate mean percent identity between the contig created by the BioEdit CAP contig assembly program in
BioEdit (Hall 1999) and the genome-walked fragments.
repeats (Haas et al. 2001) were identified within the 30 -UTR of
the CR1 elements by aligning with 30 -UTR regions from the
Anolis, painted turtle and zebrafish (data kindly given by
Andrew Shedlock) (data not shown). One of the CR1 elements, RIJ_NfCR1_PE114GW, had an additional 0.6 kb stretch
of unidentifiable sequence between the zinc finger/leucine
zipper and the esterase domain in ORF1. For the L2 sequence,
a (TAAA)5 repeat was identified at the end of the 30 -UTR.
Phylogenetic analysis of CR1 and L2 PCR Genome
Walked and PCR-Amplified Sequences
Phylogenetic relationships inferred using the ML and NJ
methods based on the endonuclease and reverse
transcriptase
domains
(supplementary
fig.
S2,
Supplementary Material online) resulted in similar topologies
(fig. 2). The CR1 sequences, as defined by RepeatMasker (Smit
et al. 2010), formed a monophyletic group, with two main
well-supported clades, the first with all the lungfish sequences
and the second with most of the sequences from the birds,
the turtles, and the lizard. With the lungfish clade, the genome-walked
contigs,
RIJ_NfCR1_PE114GW
and
RIJ_NfCR1_GW (Sirijovski et al. 2005), fell out with one set
of PCR-amplified elements ("F5" elements), whereas
RIJ_NfCR1_PE112GW fell out with the other set of PCR-amplified elements ("F1" elements).
The L2 sequences, on the other hand, did not form a
monophyletic group (fig. 2). However, sequences fell into
3533
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
ii RIJ_NfCR1_PE112GW
MBE
Metcalfe et al. . doi:10.1093/molbev/mss159
A
B
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
FIG. 2. (A) Neighbor-joining and (B) maximum likelihood phylogenies of Jockey elements. Both phylogenies are based on a concatenation of the
endonuclease domains I–IX and the reverse transcriptase domains 1–8 (720 amino acids) and inferred using the JTT substitution matrix and a gamma
distribution of the substitution rate across sites (JTT + with = 2.8). Robustness of the nodes was estimated by 1,000 bootstrap replications.
Bootstrap values less than 70 are not shown. Colored shading indicates clades shared by both inferred phylogenies. The dashed line indicates the
boundary between the CR1 and the L2 sequences. All nonlungfish sequences names are prefixed with the lineage name according to RepeatMasker
(Smit et al. 2010) and an abbreviated species name. Abbreviations for species are as follows: Ac, Anolis carolinensis; Aj, Anguilla japonica; Am,
Ambystoma mexicanum; Bf, Branchiostoma floridae; Bt, Bos taurus; Cf, Canis familiaris; Cm, Callorhinchus milii; Cp, Chrysemys picta; Dr, Danio rerio;
Ga, Gallus gallus; Hf, Heterodontus francisci; Hs, Homo sapiens; Lm, Latimeria menadoensis; Md, Mus musculus domesticus; Oa, Ornithorhynchus anatinus;
Ol, Oryzias latipes; Pf, Passeriformes; Pm, Petromyzon marinus; Ps, Platemys spixii; Ss, Sus scrofa; Tf, Takifugu rubripes; Xm, Xiphophorus maculatus; Xt,
Xenopus tropicalis.
3534
Evolution of the Size of the Australian Lungfish Genome . doi:10.1093/molbev/mss159
the same six clades in both phylogenies, except for the
L2-Bf@Crack-16_BF sequence. The first and second clades
contained all sequences from the lungfish and the Mexican
axolotl, respectively, the third most of the "Crack" sequences
from the lancelet, and the fourth lancelet "Crack" and Danio
rerio "CR1" sequences. The last two clades included a mixture
of fish and shark sequences, the coelacanth sequence fell into
one clade of these last two clades, whereas sequence from the
platypus, eel, and shark fell into the second clade. The topology of the relationships between the L2 clades was not consistent between the two phylogenies, and most branches
were poorly supported.
Phylogenetic Analysis of Random-Sequenced
Fragments
In Silico Simulation of Random Sequencing of Human
Genome
The 150 900 bp sequences were classified as approximately
21% LINEs, chiefly L1, 9.6% short interspersed nuclear
elements (SINEs), chiefly Alu, and 3.7% DNA transposons
(supplementary table S3, Supplementary Material online).
Discussion
In most eukaryotes, the protein-coding non-TE sequences
represent a small fraction of the genome, which is the most
stable component of the genome, whereas the bulk of the
genome is composed of repetitive DNA, often TEs, and is
considered highly dynamic (Pritham 2009). In an examination
of the lungfish genome, Sirijovski et al. (2005) used restriction
digests of whole gDNA to isolate the single largest identifiable
component, which was fragments of a CR1 element, and
using qPCR determined that it constituted only 0.05% of
the genome. Sirijovski et al. (2005) suggest that most of the
genome is degenerate TEs, which implies that the enormous
TE component of the genome is no longer dynamic but instead has become fossilized.
We reapproached the question of genome composition in
the Australia lungfish. We faced a number of challenges. Not
only is the Australian lungfish genome very large but extant
lungfish are also evolutionarily distant from other extant
organisms, last sharing a common ancestor with the tetrapods and coelacanths approximately 400 mya (Blair and
Hedges 2005). There is no database of repetitive elements
from lungfish or a closely related organism. We constructed
three mini-genomic libraries, one a full-digest and the other
two partial digest libraries. Clones were randomly sequenced
and analyzed. We used CENSOR on the Repbase website
(Kohany et al. 2006) with the forced translate option because
this method is more likely to identify related sequences when
no repeat database for the organism in question is available.
Results from the three libraries quickly converged, using only a
small amount of sequence, on an estimate of 40% repetitive
sequences, of which about half were CR1 and closely related
L2 sequences (table 1 and supplementary table S2, Supplementary Material Online). We were unable to use
RepeatModeler (Smit and Hubley 2010) to further identify
repeats, because of the small number and length of the sequences. No further highly repetitive sequences were identified using a self-Blastn of masked sequences nor were any
cellular genes or "orphan" TE coding sequences identified by a
Blastx of masked sequences (Sayers et al. 2011). Given the size
of the genome and our sample size, we would not have expected to identify cellular genes. The discrepancy in the percentage of the genome estimated to be CR1 elements by us
and Sirijovski et al. (2005) is because a PCR-based method
using specific primers may underestimate CR1 copy number.
However, their restriction digest photograph does broadly
support our results: on the agarose gel is 6 g of gDNA,
20% of 6 g is 1.2 g which would be clearly visible, whereas
0.05% is 3 ng, which would not.
We tested whether such little sequence could give a broad
estimate of the repetitive content of a genome by doing an in
silico simulation of our random sequencing in the well-characterized human genome. To analyze this sequence, we used
RepeatMasker (Smit et al. 2010) with the human library because the repetitive component of the human genome is well
annotated. The results (supplementary table S3, Supplementary Material Online) were similar to published results for the
whole genome. For example, in Lander et al. (2001), classes of
interspersed repeats are reported as 21% LINEs, 13% SINEs,
and 3% DNA transposons, compared with our estimate of
21% LINEs, 10% SINEs, and 4% DNA tranposons based on
135 kb of random sequence. We could not simulate our identification methods because there is no database of repetitive
elements from lungfish, or from a closely related organism,
but it does show that a broad estimate can be obtained using
a sampling approach with very little sequence.
Approximately 30% of the Australian lungfish random sequences were LINE (non-LTR) elements (table 1). These were
reclassified into lineages using RepeatMasker (Smit et al.
2010), almost all of them were identified as CR1 and the closely
related L2 elements. We further characterized these elements
by PCR genome walking to obtain representative elements.
LINEs are frequently 50 truncated (Luan et al. 1993; Wicker
et al. 2005), and we were unable to get full-length copies for
either type of element. In the case of CR1 elements, we
obtained two sequences that lacked only the 50 -UTR, and
for the L2 elements, we obtained one sequence which
3535
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
The random sequences from the mini-genomic libraries were
short (990 bp) and do not all cover the same region, so
therefore could not be used in a single phylogenetic analysis.
Forty-three random sequences were analyzed. The CR1 elements were analyzed against the three CR1 genome-walked
contigs, whereas the L2 elements were analyzed against the L2
genome-walked contig and two sequences representing the
most closely related clades to the PCR-amplified L2 elements,
[email protected]:43792-48294, and L2-Bf@Crack-17_BF.
Seventy-five percent of the CR1 random sequences (21/28)
were found within the clade of genome-walked contigs
(fig. 3), suggesting that the elements we genome walked are
representative of the majority of CR1 elements found in the
Australian lungfish genome, whereas only 30% of the L2 elements were more closely related to the L2 genome-walked
contig than to the other 2 elements (fig. 3).
MBE
MBE
Metcalfe et al. . doi:10.1093/molbev/mss159
A
B
RIJ_NfCR1_GW
1 13 (12)
RIJ_NfCR1_PE114GW
RIJ_NfL2_PE19GW
1 3 (2)
L2-Branchiostoma@Crack-17_BF
2 2 (3)
3 2 (2)
2
3
2 (2)
4
3 (3)
[email protected]:43792-48294
4 2 (2)
5
7 (7)
5
3 (4)
RIJ_NfCR1_PE112GW
1 (1)
FIG. 3. Maximum likelihood and neighbor-joining phylogenies of elements from random sequencing of mini-genomic libraries against genome-walked
elements. Most of the phylogenies had identical topologies with both methods. RIJ_NfCR1_PE112GW, RIJ_NfCR1_PE114GW, and RIJ_NfL2_PE19GW
are from this article, and RIJ-NfCR_GW is a contig generated from the fragments published by Sirijovski et al. (2005). Dashed lines indicate the position
of randomly sequenced fragments. The number in the circle is the group number for random sequences at that position. The number to the right of the
number in the circle is the number of randomly sequenced fragments in that group: those not in brackets are inferred from maximum likelihood
phylogenies and those in brackets from neighbor-joining phylogenies.
3536
sequence, but we were unable to classify 60% of the lungfish
sequences. However, TE diversity is highly variable from one
species to another (Hua-Van et al. 2011), so it is difficult to
predict what the profile of the "missing" 60% of the sequences
may be. There are several possibilities, some that we can rule
out, but many that we cannot. There are two reasons why it is
not likely to be a single conserved highly repetitive component. First, from multiple restriction digests, Sirijovski et al.
(2005) identified just two bands, both of which contained
chiefly CR1 fragments. Second, a self-Blastn of the masked
random sequences did not identify any highly repetitive sequences. It is, therefore, more likely that we have underestimated the percent of the genome composed of TEs and at
least part of the "missing" 60% of the genome is composed of
diverse TEs components. These could include a previously
unidentified type of TE. A blastx of masked sequences was
used to search for "orphan" TE coding regions, such as the
conserved reverse transcriptase region (Xiong and Eickbush
1990). None were identified. However, if the sequence was
not from a highly conserved region, such as a gag domain or a
currently unknown domain, it may not have been retrieved.
Some repetitive elements, particularly those with no known
coding region, are difficult to identify by homology-based
methods if a database of closely related sequences is not
available. For example, CACTA elements in wheat can be
composed entirely of various repeats flanked by short terminal repeats and be over 5 kb (Wicker et al. 2003). Similarly,
SINEs and miniature inverted repeat transposable elements
can be difficult to identify. Some regions within TEs also
would not be detected, such as large long terminal repeats,
or spacer regions, leading to an underestimate of the space
that the TE occupies. Finally, TE sequences that are highly
divergent, in particular older, degenerate copies, would not
have been identified.
Using next-generation sequencing, Sun et al. (2011) examined the genomes of six salamanders with haploid genome
sizes ranging from 15,000 to 47,000 Mb, the larger genomes
being almost as large as the Australian lungfish genome. They
identified 25–47% of the sequence as TEs, the most abundant
elements being those from the LTR Gypsy superfamily, in contrast to both the lungfish and other examined animal
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
lacked the ORF1. A phylogenetic analysis of PCR-amplified
elements and genome-walked elements confirmed that they
are CR1 and L2 elements (fig. 2). Phylogenetic analyses
showed that although the CR1 genome-walked elements are
representative of the CR1 sequences in the lungfish genome,
the L2 sequences are more diverse (fig. 3). Our results suggest,
therefore, that the lungfish genome is 22% CR1 and L2 elements. The majority of identified CR1 elements formed a
monophyletic group and may be the result of an amplification
burst specific to the lungfish lineage. Most of the L2 elements,
on the other hand, do not form a monophyletic lineage and
are probably the result of a series of bursts (fig. 3).
LINEs (non-LTR elements) are the predominant TE in most
of the animal genomes examined (Wicker et al. 2007). The
percentage (30) of lungfish sequence composed of LINE
elements, despite the much larger size of the genome, is comparable with that found in other vertebrates, for example,
humans (21%) (Lander et al. 2001), monodelphis (29%)
(Gentles et al. 2007), and the platypus (21%) (Warren et al.
2008). LINE clades, however, show varying success within vertebrate lineages. In the monotreme (platypus), L2 elements
predominate (Warren et al. 2008), whereas in marsupials
(monodelphis) and in eutherians, it is L1 elements that have
been more successful (Li et al. 2001; Gibbs et al. 2004; Gentles
et al. 2007). In nonavian reptiles and in the chicken, CR1 elements dominate, they make up a huge 81% of repeats in the
alligator and 71% in the anole (Shedlock et al. 2007). On the
basis of previous data and large-scale sequencing of genomic
clones of a turtle, alligator, and lizard, Shedlock et al. (2007)
proposed that the repetitive component of the amniote ancestor genome was CR1 dominated. The lungfish and coelacanths are considered to be the extant species most closely
related to tetrapods (Zardoya and Meyer 1996). If the unidentified component of the lungfish genome comprised diverse
elements, as our data suggest, and the single largest component is CR1 elements, our data suggest that CR1 elements
predominated in the sarcopterygian ancestor genome.
An extension of the correlation between genome size and
the percentage of the genome composed of TEs (Kidwell
2002, Lynch and Conery 2003) would predict that this
genome would be almost entirely composed of repetitive
MBE
Evolution of the Size of the Australian Lungfish Genome . doi:10.1093/molbev/mss159
B
% sequence identified as TEs
90
3
70
2
50
1
30
A
10
0
0
90
3
70
2
50
1
0.5
1.0 1.5
2.0
2.5 3.0 3.5
plants
animals
salamanders
Australian lungfish
30
10
0
10
20
30
40
50
60
haploid genome size (Gb)
FIG. 4. The relationship between the percentage of sequence identified as transposable elements and genome size. Data are from Hua-Van et al. (2011),
Sun et al. (2011), de Koning et al. (2011), and this study. (A) All genomes are shown. (B) Genomes < 3.5 Gb are shown (shaded in gray in A). 1, Homo
sapiens (Hua-Van et al. 2011); 2, Homo sapiens (de Koning et al. 2011); and 3, Zea mays (Hua-Van et al. 2011). Other taxa shown are Arabiodopsis
thaliana, Oryza sativa, Vitis vinifera, Sorghum bicolor, Caenorhabditis elegans, Drosophila ananassae, Drosophila melanogaster, Fugu rubripes,
Branchiostoma floridae, Gallus gallus, Mus musculus, Desmognathus ochrophaeus, Eurycea tynerensis, Batrachoseps nigriventris, Aneides flavipunctatus,
Bolitoglossa occidentalis, Bolitoglossa rostrata, and Neoceratodus forsteri.
genomes, in which non-LTR retrotransposons are more prevalent (Sun et al. 2011). We have plotted genome size versus
percent TEs identified for plant and animal genomes, including our data for the Australian lungfish (fig. 4). There is a welldescribed correlation between genome size and percent TEs
identified in plant and animal genomes (Kidwell 2002), and
this correlation seems to hit a "ceiling" of 50% TEs in the animal
genomes examined greater than 3,500 Mb (fig. 4A). The
height of this ceiling is almost certainly an artifact resulting
from an underestimation of percent TEs in the salamander
and Australian lungfish genomes. However, the types of TEs or
TE regions that are most difficult to identify, such as SINEs or
noncoding spaces, are unlikely, based on estimates of TE composition in other well-characterized genomes, to comprise the
remaining approximately 50% of the genome.
A recent publication (de Koning et al. 2011) examining TE
content in the human genome may shed light on the "missing" 50% of the large metazoan genomes examined by Sun
et al. (2011) and us. Both the human and the maize genomes
are 3,000 Mb, but the maize genome is reported as 85% TEs
and the human genome as 45% (Hua-Van et al. 2011). de
Koning et al. (2011), using a novel method, estimated that the
human genome repetitive fraction is at least 66–69%, chiefly
TEs (fig. 4B), which is much closer to estimate for the maize
genome. de Koning et al. (2011) attribute the differences in
estimated TE content in the human genome to the ability of
their approach to better detect short sequences and sequences from older and more diverse TE families. This finding
supports the indication that the underestimation of TE content in the salamander and lungfish genomes is due to the
presence of undetected older and diverse TEs.
In conclusion, there are several lines of evidence which can
shed light on the evolution of this very large genome: 1)
Thomson’s (1972) inference that in the lineage leading to
the extant Australian lungfish, there was a massive increase
in genome size between 350 and 200 mya, after which the size
of the genome changed little, 2) our estimate of 40% of the
Australian lungfish genome being recognizable TEs, and 3)
identification of previously unidentified older diverse TE families by de Koning et al. (2011) in the human genome. We
make two propositions. First, that in this very large genome, a
portion of the genome, 40%, is the result of recent amplifications, largely CR1 and L2 elements. Second, the high percentage of the genome we were unable to identify is likely to
be chiefly older, degenerate TEs resulting from ancient amplification bursts. We, therefore, speculate that the very large
lungfish genomes may be the result of a massive amplification
of TEs followed by a long period with a very low rate of
sequence removal and some ongoing TE activity.
Future work on a very much larger scale using full sequence
of bacterial artificial chromosomes to build a database of
lungfish-specific repeats and to examine the structure of a
sample of the genome, combined with next-generation sequencing and improved TE identification algorithms and
pipelines should allow us to develop an improved picture
of this extraordinary genome.
Supplementary Material
Supplementary figures S1–S3 and tables S1–S3 are available at
Molecular Biology and Evolution online (http://www.mbe
.oxfordjournals.org/).
3537
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
0
Metcalfe et al. . doi:10.1093/molbev/mss159
Acknowledgments
The authors thank David Ogereau, Jean-Luc Da Lage, and
Gaëlle Claisse for technical help. They also thank Aurelie
Hua-Van for helpful comments and two anonymous reviewers for their comments. They also thank the staff at the
Fauna Park, Macquarie University, Sydney, Australia for care
of the lungfish. A very special thanks to Marie-Louise Cariou
for help with funding. This work was supported by Centre
National de la Recherche Scientifique under the program
"Action Thématique Incitative sur Programme" awarded to
Didier Casane from 2006 to 2009.
References
3538
Kapitonov VV, Tempel S, Jurka J. 2009. Simple and fast classification of
non-LTR retrotransposons based on phylogeny of their RT domain
protein sequences. Gene 448:207–213.
Kidwell MG. 2002. Transposable elements and the evolution of genome
size in eukaryotes. Genetica 115:49–63.
Kohany O, Gentles AJ, Hankus L, Jurka J. 2006. Annotation, submission
and screening of repetitive elements in Repbase: RepbaseSubmitter
and censor. BMC Bioinformatics. 7:474.
Lander ES, Linton LM, Birren B, et al. (2753 co-authors). 2001. Initial
sequencing and analysis of the human genome. Nature 409:860–921.
Leitch I, Bennett M. 2004. Genome downsizing in polyploid plants. Biol J
Linn Soc Lond. 82:651–663.
Li WH, Gu Z, Wang H, Nekrutenko A. 2001. Evolutionary analyses of the
human genome. Nature 409:847–849.
Longhurst TJ, Joss JM. 1999. Homeobox genes in the Australian lungfish,
Neoceratodus forsteri. J Exp Zool B Mol Dev Evol. 285:140–145.
Luan DD, Korman MH, Jakubczak JL, Eickbush TH. 1993. Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal
target site: a mechanism for non-LTR retrotransposition. Cell 72:
595–605.
Lynch M, Conery JS. 2003. The origins of genome complexity. Science
302:1401–1404.
Morris SC, Harper E. 1988. Genome size in conodonts (chordata): inferred variations during 270 million years. Science 241:1230–1232.
Organ CL, Shedlock AM, Meade A, Pagel M, Edwards SV. 2007. Origin of
avian genome size and structure in non-avian dinosaurs. Nature 446:
180–184.
Ozkan H, Tuna M, Aramuganathan K. 2003. Nonadditive changes in
genome size during allopolyploidization in the wheat (AegilopsTriticum) group. J Hered. 94:260–264.
Petrov D, Sangster TA, Johnston JS, Hartl DL, Shaw KL. 2000. Evidence
for DNA loss as a determinant of genome size. Science 287:
1060–1062.
Pritham EJ. 2009. Transposable elements and factors influencing their
success in eukaryotes. J Hered. 100:648–655.
Rock J, Eldridge M, Champion A, Johnston P, Joss J. 1996. Karyotype
and nuclear DNA content of the Australian lungfish, Neoceratodus
forsteri (Ceratodidae: Dipnoi). Cytogenet Cell Genet. 73:187–189.
Sambrook J, Fritsch EF, Maniatis T. 1989. Molecular cloning: a laboratory
manual. In: Irwin N, Ford N, Nolan C, Ferguson M, editors. Molecular
cloning: a laboratory manual, 2nd ed. New York: Cold Spring Harbor
Laboratory Press.
Sayers EW, Barrett T, Benson DA, et al. (42 co-authors). 2011. Database
resources of the National Center for Biotechnology Information.
Nucleic Acids Res. 39:D38–D51.
Shedlock AM, Botka CW, Zhao S, Shetty J, Zhang T, Liu JS, Deschavanne
PJ, Edwards SV. 2007. Phylogenomics of nonavian reptiles and the
structure of the ancestral amniote genome. Proc Natl Acad Sci U S A.
104:2767–2772.
Sirijovski N, Woolnough C, Rock J, Joss JM. 2005. NfCR1, the first
non-LTR retrotransposon characterized in the Australian lungfish
genome, Neoceratodus forsteri, shows similarities to CR1-like elements. J Exp Zool B Mol Dev Evol. 304:40–49.
Smit A, Hubley R. 2010. RepeatModeler Open-1.0. Available from: http://
www.repeatmasker.org.
Smit A, Hubley R, Green P. 2010. RepeatMasker Open-3.0. Available
from: http://www.repeatmasker.org.
Sun C, Shepard DB, Chong RA, Lopez-Arriaza J, Hall K, Castoe TA,
Feschotte C, Pollock DD, Mueller RL. 2011. LTR retrotransposons
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
Blair JE, Hedges SB. 2005. Molecular phylogeny and divergence times of
deuterostome animals. Mol Biol Evol. 22:2275–2284.
Brinkmann H, Venkatesh B, Brenner S, Meyer A. 2004. Nuclear
protein-coding genes support lungfish and not the coelacanth as
the closest living relatives of land vertebrates. Proc Natl Acad Sci U S
A. 101:4900–4905.
de Koning AP, Gu W, Castoe TA, Batzer MA, Pollock DD. 2011.
Repetitive elements may comprise over two-thirds of the human
genome. PLoS Genet. 7(12): e1002384.
Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797.
Gentles AJ, Wakefield MJ, Kohany O, Gu W, Batzer MA, Pollock DD,
Jurka J. 2007. Evolutionary dynamics of transposable elements in the
short-tailed opossum Monodelphis domestica. Genome Res. 17:
992–1004.
Gibbs RA, Weinstock GM, Metzker ML, et al. (230 co-authors). 2004.
Genome sequence of the Brown Norway rat yields insights into
mammalian evolution. Nature 428:493–521.
Gregory TR. 2001. Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma. Biol Rev Camb Philos Soc. 76:
65–101.
Gregory TR. 2005. Genome size evolution in animals. In: Gregory RT,
editor. The evolution of the genome. San Diego (CA): Elsevier.
p. 3–87.
Gregory TR. 2010. Animal genome size database. Available from: http://
www.genomesize.com.
Haas NB, Grabowski JM, North J, Moran JV, Kazazian HH, Burch JB. 2001.
Subfamilies of CR1 non-LTR retrotransposons have different 5’UTR
sequences but are otherwise conserved. Gene 265:175–183.
Hall TA. 1999. BioEdit: a user-friendly biological sequence alignment
editor and analysis program for Windows 95/98/NT. Nucleic Acids
Symp Ser. 41:95–98.
Hua-Van A, Le Rouzic A, Boutin TS, Filée J, Capy P. 2011. The struggle for
life of the genome’s selfish architects. Biol Direct. 6:19.
Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 8:
275–282.
Joss JM. 2006. Lungfish evolution and development. Gen Comp
Endocrinol. 148:285–289.
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O. 2005. Repbase
update, a database of eukaryotic repetitive elements. Cytogenet
Genome Res. 110:462–467.
Kapitonov VV, Jurka J. 2003. The esterase and PHD domains in CR1-Like
Non-LTR retrotransposons. Mol Biol Evol. 20:38–46.
MBE
Evolution of the Size of the Australian Lungfish Genome . doi:10.1093/molbev/mss159
contribute to genomic gigantism in plethodontid salamanders.
Genome Biol Evol. 4:168–183.
Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. 2011.
MEGA5: molecular evolutionary genetics analysis using maximum
likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 28:2731–2739.
Thomas CA. 1971. The genetic organization of chromosomes. Annu Rev
Genet. 5:237–256.
Thomson K. 1972. An attempt to reconstruct evolutionary changes in
the cellular DNA content of lungfish. J Exp Zool. 180:363–372.
Untergasser A, Nijveen H, Rao X, Bisseling T, Geurts R, Leunissen JA.
2007. Primer3Plus, an enhanced web interface to Primer3. Nucleic
Acids Res. 35:W71–W74.
Venner S, Feschotte C, Biémont C. 2009. Dynamics of transposable
elements: towards a community ecology of the genome. Trends
Genet. 25:317–323.
MBE
Warren WC, Hillier LW, Marshall Graves JA, et al. (102 co-authors). 2008.
Genome analysis of the platypus reveals unique signatures of evolution. Nature 453:175–183.
Wicker T, Guyot R, Yahiaoui N, Keller B. 2003. CACTA transposons in
Triticeae. A diverse family of high-copy repetitive elements. Plant
Physiol. 132:52–63.
Wicker T, Robertson JS, Schulze SR, et al. (11 co-authors). 2005. The
repetitive landscape of the chicken genome. Genome Res. 15:126–136.
Wicker T, Sabot F, Hua-Van A, et al. (13 co-authors). 2007. A unified
classification system for eukaryotic transposable elements. Nat Rev
Genet. 8:973–982.
Xiong Y, Eickbush TH. 1990. Origin and evolution of retroelements
based their reverse transcriptase sequences. EMBO J. 9:3353–3362.
Zardoya R, Meyer A. 1996. Evolutionary relationships of the coelacanth,
lungfishes, and tetrapods based on the 28S ribosomal RNA gene.
Proc Natl Acad Sci U S A. 93:5449–5454.
Downloaded from http://mbe.oxfordjournals.org/ at INIST-CNRS on November 6, 2012
3539