Detection of New Transposable Element Families in

J Mol Evol (2003) 57:S50–S59
DOI: 10.1007/s00239-003-0007-2
Detection of New Transposable Element Families in Drosophila melanogaster and
Anopheles gambiae Genomes
Hadi Quesneville, Danielle Nouaud, Dominique Anxolabéhère
Laboratoire Dynamique du Génome et Evolution, Institut Jacques Monod, 2, Place Jussieu, 75251 Paris Cedex 05, France
Received: 26 July 2002 / Accepted: 13 September 2002
Abstract. The techniques that are usually used to
detect transposable elements (TEs) in nucleic acid
sequences rely on sequence similarity with previously
characterized elements. However, these methods are
likely to miss many elements in various organisms.
We tested two strategies for the detection of unknown elements. The first, which we call ‘‘TBLASTX
strategy,’’ searches for TE sequences by comparing
the six-frame translations of the nucleic acid sequences of known TEs with the genomic sequence
of interest. The second, ‘‘repeat-based strategy,’’
searches genomic sequences for long repeats and
clusters them in groups of similar sequences. TE
copies from a given family are expected to cluster
together. We tested the Drosophila melanogaster genomic sequence and the recently sequenced Anopheles
gambiae genome in which most TEs remain unknown. We showed that the ‘‘TBLASTX strategy’’ is
very efficient as it detected at least 332 new TE families in D. melanogaster and 400 in A. gambiae. This
was unexpected in Drosophila as TEs of this organism
have been extensively studied. The ‘‘repeat-based
strategy’’ appeared to be very inefficient because of
two problems: (i) TE copies are heavily deleted and
few copies share homologous regions, and (ii) segmental duplications are frequent and it is not easy to
distinguish them from TE copies.
Correspondence to: H. Quesneville; email: [email protected]
Key words: Transposable elements — Segmental
duplications — Annotations — Bioinformatics —
Genomics — Drosophila melanogaster — Anopheles
gambiae
Introduction
Transposable elements (TEs) are mobile DNA sequences that are repeated throughout genomes. All
the copies of a TE belong to what is called a TE
family and are classified according to the mechanism
by which they move from one genomic site to another. Class I TEs (or retrotransposons) transpose via
a ‘‘copy and paste’’ mechanism using an RNA intermediate. They are subdivided into two subclasses
according whether they contain long terminal repeats
(LTRs) at their extremities: LTR retrotransposons
and non-LTR retrotransposons (i.e., LINEs and
SINEs). Class II TEs, known as DNA transposons or
DNA TEs, transpose using an DNA intermediate,
generally via a ‘‘cut and paste’’ mechanism, or rolling
circle mechanism like Helitrons (Kapitonov and
Jurka 2001). TEs have been found in nearly all of the
genomes in which they have been sought. They seem
to be ubiquitous and represent a quantitatively important component of genomes (44.4% of the human
genome; International Human Genome Sequencing
Consortium 2001). Several cellular functions have
been found to be closely related to TEs (Smit 1999).
There is no doubt that the genomic DNA we observe
today evolved with the close participation of TEs.
Many TEs can be considered to be parasites, but they
S51
have probably all kept their ‘‘genome building’’
properties. They appear to be crucial actors in genome evolution.
The techniques usually used to detect TEs in nucleic acid sequences rely on nucleic sequence similarity with previously characterized elements. Using
sequence alignment programs, these methods annotate the parts of sequences that are very similar to an
element from a reference set of ‘‘known elements’’
(i.e., RepeatMasker; http://repeatmasker.genome.
washington.edu/). However, when studying sequences
from an organism in which TEs have not been
described in detail, these methods are likely to miss
many ‘‘unknown’’ elements.
A better approach consists of searching for sequence similarities with known TE proteins. This
approach can detect unknown TEs that are distantly
related to known TEs. The TBLASTN, or BLASTX
programs (Altschul et al. 1990, 1997) can be efficiently used for this task. Some amino acid domains
are quite well conserved in some TE families such as
the reverse-transcriptase (RT) domain of LINEs and
LTR retrotransposons. Profile-based methods such
as implemented in the HMMER (Eddy 1998) and
RPS-BLAST (Altschul et al. 1997) packages are more
sensitive than classical alignment programs. They use
multiple alignments of well-characterized amino acid
domains to detect amino acid signatures. They then
efficiently identify elements by searching for them in
genomic sequences (Berezikov et al. 2000). Obviously, all these approaches increase the range of TE
families that can be detected by similarity, but unfortunately they can still miss some TE families. One
of the reasons for this is that the protein sequences
and the conserved domains of many TEs are not
known, hence the search cannot be exhaustive. Another problem is that TE copies are often deleted and
do not retain any coding capabilities or any conserved protein domains.
We tested two strategies to try to overcome these
difficulties with the genomic sequences of Drosophila
melanogaster and Anopheles gambiae. D. melanogaster
is an organism in which TEs are well known and have
been extensively studied. Its sequence was mainly
chosen to validate our strategies. A. gambiae was only
sequenced very recently (http://www.ncbi.nlm.nih.
gov/cgi-bin/Entrez/map_search?chr=agambiae.inf).
Most of its transposable elements remain unknown.
The methods proposed in this paper were designed to
analyze this type of genome. The first strategy involves the use of TBLASTX (Altschul et al. 1990,
1997) to compare amino acid sequences derived from
nucleic acid sequences. Thus, the protein sequences
are not required, meaning that deleted TEs and those
with unknown protein sequences can be tested. We
expected that this method would extend the range of
TE families that can be detected on the basis of sim-
ilarity, by increasing the number of known TE sequences tested. This strategy will be referred to as the
‘‘TBLASTX strategy.’’ The second strategy relies on
the fact that TEs are repeated and dispersed in
genomic sequences. Thus, repeated sequences are
sought in the nucleic acid sequences and similar sequences are clustered together in groups. TE copies
from a given family are expected to cluster in a group.
This approach will be referred to as the ‘‘repeat-based
strategy.’’
Our results show that the TBLASTX strategy is
very efficient. At least 332 new TE families were detected in D. melanogaster and 400 in A. gambiae. This
was unexpected for Drosophila as TEs of this organism have been studied in detail. Conversely, the
repeat-based strategy appeared to be very inefficient
because (i) the TE copies are heavily deleted and only
few copies share homologous regions, and (ii) segmental duplications are frequent and it is not easy to
distinguish them from TE copies.
Materials and Methods
Sequence Data
The D. melanogaster genomic sequence used was the chromosome
arms release 2, which was downloaded from the Berkeley Drosophila Genome Project (BDGP) web site at http://www.fruitfly.org/sequence/dlMfasta.shtml#2rel2/na_arms.dros.RELEASE2.
The A. gambiae genomic sequence used was the first version of the
Whole Genome Shotgun project assembly of A. gambiae. We retrieved 8987 scaffolds at GenBank (http://www.ncbi.nlm.nih.gov/),
accessions numbers AAAB01000001 to AAAB01008987.
The sequences of the transposable elements were obtained from
the Repbase Update database release 7.2 (Kapitonov and Jurka
1998–2002; Jurka 2000), which contains all known repeated sequences including TEs (downloaded at http://www.girinst.org).
The sequences of D. melanogaster TEs were downloaded from
the BDGP web site http://www.fruitfly.org/sequence/sequence_db/
na_te.dros.embl (release 4.94). These sequences will be referred to
as the BDGP TE compilation. They represent full length TEs that
are generally functional.
BLASTER
We wrote the BLASTER program in C++. This program can be
used to annotate TEs in genomic sequences. It can compare two
sets of sequences: a query databank against a subject databank.
For each sequence in the query databank, BLASTER launches one
of the BLAST programs (BLASTN, TBLASTN, BLASTX,
TBLASTX, BLASTP, MEGABLAST) (Altschul et al. 1990, 1997)
to search the subject databank. Each BLAST search is launched in
parallel on a computer cluster. The BLAST results (HSPs) are then
postprocessed. Long insertions (or deletions) in one of two homologous sequences result in two HSPs, instead of one with a long
gap. Sparse dynamic programming is used to connect such HSPs
(HSP alignments) (Gusfield 1997; Chao et al. 1995). BLASTER is
not limited by the length of sequences. It cuts long sequences before
launching BLAST and re-assembles the results afterwards. Hence,
it can work on whole genomes, in particular to compare a genome
with itself to detect repeats. The results of BLASTER can then be
treated by the MATCHER and GROUPER programs described
below (Fig. 1).
S52
Fig. 1. Data flow through BLASTER,
MATCHER, and GROUPER.
GROUPER
GROUPER is a C++ program that we developed to treat the
BLASTER results. It uses HSP alignments to gather similar sequences into groups by simple link clustering. An alignment belongs to a group if one of the two aligned sequences already belongs
to this group over 95% of its length. Groups that share sequence
locations are regrouped into what we called a cluster. As a result of
these procedures, each group contains sequences that are homogeneous in length. A given region may belong to several groups, but
all of these groups belong to the same cluster.
MATCHER
MATCHER is another C++ program that we developed to treat
BLASTER results and to map the matches (HSP alignments) of the
subject sequences on the queries. It filters the results as follows:
when two or more matches with different subject sequences overlap
on the query, the match corresponding to the best alignment score
is kept, and the other discarded.
Fig. 2. Copy length distributions expressed as the percentage of
the length of the full length sequence for known Drosophila TEs.
Results
Lengths of Deleted Known TE Copies
Nucleotide sequences that matched previously characterized TEs were sought. Sequences were detected
with BLASTER using BLASTN repeatedly and
MATCHER. Hits were retained only if the e-value
was less than 10)10. The BDGP TE compilation is
used for D. melanogaster and all Repbase Update
database (RU) for A. gambiae. It is noteworthy that
in the BDGP D. melanogaster genomic sequence, a
repeat sequence or a transposable element sequence
may correspond to a consensus sequence due to the
assembly algorithm (Myers et al. 2000). To our
knowledge the same procedure is used for the A.
gambiae genome, meaning that it also suffers from the
same limitation. Release 3 of the BDGP D. melano-
gaster genomic sequence will correct this problem by
resequencing each repeat individually. Consequently
our description of the TE distributions in the two
sequences gives a biased image of the real TE distribution. But here we used the two genomic sequences
as they are to evaluate the two strategies and to
highlight some of the potential problems.
Figure 2 shows the distribution of copy lengths
expressed as a percentage of the length of the full
length sequence of known Drosophila TEs.
Some of the matches obtained for A. gambiae
suggest that foreign sequences are present in the genome (Table 1). In fact, this just means that the most
similar sequence has been described in another species. Very often this is due to the absence of such
sequences for Anopheles in the RU database (i.e.,
5S_DM).
S53
Table 1. Repbase Update matches (Kapitonov and Jurka 1998–2002) with A. gambiae genomic sequence, obtained with BLASTER using
BLASTN; only hits with an e-value <10)10 are reported
Namea
Typeb
Species
Lengthc #d
Min.e Q0.25f Q0.50g Q0.75h Maxi Mean identityj
AGRP1
T1_AG
IKIRARA1
PEGASUS
RT1
AGM1
TRNA_VAL
5S_DM
TRNA_GLY
ZEBEDEE
R2B_DM
TRNA_ASN
U2
TTO1_NT
HMSBEAGLE_I
U5B1
MUSID5
U6
QUASIMODO_I
L23
U4B
Repetitive sequence
Non-LTR retrotransposon
DNA transposon
DNA transposon
Non-LTR retrotransposon
LTR retrotransposon
Transfer RNA
5S RNA gene sequence
Transfer RNA
LTR retrotransposon
Non-LTR retrotransposon
Transfer RNA
Small nuclear RNA
LTR retrotransposon
LTR retrotransposon
Small nuclear RNA
Non-LTR retrotransposon
Small nuclear RNA
LTR retrotransposon
Ribosomal protein
Small nuclear RNA
Anopheles gambiae
Anopheles gambiae
Anopheles gambiae
Anopheles gambiae
Anopheles gambiae
Anopheles gambiae
Homo sapiens
Drosophila melanogaster
Homo sapiens
Aedes aegypti
Drosophila mercatorum
Homo sapiens
Homo sapiens
Nicotiana tabacum
Drosophila melanogaster
Homo sapiens
Rodents
Homo sapiens
Drosophila melanogaster
Homo sapiens
Homo sapiens
871
4634
610
534
8037
5983
76
135
74
3256
3528
77
896
5300
6529
116
83
107
6060
1310
145
0.05
0.01
0.08
0.11
0.01
0.01
0.96
0.73
0.95
0.02
0.02
0.79
0.08
0.01
0.01
0.34
0.87
0.62
0.01
0.12
0.6
829
359
234
85
43
28
26
24
18
13
13
12
9
7
6
6
4
3
2
1
1
0.15
0.06
0.28
0.42
0.01
0.01
0.96
0.86
0.96
0.05
0.02
0.79
0.08
0.01
0.01
0.35
0.9
0.62
0.01
0.12
0.6
0.34
0.14
0.43
0.86
0.06
0.01
0.96
0.86
0.96
0.05
0.02
0.79
0.08
0.01
0.01
0.35
0.92
0.88
0.01
0.12
0.6
0.74
0.31
0.78
1
0.16
0.09
0.96
0.86
0.96
0.13
0.02
0.79
0.08
0.01
0.01
0.35
0.92
0.94
0.01
0.12
0.6
4.03 92.41
2.68 94.52
4.43 95.85
1.79 94.61
1.16 92.76
2.66 96.48
0.96 89.20
0.86 85.43
0.96 94.37
0.36 83.57
0.02 96.16
0.79 96.72
0.08 91.07
0.01 89.39
0.01 86.52
0.35 100.00
0.92 86.96
0.94 95.44
0.01 90.51
0.12 81.37
0.6
87.36
a
Entry name in Repbase Update.
Type of repeat.
c
Length of the reference sequence in Repbase Update.
d
Number of copies found.
e,f,g,h,i
Lengths of the copies expressed in percentage of the Repbase sequence length for the minimum, 25, 50, 75 percentiles, and maximum,
respectively.
j
Mean identity with the Repbase sequence.
b
For all TE families, only a very small number of
copies appeared to be complete and the vast majority
appeared to be deleted. When aligned with the complete element (the TE in RU or the BDGP TE compilation), the mean of the alignment length
percentage median was 12% and 24% for D. melanogaster and A. gambiae, respectively.
To illustrate this feature more precisely, we chose
two non-LTR retroelements, two LTR retroelements
and two DNA elements from D. melanogaster. All of
the genomic copies of these elements were aligned
with the reference copy present in the BDPG TE
compilation. The deletion percentage was calculated
for each nucleotide position (Fig. 3).
Deletions were found throughout the sequence,
except for DNA elements. Deletions were more frequent in the middle of the DNA elements. This was
expected for the DNA elements as their transposition
mechanism involving a gap repair process is known to
generate internal deletions in these elements (Quesneville and Anxolabéhère 2001; Brunet et al. 2002).
The TBLASTX Strategy
Amino acid sequences that matched RU sequences
were sought by use of BLASTER with the TBLASTX
and MATCHER tools. Hits were retained only if the
e-value was less than 10)10. In D. melanogaster,
matches were removed if they overlapped with previously identified TEs.
In the D. melanogaster genome, 332 RU entries
were found. Some D. melanogaster-specific TEs were
still found in the genome even after hits that matched
these entries by BLASTN had been removed. This
suggests that several related Drosophila TE families
coexist in the genome: the well-known sequences and
more divergent sequences that are not detected at the
DNA level. Surprisingly, 222 new TEs were found in
this species. Table 2 gives some of these most common matches. This shows that even in one of the
most studied organisms, many families had been
missed and remained to be described.
Four hundred and five RU entries were found in
the A. gambiae genome. Table 3 gives some of the
most common matches. Seventy-one of them shared
similarities with known repeats from D. melanogaster. As the closest repeat families to these sequences
have been described in Drosophila, they are closely
related to Drosophila repeat families.
As previously mentioned, several sequences that
matched with one reference TE on the amino acid
level may in fact correspond to several different, but
related, families. Thus, we can only roughly estimate
the minimum number of TE families present in a
genome. We estimated that there are about 350 TE
families in D. melanogaster and 400 in A. gambiae.
S54
Fig. 3. Distributions of the percentage of deletions along the
sequences of two non-LTR retroelements, two LTR retroelements,
and two DNA elements from D. melanogaster. The percentage was
calculated for each nucleotide position by aligning (global optimal
alignment) each copy found in the genome with the full-length
element found in the BDGP TE compilation. Global optimal
alignment was performed in a C++ program developed from the
algorithm described by Myers and Miller (1988). The solid line
represents a smoothing curve obtained by the mobile mean method
with window size of 10 bp. The title of each graph gives the name of
the element, the type, and the copy number. Mean identity and
mean gap length percentage are given.
S55
Table 2.
Repbase Update matches (Kapitonov and Jurka 1998–2002) with the D. melanogaster genomic sequence obtained with
BLASTER using TBLASTX; only the 30 most common hits corresponding to new D. melanogaster TE, with an e-value <10)10, are reported
Namea
Typeb
Species
Lengthc #d
OSVALDO_I
YOYOI
TOM_I
ULYSS
TV1I
P126
NINJA_I
AGM1
GYPSY_DS
BARI1
MAG
MINOS
BMC1
CYCLO
TED
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
Repetitive sequence
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
DNA transposon
LTR retrotransposon
DNA transposon
Non-LTR retrotransposon
Retro-pseudogene
LTR retrotransposon
PRIMA4_I
RT1
TRAM_I
RIRE3_I
GYPSYDR1
LYDIA_I
LDT1
GRANDE1_ZD
DEL_LH
SUSHII
RONIN1_I
RIRE8B_I
LINE1_BM
ALU
AluSp
LTR retrotransposon
Non-LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
Non-LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
Drosophila buzzatii
6653
Ceratitis capitata
7065
Drosophila ananassae
6112
Drosophila virilis
10653
Drosophila virilis
5963
Paramecium tetraurelia 1840
Drosophila simulans
6011
Anopheles gambiae
5983
Drosophila subobscura
7522
Drosophila erecta
1750
Bombyx mori
4564
Drosophila hydei
1775
Bombyx mori
5091
Homo sapiens
753
Autographa californica
6964
nucleopolyhedrovirus
Homo sapiens
8270
Anopheles gambiae
8037
Drosophila miranda
2708
Oryza sativa
5775
Danio rerio
4463
Lymantria dispar
6054
Lymantria dispar
5677
Zea diploperennis
13769
Lilium henryi
9345
Fugu rubripes
4475
Fugu rubripes
4487
Oryza sativa
5981
Bombyx mori
5158
Homo sapiens
290
Homo sapiens
284
Mean
Min.e Q0.25f Q0.50g Q0.75h Maxi identityj
135
74
68
59
57
49
37
29
25
23
18
14
14
13
12
0.005
0.006
0.005
0.004
0.007
0.029
0.008
0.003
0.006
0.033
0.018
0.022
0.005
0.084
0.008
0.016
0.015
0.014
0.01
0.018
0.093
0.02
0.011
0.011
0.074
0.023
0.044
0.012
0.369
0.016
0.037
0.028
0.021
0.014
0.037
0.219
0.047
0.031
0.016
0.13
0.042
0.151
0.023
0.382
0.031
0.064
0.07
0.054
0.031
0.07
0.365
0.084
0.054
0.034
0.259
0.068
0.239
0.032
0.637
0.079
0.419
0.491
0.216
0.132
0.19
0.666
0.414
0.181
0.133
0.71
0.135
0.632
0.08
0.817
0.395
54.783
49.536
51.18
50.027
50.871
36.822
55.998
46.349
45.795
66.531
40.659
59.436
46.748
60.498
48.061
12
10
10
10
9
9
9
8
8
7
7
7
7
6
6
0.005
0.005
0.017
0.009
0.01
0.008
0.01
0.005
0.005
0.013
0.009
0.007
0.011
0.093
0.898
0.007
0.007
0.053
0.015
0.012
0.011
0.017
0.008
0.007
0.015
0.014
0.012
0.016
0.3
0.901
0.008
0.012
0.11
0.018
0.025
0.012
0.02
0.009
0.008
0.021
0.016
0.014
0.017
0.41
0.933
0.009
0.016
0.127
0.02
0.04
0.069
0.025
0.017
0.014
0.029
0.024
0.016
0.062
0.81
0.94
0.011
0.019
0.161
0.03
0.095
0.086
0.047
0.02
0.014
0.038
0.054
0.116
0.073
0.945
0.975
53.022
43.887
46.179
44.306
44.821
44.058
38.044
38.828
44.186
44.334
50.352
48.92
42.508
68.357
75.1
a
Entry name in Repbase Update.
Type of repeat.
c
Length of the reference sequence in Repbase Update.
d
Number of copies found.
e,f,g,h,i
Lengths of the copies expressed in percentage of the Repbase sequence length for the minimum, 25, 50, 75 percentiles, and maximum,
respectively.
j
Mean identity with the Repbase sequence.
b
The Repeat-Based Strategy
Repeats Statistics
We used the all-versus-all BLASTN comparisons
(with low complexity filter) of BLASTER to detect
repeated sequences (e-value filter = 10)300). These
sequences were then clustered using GROUPER.
However, Anopheles is an outbred organism and a
number of individuals from the PEST strain were sequenced. Thus, we expected to find polymorphic regions disrupting the assembly. These regions were
expected either to be present in several copies in the
assembly and/or not to have enough coverage to be
assembled at all. Consequently, the small contigs in
the release are likely to correspond to these poly-
morphic sequences. To test for possible bias in repeats
due to these polymorphic regions, a repeat search
limited to the long scaffolds >100 kb was carried out
(representing 87% of the genomic sequence) (Table 4).
About one-third of the Anopheles repeated sequences were due to these small contigs and thus to
polymorphism. Even when only the long scaffolds
were used, our genome comparison reveals that
Anopheles contains a higher percentage of repeat sequences than Drosophila. A previous study suggested
that D. melanogaster carries an unusually small
number of repeats (Achaz et al. 2001), in comparison
to other phylogenetically very distant eukaryotes
such as Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana, and Homo sapiens. Here
we confirm this particularity with a closely related
species, A. gambiae.
S56
Table 3. Repbase Update matches (Kapitonov and Jurka 1998–2002) with A. gambiae obtained with BLASTER using TBLASTX; only
the 30 most common hits with an e-value <10)10 are reported
Namea
Typeb
Species
Lengthc #d
Min.e
Mean
Q0.25f Q0.50g Q0.75h Maxi identityj
T1_AG
AGRP1
NINJA_I
MAG
AGM1
DMCR1A
BOVB_VA
RT1
IKIRARA1
BOVB
BLASTOPIA_LTR
TC1_DM
AACOPIA1_I
IVK_DM
BEL_I
EXPANDER2
I_DM
LDT1
ROO_I
TABOR_I
INVADER2_I
COPIA_DM
MDG3_DM
YOYOI
STALKER2_I
DIVER_I
PEGASUS
HMSBEAGLE_I
DMRT1C
EXPANDER
Non-LTR retrotransposon
Repetitive sequence
LTR Retrotransposon
Retrotransposon
LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
DNA transposon
Non-LTR retrotransposon
LTR retrotransposon
DNA transposon
LTR retrotransposon
Non-LTR retrotransposon
LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
LTR retrotransposon
DNA transposon
LTR retrotransposon
Non-LTR retrotransposon
Non-LTR retrotransposon
Anopheles gambiae
Anopheles gambiae
Drosophila simulans
Bombyx mori
Anopheles gambiae
Drosophila melanogaster
Vipera ammodytes
Anopheles gambiae
Anopheles gambiae
Bos taurus
Drosophila melanogaster
Drosophila melanogaster
Aedes aegypti
Drosophila melanogaster
Drosophila melanogaster
Fugu rubripes
Drosophila melanogaster
Lymantria dispar
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Drosophila melanogaster
Ceratitis capitata
Drosophila melanogaster
Drosophila melanogaster
Anopheles gambiae
Drosophila melanogaster
Drosophila melanogaster
Fugu rubripes
4634
871
6011
4564
5983
4470
4606
8037
610
3302
4481
1666
4110
5402
5404
3369
6231
5677
8256
6336
4592
5143
4986
7065
7402
5643
534
6529
5443
3362
0.01
0.02
<0.01
<0.01
<0.01
0.01
0.01
<0.01
0.05
0.01
0.01
0.02
0.01
<0.01
0.01
0.01
0.01
<0.01
0.01
0.01
0.01
0.01
0.01
<0.01
<0.01
<0.01
0.08
0.01
0.01
0.02
0.05
0.18
0.05
0.06
0.04
0.04
0.05
0.02
0.31
0.04
0.04
0.16
0.08
0.02
0.02
0.06
0.02
0.04
0.02
0.02
0.03
0.05
0.03
0.02
0.01
0.03
0.45
0.01
0.03
0.04
3679
1038
846
715
569
492
384
337
233
231
231
226
215
205
159
157
149
144
139
127
126
122
117
115
114
112
102
100
88
85
0.09
0.35
0.09
0.15
0.15
0.08
0.08
0.08
0.45
0.12
0.07
0.43
0.17
0.06
0.05
0.14
0.04
0.09
0.03
0.05
0.06
0.12
0.12
0.04
0.02
0.04
0.83
0.02
0.08
0.07
0.19
0.59
0.3
0.29
0.47
0.14
0.11
0.16
0.75
0.17
0.17
0.59
0.36
0.1
0.13
0.14
0.08
0.15
0.07
0.14
0.18
0.3
0.35
0.09
0.05
0.11
0.99
0.09
0.18
0.12
1.72
2.08
1.05
1.19
1.33
0.66
0.57
1
2.4
0.35
0.8
0.8
1.28
0.88
0.91
0.49
0.26
0.37
0.58
0.57
0.87
0.76
0.78
0.57
0.56
0.9
1.5
0.61
0.49
0.49
48.19
76.57
42.06
39.53
46.57
40.14
34.94
52.02
86.19
34.14
38.62
41.89
46.01
36.87
41.16
33.81
37.99
38.68
40.04
45.75
41.24
41.73
43.96
39.90
41.33
41.31
81.50
42.11
40.70
36.02
a
Entry name in Repbase Update.
Type of repeat.
c
Length of the reference sequence in Repbase Update.
d
Number of copies found.
e,f,g,h,i
Lengths of the copies expressed in percentage of the Repbase sequence length for the minimum, 25, 50, 75 percentiles, and maximum,
respectively.
j
Mean identity with the Repbase sequence.
b
Repeats and TEs
To evaluate how well TEs were detected depending
on whether they are highly dispersed and on the
number of repeats, repeats that correspond to one of
the TEs detected by TBLASTX were identified according to their genomic locations.
A cross location analysis for D. melanogaster (and
A. gambiae) showed that 47% (48%) of the groups
contained several TEs. A scatter plot representing
each group found in D. melanogaster in function of
the number of TE families detected and its size (the
number of repeated sequences) is presented in Fig. 4.
It suggests that many repeated sequences are not TE
copies and are duplicated by other means than
transposition. These repeated sequences are called
segmental duplications when their size exceeds 1 kb
(Bailey et al. 2001; International Human Genome
Sequencing Consortium 2001).
Based on the sequence information for the repeated
sequences, it should be possible to reconstruct a consensus sequence from the deleted copies for each TE
family. By construction, the sequences within a cluster
share similar regions enabling them to be assembled.
Sixty-eight percent of the TEs detected in D. melanogaster (77% for A. gambiae) are spread over several
clusters. Thus, the consensus sequence of 68% of the
TEs should be spread over several non overlapping
fragments in D. melanogaster. This result highlights the
fragmented nature of TE copies in the genome.
Discussion
Detection of New TEs
Our strategy for the detection of TE based on
TBLASTX was very successful. This success can in part
be explained by the fact that even though it only seems
S57
Table 4.
Repeat statistics
A. gambiae
Genome size (Mb)
Genomic sequence size (Mb)
Repeated fractiona
Number of groupsb
Number of clustersb
D. melanogaster
Long scaffolds (>100 kb)
All scaffolds
180
124
3.7%
2312
434
–
243
22%
16594
2238
290
278
32%
36036
5166
a
Fraction of the genome that is present at least in two copies.
Number of groups and clusters respectively obtained by BLASTER and GROUPER, when a genome was compared with itself by
BLASTN. All HSP with an e-value >10)300 were removed, and a coverage constraint of 95% was applied for constructing groups.
b
Fig. 4. Co-occurrence of segmental duplications and TE families. Each group found in D. melanogaster is represented as a function of the
number of TE families detected in the group and the number of repeated sequences belonging to the group. Arrows indicate groups that
contain more than 100 sequences and more than two TE families.
to compare nucleic acid sequences, it is their amino acid
sequences that are really compared, which means that
even weak similarities can be detected. This success can
also be explained by the TBLASTX strategy, which
makes it possible to use an extensive nucleic acid database (RU) for amino acid comparisons. Indeed, the
proteins corresponding to TEs from many TE families
are not well characterized. Moreover, some TE sequences are only partially known and only deleted elements are described. Consequently, no extensive
protein database that can be used in a BLASTX search
exists. Our study increases the number of reference TEs
(RU entries) tested, and then increases the number of
related sequences effectively detected.
There are two problems associated with the strategy that detects unknown TE relying on their repeated
and dispersed nature: TE copies are often largely deleted and segmental duplications appear to be frequent. The best identification that can be expected is a
consensus sequence based on a cluster of sequences.
Our results showed that 68% of the TEs in D. melanogaster are scattered in several clusters. Thus, only
several partial consensus sequences could be expected
in general. This problem may be limited to compact
genomes with high rates of DNA loss such as Drosophila (and probably Anopheles). In large genomes
with low rates of DNA loss such as the human genome, TE copies are expected to be less deleted. TEs
would be scattered in less clusters: consensus sequences reconstructed from them would be less partial. But, it is not easy to distinguish between
segmental duplications and TE copies. Only a small
S58
number of copies of segmental duplications should be
present compared to the number of repeats of a TE
family. However, this was not the case (Fig. 4). Indeed, Fig. 4 shows several groups including several TE
families, containing more than 100 sequences. The
difficulty is increased further by the fact that segmental duplications often contain several TE copies.
In conclusion, this strategy appears to be inefficient
for the detection of unknown TEs in most of the cases.
Deletions
Our results show that TE copies can carry deletions
over most of their sequence. In D. melanogaster, 50%
of the copies were less than 12% the length of the full
length element (24% for A. gambiae). The remaining
regions shared a high identity with the full-length
element. Deletions were found throughout the sequence. TE sequence deletions affect both TE types
(class I and class II) identically, suggesting that they
are independent of the transposition mechanism. A
previous study showed such deletion profiles in the
Helena class I element (Petrov and Hartl 1998). The
authors proposed that the mechanism responsible for
these deletions is not restricted to TEs, but is more
general and is involved in the control of genome size.
The same mechanism is probably responsible for the
pattern observed here. This suggests that TEs disappear from a genome by successive deletions rather
than due to point mutations.
It has been proposed (Charlesworth et al. 1989)
that unequal recombination events occurring between
homologous copies inserted at different loci may
control TE dynamics. As the resulting deletions or
duplications are counter-selected, this selective pressure is expected to be proportional to copy number
and recombination rate. These unequal exchanges
may counter-balance the increase in copy number by
transposition. However, the deletion pattern observed here may reduce the effect of this phenomenon. Indeed, a striking consequence is that different
copies of the same TE family have few regions in
common. Consequently, the effective number of
copies that shares enough homologous sequences to
promote unequal recombination is lower than the
copy number, which reduces the effect of unequal
recombination in the control of TE dynamics.
Drosophila and Anopheles was due to segmental duplications (as 50% of the groups contain several TE
families). Hence, they represent 1.8% and 11% of the
whole sequence, respectively. A previous study estimated that the proportion of segmental duplication
in Drosophila melanogaster, Caenorhabditis elegans,
and Homo sapiens was about 1.2%, 4.25%, and 3.25%
of the genome, respectively, after removing TE sequences (International Human Genome Sequencing
Consortium 2001). The discrepancy between the two
results for Drosophila may be due to the presence of
TEs and the fact that we also counted short sequences
of <1 kb. However, despite these differences, the estimated percentages of segmental duplication in D.
melanogaster in the two studies (1.8% and 1.2%)
suggest that the bias is weak, and consequently that
our estimation for Anopheles is reliable. Comparisons
with other species highlight the high proportion of
segmental duplication in A. gambiae.
TE sequences were included in our segmental duplication analysis and could be seen to co-occur with
segmental duplication. About 50% of our groups
contained more than one TE in the repeated sequence, suggesting that most of the segmental duplications contain several TEs. It is striking to
observe such a good correlation between them. This
suggests that TEs could be involved in some of the
mechanisms at the origin of segmental duplications.
Several mechanisms could be proposed. They may (i)
act as dispersed homologous sequences, favoring
unequal recombination and thus promoting duplications; (ii) double stranded DNA breaks caused by
their endonuclease activities, activate host DNA gap
repair systems that promote recombination and gene
conversions; or (iii) they may act as reverse transcriptases, reverse transcribing diverse mRNAs and
integrating them at random sites.
Acknowledgments. We would like to thank the International
Anopheles Sequencing Consortium for providing the Anopheles
sequences, Paul Brey and Charles Roth for helpful discussions, and
two anonymous reviewers for their comments. This work was
supported by the ‘‘Centre National de Recherche Scientifique’’
(CNRS), the Universities P. and M. Curie and D. Diderot (Institut
Jacques Monod, UMR 7592, Dynamique du Génome et Evolution) and by the ‘‘programme Bio-Informatique’’ (CNRS).
References
Segmental Duplications
A repeated sequence containing several TE families
can be considered as a segmental duplication. Obviously sequences with one or zero TE family can also
be segmental duplications, but we are only certain
when there are several TE families. We observed a
surprisingly large number of segmental duplications.
At least 50% of the genome repeated fraction in both
Achaz G, Netter P, Coissac E (2001) Study of intrachromosomal
duplications among the eukaryote genomes. Mol Biol Evol
18:2280–2288
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990)
Basic local alignment search tool. J Mol Biol 215:403–410
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller
W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs. Nucleic Acids
Res 25:3389–3402
S59
Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE (2001)
Segmental duplications: organization and impact within the
current human genome project assembly. Genome Res
11:1005–1017
Berezikov E, Bucheton A, Busseau I (2000) A search for reverse
transcriptase-coding sequences reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster. Genome Biol 1:research0011.1–0011.15
Brunet F, Giraud T, Godin F, Capy P (2002) Do deletions of
Mos1-like elements occur randomly in the Drosophilidae family? J Mol Evol 54:227–234
Chao KM, Zhang J, Ostell J, Miller W (1995) A local alignment
tool for very long DNA sequences. Comput Appl Biosci
11:147–153
Charlesworth B, Langley CH (1989) The population genetics of
Drosophila transposable elements. Annu Rev Genet 23:251–
287
Eddy SR (1998) Profile hidden Markov models. Bioinformatics
14:755–763
Gusfield D (1997) Algorithms on strings, trees, and sequences.
Computer sciences and computational biology. Cambridge
University Press, Cambridge, pp 325–329
International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature
409:860–921
Jurka J (2000) Repbase update: A database and an electronic
journal of repetitive elements. Trends Genet 16:418–420
Kapitonov VV, Jurka J (1998–2002) Repbase update (www.girinst.org/Repbase_Update)
Kapitonov VV, Jurka J (2001) Rolling circle transposons in eukaryotes. PNAS 98:8714–8719
Myers EW, Miller W (1988) Optimal alignments in linear space.
Comput Appl Biosci 4:11–17
Myers EW, Sutton GG, Delcher AL, et al. (2000) A whole-genome
assembly of Drosophila. Science 287:2196–2204
Petrov DA, Hartl DL (1998) High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species groups. Mol
Biol Evol 15:293–302
Quesneville H, Anxolabéhère D (2001) Genetic algorithm based
model of evolutionary dynamics of class II transposable elements. J Theor Biol 213:21–30
Smit AF (1999) Interpersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet
Dev 9:657–663