J Mol Evol (2003) 57:S50–S59 DOI: 10.1007/s00239-003-0007-2 Detection of New Transposable Element Families in Drosophila melanogaster and Anopheles gambiae Genomes Hadi Quesneville, Danielle Nouaud, Dominique Anxolabéhère Laboratoire Dynamique du Génome et Evolution, Institut Jacques Monod, 2, Place Jussieu, 75251 Paris Cedex 05, France Received: 26 July 2002 / Accepted: 13 September 2002 Abstract. The techniques that are usually used to detect transposable elements (TEs) in nucleic acid sequences rely on sequence similarity with previously characterized elements. However, these methods are likely to miss many elements in various organisms. We tested two strategies for the detection of unknown elements. The first, which we call ‘‘TBLASTX strategy,’’ searches for TE sequences by comparing the six-frame translations of the nucleic acid sequences of known TEs with the genomic sequence of interest. The second, ‘‘repeat-based strategy,’’ searches genomic sequences for long repeats and clusters them in groups of similar sequences. TE copies from a given family are expected to cluster together. We tested the Drosophila melanogaster genomic sequence and the recently sequenced Anopheles gambiae genome in which most TEs remain unknown. We showed that the ‘‘TBLASTX strategy’’ is very efficient as it detected at least 332 new TE families in D. melanogaster and 400 in A. gambiae. This was unexpected in Drosophila as TEs of this organism have been extensively studied. The ‘‘repeat-based strategy’’ appeared to be very inefficient because of two problems: (i) TE copies are heavily deleted and few copies share homologous regions, and (ii) segmental duplications are frequent and it is not easy to distinguish them from TE copies. Correspondence to: H. Quesneville; email: [email protected] Key words: Transposable elements — Segmental duplications — Annotations — Bioinformatics — Genomics — Drosophila melanogaster — Anopheles gambiae Introduction Transposable elements (TEs) are mobile DNA sequences that are repeated throughout genomes. All the copies of a TE belong to what is called a TE family and are classified according to the mechanism by which they move from one genomic site to another. Class I TEs (or retrotransposons) transpose via a ‘‘copy and paste’’ mechanism using an RNA intermediate. They are subdivided into two subclasses according whether they contain long terminal repeats (LTRs) at their extremities: LTR retrotransposons and non-LTR retrotransposons (i.e., LINEs and SINEs). Class II TEs, known as DNA transposons or DNA TEs, transpose using an DNA intermediate, generally via a ‘‘cut and paste’’ mechanism, or rolling circle mechanism like Helitrons (Kapitonov and Jurka 2001). TEs have been found in nearly all of the genomes in which they have been sought. They seem to be ubiquitous and represent a quantitatively important component of genomes (44.4% of the human genome; International Human Genome Sequencing Consortium 2001). Several cellular functions have been found to be closely related to TEs (Smit 1999). There is no doubt that the genomic DNA we observe today evolved with the close participation of TEs. Many TEs can be considered to be parasites, but they S51 have probably all kept their ‘‘genome building’’ properties. They appear to be crucial actors in genome evolution. The techniques usually used to detect TEs in nucleic acid sequences rely on nucleic sequence similarity with previously characterized elements. Using sequence alignment programs, these methods annotate the parts of sequences that are very similar to an element from a reference set of ‘‘known elements’’ (i.e., RepeatMasker; http://repeatmasker.genome. washington.edu/). However, when studying sequences from an organism in which TEs have not been described in detail, these methods are likely to miss many ‘‘unknown’’ elements. A better approach consists of searching for sequence similarities with known TE proteins. This approach can detect unknown TEs that are distantly related to known TEs. The TBLASTN, or BLASTX programs (Altschul et al. 1990, 1997) can be efficiently used for this task. Some amino acid domains are quite well conserved in some TE families such as the reverse-transcriptase (RT) domain of LINEs and LTR retrotransposons. Profile-based methods such as implemented in the HMMER (Eddy 1998) and RPS-BLAST (Altschul et al. 1997) packages are more sensitive than classical alignment programs. They use multiple alignments of well-characterized amino acid domains to detect amino acid signatures. They then efficiently identify elements by searching for them in genomic sequences (Berezikov et al. 2000). Obviously, all these approaches increase the range of TE families that can be detected by similarity, but unfortunately they can still miss some TE families. One of the reasons for this is that the protein sequences and the conserved domains of many TEs are not known, hence the search cannot be exhaustive. Another problem is that TE copies are often deleted and do not retain any coding capabilities or any conserved protein domains. We tested two strategies to try to overcome these difficulties with the genomic sequences of Drosophila melanogaster and Anopheles gambiae. D. melanogaster is an organism in which TEs are well known and have been extensively studied. Its sequence was mainly chosen to validate our strategies. A. gambiae was only sequenced very recently (http://www.ncbi.nlm.nih. gov/cgi-bin/Entrez/map_search?chr=agambiae.inf). Most of its transposable elements remain unknown. The methods proposed in this paper were designed to analyze this type of genome. The first strategy involves the use of TBLASTX (Altschul et al. 1990, 1997) to compare amino acid sequences derived from nucleic acid sequences. Thus, the protein sequences are not required, meaning that deleted TEs and those with unknown protein sequences can be tested. We expected that this method would extend the range of TE families that can be detected on the basis of sim- ilarity, by increasing the number of known TE sequences tested. This strategy will be referred to as the ‘‘TBLASTX strategy.’’ The second strategy relies on the fact that TEs are repeated and dispersed in genomic sequences. Thus, repeated sequences are sought in the nucleic acid sequences and similar sequences are clustered together in groups. TE copies from a given family are expected to cluster in a group. This approach will be referred to as the ‘‘repeat-based strategy.’’ Our results show that the TBLASTX strategy is very efficient. At least 332 new TE families were detected in D. melanogaster and 400 in A. gambiae. This was unexpected for Drosophila as TEs of this organism have been studied in detail. Conversely, the repeat-based strategy appeared to be very inefficient because (i) the TE copies are heavily deleted and only few copies share homologous regions, and (ii) segmental duplications are frequent and it is not easy to distinguish them from TE copies. Materials and Methods Sequence Data The D. melanogaster genomic sequence used was the chromosome arms release 2, which was downloaded from the Berkeley Drosophila Genome Project (BDGP) web site at http://www.fruitfly.org/sequence/dlMfasta.shtml#2rel2/na_arms.dros.RELEASE2. The A. gambiae genomic sequence used was the first version of the Whole Genome Shotgun project assembly of A. gambiae. We retrieved 8987 scaffolds at GenBank (http://www.ncbi.nlm.nih.gov/), accessions numbers AAAB01000001 to AAAB01008987. The sequences of the transposable elements were obtained from the Repbase Update database release 7.2 (Kapitonov and Jurka 1998–2002; Jurka 2000), which contains all known repeated sequences including TEs (downloaded at http://www.girinst.org). The sequences of D. melanogaster TEs were downloaded from the BDGP web site http://www.fruitfly.org/sequence/sequence_db/ na_te.dros.embl (release 4.94). These sequences will be referred to as the BDGP TE compilation. They represent full length TEs that are generally functional. BLASTER We wrote the BLASTER program in C++. This program can be used to annotate TEs in genomic sequences. It can compare two sets of sequences: a query databank against a subject databank. For each sequence in the query databank, BLASTER launches one of the BLAST programs (BLASTN, TBLASTN, BLASTX, TBLASTX, BLASTP, MEGABLAST) (Altschul et al. 1990, 1997) to search the subject databank. Each BLAST search is launched in parallel on a computer cluster. The BLAST results (HSPs) are then postprocessed. Long insertions (or deletions) in one of two homologous sequences result in two HSPs, instead of one with a long gap. Sparse dynamic programming is used to connect such HSPs (HSP alignments) (Gusfield 1997; Chao et al. 1995). BLASTER is not limited by the length of sequences. It cuts long sequences before launching BLAST and re-assembles the results afterwards. Hence, it can work on whole genomes, in particular to compare a genome with itself to detect repeats. The results of BLASTER can then be treated by the MATCHER and GROUPER programs described below (Fig. 1). S52 Fig. 1. Data flow through BLASTER, MATCHER, and GROUPER. GROUPER GROUPER is a C++ program that we developed to treat the BLASTER results. It uses HSP alignments to gather similar sequences into groups by simple link clustering. An alignment belongs to a group if one of the two aligned sequences already belongs to this group over 95% of its length. Groups that share sequence locations are regrouped into what we called a cluster. As a result of these procedures, each group contains sequences that are homogeneous in length. A given region may belong to several groups, but all of these groups belong to the same cluster. MATCHER MATCHER is another C++ program that we developed to treat BLASTER results and to map the matches (HSP alignments) of the subject sequences on the queries. It filters the results as follows: when two or more matches with different subject sequences overlap on the query, the match corresponding to the best alignment score is kept, and the other discarded. Fig. 2. Copy length distributions expressed as the percentage of the length of the full length sequence for known Drosophila TEs. Results Lengths of Deleted Known TE Copies Nucleotide sequences that matched previously characterized TEs were sought. Sequences were detected with BLASTER using BLASTN repeatedly and MATCHER. Hits were retained only if the e-value was less than 10)10. The BDGP TE compilation is used for D. melanogaster and all Repbase Update database (RU) for A. gambiae. It is noteworthy that in the BDGP D. melanogaster genomic sequence, a repeat sequence or a transposable element sequence may correspond to a consensus sequence due to the assembly algorithm (Myers et al. 2000). To our knowledge the same procedure is used for the A. gambiae genome, meaning that it also suffers from the same limitation. Release 3 of the BDGP D. melano- gaster genomic sequence will correct this problem by resequencing each repeat individually. Consequently our description of the TE distributions in the two sequences gives a biased image of the real TE distribution. But here we used the two genomic sequences as they are to evaluate the two strategies and to highlight some of the potential problems. Figure 2 shows the distribution of copy lengths expressed as a percentage of the length of the full length sequence of known Drosophila TEs. Some of the matches obtained for A. gambiae suggest that foreign sequences are present in the genome (Table 1). In fact, this just means that the most similar sequence has been described in another species. Very often this is due to the absence of such sequences for Anopheles in the RU database (i.e., 5S_DM). S53 Table 1. Repbase Update matches (Kapitonov and Jurka 1998–2002) with A. gambiae genomic sequence, obtained with BLASTER using BLASTN; only hits with an e-value <10)10 are reported Namea Typeb Species Lengthc #d Min.e Q0.25f Q0.50g Q0.75h Maxi Mean identityj AGRP1 T1_AG IKIRARA1 PEGASUS RT1 AGM1 TRNA_VAL 5S_DM TRNA_GLY ZEBEDEE R2B_DM TRNA_ASN U2 TTO1_NT HMSBEAGLE_I U5B1 MUSID5 U6 QUASIMODO_I L23 U4B Repetitive sequence Non-LTR retrotransposon DNA transposon DNA transposon Non-LTR retrotransposon LTR retrotransposon Transfer RNA 5S RNA gene sequence Transfer RNA LTR retrotransposon Non-LTR retrotransposon Transfer RNA Small nuclear RNA LTR retrotransposon LTR retrotransposon Small nuclear RNA Non-LTR retrotransposon Small nuclear RNA LTR retrotransposon Ribosomal protein Small nuclear RNA Anopheles gambiae Anopheles gambiae Anopheles gambiae Anopheles gambiae Anopheles gambiae Anopheles gambiae Homo sapiens Drosophila melanogaster Homo sapiens Aedes aegypti Drosophila mercatorum Homo sapiens Homo sapiens Nicotiana tabacum Drosophila melanogaster Homo sapiens Rodents Homo sapiens Drosophila melanogaster Homo sapiens Homo sapiens 871 4634 610 534 8037 5983 76 135 74 3256 3528 77 896 5300 6529 116 83 107 6060 1310 145 0.05 0.01 0.08 0.11 0.01 0.01 0.96 0.73 0.95 0.02 0.02 0.79 0.08 0.01 0.01 0.34 0.87 0.62 0.01 0.12 0.6 829 359 234 85 43 28 26 24 18 13 13 12 9 7 6 6 4 3 2 1 1 0.15 0.06 0.28 0.42 0.01 0.01 0.96 0.86 0.96 0.05 0.02 0.79 0.08 0.01 0.01 0.35 0.9 0.62 0.01 0.12 0.6 0.34 0.14 0.43 0.86 0.06 0.01 0.96 0.86 0.96 0.05 0.02 0.79 0.08 0.01 0.01 0.35 0.92 0.88 0.01 0.12 0.6 0.74 0.31 0.78 1 0.16 0.09 0.96 0.86 0.96 0.13 0.02 0.79 0.08 0.01 0.01 0.35 0.92 0.94 0.01 0.12 0.6 4.03 92.41 2.68 94.52 4.43 95.85 1.79 94.61 1.16 92.76 2.66 96.48 0.96 89.20 0.86 85.43 0.96 94.37 0.36 83.57 0.02 96.16 0.79 96.72 0.08 91.07 0.01 89.39 0.01 86.52 0.35 100.00 0.92 86.96 0.94 95.44 0.01 90.51 0.12 81.37 0.6 87.36 a Entry name in Repbase Update. Type of repeat. c Length of the reference sequence in Repbase Update. d Number of copies found. e,f,g,h,i Lengths of the copies expressed in percentage of the Repbase sequence length for the minimum, 25, 50, 75 percentiles, and maximum, respectively. j Mean identity with the Repbase sequence. b For all TE families, only a very small number of copies appeared to be complete and the vast majority appeared to be deleted. When aligned with the complete element (the TE in RU or the BDGP TE compilation), the mean of the alignment length percentage median was 12% and 24% for D. melanogaster and A. gambiae, respectively. To illustrate this feature more precisely, we chose two non-LTR retroelements, two LTR retroelements and two DNA elements from D. melanogaster. All of the genomic copies of these elements were aligned with the reference copy present in the BDPG TE compilation. The deletion percentage was calculated for each nucleotide position (Fig. 3). Deletions were found throughout the sequence, except for DNA elements. Deletions were more frequent in the middle of the DNA elements. This was expected for the DNA elements as their transposition mechanism involving a gap repair process is known to generate internal deletions in these elements (Quesneville and Anxolabéhère 2001; Brunet et al. 2002). The TBLASTX Strategy Amino acid sequences that matched RU sequences were sought by use of BLASTER with the TBLASTX and MATCHER tools. Hits were retained only if the e-value was less than 10)10. In D. melanogaster, matches were removed if they overlapped with previously identified TEs. In the D. melanogaster genome, 332 RU entries were found. Some D. melanogaster-specific TEs were still found in the genome even after hits that matched these entries by BLASTN had been removed. This suggests that several related Drosophila TE families coexist in the genome: the well-known sequences and more divergent sequences that are not detected at the DNA level. Surprisingly, 222 new TEs were found in this species. Table 2 gives some of these most common matches. This shows that even in one of the most studied organisms, many families had been missed and remained to be described. Four hundred and five RU entries were found in the A. gambiae genome. Table 3 gives some of the most common matches. Seventy-one of them shared similarities with known repeats from D. melanogaster. As the closest repeat families to these sequences have been described in Drosophila, they are closely related to Drosophila repeat families. As previously mentioned, several sequences that matched with one reference TE on the amino acid level may in fact correspond to several different, but related, families. Thus, we can only roughly estimate the minimum number of TE families present in a genome. We estimated that there are about 350 TE families in D. melanogaster and 400 in A. gambiae. S54 Fig. 3. Distributions of the percentage of deletions along the sequences of two non-LTR retroelements, two LTR retroelements, and two DNA elements from D. melanogaster. The percentage was calculated for each nucleotide position by aligning (global optimal alignment) each copy found in the genome with the full-length element found in the BDGP TE compilation. Global optimal alignment was performed in a C++ program developed from the algorithm described by Myers and Miller (1988). The solid line represents a smoothing curve obtained by the mobile mean method with window size of 10 bp. The title of each graph gives the name of the element, the type, and the copy number. Mean identity and mean gap length percentage are given. S55 Table 2. Repbase Update matches (Kapitonov and Jurka 1998–2002) with the D. melanogaster genomic sequence obtained with BLASTER using TBLASTX; only the 30 most common hits corresponding to new D. melanogaster TE, with an e-value <10)10, are reported Namea Typeb Species Lengthc #d OSVALDO_I YOYOI TOM_I ULYSS TV1I P126 NINJA_I AGM1 GYPSY_DS BARI1 MAG MINOS BMC1 CYCLO TED LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon Repetitive sequence LTR retrotransposon LTR retrotransposon LTR retrotransposon DNA transposon LTR retrotransposon DNA transposon Non-LTR retrotransposon Retro-pseudogene LTR retrotransposon PRIMA4_I RT1 TRAM_I RIRE3_I GYPSYDR1 LYDIA_I LDT1 GRANDE1_ZD DEL_LH SUSHII RONIN1_I RIRE8B_I LINE1_BM ALU AluSp LTR retrotransposon Non-LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon Non-LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon Drosophila buzzatii 6653 Ceratitis capitata 7065 Drosophila ananassae 6112 Drosophila virilis 10653 Drosophila virilis 5963 Paramecium tetraurelia 1840 Drosophila simulans 6011 Anopheles gambiae 5983 Drosophila subobscura 7522 Drosophila erecta 1750 Bombyx mori 4564 Drosophila hydei 1775 Bombyx mori 5091 Homo sapiens 753 Autographa californica 6964 nucleopolyhedrovirus Homo sapiens 8270 Anopheles gambiae 8037 Drosophila miranda 2708 Oryza sativa 5775 Danio rerio 4463 Lymantria dispar 6054 Lymantria dispar 5677 Zea diploperennis 13769 Lilium henryi 9345 Fugu rubripes 4475 Fugu rubripes 4487 Oryza sativa 5981 Bombyx mori 5158 Homo sapiens 290 Homo sapiens 284 Mean Min.e Q0.25f Q0.50g Q0.75h Maxi identityj 135 74 68 59 57 49 37 29 25 23 18 14 14 13 12 0.005 0.006 0.005 0.004 0.007 0.029 0.008 0.003 0.006 0.033 0.018 0.022 0.005 0.084 0.008 0.016 0.015 0.014 0.01 0.018 0.093 0.02 0.011 0.011 0.074 0.023 0.044 0.012 0.369 0.016 0.037 0.028 0.021 0.014 0.037 0.219 0.047 0.031 0.016 0.13 0.042 0.151 0.023 0.382 0.031 0.064 0.07 0.054 0.031 0.07 0.365 0.084 0.054 0.034 0.259 0.068 0.239 0.032 0.637 0.079 0.419 0.491 0.216 0.132 0.19 0.666 0.414 0.181 0.133 0.71 0.135 0.632 0.08 0.817 0.395 54.783 49.536 51.18 50.027 50.871 36.822 55.998 46.349 45.795 66.531 40.659 59.436 46.748 60.498 48.061 12 10 10 10 9 9 9 8 8 7 7 7 7 6 6 0.005 0.005 0.017 0.009 0.01 0.008 0.01 0.005 0.005 0.013 0.009 0.007 0.011 0.093 0.898 0.007 0.007 0.053 0.015 0.012 0.011 0.017 0.008 0.007 0.015 0.014 0.012 0.016 0.3 0.901 0.008 0.012 0.11 0.018 0.025 0.012 0.02 0.009 0.008 0.021 0.016 0.014 0.017 0.41 0.933 0.009 0.016 0.127 0.02 0.04 0.069 0.025 0.017 0.014 0.029 0.024 0.016 0.062 0.81 0.94 0.011 0.019 0.161 0.03 0.095 0.086 0.047 0.02 0.014 0.038 0.054 0.116 0.073 0.945 0.975 53.022 43.887 46.179 44.306 44.821 44.058 38.044 38.828 44.186 44.334 50.352 48.92 42.508 68.357 75.1 a Entry name in Repbase Update. Type of repeat. c Length of the reference sequence in Repbase Update. d Number of copies found. e,f,g,h,i Lengths of the copies expressed in percentage of the Repbase sequence length for the minimum, 25, 50, 75 percentiles, and maximum, respectively. j Mean identity with the Repbase sequence. b The Repeat-Based Strategy Repeats Statistics We used the all-versus-all BLASTN comparisons (with low complexity filter) of BLASTER to detect repeated sequences (e-value filter = 10)300). These sequences were then clustered using GROUPER. However, Anopheles is an outbred organism and a number of individuals from the PEST strain were sequenced. Thus, we expected to find polymorphic regions disrupting the assembly. These regions were expected either to be present in several copies in the assembly and/or not to have enough coverage to be assembled at all. Consequently, the small contigs in the release are likely to correspond to these poly- morphic sequences. To test for possible bias in repeats due to these polymorphic regions, a repeat search limited to the long scaffolds >100 kb was carried out (representing 87% of the genomic sequence) (Table 4). About one-third of the Anopheles repeated sequences were due to these small contigs and thus to polymorphism. Even when only the long scaffolds were used, our genome comparison reveals that Anopheles contains a higher percentage of repeat sequences than Drosophila. A previous study suggested that D. melanogaster carries an unusually small number of repeats (Achaz et al. 2001), in comparison to other phylogenetically very distant eukaryotes such as Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana, and Homo sapiens. Here we confirm this particularity with a closely related species, A. gambiae. S56 Table 3. Repbase Update matches (Kapitonov and Jurka 1998–2002) with A. gambiae obtained with BLASTER using TBLASTX; only the 30 most common hits with an e-value <10)10 are reported Namea Typeb Species Lengthc #d Min.e Mean Q0.25f Q0.50g Q0.75h Maxi identityj T1_AG AGRP1 NINJA_I MAG AGM1 DMCR1A BOVB_VA RT1 IKIRARA1 BOVB BLASTOPIA_LTR TC1_DM AACOPIA1_I IVK_DM BEL_I EXPANDER2 I_DM LDT1 ROO_I TABOR_I INVADER2_I COPIA_DM MDG3_DM YOYOI STALKER2_I DIVER_I PEGASUS HMSBEAGLE_I DMRT1C EXPANDER Non-LTR retrotransposon Repetitive sequence LTR Retrotransposon Retrotransposon LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon DNA transposon Non-LTR retrotransposon LTR retrotransposon DNA transposon LTR retrotransposon Non-LTR retrotransposon LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon LTR retrotransposon DNA transposon LTR retrotransposon Non-LTR retrotransposon Non-LTR retrotransposon Anopheles gambiae Anopheles gambiae Drosophila simulans Bombyx mori Anopheles gambiae Drosophila melanogaster Vipera ammodytes Anopheles gambiae Anopheles gambiae Bos taurus Drosophila melanogaster Drosophila melanogaster Aedes aegypti Drosophila melanogaster Drosophila melanogaster Fugu rubripes Drosophila melanogaster Lymantria dispar Drosophila melanogaster Drosophila melanogaster Drosophila melanogaster Drosophila melanogaster Drosophila melanogaster Ceratitis capitata Drosophila melanogaster Drosophila melanogaster Anopheles gambiae Drosophila melanogaster Drosophila melanogaster Fugu rubripes 4634 871 6011 4564 5983 4470 4606 8037 610 3302 4481 1666 4110 5402 5404 3369 6231 5677 8256 6336 4592 5143 4986 7065 7402 5643 534 6529 5443 3362 0.01 0.02 <0.01 <0.01 <0.01 0.01 0.01 <0.01 0.05 0.01 0.01 0.02 0.01 <0.01 0.01 0.01 0.01 <0.01 0.01 0.01 0.01 0.01 0.01 <0.01 <0.01 <0.01 0.08 0.01 0.01 0.02 0.05 0.18 0.05 0.06 0.04 0.04 0.05 0.02 0.31 0.04 0.04 0.16 0.08 0.02 0.02 0.06 0.02 0.04 0.02 0.02 0.03 0.05 0.03 0.02 0.01 0.03 0.45 0.01 0.03 0.04 3679 1038 846 715 569 492 384 337 233 231 231 226 215 205 159 157 149 144 139 127 126 122 117 115 114 112 102 100 88 85 0.09 0.35 0.09 0.15 0.15 0.08 0.08 0.08 0.45 0.12 0.07 0.43 0.17 0.06 0.05 0.14 0.04 0.09 0.03 0.05 0.06 0.12 0.12 0.04 0.02 0.04 0.83 0.02 0.08 0.07 0.19 0.59 0.3 0.29 0.47 0.14 0.11 0.16 0.75 0.17 0.17 0.59 0.36 0.1 0.13 0.14 0.08 0.15 0.07 0.14 0.18 0.3 0.35 0.09 0.05 0.11 0.99 0.09 0.18 0.12 1.72 2.08 1.05 1.19 1.33 0.66 0.57 1 2.4 0.35 0.8 0.8 1.28 0.88 0.91 0.49 0.26 0.37 0.58 0.57 0.87 0.76 0.78 0.57 0.56 0.9 1.5 0.61 0.49 0.49 48.19 76.57 42.06 39.53 46.57 40.14 34.94 52.02 86.19 34.14 38.62 41.89 46.01 36.87 41.16 33.81 37.99 38.68 40.04 45.75 41.24 41.73 43.96 39.90 41.33 41.31 81.50 42.11 40.70 36.02 a Entry name in Repbase Update. Type of repeat. c Length of the reference sequence in Repbase Update. d Number of copies found. e,f,g,h,i Lengths of the copies expressed in percentage of the Repbase sequence length for the minimum, 25, 50, 75 percentiles, and maximum, respectively. j Mean identity with the Repbase sequence. b Repeats and TEs To evaluate how well TEs were detected depending on whether they are highly dispersed and on the number of repeats, repeats that correspond to one of the TEs detected by TBLASTX were identified according to their genomic locations. A cross location analysis for D. melanogaster (and A. gambiae) showed that 47% (48%) of the groups contained several TEs. A scatter plot representing each group found in D. melanogaster in function of the number of TE families detected and its size (the number of repeated sequences) is presented in Fig. 4. It suggests that many repeated sequences are not TE copies and are duplicated by other means than transposition. These repeated sequences are called segmental duplications when their size exceeds 1 kb (Bailey et al. 2001; International Human Genome Sequencing Consortium 2001). Based on the sequence information for the repeated sequences, it should be possible to reconstruct a consensus sequence from the deleted copies for each TE family. By construction, the sequences within a cluster share similar regions enabling them to be assembled. Sixty-eight percent of the TEs detected in D. melanogaster (77% for A. gambiae) are spread over several clusters. Thus, the consensus sequence of 68% of the TEs should be spread over several non overlapping fragments in D. melanogaster. This result highlights the fragmented nature of TE copies in the genome. Discussion Detection of New TEs Our strategy for the detection of TE based on TBLASTX was very successful. This success can in part be explained by the fact that even though it only seems S57 Table 4. Repeat statistics A. gambiae Genome size (Mb) Genomic sequence size (Mb) Repeated fractiona Number of groupsb Number of clustersb D. melanogaster Long scaffolds (>100 kb) All scaffolds 180 124 3.7% 2312 434 – 243 22% 16594 2238 290 278 32% 36036 5166 a Fraction of the genome that is present at least in two copies. Number of groups and clusters respectively obtained by BLASTER and GROUPER, when a genome was compared with itself by BLASTN. All HSP with an e-value >10)300 were removed, and a coverage constraint of 95% was applied for constructing groups. b Fig. 4. Co-occurrence of segmental duplications and TE families. Each group found in D. melanogaster is represented as a function of the number of TE families detected in the group and the number of repeated sequences belonging to the group. Arrows indicate groups that contain more than 100 sequences and more than two TE families. to compare nucleic acid sequences, it is their amino acid sequences that are really compared, which means that even weak similarities can be detected. This success can also be explained by the TBLASTX strategy, which makes it possible to use an extensive nucleic acid database (RU) for amino acid comparisons. Indeed, the proteins corresponding to TEs from many TE families are not well characterized. Moreover, some TE sequences are only partially known and only deleted elements are described. Consequently, no extensive protein database that can be used in a BLASTX search exists. Our study increases the number of reference TEs (RU entries) tested, and then increases the number of related sequences effectively detected. There are two problems associated with the strategy that detects unknown TE relying on their repeated and dispersed nature: TE copies are often largely deleted and segmental duplications appear to be frequent. The best identification that can be expected is a consensus sequence based on a cluster of sequences. Our results showed that 68% of the TEs in D. melanogaster are scattered in several clusters. Thus, only several partial consensus sequences could be expected in general. This problem may be limited to compact genomes with high rates of DNA loss such as Drosophila (and probably Anopheles). In large genomes with low rates of DNA loss such as the human genome, TE copies are expected to be less deleted. TEs would be scattered in less clusters: consensus sequences reconstructed from them would be less partial. But, it is not easy to distinguish between segmental duplications and TE copies. Only a small S58 number of copies of segmental duplications should be present compared to the number of repeats of a TE family. However, this was not the case (Fig. 4). Indeed, Fig. 4 shows several groups including several TE families, containing more than 100 sequences. The difficulty is increased further by the fact that segmental duplications often contain several TE copies. In conclusion, this strategy appears to be inefficient for the detection of unknown TEs in most of the cases. Deletions Our results show that TE copies can carry deletions over most of their sequence. In D. melanogaster, 50% of the copies were less than 12% the length of the full length element (24% for A. gambiae). The remaining regions shared a high identity with the full-length element. Deletions were found throughout the sequence. TE sequence deletions affect both TE types (class I and class II) identically, suggesting that they are independent of the transposition mechanism. A previous study showed such deletion profiles in the Helena class I element (Petrov and Hartl 1998). The authors proposed that the mechanism responsible for these deletions is not restricted to TEs, but is more general and is involved in the control of genome size. The same mechanism is probably responsible for the pattern observed here. This suggests that TEs disappear from a genome by successive deletions rather than due to point mutations. It has been proposed (Charlesworth et al. 1989) that unequal recombination events occurring between homologous copies inserted at different loci may control TE dynamics. As the resulting deletions or duplications are counter-selected, this selective pressure is expected to be proportional to copy number and recombination rate. These unequal exchanges may counter-balance the increase in copy number by transposition. However, the deletion pattern observed here may reduce the effect of this phenomenon. Indeed, a striking consequence is that different copies of the same TE family have few regions in common. Consequently, the effective number of copies that shares enough homologous sequences to promote unequal recombination is lower than the copy number, which reduces the effect of unequal recombination in the control of TE dynamics. Drosophila and Anopheles was due to segmental duplications (as 50% of the groups contain several TE families). Hence, they represent 1.8% and 11% of the whole sequence, respectively. A previous study estimated that the proportion of segmental duplication in Drosophila melanogaster, Caenorhabditis elegans, and Homo sapiens was about 1.2%, 4.25%, and 3.25% of the genome, respectively, after removing TE sequences (International Human Genome Sequencing Consortium 2001). The discrepancy between the two results for Drosophila may be due to the presence of TEs and the fact that we also counted short sequences of <1 kb. However, despite these differences, the estimated percentages of segmental duplication in D. melanogaster in the two studies (1.8% and 1.2%) suggest that the bias is weak, and consequently that our estimation for Anopheles is reliable. Comparisons with other species highlight the high proportion of segmental duplication in A. gambiae. TE sequences were included in our segmental duplication analysis and could be seen to co-occur with segmental duplication. About 50% of our groups contained more than one TE in the repeated sequence, suggesting that most of the segmental duplications contain several TEs. It is striking to observe such a good correlation between them. This suggests that TEs could be involved in some of the mechanisms at the origin of segmental duplications. Several mechanisms could be proposed. They may (i) act as dispersed homologous sequences, favoring unequal recombination and thus promoting duplications; (ii) double stranded DNA breaks caused by their endonuclease activities, activate host DNA gap repair systems that promote recombination and gene conversions; or (iii) they may act as reverse transcriptases, reverse transcribing diverse mRNAs and integrating them at random sites. Acknowledgments. We would like to thank the International Anopheles Sequencing Consortium for providing the Anopheles sequences, Paul Brey and Charles Roth for helpful discussions, and two anonymous reviewers for their comments. This work was supported by the ‘‘Centre National de Recherche Scientifique’’ (CNRS), the Universities P. and M. Curie and D. Diderot (Institut Jacques Monod, UMR 7592, Dynamique du Génome et Evolution) and by the ‘‘programme Bio-Informatique’’ (CNRS). References Segmental Duplications A repeated sequence containing several TE families can be considered as a segmental duplication. Obviously sequences with one or zero TE family can also be segmental duplications, but we are only certain when there are several TE families. We observed a surprisingly large number of segmental duplications. At least 50% of the genome repeated fraction in both Achaz G, Netter P, Coissac E (2001) Study of intrachromosomal duplications among the eukaryote genomes. Mol Biol Evol 18:2280–2288 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410 Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 S59 Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11:1005–1017 Berezikov E, Bucheton A, Busseau I (2000) A search for reverse transcriptase-coding sequences reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster. Genome Biol 1:research0011.1–0011.15 Brunet F, Giraud T, Godin F, Capy P (2002) Do deletions of Mos1-like elements occur randomly in the Drosophilidae family? J Mol Evol 54:227–234 Chao KM, Zhang J, Ostell J, Miller W (1995) A local alignment tool for very long DNA sequences. Comput Appl Biosci 11:147–153 Charlesworth B, Langley CH (1989) The population genetics of Drosophila transposable elements. Annu Rev Genet 23:251– 287 Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763 Gusfield D (1997) Algorithms on strings, trees, and sequences. Computer sciences and computational biology. Cambridge University Press, Cambridge, pp 325–329 International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921 Jurka J (2000) Repbase update: A database and an electronic journal of repetitive elements. Trends Genet 16:418–420 Kapitonov VV, Jurka J (1998–2002) Repbase update (www.girinst.org/Repbase_Update) Kapitonov VV, Jurka J (2001) Rolling circle transposons in eukaryotes. PNAS 98:8714–8719 Myers EW, Miller W (1988) Optimal alignments in linear space. Comput Appl Biosci 4:11–17 Myers EW, Sutton GG, Delcher AL, et al. (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204 Petrov DA, Hartl DL (1998) High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species groups. Mol Biol Evol 15:293–302 Quesneville H, Anxolabéhère D (2001) Genetic algorithm based model of evolutionary dynamics of class II transposable elements. J Theor Biol 213:21–30 Smit AF (1999) Interpersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev 9:657–663
© Copyright 2026 Paperzz