BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 6 2006, pages 676–684 doi:10.1093/bioinformatics/btk032 Genome analysis Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression Valentina Boeva1, , Mireille Regnier2, Dmitri Papatsenko3 and Vsevolod Makeev4,5 1 Department of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia, INRIA Rocquencourt, France, 3University of California, Berkeley, USA, 4State Research Center GosNIIGenetika, Moscow, Russia and 5Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia 2 Received on October 14, 2005; revised on December 22, 2005; accepted on December 28, 2005 Advance Access publication January 10, 2006 Associate Editor: Steven L. Salzberg ABSTRACT Motivation: Genomic sequences are highly redundant and contain many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of particular interest. They are found in regulatory regions of eukaryotic genes and are reported to interact with transcription factors. However, accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and classification. Results: We have obtained formulas for P-values of FTR occurrence and developed an FTR identification algorithm implemented in TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp) in coding and non-coding regions including UTRs, heterochromatic, intergenic and enhancer sequences of Drosophila melanogaster and Drosophila pseudoobscura. Tandems with period three and its multiples were found in coding segments, whereas FTRs with periods multiple of six are overrepresented in all non-coding segment. Periods equal to 5–7 and 11–14 were characteristic of the enhancer regions and other non-coding regions close to genes. Availability: TandemSWAN web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/projects/ swan/www/ Contacts: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Eukaryotic genomes contain many types of repetitive sequences, such as long repeats, satellite DNA and many other yet unclassified sequences of various lengths and levels of repetitiveness (Singer and Berg, 1991). So far, the efforts of researchers have been predominantly focused on nearly perfect repeats such as microsatellites and others (Li et al., 2002). Analysis of more divergent (fuzzy) tandem repeats was complicated by problems related to their discrimination from background and insufficient annotation level of genomes. In this study we focus on fuzzy tandems containing n occurrences (n > 2) of a mismatched word with period of T bases (T 3–24) without insertions or deletions. Tandem repeats are usually To whom correspondence should be addressed. 676 classified into microsatellites (1–6 bp), minisatellites (6–24 bp, and in some cases longer) (Vergnaud and Denoeud, 2000) and ‘classical’ satellites. The length scale of fuzzy repeats considered here corresponds to micro- and minisatellite repeat classes. However, we do not consider periods with T ¼ 1 or 2, as they correspond to poly-A or TATA-like sequence, a different biological object explored elsewhere (Katti et al., 2001; Schug et al., 1998; Subramanian et al., 2003). Fuzzy tandem repeats (FTRs) have been found in regulatory regions of eukaryotic genes (Shi et al., 2000); such tandems sometimes form cooperative arrays of binding sites and interact with transcription factors (Gao and Finkelshtein, 1998; Ott and Hansen, 1996; Meloni et al., 1998; Ramchandran et al., 2000). However, it is still unclear (1) how to define and extract fuzzy tandems, (2) whether functionally different sequences are enriched by tandems of a specific structure and (3) what biological function (if any) fuzzy tandems perform in genome. If the genome distribution of FTRs is uneven, their exploration should help to locate structural/functional sequence categories and to understand underlying mechanisms of their function. The degree of FTR propagation varies from one genome to the other and from one functional sequence category to the other; existing algorithms (Benson, 1999; Kolpakov et al., 2003) return up to 10–15% of the Drosophila melanogaster and >10% (Benson, 1999) of the human genome as tandem repeats of various structure. Accumulation of tandems in genomes is a result of errors during replication and some rearrangement events (Dover, 1982; Singer and Berg, 1991, Ellegren, 2004). From that perspective, much of repetitive genomic DNA might be considered as non-informative; however, there are cases where presence of tandems is tightly linked to a biological function (Nakamura et al., 1998). For instance, long tandem repeats constitute a large portion of heterochromatin satellite DNA and are involved in centromere formation and function (Martienssen, 2003); sometimes presence of long tandems even serves as a signal of extra centromere formation (Singer and Berg, 1991). Much less is known about the role of shorter repetitive sequences, especially highly mismatched fuzzy tandems (FTRs), quite abundant in exons, introns and transcription regulatory sequences (Nakamura et al., 1998). In exons, FTRs may reflect sequence periodicities existing in protein sequence or even structural features, such as hydrophobic helices (Katti et al., 2000; Li et al., 2004); it is unclear if these tandems have any function at The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Short fuzzy tandem repeats in genomic sequences the DNA level. In complex eukaryotic regulatory regions, such as enhancers and silencers, FTRs appear to be linked with some types of binding sites for transcription factors (Antoniewski et al., 1996; Ott and Hansen, 1996; Ramchandran et al., 2000). One of the attractive models suggests that an FTR with a unit consensus similar to a binding site modulates exact response to regulator concentration (Carroll et al., 2001; Davidson et al., 2000). Repeats of various types may also be important for regulation that controls spatial packaging/dynamics of eukaryotic DNA. Thus, 8–16 bp repeats separated by distance <200 bases may characterize Scaffold Attached Regions (Boulikas, 1995), periodic signals appear to play a role in nucleosome positioning (Ioshikhes et al., 1999). Periodic signals are present in prokaryotic and eukaryotic promoters, where they correspond to arrays of sites for DNAbinding proteins (Kutuzova et al., 1999; Kravatskaia et al., 2002; Makeev et al., 2003). Tandem structure sometimes may be important for genome functioning—many human diseases are known to be caused by increase in the number of copies, etc. (Verkerk et al., 1991; Huntington’s Disease Collaborative Research Group, 1993; Fu et al., 1992; Thibodeau et al., 1993; Wooster et al., 1994; Villafranca et al., 2001; Niv et al., 2005). From practical point of view, variations in tandem structure serves in many important applications, such as linkage analysis and DNA fingerprinting (Edwards et al., 1992; Weber and May, 1989). Here we conducted a functional analysis of tandems in Drosophila at a genome-wide level by (1) introducing probabilistic models for tandems with a high degree of fuzziness and (2) finding tandem structures specific to certain functional sequence categories. 2 2.1 METHODS Algorithm Fuzzy tandems differ in number of mismatched letters, period T and the number of repeated units n, also called the exponent. For instance, in the tandem ATcgc j ATggc j ATtcc j ATcgg only two positions are identical in all units; this level of divergence makes it difficult to detect such tandems with the existing tools (Benson, 1999; Kolpakov et al., 2003). Typically, finding of periodic signals in biological sequences is solved with the help of autocorrelation analysis (Makeev and Tumanyan, 1996; Chaley et al., 1999; Chechetkin and Lobzin, 1998) and/or periodic alignments (e.g. Benson, 1999). However, such algebraic methods per se usually cannot select the best repeat from overlapping repeats with different periods. In addition, in the case of fuzzy repeats the probability of the tandem repeat to appear by chance cannot be neglected. In this work, we amend a detection/ scoring algorithm with statistical criteria for tandem discrimination. At the first step, candidate repeats are found using local autocorrelation analysis. At the second step, the candidate repeats are filtered based on their statistical weights. The filtering allows one to obtain a set of non-overlapping tandems for any sequence. Non-overlapping tandems identified in a sequence comprise a coverage map that can be easily compared with genome annotations and sequence feature maps. Until now, there have been no algorithms providing solutions to all problems aimed at this study. Some algorithms (Los Alamos National Laboratory, http://biosphere.lanl.gov/tandyman/cgi-bin/tandyman.cgi) detect only perfect tandems, others return only tandems with predefined parameters (Castelo et al., 2002; Sagot and Myers, 1998) or cannot resolve the problem of overlapping periods (Benson, 1999; Kolpakov et al., 2003). We included an option into our software, which allows one to calculate statistical significance of repeats found by other repeat finders, particularly TRF (Benson, 1999) and MREPS (Kolpakov et al., 2003). Fig. 1. Identification of candidate repeats. The i-th element of the output array (w in the text) contains the number of mismatches between three sequence positions: i, i + T, i T. The i-th element of the local sum array (A in the text) contains the sum of T sequential elements of the output array starting from positions i. Small values of the local sum indicate tandem positions (see the text for the details). Identification of candidate repeats. In the first step of the algorithm we search for candidate repeats for each period T from the range of interest. The algorithm compares a seed word of length (period) T in sequence position i with words of the same length in positions i T and i + T. For each letter of the seed word, the number of mismatches w, found from both comparisons, is then recorded to the corresponding sequence positions of an output data array. If the symbols in all three comparisons are identical the score equals zero; if only two symbols are identical, the score equals 1; and if all three symbols are different the score equals 2. An example of the output array obtained after the described local autocorrelation procedure is shown in Figure 1. The algorithm identifies putative tandem repeats by finding minimums for the local sum A of scores w in the second pass, AT ½i ¼ iþT 1 X wT ½k: ð1Þ k¼i All positions with the local sum below threshold K are included into candidate tandem repeats for the selected period. Greater values of K correspond to tandems with higher degree of fuzziness. This procedure is repeated for each period T from the input range. For each T, tandems are extracted for different K, which runs from zero to (T C), where C is a user-defined parameter, ‘the significance level’, literally a number of maximal mismatches allowed. Filtration of candidate tandem repeats. The extraction step may return a collection of tandems different by their phase, fuzziness and the number of repeated units for the same DNA segment. However, genome-wide analysis (i.e. map feature comparison) requires a non-overlapping set of tandems. Therefore we filter extracted overlapping tandems (including those with multiple periods, like 3 and 6) and select the most statistically significant one. We propose two statistical models for calculation of FTR P-values, ‘the MaSk’ and ‘the MotiF’. Corresponding P-values are denoted here as PSvalue and PF-value; their calculation is based on ‘MaSk’ and ‘MotiF’ probabilities, PrS and PrF. ‘MaSk’ characterizes combinatorial properties of tandem repeats such as the minimal number of identical symbols in corresponding repeat position. For instance, the ‘MaSk’ for the tandem TTC j TCC j TGG is (3,[3,1,2]), which means that the repeat has an exponent equal to 3, and at the first position all three letters are identical, at the second position it can be any letter and at the third position at least two letters must be identical. In all cases the MaSk is considered regardless of the particular letters in the sequence. The probability to obtain the ‘MaSk’ on random position, called the ‘MaSk’ probability, is equal to: PrS ðn‚ ½k1 ‚ k2 ‚. . . ‚kT Þ ¼ T Y X i¼1 n þn þn þn ¼n‚ A C G T 9na ki n pnA pnC pnG pnT : nA nC nG nT A C G T ð2Þ Here, the summation is taken over all possible letter combination that comply to the specific mask. In this formula n is the exponent of the candidate repeat, T is the period, ki is the maximal number of identical symbols in 677 V.Boeva et al. position i, 1 ki n, of the repeat. The parameters of Bernoulli model, symbol frequencies pA, . . . , pT, are evaluated from the entire sequence. For instance, MaSk probability calculated for tandem TTC j TCC j TGG assuming pA ¼ pC ¼ pG ¼ pT ¼ 0.25 is equal to PrS(3, [3,1,2]) ¼ 0.04. The corresponding PS-value (the probability to obtain a tandem satisfying the ‘MaSk’ in a random text of length N) is equal to PS -value ¼ 1 ½1 PrS ðn‚ T‚ k1 ‚ k2 ‚ . . . ‚ kT ÞNT nþ1 ‚ ð3Þ where N is taken equal to the length of the entire sequence or the scanning window length. The ‘MotiF’ model is based on the conception of the motif. A motif H represents a set of all words which comply with IUPAC consensus [http:// bioinformatics.org/sms/iupac.html] of the observed FTR. For example a consensus for ATC j ATG j TTG is ‘WTS’ and the motif is {ATC,ATG,TTC,TTG}. Such motif representation has advantages and drawbacks; we discussed some of these issues in Kotelnikova et al. (2005). The probability to find a motif H in the sequence is simply the probability to find any word belonging to it: X PrF ðHÞ ¼ PðvÞ: ð4Þ v2H Then the MotiF P-value, PF-value, is the probability to find at least n consecutive occurrences of motif H in a sequence of length N, given that H has been already found once in the sequence: PF -value ¼ 1 ½1 PrnF ðHÞð1 PrF ðHÞÞNTnþ1 1 ½1 PrF ðHÞNTþ1 ð5Þ Implementation The FTR extraction algorithm is implemented as a C++ package TandemSWAN, available for online data processing and for download from the following URL: http://bioinform.genetika.ru/projects/swan/www. TandemSWAN accepts input sequences in most available file formats; user-defined parameters include minimal and maximal period lengths, ‘significance level’, ‘the MaSk’ or ‘the MotiF’ statistical mode and ‘the penalty factor’ for sub-periods (see online help for the details on parameter settings). Memory requirements and running time depend on repeat abundance in the query sequence and on parameter values; e.g. running time for 22 MB Drosophila chrX is amounted to 2.5 h. TandemSWAN includes utilities for calculation of P-values for tandem repeats obtained by related programs, MREPS (Kolpakov et al., 2003) and TRF (Benson, 1999). 2.3 Coverage of random sequence with FTRs identified by TandemSWAN agrees with theoretical prediction Genome-wide exploration requires convenient FTR maps, which we refer here as ‘coverage maps’ or fraction of the sequence dataset positions (e.g. percentage of total exon length) covered by FTRs with specific structure. In the case of sequences obtained under Bernoulli model, this fraction can be evaluated analytically for each particular FTR period. To test the performance of our algorithm, we compared this analytical value with the results of FTR identification in simulated random sequences. Indeed, for each period T and ‘significance level’ C the probability Q to find a candidate tandem repeat in the first step of the algorithm starting from some position i T (Fig. 1) is written as ! iþT 1 X Q¼P wT ½k T C : ð6Þ k¼i 678 where F(x) is the standard normal cumulative distribution. To obtain the fraction of the random sequence covered by tandem repeats found at the second step of the algorithm, statistical weighting, one should take into account possible overlaps of candidate tandem repeats in neighboring positions. Finally, the coverage can be approximated as 3TQN/(5T 2), where the T-dependent factor at Q reflects tandem repeat overlaps. We generated several 1 Mb sequences with uniform letter frequencies and with letter frequencies from the genome of D.melanogaster, identified FTRs with different parameter settings and calculated coverage maps for periods in the range 3–15 bases. Comparison between the observed and the calculated coverage of FTRs (Fig. 2) demonstrates that the devised formula [Equation (7)] accurately describes the distribution of majority of FTRs present in the random sequence. The agreement between theoretical and observed coverage values holds for the range of periods explored in this study and only moderately depends on letter frequencies. 3 : P-values calculated using either model allow for unambiguous discrimination between, for instance, a longer, highly mismatched tandem and a shorter one, containing fewer, but better matching units. As we pointed out earlier, weighting also helps to eliminate overlapping tandems. 2.2 Here wT[k], the elements of the array wT (see definitions in ‘Identification of candidate repeats’ and Fig. 1), are considered as random variables with expectation E and variance V depend on the letter frequencies evaluated from D.melanogaster genome. According to central limit approximation Equation (6) can be written as follows: Tð1 EÞ C þ 1 pffiffiffiffiffiffiffi ‚ ð7Þ QF VT RESULTS AND DISCUSSION FTR density is uneven across the genome of Drosophila. Distribution of functional and other sequence features is known to be unequal among eukaryotic chromosomes and between different chromosome loci. Therefore, we decided to explore distribution of FTRs, as detected by our algorithm across a sample eukaryotic genome, D.melanogaster (Celniker et al., 2002). We focused our attention on the genome of Drosophila because of its outstanding level of annotation, availability of many related genomes and its relatively small size (120 MB). We performed FTR extraction with default parameter setting (see TandemSWAN help) using the ‘MaSk’ model for statistical weighting. In this test we explored map coverage, calculated for non-overlapping 16 KB windows, without discriminating tandem motifs and periods. We have found that FTR density (map coverage) along all chromosomes is highly inhomogeneous (Fig. 3a), and the central arm regions have higher FTR density than the distal regions (Fig. 3b). In addition, the average FTR density on X-chromosome is substantially higher than on autosomes (See ‘Relative FTR density across genome’ below). In order to assess possible roles of FTRs, we correlated FTR distribution in the Drosophila genome with positions of coding sequences and some other sequence features, such as local AT/ GC composition. We have found that gene rich segments contain less FTRs than intergenic regions, while AT rich segments are enriched by FTRs (Supplementary Fig. 1). Reasons leading to fluctuation of FTR density in genome can be different; therefore we explored FTR structure in functionally distinct sequence datasets. FTR with 3k periods prevail in coding regions. We investigated if FTRs of specific periods may prevail in certain functional categories. We assembled several sequence datasets containing all exons, 30 -untranslated region (30 -UTRs), 50 UTRs, intergenic regions, intergenic heterochromatin (from Drosophila Heterochromatin Project, http://www.dhgp.org/; http://flybase.net/annot/dmel_het_ release3.2b2.txt; ftp://flybase.net/genomes/Drosophila_melanogaster/current_hetchr/fasta/) and a dataset containing 124 Short fuzzy tandem repeats in genomic sequences Fig. 2. Comparison of map coverage values obtained by TandemSWAN with theoretical expectation. The expected coverage was according to Equation (7) in the text. –*–, TandemSWAN, significance level 1; ––, theoretical, significance level 1; –*–, TandemSWAN, significance level 2; –D–theoretical, significance level 2; –·– TandemSWAN, significance level 3; ––, theoretical, significance level 3. A 1M long Bernoulli random sequence with average genomic nucleotide frequencies was simulated. Fig. 3. FTR occurrence in different segments of D.melanogaster genome. FTR density is higher on the X-chromosome (a) and on the centers of chromosome arms (b). 679 V.Boeva et al. 680 Short fuzzy tandem repeats in genomic sequences transcription regulatory regions, i.e. enhancers (or cis-regulatory modules) (https://webfiles.berkeley.edu/dap5/public_html/index. html). Corresponding datasets were also constructed for related species D.pseudoobscura (Richards et al., 2005). In order to explore FTRs within a functionally related group of genes we also generated the corresponding datasets for selection of 16 developmental gene loci from D.melanogaster and D.pseudoobscura. To investigate prevalence of specific tandems in all these datasets we compared total sequence coverage by FTRs with different periods (Fig. 4). As expected, the most striking signals were detected in datasets containing exons (Fig. 4a). We found that in the coding regions FTRs with periods divisible by 3 are prevailing; instead, tandems with periods not equal to 3k are suppressed (below random expectation). We also found that 3k periods in coding regions of the Xchromosome have a greater coverage than 3k-periodic FTRs found in exons of autosomes. This suggests that FTR density even within similar functional units may be linked with the physical map, i.e. a particular place in genome. Surprisingly, we also detected high presence of periods multiples of 6 (but not the other 3k periods) in the non-coding sequences. Apparently, there is a 6k background in genome, not related to periodicities caused by codon triplets [see ‘Possible source of 3k (6/12) background’ below]. FTR periods specific to sequence categories other than exons. FTRs with periods 6 and 12 were found to be highly abundant throughout all analyzed datasets, including transcription regulatory regions, intergenic spacers, UTRs and even intergenic heterochromatin (Fig. 4b–d). At the same time, non-coding regions were also found to be enriched by FTRs with other than 3k periods. Outside exons we observed 2- to 3-fold FTR excess over random expectation, which supports ‘non-random’ origin of FTRs and the nonrandom character of genomic sequences in general. In order to detect differences in the FTR structure among datasets representing non-coding regions, we compared prevalence of all periods, i.e. FTR profiles, as shown in Figure 4 (Table 1 and Supplementary Table 1). The correlation analysis has shown that according to prevailing FTR periods, all 22 datasets can be subdivided into at least three groups of similarity, one corresponding to coding regions, another to heterochromatin and the last one corresponding to intergenic regions, spacers and others. Comparison of absolute levels of FTR presence in different datasets has shown that intergenic heterochromatin, in general, contains less FTRs than euchromatin (Fig. 4b). Moreover, heterochromatic regions displayed some excess of FTRs with periods equal to 3k. In general, comparison of different sequence categories demonstrated that FTRs with all explored periods are overabundant in the genome, (with the exception of exons) and FTRs with 6k periods for some reason strongly prevail, even in non-coding DNA. FTRs in enhancers are similar, but not identical to that in intergenic regions. Repetitive sequences in transcription regulatory regions are of special interest. While periodic signals present in exons (3k-periodic FTRs) can be explained by the genetic triplet code (and by periodicities in protein sequences), in regulatory regions, FTRs may well represent a background. To investigate this problem, we removed 6k background by normalizing FTR coverage values in functional datasets to the coverage in genome fragments without any functional annotation (nonfunctional). We focused on 124 annotated (experimentally validated) enhancer regions from D.melanogaster (https://webfiles.berkeley.edu/ dap5/public_html/data_06/124_Dmel_Enc.fa). The vast majority of these sequences are involved in regulation of developmental genes. However, this group is nether functionally nor structurally homogeneous. The enhancers have different length (0.3–3 kb) and regulate genes transcribed at different developmental stages. To achieve better representation we considered the entire dataset (124 sequences, 181 690 bp) and two sub samples, so-called ‘AP’ (anterior–posterior, 72 sequences, 117 377 bp) and ‘DV’ (dorsoventral, 136 sequences, 114 354 bp) enhancers. Along with enhancers we also considered a separate dataset combining ‘spacers’ between enhancers and a dataset combining coding regions from the same genome locations. The corresponding datasets were also constructed for D.pseudoobscura ‘AP’ enhancers (sequences are available from the website). Analysis of the normalized FTR distributions in enhancer datasets and in spacers (Fig. 4e) have shown some degree of enrichment by FTRs with periods 7 and 8 in all datasets, representing loci of developmental genes. No major differences were found in FTR distribution between the enhancers and their flanking regions or ‘spacers’. However, the overall FTR distribution was not identical to that found in the other non-functional, intergenic regions of the genome. Along with the assessment of general FTR distributions in enhancers, we also investigated a possible relation between FTR motifs and the binding motifs for transcription factors present in the enhancers. In some single cases (even-skipped stripe 2 enhancer) we observed some similarity, but on the larger scale the correlation was not found to be significant. Fig. 4. Fraction covered by FTRs with different periods calculated for DNA with different function from D.melanogaster and D.pseudoobscura. FTRs with periods multiple to 3 are overrepresented in exons and FTRs with periods 6 and 12 in non-translated DNA. Note the comparative deficiency of period 9 FTRs in intergenic DNA, especially in D.pseudoobscura. FTRs were identified by SWAN with significance level C ¼ 2 without filtration by statistical significance. For comparison in all cases curve ‘–·–’ shows results for 1 MB simulated Bernoulli random sequence with average genomic nucleotide frequencies. (a) D.melanogaster exons: ‘––’, autosomes (26 719 758 bp); ‘–D–’, X-chromosome, (5 479 105 bp). (b) D.melanogaster euchromatin intergenic and heterochromatin DNA: ‘––’, autosome intergenic (47 527 681 bp); ‘–D–’, X-chromosome intergenic (11 598 950 bp); ‘–&–’, heterochromatin (7 089 934 bp). (c) D.melanogaster UTRs: ‘–*–’, autosome 50-UTRs (3 331 080 bp); ‘–*–’, X-chromosome 50-UTRs (686 503 bp); ‘––’, autosome 30-UTRs (5 549 366 bp); ‘–·–’, X-chromosome 30-UTRs (1 254 227 bp); (d) D.melanogaster regulatory regions compared with autosome intergenic DNA: ‘––’ dorsal and twist enhancers (114 354 bp); ‘–*–’, 124 enhancers (181 690); ‘–D–’, autosome intergenic DNA (47 527 681 bp), the same as in panel (B); (e) D.melanogaster regulatory regions normalized for its autosome intergenic DNA: ‘––’, 124 enhancers (181 690 bp); ‘–D–’, AP enhancers, (115 599 bp); ‘–*–’, AP spacers (349 634 bp); (f) D.pseudoobscura intergenic DNA and CDS: ‘–&–’, autosome intergenic (49 347 738 bp); ‘–*–’, X-chromosome intergenic (30 400 371 bp); ‘––’, autosome exons (14 069 722 bp); ‘–D–’, X-chromosome exons (5 417 123 bp); (g) D.pseudoobscura and D.melanogaster CDS: ‘–&–’, D.pseudoobscura autosomes (14 069 722 bp); ‘––’, D.melanogaster autosomes (26 719 758 bp); (h) D.pseudoobscura and D.melanogaster intergenic DNA: ‘–&–’, D.pseudoobscura autosome (49 347 738 bp); ‘––’, D.melanogaster autosome (47 527 681 bp); (i) D.pseudoobscura and D.melanogaster AP enhancers: ‘–&–’, D.pseudoobscura (60 085 bp); ‘––’, D.melanogaster (115 599 bp). 681 V.Boeva et al. It appears that FTRs in enhancers are different from FTRs in the rest of the genome by their period, and perhaps, by motif composition; however, insufficient amount of annotated enhancer regions in genome and presence of 6k background complicates the analysis. Possible source of 3k (6k) background. Our results show that in Drosophila tandems with periods 6 and 12 are found in the entire genome, whereas other 3k periods are restricted to protein coding sequences. The abundance of 6k periods might be explained, either through the mechanisms of DNA replication and rearrangement, or from possible structural DNA features. Existing experimental data suggest that certain 3-periodic synthetic DNA sequences have substantially different helix stability probably owing to mismatched alignments at equilibrium temperatures during melting (Delcourt and Blake, 1991). So, the quasi-repeated structures may be important for functional stability/flexibility of DNA molecule. Periods of 3k abundant in coding sequences apparently are related to periodicity in protein sequences and triplet nature of genetic code. Repetitive structures in the protein sequences, such as 3–10 helix or hydrophobic alpha helixes may also cause periodicity at the level of DNA (Katti et al., 2000). The ‘coincidence’ that 6k periods also fall into 3k period may even have deeper roots. Apparently, nucleic acids and their replication appeared in certain form before the triplet genetic code (Lifson, 1997); so it can be that 3–6k ‘matrix’ was in DNA long before the genetic code itself, and on itself served as a source for formation of what we know today as triplet genetic code. Be the origin of 3k/6k periodic FTRs ‘mechanical’ (DNA stability/replication) or ‘historical’ (3k matrix), it is unlikely that this nonspecifically distributed signal is connected with fine mechanisms of genome functioning, such as regulation of gene expression. However, this does not exclude a possibility that FTRs with other periods or FTRs containing some specific motifs are involved into some regulatory functions. Moreover, if the quasi-periodic structure of native DNA sequence is indeed important it would poise an additional constraint on all motifs in the sequence, including regulatory signals. Role of FTRs with periods other than 3k. Analysis of FTRs with periods different from 3k has shown that periods 7 and 8 are more abundant in enhancers and in regions without functional annotation around (Fig. 4e). Currently, it is not clear what function these FTRs may perform in enhancers and whether their presence is related to any function at all. For instance, we have found no correlation between FTRs and recognition motifs for transcription factors present in the same enhancers. However we considered only a limited number of binding motifs (11), most of which are far from being perfect. Improving enhancer annotations, number of considered motifs and better recognition of the functional motif matches may shed more light on possible roles of FTRs in the regulation of transcription. As it has been suggested earlier (Papatsenko et al., 2002), some FTRs may play a role as cassettes, containing synergistically acting tandems of binding sites and responding to certain threshold levels of transcription factors. However, it is also possible that presence of specific repetitive sequences provide certain spatial geometry to an enhancer, required for correct assembly of regulatory protein complexes. Apparently, some FTRs may be involved in the maintenance 682 of chromatin structure and/or spatial DNA geometry even in a broader context. The role of tandem repeats in regulatory regions was discussed recently in (Sinha and Siggia, 2005). The authors assess high quality tandem repeats found in enhancers of D.melanogaster and D.pseudoobscura using TRF and MREPS and demonstrated their low conservation. They concluded that the tandem repeats carry a limited functional load. This agrees with our first finding that the majority of FTRs found in enhancers have the same predominant periods as non-annotated intergenic DNA. On the other hand, some FTRs found in enhancer regions have specific properties and may be involved in regulatory function. Relative FTR density across the genome. While qualitative composition of FTRs is surprisingly similar across the genome, their density may substantially vary from one genomic location to the other and from one genome to the other. This may or may not be connected with the local gene density or even presence of ‘gene deserts’ (Ovcharenko et al., 2005). We have found that the genome of D.pseudoobscura has a greater DNA fraction covered by FTRs in all functional categories than the genome of D.melanogaster (Fig. 4). In both genomes Xchromosome has a higher FTR density than other chromosome arms, and finally, distal locations of the chromosome arms have lower FTR densities than more central loci (Fig. 3). It is noteworthy that the Drosophila X-chromosome has a greater number of short perfect repeats (Katti et al., 2001) and probably a greater number of recent gene duplications as compared with autosomes (Thornton and Long, 2002). This difference between sex-related chromosomes and the rest of genome was also reported in human, where LINE1 repeat elements cover one-third of the human X-chromosome (Ross et al., 2005). However, in the case of Caenorhabditis elegans genome the reported repeat density in X-chromosome is lower (Achaz et al., 2001). The difference in FTR fraction between the genomes D.melanogaster and D.pseuodoobscura is probably related to a higher compactness of the D.melanogaster genome. Finally, we have found that FTRs are relatively less abundant in intergenic heterochromatin. Perfect tandems with periodicity of 5 found in pericentromeric heterochromatic regions (Sun et al, 2003) actually cover a surprisingly low fraction of the total heterochromatic DNA. Problems of FTR exploration and focus of the TandemSWAN algorithm. Eukaryotic genomes are overwhelmed by repetitive sequences probably carrying no biological function, and are caused, for instance, simply by peculiarities of DNA replication (Ellegren, 2004). This non-functional noise needs to be filtered, which in part can be done by exploring repeat parameters in different sequence categories and/or by normalizing to the noise in the non-functional regions. However, even the background signals may serve maintenance of overall DNA survivability and might contain signals, required, for instance, for chromatin packaging. Here we explored fuzzy tandems since they are largely out of the focus of the regular tandem-finding programs (Castelo et al., 2002; Sagot and Myers, 1998; Benson, 1999; Kolpakov et al., 2003). Specifics of our algorithm can be illustrated by comparing TandemSWAN with the two most popular repeat finders, TRF (Benson, 1999) and MREPS (Kolpakov et al., 2003) (Fig. 5 and Supplementary Table 2). Doing a comparison with TRF and MREPS we tried to obtain the sequence sets that were as similar as possible to those obtained with SWAN with parameters characteristic for our study. Short fuzzy tandem repeats in genomic sequences Table 1. Correlation between FTR profiles for D.melanogaster datasets 1 2 3 4 5 6 7 8 30 -UTR 50 -UTR CDS Hetero 124 ENH Inter Spacer Random 1 2 3 4 5 6 7 8 1.00 0.91 0.42 0.78 0.91 0.96 0.92 0.51 0.91 1.00 0.67 0.89 0.87 0.95 0.91 0.62 0.42 0.67 1.00 0.76 0.29 0.42 0.37 0.32 0.78 0.89 0.76 1.00 0.69 0.81 0.71 0.66 0.91 0.87 0.29 0.69 1.00 0.96 0.93 0.58 0.96 0.95 0.42 0.81 0.96 1.00 0.97 0.67 0.92 0.91 0.37 0.71 0.93 0.97 1.00 0.57 0.51 0.62 0.32 0.66 0.58 0.67 0.57 1.00 (1) 30 -UTR, autosomes; (2) 50 -UTR, autosomes; (3) CDS, autosomes; (4) intergenic heterochromatin; (5) 124 enhancers; (6) intergenic euchromatin; (7) spacers in AP loci; (8) random sequence. Color-code: (white) r ¼ 0–0.6; (light grey) 0.6–0.9; (medium grey) 0.9–0.95; (dark grey) 0.95–1. (a) (b) Fig. 5. Similarity between tandem sets identified by different repeat finders. Fractions of a 50 kb fragment of D.melanogaster chr2L sequence covered with tandem repeats identified by different algorithms with following parameters: TandemSWAN, minimal period 3, maximal period 15, significance level 2, filtration with (a) PS-probability < 105 and (b) PS-probability < 103, ‘MaSk’ statistical mode; TRF (Benson, 1999), minimal period 3, maximal period 15, match 2, mismatch 2, indel 15, pmatch 80, pindel 0, minscore 20; MREPS (Kolpakov et al., 2003), err 15, maxperiod 15, minperiod 3. Actually, both TRF and MREPS are usually used to search for more precise repeats than those in Figure 5 and at least TRF was operating near the limit of repeat fuzziness allowed by its internet-based version. All the three tools perform similarly in the case of perfect repeats. However, the results of TandemSWAN and TRF/MREPS become quite different in the case of FTR extraction. 4 CONCLUSIONS In this work we have formulated a working definition of FTRs based on its statistical properties, developed an extraction algorithm and compared properties of tandems found in sequences carrying different functions. We developed statistics that allow calculation of P-values for FTR occurrence in Bernoulli type random sequences, which can be useful for other algorithms. This statistical approach implemented in the TandemSWAN program, aimed to identify FTRs with a broad spectrum of listed parameters. Using this approach, we identified short FTRs with periods 3–24 bases in the D.melanogaster and D.pseudoobscura genomes and compared FTR structure and occurrence in coding and non-coding regions, heterochromatic regions and regulatory (enhancer) sequences. We have found that different types of tandems are abundant in different functional sequence categories, with each category having its own pattern of preferred period lengths. Tan- dems with period 3 and their multiples were found to be characteristic of coding regions. FTRs with 6k periods are characteristic for all non-coding DNA. FTRs with periods equal to 5, 6, 7 and 11, 12, 14 were enriched in loci of developmental genes and developmental enhancers. The regulatory modules at the mean have no less FTRs than spacers nearby; furthermore, FTR with periods 7 and 8 are found more often in Drosophila cis-regulatory modules then in other non-coding DNA. Obviously, both the evolution and the DNA structure of regulatory modules are subject to many additional parameters, such as DNA melting and adsorption of protein regulatory factors. Thus, it is possible that FTR found in cis-regulatory modules have some particular sequence structure facilitating their function and should be studied in greater detail. To understand the role of the 6 bp-related omnipresent repeats it is necessary first to test if they are present in different bacterial, animal and plant taxa, and not only in Drosophila species. This work is currently in progress. ACKNOWLEDGEMENTS The authors are thankful for M. Borodovsky, A. P. Lifanov, N. A. Oparina, N. G. Esipova, M. Lassig, A. V. Favorov, V. E. Ramensky, M. G. Gelfand and A. A. Mironov for valuable discussion. They also thank R.Zinzen for careful manuscript reading and suggested changes. This study has been supported by the French Program EcoNet-08159PG, INTAS grant 04-83-3994, Russian State Contract No 02.434.11008, RFBR grant 04-04-49601, Fogerthy RO3 TW005899-01A1 program, Russian Academy of Science Presidium Program in Molecular and Cellular Biology, project #10 and Ludwig Institute of Cancer Research Grant CRDF GAP RBO-1268. Conflict of Interest: none declared. REFERENCES Antoniewski,C. et al. (1996) Direct repeats bind the EcR/USP receptor and mediate ecdysteroid responses in Drosophila melanogaster. Mol. Cell. Biol., 16, 2977–2986. Achaz,G. et al. (2001) Study of intrachromosomal duplications among the eukaryote genomes. Mol. Biol. Evol., 18, 2280–2288. Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. Benson,G. and Waterman,M. (1994) A method for fast database search for all knucleotide repeats. Nucleic Acids Res., 22, 4828–4836. 683 V.Boeva et al. Boulikas,T. (1995) Chromatin domains and prediction of MAR sequences. Int. Rev. Cytol., 162A, 279–388. Carroll,S.B., Grenier,J.K. and Weatherbee,S.D. (2001) From DNA to Diversity. Molecular Genetics and the Evolution of Animal Design. Blackwell Science, Malden, MA, ISBN 0-632-04511-6. Castelo,A.T. et al. (2002) TROLL—tandem repeat occurrence locator. Bioinformatics, 18, 634–636. Celniker,S. et al. (2002) Finishing a whole genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol., 3, RESEARCH0079. Chaley,M.B. et al. (1999) Method revealing latent periodicity of the nucleotide sequences modified for a case of small samples. DNA Res., 6, 153–163. Chechetkin,V.R. and Lobzin,V.V. (1998) Nucleosome units and hidden periodicities in DNA sequences. J. Biomol. Struct. Dyn., 15, 937–947. Delcourt,S.G. and Blake,R.D. (1991) Stacking energies in DNA. J. Biol. Chem., 266, 15160–15169. Davidson,H. et al. (2000) Genomic sequence analysis of Fugu rubripes CFTR and flanking genes in a 60 kb region conserving synteny with 800 kb of human chromosome 7. Genome Res., 10, 1194–1203. Dover,G.A. (1982) Molecular drive, a cohesive model of species evolution. Nature, 299, 111–117. D.melanogaster heterochomatin genome data from Drosophila Heterochromatin Genome Project. Edwards,A. et al. (1992) Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. Genomics, 12, 241–253. Ellegren,H. (2004) Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet., 5, 435–445. Fu,Y.-H. et al. (1992) An unstable triplet repeat in a gene related to myotonic muscular dystrophy. Science, 255, 1256–1258. Gao,Q. and Finkelstein,R. (1998) Targeting gene expression to the head: the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen. Development, 125, 4185–4193. Huntington’s Disease Collaborative Research Group (1993), A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell, 72, 971–983. Ioshikhes,I. et al. (1999) Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc. Natl Acad. Sci. USA, 96, 2891–2895. Karlin,S. et al. (1988) Efficient algorithms for molecular sequence analysis. Proc. Natl Acad. Sci. USA, 85, 841–845. Katti,M.V. et al. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol., 18, 1161–1167. Katti,M.V. et al. (2000) Amino acid repeat patterns in protein sequences: their diversity and structural-functional implica-tions. Protein Sci., 9, 1203–1209. Kolpakov,R. et al. (2003) mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res., 31, 3672–3678. Kotelnikova,E.A. et al. (2005) Evolution of transcription factor DNA binding sites. Gene, 347, 255–263. Kravatskaia,G.I. et al. (2002) Similarities in periodical structures in the position of nucleotides in regions of initiation of replication of bacterial genomes. Biofizika, 47, 595–599. Kutuzova,G.I. et al. (1999) Periodicity in contacts of RNA-polymerase with promoters. Biofizika, 44, 216–223. Landau,G.M. et al. (2001) An algorithm for approximate tandem repeats. J. Comput. Biol., 8, 1–18. Li,L. et al. (2004) Pseudo-periodic partitions of biological sequences. Bioinformatics, 20, 295–306. Li,Y.C. et al. (2002) Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol. Ecol., 11, 2453–2465. Lifson,S. (1997) On the crucial stages in the origin of animate matter. J. Mol. Evol., 44, 1–8. Makeev,V.Ju. and Tumanyan,V.G. (1996) Search of periodicities in primary structure of biopolymers: a general Fourier approach. Comput. Appl. Biosci., 12, 49–54. 684 Makeev,V.J. et al. (2003) Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res., 31, 6016–6026. Martienssen,R.A. (2003) Maintenance of heterochromatin by RNA interference of tandem repeats. Nat. Genet., 35, 213–214. Meloni,R. et al. (1998) A tetranucleotide polymorphic microsatellite, located in the first intron of the tyrosine hydroxylase gene, acts as a transcription regulatory element in vitro. Hum. Mol. Genet., 7, 423–428. Nakamura,Y. et al. (1998) VNTR (variable number of tandem repeat) sequences as transcriptional, translational, or functional regulators. J. Hum. Genet., 43, 149–152. Niv,E. et al. (2005) Microsatellite instability in patients with chronic B-cell lymphocytic leukaemia. Br. J. Cancer., 92, 1517–1523. Ovcharenko,I. et al. (2005) Evolution and functional classification of vertebrate gene deserts. Genome Res., 15, 137–145. Ott,R.W. and Hansen,L.K. (1996) Repeated sequences from the Arabidopsis thaliana genome function as enhancers in transgenic tobacco. Mol. Gen. Genet., 252, 563–571. Papatsenko,D.A. et al. (2002) Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. Genome Res., 12, 470–481. Ramchandran,R. et al. (2000) A (GATA)(7) motif located in the 50 boundary area of the human beta-globin locus control region exhibits silencer activity in erythroid cells. Am. J. Hematol., 65, 14–24. Richards,S. et al. (2005) Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res., 15, 1–18. Ross,M.T. et al. (2005) The DNA sequence of the human X chromosome. Nature, 434, 325–337. Sagot,M. and Myers,E. (1998) Identifying satellites in nucleic acid sequences. In Istrail,S., Pevzner,T. and Waterman,M. (eds), Proceedings of the Second Annual International Conference on Computational Molecular Biology. ACM Press, NY, pp. 234–242. Schug,M.D. et al. (1998) The distribution and frequency of microsatellite loci in Drosophila melanogaster. Mol. Ecol., 7, 57–70. Shi,X.M. et al. (2000) Tandem repeat of C/EBP binding sites mediates PPARgamma2 gene transcription in glucocorticoid-induced adipocyte differentiation. J. Cell Biochem., 76, 518–527. Singer,M. and Berg,T. (1991) Genes and Genomes. University Science Books, Mill Valley, California. Sinha,S. and Siggia,E.D. (2005) Sequence turnover and tandem repeats in cisregulatory modules in drosophila. Mol. Biol. Evol., 22, 874–885. Subramanian,S. et al. (2003) Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol., 4, R13. Sun,X. et al. (2003) Sequence analysis of a functional Drosophila centromere. Genome Res., 13, 182–194. Thibodeau,S.N. et al. (1993) Microsatellite instability in cancer of the proximal colon. Science, 260, 816–819. Thornton,K. and Long,M. (2002) Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol. Biol. Evol., 19, 918–925. Verkerk,A. et al. (1991) Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell, 65, 905–914. Vergnaud,G. and Denoeud,F. (2000) Minisatellites: mutability and genome architecture. Genome Res., 10, 899–907. Villafranca,E. et al. (2001) Polymorphisms of the repeated sequences in the en-hancer region of the thymidylate synthase gene promoter may predict downstaging after preoperative chemoradiation in rectal cancer. J. Clin. Oncol., 19, 1779–1786. Weber,J.L. and May,P.E. (1989) Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet., 44, 388–396. Wooster,R. et al. (1994) Instability of short tandem repeats (microsatellites) in human cancers. Nat. Genet., 6, 152–156.
© Copyright 2026 Paperzz