Short fuzzy tandem repeats in genomic sequences, identification

BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 6 2006, pages 676–684
doi:10.1093/bioinformatics/btk032
Genome analysis
Short fuzzy tandem repeats in genomic sequences, identification,
and possible role in regulation of gene expression
Valentina Boeva1, , Mireille Regnier2, Dmitri Papatsenko3 and Vsevolod Makeev4,5
1
Department of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia,
INRIA Rocquencourt, France, 3University of California, Berkeley, USA, 4State Research Center
GosNIIGenetika, Moscow, Russia and 5Engelhardt Institute of Molecular Biology,
Russian Academy of Sciences, Moscow, Russia
2
Received on October 14, 2005; revised on December 22, 2005; accepted on December 28, 2005
Advance Access publication January 10, 2006
Associate Editor: Steven L. Salzberg
ABSTRACT
Motivation: Genomic sequences are highly redundant and contain
many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of
particular interest. They are found in regulatory regions of eukaryotic
genes and are reported to interact with transcription factors. However,
accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and
classification.
Results: We have obtained formulas for P-values of FTR occurrence
and developed an FTR identification algorithm implemented in
TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp)
in coding and non-coding regions including UTRs, heterochromatic,
intergenic and enhancer sequences of Drosophila melanogaster and
Drosophila pseudoobscura. Tandems with period three and its multiples
were found in coding segments, whereas FTRs with periods multiple
of six are overrepresented in all non-coding segment. Periods equal to
5–7 and 11–14 were characteristic of the enhancer regions and other
non-coding regions close to genes.
Availability: TandemSWAN web page, stand-alone version and
documentation can be found at http://bioinform.genetika.ru/projects/
swan/www/
Contacts: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
Eukaryotic genomes contain many types of repetitive sequences,
such as long repeats, satellite DNA and many other yet unclassified
sequences of various lengths and levels of repetitiveness (Singer and
Berg, 1991). So far, the efforts of researchers have been predominantly focused on nearly perfect repeats such as microsatellites and
others (Li et al., 2002). Analysis of more divergent (fuzzy) tandem
repeats was complicated by problems related to their discrimination
from background and insufficient annotation level of genomes.
In this study we focus on fuzzy tandems containing n occurrences
(n > 2) of a mismatched word with period of T bases (T 3–24)
without insertions or deletions. Tandem repeats are usually
To whom correspondence should be addressed.
676
classified into microsatellites (1–6 bp), minisatellites (6–24 bp,
and in some cases longer) (Vergnaud and Denoeud, 2000) and
‘classical’ satellites. The length scale of fuzzy repeats considered
here corresponds to micro- and minisatellite repeat classes.
However, we do not consider periods with T ¼ 1 or 2, as they
correspond to poly-A or TATA-like sequence, a different biological
object explored elsewhere (Katti et al., 2001; Schug et al., 1998;
Subramanian et al., 2003).
Fuzzy tandem repeats (FTRs) have been found in regulatory
regions of eukaryotic genes (Shi et al., 2000); such tandems sometimes form cooperative arrays of binding sites and interact with
transcription factors (Gao and Finkelshtein, 1998; Ott and
Hansen, 1996; Meloni et al., 1998; Ramchandran et al., 2000).
However, it is still unclear (1) how to define and extract fuzzy
tandems, (2) whether functionally different sequences are enriched
by tandems of a specific structure and (3) what biological function
(if any) fuzzy tandems perform in genome. If the genome distribution of FTRs is uneven, their exploration should help to
locate structural/functional sequence categories and to understand
underlying mechanisms of their function.
The degree of FTR propagation varies from one genome to the
other and from one functional sequence category to the other; existing algorithms (Benson, 1999; Kolpakov et al., 2003) return up to
10–15% of the Drosophila melanogaster and >10% (Benson, 1999)
of the human genome as tandem repeats of various structure.
Accumulation of tandems in genomes is a result of errors during
replication and some rearrangement events (Dover, 1982; Singer
and Berg, 1991, Ellegren, 2004). From that perspective, much of
repetitive genomic DNA might be considered as non-informative;
however, there are cases where presence of tandems is tightly linked
to a biological function (Nakamura et al., 1998).
For instance, long tandem repeats constitute a large portion of
heterochromatin satellite DNA and are involved in centromere
formation and function (Martienssen, 2003); sometimes presence
of long tandems even serves as a signal of extra centromere formation (Singer and Berg, 1991). Much less is known about the role of
shorter repetitive sequences, especially highly mismatched fuzzy
tandems (FTRs), quite abundant in exons, introns and transcription
regulatory sequences (Nakamura et al., 1998). In exons, FTRs may
reflect sequence periodicities existing in protein sequence or even
structural features, such as hydrophobic helices (Katti et al., 2000;
Li et al., 2004); it is unclear if these tandems have any function at
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Short fuzzy tandem repeats in genomic sequences
the DNA level. In complex eukaryotic regulatory regions, such as
enhancers and silencers, FTRs appear to be linked with some types
of binding sites for transcription factors (Antoniewski et al., 1996;
Ott and Hansen, 1996; Ramchandran et al., 2000). One of the
attractive models suggests that an FTR with a unit consensus similar
to a binding site modulates exact response to regulator concentration (Carroll et al., 2001; Davidson et al., 2000).
Repeats of various types may also be important for regulation
that controls spatial packaging/dynamics of eukaryotic DNA. Thus,
8–16 bp repeats separated by distance <200 bases may characterize
Scaffold Attached Regions (Boulikas, 1995), periodic signals
appear to play a role in nucleosome positioning (Ioshikhes et al.,
1999). Periodic signals are present in prokaryotic and eukaryotic
promoters, where they correspond to arrays of sites for DNAbinding proteins (Kutuzova et al., 1999; Kravatskaia et al., 2002;
Makeev et al., 2003).
Tandem structure sometimes may be important for genome
functioning—many human diseases are known to be caused by
increase in the number of copies, etc. (Verkerk et al., 1991;
Huntington’s Disease Collaborative Research Group, 1993; Fu
et al., 1992; Thibodeau et al., 1993; Wooster et al., 1994;
Villafranca et al., 2001; Niv et al., 2005). From practical point
of view, variations in tandem structure serves in many important
applications, such as linkage analysis and DNA fingerprinting
(Edwards et al., 1992; Weber and May, 1989).
Here we conducted a functional analysis of tandems in Drosophila at a genome-wide level by (1) introducing probabilistic models for tandems with a high degree of fuzziness and (2) finding
tandem structures specific to certain functional sequence categories.
2
2.1
METHODS
Algorithm
Fuzzy tandems differ in number of mismatched letters, period T and the
number of repeated units n, also called the exponent. For instance, in the
tandem ATcgc j ATggc j ATtcc j ATcgg only two positions are identical in
all units; this level of divergence makes it difficult to detect such tandems
with the existing tools (Benson, 1999; Kolpakov et al., 2003).
Typically, finding of periodic signals in biological sequences is solved
with the help of autocorrelation analysis (Makeev and Tumanyan, 1996;
Chaley et al., 1999; Chechetkin and Lobzin, 1998) and/or periodic alignments (e.g. Benson, 1999). However, such algebraic methods per se usually
cannot select the best repeat from overlapping repeats with different periods.
In addition, in the case of fuzzy repeats the probability of the tandem repeat
to appear by chance cannot be neglected. In this work, we amend a detection/
scoring algorithm with statistical criteria for tandem discrimination. At the
first step, candidate repeats are found using local autocorrelation analysis. At
the second step, the candidate repeats are filtered based on their statistical
weights. The filtering allows one to obtain a set of non-overlapping tandems
for any sequence. Non-overlapping tandems identified in a sequence comprise a coverage map that can be easily compared with genome annotations
and sequence feature maps.
Until now, there have been no algorithms providing solutions to all problems aimed at this study. Some algorithms (Los Alamos National Laboratory, http://biosphere.lanl.gov/tandyman/cgi-bin/tandyman.cgi) detect only
perfect tandems, others return only tandems with predefined parameters
(Castelo et al., 2002; Sagot and Myers, 1998) or cannot resolve the problem
of overlapping periods (Benson, 1999; Kolpakov et al., 2003). We included
an option into our software, which allows one to calculate statistical significance of repeats found by other repeat finders, particularly TRF (Benson,
1999) and MREPS (Kolpakov et al., 2003).
Fig. 1. Identification of candidate repeats. The i-th element of the output
array (w in the text) contains the number of mismatches between three
sequence positions: i, i + T, i T. The i-th element of the local sum array
(A in the text) contains the sum of T sequential elements of the output array
starting from positions i. Small values of the local sum indicate tandem
positions (see the text for the details).
Identification of candidate repeats. In the first step of the algorithm we
search for candidate repeats for each period T from the range of interest.
The algorithm compares a seed word of length (period) T in sequence
position i with words of the same length in positions i T and i + T. For each
letter of the seed word, the number of mismatches w, found from both
comparisons, is then recorded to the corresponding sequence positions of
an output data array. If the symbols in all three comparisons are identical the
score equals zero; if only two symbols are identical, the score equals 1; and if
all three symbols are different the score equals 2. An example of the output
array obtained after the described local autocorrelation procedure is shown in
Figure 1. The algorithm identifies putative tandem repeats by finding minimums for the local sum A of scores w in the second pass,
AT ½i ¼
iþT
1
X
wT ½k:
ð1Þ
k¼i
All positions with the local sum below threshold K are included into candidate tandem repeats for the selected period. Greater values of K correspond
to tandems with higher degree of fuzziness. This procedure is repeated for
each period T from the input range. For each T, tandems are extracted for
different K, which runs from zero to (T C), where C is a user-defined
parameter, ‘the significance level’, literally a number of maximal mismatches allowed.
Filtration of candidate tandem repeats. The extraction step may return a
collection of tandems different by their phase, fuzziness and the number of
repeated units for the same DNA segment. However, genome-wide analysis
(i.e. map feature comparison) requires a non-overlapping set of tandems.
Therefore we filter extracted overlapping tandems (including those with
multiple periods, like 3 and 6) and select the most statistically significant
one. We propose two statistical models for calculation of FTR P-values, ‘the
MaSk’ and ‘the MotiF’. Corresponding P-values are denoted here as PSvalue and PF-value; their calculation is based on ‘MaSk’ and ‘MotiF’ probabilities, PrS and PrF. ‘MaSk’ characterizes combinatorial properties of
tandem repeats such as the minimal number of identical symbols in corresponding repeat position. For instance, the ‘MaSk’ for the tandem
TTC j TCC j TGG is (3,[3,1,2]), which means that the repeat has an exponent
equal to 3, and at the first position all three letters are identical, at the second
position it can be any letter and at the third position at least two letters must
be identical. In all cases the MaSk is considered regardless of the particular
letters in the sequence. The probability to obtain the ‘MaSk’ on random
position, called the ‘MaSk’ probability, is equal to:
PrS ðn‚ ½k1 ‚ k2 ‚. . . ‚kT Þ
¼
T
Y
X
i¼1 n þn þn þn ¼n‚
A
C
G
T
9na ki
n
pnA pnC pnG pnT :
nA nC nG nT A C G T
ð2Þ
Here, the summation is taken over all possible letter combination that comply to the specific mask. In this formula n is the exponent of the candidate
repeat, T is the period, ki is the maximal number of identical symbols in
677
V.Boeva et al.
position i, 1 ki n, of the repeat. The parameters of Bernoulli model,
symbol frequencies pA, . . . , pT, are evaluated from the entire sequence. For
instance, MaSk probability calculated for tandem TTC j TCC j TGG assuming pA ¼ pC ¼ pG ¼ pT ¼ 0.25 is equal to PrS(3, [3,1,2]) ¼ 0.04.
The corresponding PS-value (the probability to obtain a tandem satisfying
the ‘MaSk’ in a random text of length N) is equal to
PS -value ¼ 1 ½1 PrS ðn‚ T‚ k1 ‚ k2 ‚ . . . ‚ kT ÞNT nþ1 ‚
ð3Þ
where N is taken equal to the length of the entire sequence or the scanning
window length.
The ‘MotiF’ model is based on the conception of the motif. A motif H
represents a set of all words which comply with IUPAC consensus [http://
bioinformatics.org/sms/iupac.html] of the observed FTR. For example a
consensus for ATC j ATG j TTG is ‘WTS’ and the motif is
{ATC,ATG,TTC,TTG}. Such motif representation has advantages and
drawbacks; we discussed some of these issues in Kotelnikova et al.
(2005). The probability to find a motif H in the sequence is simply the
probability to find any word belonging to it:
X
PrF ðHÞ ¼
PðvÞ:
ð4Þ
v2H
Then the MotiF P-value, PF-value, is the probability to find at least n consecutive occurrences of motif H in a sequence of length N, given that H has
been already found once in the sequence:
PF -value ¼
1 ½1 PrnF ðHÞð1 PrF ðHÞÞNTnþ1
1 ½1 PrF ðHÞNTþ1
ð5Þ
Implementation
The FTR extraction algorithm is implemented as a C++ package TandemSWAN, available for online data processing and for download from the
following URL: http://bioinform.genetika.ru/projects/swan/www. TandemSWAN accepts input sequences in most available file formats; user-defined
parameters include minimal and maximal period lengths, ‘significance
level’, ‘the MaSk’ or ‘the MotiF’ statistical mode and ‘the penalty factor’
for sub-periods (see online help for the details on parameter settings). Memory requirements and running time depend on repeat abundance in the query
sequence and on parameter values; e.g. running time for 22 MB Drosophila
chrX is amounted to 2.5 h. TandemSWAN includes utilities for calculation
of P-values for tandem repeats obtained by related programs, MREPS
(Kolpakov et al., 2003) and TRF (Benson, 1999).
2.3
Coverage of random sequence with FTRs
identified by TandemSWAN agrees with
theoretical prediction
Genome-wide exploration requires convenient FTR maps, which we refer
here as ‘coverage maps’ or fraction of the sequence dataset positions (e.g.
percentage of total exon length) covered by FTRs with specific structure. In
the case of sequences obtained under Bernoulli model, this fraction can be
evaluated analytically for each particular FTR period. To test the performance of our algorithm, we compared this analytical value with the results of
FTR identification in simulated random sequences. Indeed, for each period T
and ‘significance level’ C the probability Q to find a candidate tandem repeat
in the first step of the algorithm starting from some position i T (Fig. 1) is
written as
!
iþT
1
X
Q¼P
wT ½k T C :
ð6Þ
k¼i
678
where F(x) is the standard normal cumulative distribution. To obtain the
fraction of the random sequence covered by tandem repeats found at the
second step of the algorithm, statistical weighting, one should take into
account possible overlaps of candidate tandem repeats in neighboring positions. Finally, the coverage can be approximated as 3TQN/(5T 2), where
the T-dependent factor at Q reflects tandem repeat overlaps.
We generated several 1 Mb sequences with uniform letter frequencies and
with letter frequencies from the genome of D.melanogaster, identified FTRs
with different parameter settings and calculated coverage maps for periods in
the range 3–15 bases. Comparison between the observed and the calculated
coverage of FTRs (Fig. 2) demonstrates that the devised formula [Equation
(7)] accurately describes the distribution of majority of FTRs present in the
random sequence. The agreement between theoretical and observed coverage values holds for the range of periods explored in this study and only
moderately depends on letter frequencies.
3
:
P-values calculated using either model allow for unambiguous discrimination between, for instance, a longer, highly mismatched tandem and a shorter
one, containing fewer, but better matching units. As we pointed out earlier,
weighting also helps to eliminate overlapping tandems.
2.2
Here wT[k], the elements of the array wT (see definitions in ‘Identification of
candidate repeats’ and Fig. 1), are considered as random variables with
expectation E and variance V depend on the letter frequencies evaluated
from D.melanogaster genome. According to central limit approximation
Equation (6) can be written as follows:
Tð1 EÞ C þ 1
pffiffiffiffiffiffiffi
‚
ð7Þ
QF
VT
RESULTS AND DISCUSSION
FTR density is uneven across the genome of Drosophila. Distribution of functional and other sequence features is known to be
unequal among eukaryotic chromosomes and between different
chromosome loci. Therefore, we decided to explore distribution
of FTRs, as detected by our algorithm across a sample eukaryotic
genome, D.melanogaster (Celniker et al., 2002). We focused our
attention on the genome of Drosophila because of its outstanding
level of annotation, availability of many related genomes and its
relatively small size (120 MB). We performed FTR extraction
with default parameter setting (see TandemSWAN help) using the
‘MaSk’ model for statistical weighting. In this test we explored map
coverage, calculated for non-overlapping 16 KB windows, without
discriminating tandem motifs and periods.
We have found that FTR density (map coverage) along all chromosomes is highly inhomogeneous (Fig. 3a), and the central arm
regions have higher FTR density than the distal regions (Fig. 3b). In
addition, the average FTR density on X-chromosome is substantially higher than on autosomes (See ‘Relative FTR density across
genome’ below).
In order to assess possible roles of FTRs, we correlated FTR
distribution in the Drosophila genome with positions of coding
sequences and some other sequence features, such as local AT/
GC composition. We have found that gene rich segments contain
less FTRs than intergenic regions, while AT rich segments are
enriched by FTRs (Supplementary Fig. 1). Reasons leading to fluctuation of FTR density in genome can be different; therefore we
explored FTR structure in functionally distinct sequence datasets.
FTR with 3k periods prevail in coding regions. We investigated if
FTRs of specific periods may prevail in certain functional categories. We assembled several sequence datasets containing all exons,
30 -untranslated region (30 -UTRs), 50 UTRs, intergenic regions, intergenic heterochromatin (from Drosophila Heterochromatin
Project, http://www.dhgp.org/; http://flybase.net/annot/dmel_het_
release3.2b2.txt;
ftp://flybase.net/genomes/Drosophila_melanogaster/current_hetchr/fasta/) and a dataset containing 124
Short fuzzy tandem repeats in genomic sequences
Fig. 2. Comparison of map coverage values obtained by TandemSWAN with theoretical expectation. The expected coverage was according to Equation (7) in the
text. –*–, TandemSWAN, significance level 1; ––, theoretical, significance level 1; –*–, TandemSWAN, significance level 2; –D–theoretical, significance level
2; –·– TandemSWAN, significance level 3; ––, theoretical, significance level 3. A 1M long Bernoulli random sequence with average genomic nucleotide
frequencies was simulated.
Fig. 3. FTR occurrence in different segments of D.melanogaster genome. FTR density is higher on the X-chromosome (a) and on the centers of
chromosome arms (b).
679
V.Boeva et al.
680
Short fuzzy tandem repeats in genomic sequences
transcription regulatory regions, i.e. enhancers (or cis-regulatory
modules) (https://webfiles.berkeley.edu/dap5/public_html/index.
html). Corresponding datasets were also constructed for related
species D.pseudoobscura (Richards et al., 2005). In order to explore
FTRs within a functionally related group of genes we also generated
the corresponding datasets for selection of 16 developmental gene
loci from D.melanogaster and D.pseudoobscura. To investigate
prevalence of specific tandems in all these datasets we compared
total sequence coverage by FTRs with different periods (Fig. 4).
As expected, the most striking signals were detected in datasets
containing exons (Fig. 4a). We found that in the coding regions
FTRs with periods divisible by 3 are prevailing; instead, tandems
with periods not equal to 3k are suppressed (below random expectation). We also found that 3k periods in coding regions of the Xchromosome have a greater coverage than 3k-periodic FTRs found
in exons of autosomes. This suggests that FTR density even within
similar functional units may be linked with the physical map, i.e. a
particular place in genome.
Surprisingly, we also detected high presence of periods multiples
of 6 (but not the other 3k periods) in the non-coding sequences.
Apparently, there is a 6k background in genome, not related to
periodicities caused by codon triplets [see ‘Possible source of 3k
(6/12) background’ below].
FTR periods specific to sequence categories other than exons.
FTRs with periods 6 and 12 were found to be highly abundant
throughout all analyzed datasets, including transcription regulatory
regions, intergenic spacers, UTRs and even intergenic heterochromatin (Fig. 4b–d). At the same time, non-coding regions were also
found to be enriched by FTRs with other than 3k periods. Outside
exons we observed 2- to 3-fold FTR excess over random expectation, which supports ‘non-random’ origin of FTRs and the nonrandom character of genomic sequences in general.
In order to detect differences in the FTR structure among datasets
representing non-coding regions, we compared prevalence of all
periods, i.e. FTR profiles, as shown in Figure 4 (Table 1 and Supplementary Table 1). The correlation analysis has shown that
according to prevailing FTR periods, all 22 datasets can be subdivided into at least three groups of similarity, one corresponding to
coding regions, another to heterochromatin and the last one corresponding to intergenic regions, spacers and others.
Comparison of absolute levels of FTR presence in different datasets has shown that intergenic heterochromatin, in general, contains
less FTRs than euchromatin (Fig. 4b). Moreover, heterochromatic
regions displayed some excess of FTRs with periods equal to 3k.
In general, comparison of different sequence categories demonstrated that FTRs with all explored periods are overabundant in the
genome, (with the exception of exons) and FTRs with 6k periods for
some reason strongly prevail, even in non-coding DNA.
FTRs in enhancers are similar, but not identical to that in
intergenic regions. Repetitive sequences in transcription regulatory
regions are of special interest. While periodic signals present
in exons (3k-periodic FTRs) can be explained by the genetic
triplet code (and by periodicities in protein sequences), in
regulatory regions, FTRs may well represent a background. To
investigate this problem, we removed 6k background by normalizing FTR coverage values in functional datasets to the coverage
in genome fragments without any functional annotation (nonfunctional).
We focused on 124 annotated (experimentally validated) enhancer regions from D.melanogaster (https://webfiles.berkeley.edu/
dap5/public_html/data_06/124_Dmel_Enc.fa). The vast majority
of these sequences are involved in regulation of developmental
genes. However, this group is nether functionally nor structurally
homogeneous. The enhancers have different length (0.3–3 kb) and
regulate genes transcribed at different developmental stages. To
achieve better representation we considered the entire dataset
(124 sequences, 181 690 bp) and two sub samples, so-called ‘AP’
(anterior–posterior, 72 sequences, 117 377 bp) and ‘DV’ (dorsoventral, 136 sequences, 114 354 bp) enhancers. Along with enhancers we also considered a separate dataset combining ‘spacers’
between enhancers and a dataset combining coding regions from
the same genome locations. The corresponding datasets were also
constructed for D.pseudoobscura ‘AP’ enhancers (sequences are
available from the website).
Analysis of the normalized FTR distributions in enhancer datasets
and in spacers (Fig. 4e) have shown some degree of enrichment by
FTRs with periods 7 and 8 in all datasets, representing loci of
developmental genes. No major differences were found in FTR
distribution between the enhancers and their flanking regions or
‘spacers’. However, the overall FTR distribution was not identical
to that found in the other non-functional, intergenic regions of the
genome.
Along with the assessment of general FTR distributions in enhancers, we also investigated a possible relation between FTR motifs
and the binding motifs for transcription factors present in the enhancers. In some single cases (even-skipped stripe 2 enhancer) we
observed some similarity, but on the larger scale the correlation
was not found to be significant.
Fig. 4. Fraction covered by FTRs with different periods calculated for DNA with different function from D.melanogaster and D.pseudoobscura. FTRs with
periods multiple to 3 are overrepresented in exons and FTRs with periods 6 and 12 in non-translated DNA. Note the comparative deficiency of period 9 FTRs in
intergenic DNA, especially in D.pseudoobscura. FTRs were identified by SWAN with significance level C ¼ 2 without filtration by statistical significance.
For comparison in all cases curve ‘–·–’ shows results for 1 MB simulated Bernoulli random sequence with average genomic nucleotide frequencies.
(a) D.melanogaster exons: ‘––’, autosomes (26 719 758 bp); ‘–D–’, X-chromosome, (5 479 105 bp). (b) D.melanogaster euchromatin intergenic and heterochromatin DNA: ‘––’, autosome intergenic (47 527 681 bp); ‘–D–’, X-chromosome intergenic (11 598 950 bp); ‘–&–’, heterochromatin (7 089 934 bp).
(c) D.melanogaster UTRs: ‘–*–’, autosome 50-UTRs (3 331 080 bp); ‘–*–’, X-chromosome 50-UTRs (686 503 bp); ‘––’, autosome 30-UTRs (5 549 366
bp); ‘–·–’, X-chromosome 30-UTRs (1 254 227 bp); (d) D.melanogaster regulatory regions compared with autosome intergenic DNA: ‘––’ dorsal and twist
enhancers (114 354 bp); ‘–*–’, 124 enhancers (181 690); ‘–D–’, autosome intergenic DNA (47 527 681 bp), the same as in panel (B); (e) D.melanogaster
regulatory regions normalized for its autosome intergenic DNA: ‘––’, 124 enhancers (181 690 bp); ‘–D–’, AP enhancers, (115 599 bp); ‘–*–’, AP spacers
(349 634 bp); (f) D.pseudoobscura intergenic DNA and CDS: ‘–&–’, autosome intergenic (49 347 738 bp); ‘–*–’, X-chromosome intergenic (30 400 371 bp);
‘––’, autosome exons (14 069 722 bp); ‘–D–’, X-chromosome exons (5 417 123 bp); (g) D.pseudoobscura and D.melanogaster CDS: ‘–&–’, D.pseudoobscura
autosomes (14 069 722 bp); ‘––’, D.melanogaster autosomes (26 719 758 bp); (h) D.pseudoobscura and D.melanogaster intergenic DNA: ‘–&–’,
D.pseudoobscura autosome (49 347 738 bp); ‘––’, D.melanogaster autosome (47 527 681 bp); (i) D.pseudoobscura and D.melanogaster AP enhancers:
‘–&–’, D.pseudoobscura (60 085 bp); ‘––’, D.melanogaster (115 599 bp).
681
V.Boeva et al.
It appears that FTRs in enhancers are different from FTRs in the
rest of the genome by their period, and perhaps, by motif composition; however, insufficient amount of annotated enhancer regions in
genome and presence of 6k background complicates the analysis.
Possible source of 3k (6k) background. Our results show that in
Drosophila tandems with periods 6 and 12 are found in the entire
genome, whereas other 3k periods are restricted to protein coding
sequences.
The abundance of 6k periods might be explained, either
through the mechanisms of DNA replication and rearrangement,
or from possible structural DNA features. Existing experimental
data suggest that certain 3-periodic synthetic DNA sequences
have substantially different helix stability probably owing to mismatched alignments at equilibrium temperatures during melting
(Delcourt and Blake, 1991). So, the quasi-repeated structures
may be important for functional stability/flexibility of DNA
molecule.
Periods of 3k abundant in coding sequences apparently are related
to periodicity in protein sequences and triplet nature of genetic code.
Repetitive structures in the protein sequences, such as 3–10 helix or
hydrophobic alpha helixes may also cause periodicity at the level of
DNA (Katti et al., 2000).
The ‘coincidence’ that 6k periods also fall into 3k period
may even have deeper roots. Apparently, nucleic acids and their
replication appeared in certain form before the triplet genetic
code (Lifson, 1997); so it can be that 3–6k ‘matrix’ was in
DNA long before the genetic code itself, and on itself served as
a source for formation of what we know today as triplet genetic
code.
Be the origin of 3k/6k periodic FTRs ‘mechanical’ (DNA stability/replication) or ‘historical’ (3k matrix), it is unlikely that this nonspecifically distributed signal is connected with fine mechanisms
of genome functioning, such as regulation of gene expression.
However, this does not exclude a possibility that FTRs with
other periods or FTRs containing some specific motifs are involved
into some regulatory functions. Moreover, if the quasi-periodic
structure of native DNA sequence is indeed important it would
poise an additional constraint on all motifs in the sequence, including regulatory signals.
Role of FTRs with periods other than 3k. Analysis of FTRs with
periods different from 3k has shown that periods 7 and 8 are more
abundant in enhancers and in regions without functional annotation
around (Fig. 4e). Currently, it is not clear what function these FTRs
may perform in enhancers and whether their presence is related to
any function at all. For instance, we have found no correlation
between FTRs and recognition motifs for transcription factors present in the same enhancers. However we considered only a limited
number of binding motifs (11), most of which are far from being
perfect. Improving enhancer annotations, number of considered
motifs and better recognition of the functional motif matches
may shed more light on possible roles of FTRs in the regulation
of transcription.
As it has been suggested earlier (Papatsenko et al., 2002), some
FTRs may play a role as cassettes, containing synergistically acting
tandems of binding sites and responding to certain threshold levels
of transcription factors. However, it is also possible that presence of
specific repetitive sequences provide certain spatial geometry to an
enhancer, required for correct assembly of regulatory protein complexes. Apparently, some FTRs may be involved in the maintenance
682
of chromatin structure and/or spatial DNA geometry even in a
broader context.
The role of tandem repeats in regulatory regions was discussed
recently in (Sinha and Siggia, 2005). The authors assess high quality
tandem repeats found in enhancers of D.melanogaster and D.pseudoobscura using TRF and MREPS and demonstrated their low
conservation. They concluded that the tandem repeats carry a limited functional load. This agrees with our first finding that the
majority of FTRs found in enhancers have the same predominant
periods as non-annotated intergenic DNA. On the other hand, some
FTRs found in enhancer regions have specific properties and may be
involved in regulatory function.
Relative FTR density across the genome. While qualitative composition of FTRs is surprisingly similar across the genome, their
density may substantially vary from one genomic location to the
other and from one genome to the other. This may or may not be
connected with the local gene density or even presence of ‘gene
deserts’ (Ovcharenko et al., 2005).
We have found that the genome of D.pseudoobscura has a greater
DNA fraction covered by FTRs in all functional categories than
the genome of D.melanogaster (Fig. 4). In both genomes Xchromosome has a higher FTR density than other chromosome
arms, and finally, distal locations of the chromosome arms have
lower FTR densities than more central loci (Fig. 3).
It is noteworthy that the Drosophila X-chromosome has a greater
number of short perfect repeats (Katti et al., 2001) and probably a
greater number of recent gene duplications as compared with
autosomes (Thornton and Long, 2002). This difference between
sex-related chromosomes and the rest of genome was also reported
in human, where LINE1 repeat elements cover one-third of the
human X-chromosome (Ross et al., 2005). However, in the case
of Caenorhabditis elegans genome the reported repeat density in
X-chromosome is lower (Achaz et al., 2001).
The difference in FTR fraction between the genomes
D.melanogaster and D.pseuodoobscura is probably related to a
higher compactness of the D.melanogaster genome. Finally, we
have found that FTRs are relatively less abundant in intergenic
heterochromatin. Perfect tandems with periodicity of 5 found in
pericentromeric heterochromatic regions (Sun et al, 2003) actually
cover a surprisingly low fraction of the total heterochromatic DNA.
Problems of FTR exploration and focus of the TandemSWAN
algorithm. Eukaryotic genomes are overwhelmed by repetitive
sequences probably carrying no biological function, and are caused,
for instance, simply by peculiarities of DNA replication (Ellegren,
2004). This non-functional noise needs to be filtered, which in part
can be done by exploring repeat parameters in different sequence
categories and/or by normalizing to the noise in the non-functional
regions. However, even the background signals may serve maintenance of overall DNA survivability and might contain signals,
required, for instance, for chromatin packaging.
Here we explored fuzzy tandems since they are largely out of the
focus of the regular tandem-finding programs (Castelo et al., 2002;
Sagot and Myers, 1998; Benson, 1999; Kolpakov et al., 2003).
Specifics of our algorithm can be illustrated by comparing TandemSWAN with the two most popular repeat finders, TRF (Benson,
1999) and MREPS (Kolpakov et al., 2003) (Fig. 5 and Supplementary Table 2). Doing a comparison with TRF and MREPS we tried to
obtain the sequence sets that were as similar as possible to those
obtained with SWAN with parameters characteristic for our study.
Short fuzzy tandem repeats in genomic sequences
Table 1. Correlation between FTR profiles for D.melanogaster datasets
1
2
3
4
5
6
7
8
30 -UTR
50 -UTR
CDS
Hetero
124 ENH
Inter
Spacer
Random
1
2
3
4
5
6
7
8
1.00
0.91
0.42
0.78
0.91
0.96
0.92
0.51
0.91
1.00
0.67
0.89
0.87
0.95
0.91
0.62
0.42
0.67
1.00
0.76
0.29
0.42
0.37
0.32
0.78
0.89
0.76
1.00
0.69
0.81
0.71
0.66
0.91
0.87
0.29
0.69
1.00
0.96
0.93
0.58
0.96
0.95
0.42
0.81
0.96
1.00
0.97
0.67
0.92
0.91
0.37
0.71
0.93
0.97
1.00
0.57
0.51
0.62
0.32
0.66
0.58
0.67
0.57
1.00
(1) 30 -UTR, autosomes; (2) 50 -UTR, autosomes; (3) CDS, autosomes; (4) intergenic heterochromatin; (5) 124 enhancers; (6) intergenic euchromatin; (7) spacers in AP loci; (8) random
sequence. Color-code: (white) r ¼ 0–0.6; (light grey) 0.6–0.9; (medium grey) 0.9–0.95; (dark grey) 0.95–1.
(a)
(b)
Fig. 5. Similarity between tandem sets identified by different repeat finders.
Fractions of a 50 kb fragment of D.melanogaster chr2L sequence covered
with tandem repeats identified by different algorithms with following parameters: TandemSWAN, minimal period 3, maximal period 15, significance
level 2, filtration with (a) PS-probability < 105 and (b) PS-probability < 103,
‘MaSk’ statistical mode; TRF (Benson, 1999), minimal period 3, maximal
period 15, match 2, mismatch 2, indel 15, pmatch 80, pindel 0, minscore 20;
MREPS (Kolpakov et al., 2003), err 15, maxperiod 15, minperiod 3.
Actually, both TRF and MREPS are usually used to search for more
precise repeats than those in Figure 5 and at least TRF was operating
near the limit of repeat fuzziness allowed by its internet-based
version. All the three tools perform similarly in the case of perfect
repeats. However, the results of TandemSWAN and TRF/MREPS
become quite different in the case of FTR extraction.
4
CONCLUSIONS
In this work we have formulated a working definition of FTRs
based on its statistical properties, developed an extraction
algorithm and compared properties of tandems found in sequences
carrying different functions. We developed statistics that allow
calculation of P-values for FTR occurrence in Bernoulli type
random sequences, which can be useful for other algorithms.
This statistical approach implemented in the TandemSWAN
program, aimed to identify FTRs with a broad spectrum of listed
parameters.
Using this approach, we identified short FTRs with periods 3–24
bases in the D.melanogaster and D.pseudoobscura genomes and
compared FTR structure and occurrence in coding and non-coding
regions, heterochromatic regions and regulatory (enhancer)
sequences. We have found that different types of tandems are
abundant in different functional sequence categories, with each
category having its own pattern of preferred period lengths. Tan-
dems with period 3 and their multiples were found to be characteristic of coding regions. FTRs with 6k periods are characteristic
for all non-coding DNA. FTRs with periods equal to 5, 6, 7 and 11,
12, 14 were enriched in loci of developmental genes and developmental enhancers. The regulatory modules at the mean have no less
FTRs than spacers nearby; furthermore, FTR with periods 7 and 8
are found more often in Drosophila cis-regulatory modules then in
other non-coding DNA. Obviously, both the evolution and the DNA
structure of regulatory modules are subject to many additional parameters, such as DNA melting and adsorption of protein regulatory
factors. Thus, it is possible that FTR found in cis-regulatory modules have some particular sequence structure facilitating their function and should be studied in greater detail. To understand the role of
the 6 bp-related omnipresent repeats it is necessary first to test if
they are present in different bacterial, animal and plant taxa, and not
only in Drosophila species. This work is currently in progress.
ACKNOWLEDGEMENTS
The authors are thankful for M. Borodovsky, A. P. Lifanov,
N. A. Oparina, N. G. Esipova, M. Lassig, A. V. Favorov,
V. E. Ramensky, M. G. Gelfand and A. A. Mironov for valuable
discussion. They also thank R.Zinzen for careful manuscript reading
and suggested changes. This study has been supported by the French
Program EcoNet-08159PG, INTAS grant 04-83-3994, Russian State
Contract No 02.434.11008, RFBR grant 04-04-49601, Fogerthy
RO3 TW005899-01A1 program, Russian Academy of Science
Presidium Program in Molecular and Cellular Biology, project
#10 and Ludwig Institute of Cancer Research Grant CRDF GAP
RBO-1268.
Conflict of Interest: none declared.
REFERENCES
Antoniewski,C. et al. (1996) Direct repeats bind the EcR/USP receptor and mediate
ecdysteroid responses in Drosophila melanogaster. Mol. Cell. Biol., 16,
2977–2986.
Achaz,G. et al. (2001) Study of intrachromosomal duplications among the eukaryote
genomes. Mol. Biol. Evol., 18, 2280–2288.
Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res., 27, 573–580.
Benson,G. and Waterman,M. (1994) A method for fast database search for all knucleotide repeats. Nucleic Acids Res., 22, 4828–4836.
683
V.Boeva et al.
Boulikas,T. (1995) Chromatin domains and prediction of MAR sequences. Int. Rev.
Cytol., 162A, 279–388.
Carroll,S.B., Grenier,J.K. and Weatherbee,S.D. (2001) From DNA to Diversity.
Molecular Genetics and the Evolution of Animal Design. Blackwell Science,
Malden, MA, ISBN 0-632-04511-6.
Castelo,A.T. et al. (2002) TROLL—tandem repeat occurrence locator. Bioinformatics,
18, 634–636.
Celniker,S. et al. (2002) Finishing a whole genome shotgun: release 3 of the
Drosophila melanogaster euchromatic genome sequence. Genome Biol., 3,
RESEARCH0079.
Chaley,M.B. et al. (1999) Method revealing latent periodicity of the nucleotide
sequences modified for a case of small samples. DNA Res., 6, 153–163.
Chechetkin,V.R. and Lobzin,V.V. (1998) Nucleosome units and hidden periodicities in
DNA sequences. J. Biomol. Struct. Dyn., 15, 937–947.
Delcourt,S.G. and Blake,R.D. (1991) Stacking energies in DNA. J. Biol. Chem., 266,
15160–15169.
Davidson,H. et al. (2000) Genomic sequence analysis of Fugu rubripes CFTR and
flanking genes in a 60 kb region conserving synteny with 800 kb of human chromosome 7. Genome Res., 10, 1194–1203.
Dover,G.A. (1982) Molecular drive, a cohesive model of species evolution. Nature,
299, 111–117.
D.melanogaster heterochomatin genome data from Drosophila Heterochromatin
Genome Project.
Edwards,A. et al. (1992) Genetic variation at five trimeric and tetrameric tandem repeat
loci in four human population groups. Genomics, 12, 241–253.
Ellegren,H. (2004) Microsatellites: simple sequences with complex evolution. Nat.
Rev. Genet., 5, 435–445.
Fu,Y.-H. et al. (1992) An unstable triplet repeat in a gene related to myotonic muscular
dystrophy. Science, 255, 1256–1258.
Gao,Q. and Finkelstein,R. (1998) Targeting gene expression to the head: the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen. Development,
125, 4185–4193.
Huntington’s Disease Collaborative Research Group (1993), A novel gene containing a
trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell, 72, 971–983.
Ioshikhes,I. et al. (1999) Periodical distribution of transcription factor sites in promoter
regions and connection with chromatin structure. Proc. Natl Acad. Sci. USA, 96,
2891–2895.
Karlin,S. et al. (1988) Efficient algorithms for molecular sequence analysis. Proc. Natl
Acad. Sci. USA, 85, 841–845.
Katti,M.V. et al. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol., 18, 1161–1167.
Katti,M.V. et al. (2000) Amino acid repeat patterns in protein sequences: their diversity
and structural-functional implica-tions. Protein Sci., 9, 1203–1209.
Kolpakov,R. et al. (2003) mreps: Efficient and flexible detection of tandem repeats in
DNA. Nucleic Acids Res., 31, 3672–3678.
Kotelnikova,E.A. et al. (2005) Evolution of transcription factor DNA binding sites.
Gene, 347, 255–263.
Kravatskaia,G.I. et al. (2002) Similarities in periodical structures in the position of
nucleotides in regions of initiation of replication of bacterial genomes. Biofizika,
47, 595–599.
Kutuzova,G.I. et al. (1999) Periodicity in contacts of RNA-polymerase with promoters.
Biofizika, 44, 216–223.
Landau,G.M. et al. (2001) An algorithm for approximate tandem repeats. J. Comput.
Biol., 8, 1–18.
Li,L. et al. (2004) Pseudo-periodic partitions of biological sequences. Bioinformatics,
20, 295–306.
Li,Y.C. et al. (2002) Microsatellites: genomic distribution, putative functions and
mutational mechanisms: a review. Mol. Ecol., 11, 2453–2465.
Lifson,S. (1997) On the crucial stages in the origin of animate matter. J. Mol. Evol., 44,
1–8.
Makeev,V.Ju. and Tumanyan,V.G. (1996) Search of periodicities in primary
structure of biopolymers: a general Fourier approach. Comput. Appl. Biosci.,
12, 49–54.
684
Makeev,V.J. et al. (2003) Distance preferences in the arrangement of binding motifs
and hierarchical levels in organization of transcription regulatory information.
Nucleic Acids Res., 31, 6016–6026.
Martienssen,R.A. (2003) Maintenance of heterochromatin by RNA interference of
tandem repeats. Nat. Genet., 35, 213–214.
Meloni,R. et al. (1998) A tetranucleotide polymorphic microsatellite, located in the first
intron of the tyrosine hydroxylase gene, acts as a transcription regulatory element in
vitro. Hum. Mol. Genet., 7, 423–428.
Nakamura,Y. et al. (1998) VNTR (variable number of tandem repeat) sequences
as transcriptional, translational, or functional regulators. J. Hum. Genet., 43,
149–152.
Niv,E. et al. (2005) Microsatellite instability in patients with chronic B-cell lymphocytic leukaemia. Br. J. Cancer., 92, 1517–1523.
Ovcharenko,I. et al. (2005) Evolution and functional classification of vertebrate gene
deserts. Genome Res., 15, 137–145.
Ott,R.W. and Hansen,L.K. (1996) Repeated sequences from the Arabidopsis thaliana
genome function as enhancers in transgenic tobacco. Mol. Gen. Genet., 252,
563–571.
Papatsenko,D.A. et al. (2002) Extraction of functional binding sites from unique
regulatory regions: the Drosophila early developmental enhancers. Genome
Res., 12, 470–481.
Ramchandran,R. et al. (2000) A (GATA)(7) motif located in the 50 boundary area of the
human beta-globin locus control region exhibits silencer activity in erythroid cells.
Am. J. Hematol., 65, 14–24.
Richards,S. et al. (2005) Comparative genome sequencing of Drosophila
pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res.,
15, 1–18.
Ross,M.T. et al. (2005) The DNA sequence of the human X chromosome. Nature, 434,
325–337.
Sagot,M. and Myers,E. (1998) Identifying satellites in nucleic acid sequences.
In Istrail,S., Pevzner,T. and Waterman,M. (eds), Proceedings of the Second Annual
International Conference on Computational Molecular Biology. ACM Press, NY,
pp. 234–242.
Schug,M.D. et al. (1998) The distribution and frequency of microsatellite loci in
Drosophila melanogaster. Mol. Ecol., 7, 57–70.
Shi,X.M. et al. (2000) Tandem repeat of C/EBP binding sites mediates PPARgamma2
gene transcription in glucocorticoid-induced adipocyte differentiation. J. Cell
Biochem., 76, 518–527.
Singer,M. and Berg,T. (1991) Genes and Genomes. University Science Books, Mill
Valley, California.
Sinha,S. and Siggia,E.D. (2005) Sequence turnover and tandem repeats in cisregulatory modules in drosophila. Mol. Biol. Evol., 22, 874–885.
Subramanian,S. et al. (2003) Genome-wide analysis of microsatellite repeats in
humans: their abundance and density in specific genomic regions. Genome
Biol., 4, R13.
Sun,X. et al. (2003) Sequence analysis of a functional Drosophila centromere. Genome
Res., 13, 182–194.
Thibodeau,S.N. et al. (1993) Microsatellite instability in cancer of the proximal colon.
Science, 260, 816–819.
Thornton,K. and Long,M. (2002) Rapid divergence of gene duplicates on the
Drosophila melanogaster X chromosome. Mol. Biol. Evol., 19, 918–925.
Verkerk,A. et al. (1991) Identification of a gene (FMR-1) containing a CGG repeat
coincident with a breakpoint cluster region exhibiting length variation in fragile X
syndrome. Cell, 65, 905–914.
Vergnaud,G. and Denoeud,F. (2000) Minisatellites: mutability and genome architecture. Genome Res., 10, 899–907.
Villafranca,E. et al. (2001) Polymorphisms of the repeated sequences in the en-hancer
region of the thymidylate synthase gene promoter may predict downstaging after
preoperative chemoradiation in rectal cancer. J. Clin. Oncol., 19, 1779–1786.
Weber,J.L. and May,P.E. (1989) Abundant class of human DNA polymorphisms which
can be typed using the polymerase chain reaction. Am. J. Hum. Genet., 44,
388–396.
Wooster,R. et al. (1994) Instability of short tandem repeats (microsatellites) in human
cancers. Nat. Genet., 6, 152–156.