Sequence analysis by additive scales: DNA structure for sequences

Vol. 16 no. 10 2000
Pages 865–889
BIOINFORMATICS
Sequence analysis by additive scales: DNA
structure for sequences and repeats of all lengths
Pierre Baldi 1, 2,∗ and Pierre-François Baisnée 1
1 Department
of Information and Computer Science and 2 Department of Biological
Chemistry, College of Medicine, University of California, Irvine, CA 92697-3425, USA
Received on April 24, 2000; accepted on May 25, 2000
Abstract
Motivation: DNA structure plays an important role in
a variety of biological processes. Different di- and trinucleotide scales have been proposed to capture various
aspects of DNA structure including base stacking energy,
propeller twist angle, protein deformability, bendability,
and position preference. Yet, a general framework for the
computational analysis and prediction of DNA structure
is still lacking. Such a framework should in particular address the following issues: (1) construction of sequences
with extremal properties; (2) quantitative evaluation of
sequences with respect to a given genomic background;
(3) automatic extraction of extremal sequences and
profiles from genomic databases; (4) distribution and
asymptotic behavior as the length N of the sequences
increases; and (5) complete analysis of correlations
between scales.
Results: We develop a general framework for sequence
analysis based on additive scales, structural or other,
that addresses all these issues. We show how to construct
extremal sequences and calibrate scores for automatic
genomic and database extraction. We show that distributions rapidly converge to normality as N increases.
Pairwise correlations between scales depend both on
background distribution and sequence length and rapidly
converge to an analytically predictable asymptotic value.
For di- and tri-nucleotide scales, normal behavior and
asymptotic correlation values are attained over a characteristic window length of about 10–15 bp. With a uniform
background distribution, pairwise correlations between
empirically-derived scales remain relatively small and
roughly constant at all lengths, except for propeller twist
and protein deformability which are positively correlated.
There is a positive (resp. negative) correlation between
dinucleotide base stacking (resp. propeller twist and
protein deformability) and AT-content that increases in
magnitude with length. The framework is applied to the
analysis of various DNA tandem repeats. We derive exact
expressions for counting the number of repeat unit classes
∗ To whom all correspondence should be addressed.
c Oxford University Press 2000
at all lengths. Tandem repeats are likely to result from
a variety of different mechanisms, a fraction of which
is likely to depend on profiles characterized by extreme
structural features.
Contact: [email protected]; [email protected]
Introduction
Evidence is mounting that DNA structural properties
beyond the double helical pattern play an important
role in a number of fundamental biological processes,
both under healthy and pathological conditions. This
is not too surprising if one realizes that meters of
DNA must be compacted into a nucleus that is only
a few microns in diameter while, at the same time,
preserving the ability of turning thousands of genes on
and off in a precisely orchestrated fashion. The threedimensional structure of DNA, as well as its organization
into chromatin fibers, seems to be essential to its functions
and has been implicated in diverse phenomena ranging
from protein binding sites, to gene regulation, to triplet
repeat expansion diseases. The goal of this work is to
develop computational methods for the structural analysis
of DNA sequences. While DNA structure is our primary
motivation and area of application, the framework we
develop is completely general and applies to sequences
over any alphabet, including codon, RNA, and protein
alphabets, whenever local additive scales, as defined
below, are available.
DNA structure
DNA structure has been found to depend on the exact
sequence of nucleotides, an effect that seems to be
caused largely by interactions between neighboring base
pairs (Ornstein et al., 1978; Satchwell et al., 1986;
Breslauer et al., 1986; Calladine et al., 1988; Goodsell
and Dickerson, 1994; Sinden, 1994; Brukner et al., 1995;
Hassan and Calladine, 1996; Hunter, 1996; Ponomarenko
et al., 1999; Fye and Benham, 1999). This means that
different sequences can have different intrinsic structures,
or different propensities for forming particular structures.
865
P.Baldi and P.-F.Baisnée
Periodic repetitions of bent DNA in phase with the
helical pitch, for instance, will cause DNA to assume a
macroscopically curved structure. Flexible or intrinsically
curved DNA is energetically more favorable to wrap
around histones than rigid and unbent DNA, and this has
been shown to influence nucleosome positioning (Drew
and Travers, 1985; Satchwell et al., 1986; Simpson, 1991;
Lu et al., 1994; Wolffe and Drew, 1995; Baldi et al.,
1996; Zhu and Thiele, 1996; Liu and Stein, 1997). In
addition, the chromatin complex structure of DNA and
the positioning of nucleosomes along the genome have
been found to play an important (generally inhibitory)
role in the regulation of gene transcription (Pazin and
Kadonaga, 1997; Tsukiyama and Wu, 1997; Werner and
Burley, 1997; Pedersen et al., 1998). Sequence-dependent
DNA structure is often important for DNA binding
proteins, such as TBP (TATA-binding-protein) (Parvin et
al., 1995; Starr et al., 1995; Grove et al., 1996) and gene
regulation (Sheridan et al., 1998). While the number of
resolved structures of DNA–protein complexes continues
to grow in the PDB database, the field of computational
DNA structural analysis is clearly far behind its protein
cousin and completely lacks any degree of systemicity.
Most likely, most DNA structural signals remain to be
uncovered.
DNA structural scales
Based on many different empirical measurements or theoretical approaches, several models have been constructed
that relate the nucleotide sequence to DNA flexibility
and curvature (Ornstein et al., 1978; Satchwell et al.,
1986; Goodsell and Dickerson, 1994; Sinden, 1994;
Brukner et al., 1995; Hassan and Calladine, 1996; Hunter,
1996; Baldi et al., 1998; Ponomarenko et al., 1999).
These models are typically in the form of dinucleotide
or trinucleotide scales that assign a particular value to
each di- or tri-nucleotide and its reverse complement. A
non-exhaustive list of such scales includes:
(1) The dinucleotide base stacking energy (BS) scale
(Ornstein et al., 1978) expressed in kilocalories per
mole. The scale is derived from approximate quantum mechanical calculations on crystal structures.
(2) The dinucleotide propeller twist angle (PT) scale
(Hassan and Calladine, 1996) measured in degrees.
This scale is based on X-ray crystallography of
DNA oligomers. Dinucleotides with a large negative
propeller-twist angle tend to be more rigid than
dinucleotides with low negative propeller-twist
angle.
(3) The dinucleotide protein deformability (PD) scale
(Olson et al., 1998) derived from empirical energy
functions extracted from the fluctuations and correlations of structural parameters in DNA–protein
866
crystal complexes. Dinucleotides with large PD values tend to be more flexible.
(4) The trinucleotide bendability (B) model (Brukner
et al., 1995) based on Dnase I cutting frequencies.
The enzyme Dnase I preferably binds (to the minor
groove) and cuts DNA that is bent, or bendable,
towards the major groove (Lahm and Suck, 1991;
Suck, 1994). Thus Dnase I cutting frequencies
on naked DNA can be interpreted as a quantitative measure of major groove compressibility or
anisotropic bendability. These frequencies allow
for the derivation of bendability parameters for
the 32 complementary trinucleotide pairs. Large B
values correspond to flexibility.
(5) The trinucleotide position preference (PP) scale
derived from experimental investigations of the
positioning of DNA in nucleosomes. It has been
found that certain trinucleotides have a strong
preference for being positioned in phase with the
helical repeat. Depending on the exact rotational
position, such triplets will have minor grooves
facing either towards or away from the nucleosome
core (Satchwell et al., 1986). Based on the premise
that flexible sequences can occupy any rotational
position on nucleosomal DNA, these preference
values can be used as a triplet scale that measures
DNA flexibility. Hence, in this model, all triplets
with close to zero preference are assumed to be
flexible, while triplets with preference for facing
either in or out are taken to be more rigid and have
larger PP values. Note that we do not use this scale
as a measure of how well different triplets form
nucleosomal DNA. Instead, the absolute value, or
unsigned nucleosome positioning preference, is
used here, as in Pedersen et al. (1998), as a measure
of DNA flexibility.
For completeness, all these scales are displayed in
the appendix.
In previous studies, we found these models useful (Baldi
et al., 1996; Pedersen et al., 2000; Baldi et al., 1999),
in particular for the detection of putative new structural
signatures associated with an increase of bendability in
downsteam regions of RNA polymerase II promoters.
A similar approach (Liao et al., 2000) was used to
analyze the structure of insertion sites for P transposable
elements in Drosophila melanogaster and suggest that
the corresponding transposition mechanism recognizes a
structural signature rather than a specific sequence motif.
With the exception of BS, all the models were determined by purely experimental observations of sequencestructure correlations. Additional scales capturing DNA
properties related directly or indirectly to structure, such
as enthalpy, or melting temperature, have also been
Sequence analysis by additive scales
proposed (Breslauer et al., 1986; Ponomarenko et al.,
1999). The primary focus of this work is not on assessing
the merits and pitfalls of each model, but rather on the
development of general methods for the systematic
application of any scale to any sequence of any length, up
to entire genomes, under the assumption that the scale can
be used additively within a sliding window. In general,
this assumption will provide a reasonable approximation,
at least up to a certain length to be determined experimentally. In particular, we are interested in the development
of methods for the automatic recognition of structural
motifs associated with extremal features, such as extreme
stiffness or bendability. The calibration of corresponding
thresholds is expected to be useful in database searches
and is conceptually similar, for instance, to the calibration
of thresholds for detecting sequence homology. More generally, however, database searches may also be conducted
on the basis of structural signatures or profiles that need
not be extremal and could be obtained from reasonable
training sets. Certain protein binding sites, for instance,
are highly degenerate at the DNA sequence level, with
low sequence homology, while exhibiting at the same
time a high degree of DNA structural similarity. Similarly,
periodic flexible triplets in phase with the double helical
pitch are necessary to ensure long range curvature, for
instance in nucleosome regions.
Although several scales may agree on some structural
features, the fact remains that they may also display divergent interpretations of some sequence elements. While no
final consensus regarding these models exists, it is likely
that each one provides a slightly different and partially
complementary view of DNA structure. Thus a second
goal of this work is the comparison of the models in the
limited sense of estimating the statistical correlation between different scales. In Baldi et al. (1998) it was shown
that by and large many of the commonly used scales exhibit low correlations measured at the level of single dior tri-nucleotides. Empirical measurement of correlations
between the scales over longer lengths in Escherichia
coli have recently revealed different unexplained patterns
(Pedersen et al., 2000). Here we provide a complete explanation of this phenomena and show how correlations vary
with background distribution and with window length. Finally, while the methods introduced can be applied to any
DNA sequence, we focus here on a particularly important
class of DNA sequences, namely DNA tandem repeats,
where the general framework is further specialized.
DNA repeats
Genomes, especially eukaryotic genomes, are replete with
DNA repetitive regions (Jurka et al., 1992; Jeffreys, 1997;
Jurka, 1998). Well over 30% of the human genome
has been estimated to comprise repetitive DNA of some
sort (Benson and Waterman, 1994) the exact function of
which is often unknown. Such DNA arises through many
different evolutionary and genetic mechanisms. Over
950 different classes (Jurka, 1998) of repeats have been
censed. Two major groups of repeats exist: interspersed
repeats, and tandem repeats. While the methods to be
developed can be applied to both groups, our analysis
will focus on tandem repeats, consisting of two or more
contiguous copies of a particular pattern of nucleotides.
Tandem repeats may cover up to 10% of the human
genome.
Tandem repeats vary widely, over several orders of magnitude, both in terms of the length of the repeating pattern
and the number of more or less exact contiguous copies.
Repeats are often polymorphic and therefore play a major
role in linkage studies and DNA fingerprinting. In many
cases, the genetic origin, the structure, and the function of
these repetitive regions is poorly understood. There exist
a few examples, however, where the repeats are known to
play a biological role in both healthy and pathological conditions. Certain tandem repeats, for instance, have been
associated with protein binding sites or interactions with
transcription factors. An important advance in epigenetics
research has been the realization that interactions between
repeated DNA sequences can trigger the formation and the
transmission of inactive genetic states and DNA modifications (Wolffe and Matzke, 1999). In several of these cases,
the particular DNA-helical structural features of the repeat
sequences seem to play an essential role.
Interest in tandem repeats has been heightened over the
last few years by the discovery that several important
degenerative disorders including Huntington’s disease,
myotonic dystrophy, fragile X syndrome, and several
forms of ataxia, result from the abnormal expansion
of particular DNA triplets (The Huntington’s Disease
Collaborative Research Group, 1993; Ashley and Warren,
1995; Ross, 1995; Gusella and MacDonald, 1996; Hardy
and Gwinn-Hardy, 1998; Rubinsztein and Hayden, 1998;
Baldi et al., 1999). The exact mechanism by which a
triplet repeat mutation causes disease varies as indicated
by the fact that currently known repeat expansions are
found both in 5 UTRs, in 3 UTRs, in introns, and within
coding sequences of various affected genes (Ashley and
Warren, 1995; Gusella and MacDonald, 1996; Rubinsztein
and Amos, 1998; Rubinsztein and Hayden, 1998). For
instance, fragile X mental retardation is associated with
an expanded CGG repeat in the 5 UTR of the FMR1 gene
(Nelson, 1995; Eichler and Nelson, 1998). The 64 possible
triplets can be clustered into 12 equivalence classes when
shift and reverse complement operations are considered
(see below). Currently only three repeat classes CAG,
CGG, and GAA, out of the possible twelve, are associated
with triplet repeat disorders.
There is evidence that unusual structural features of
the repeats play a role in their expansion (Wells, 1996;
867
P.Baldi and P.-F.Baisnée
Pearson and Sinden, 1998a,b; Moore et al., 1999). In
Baldi et al. (1999), the structural scales above were used
to show that the triplet classes involved in the diseases
have extreme structural characteristics of very high or
very low flexibility. Methods to quantify the degree of
extremality relative to other sequences, however, were not
developed. Furthermore, other triplet or non-triplet repeats
may play a role in diseases as well as other biological
processes. Therefore the techniques need to be improved
and extended to all classes of repeats.
Hence, given the importance of repeating patterns and
the exponential growth of sequence databases, our goal is
also to develop new tools for the computational analysis
of the structural properties of arbitrary repeats and begin
to apply such techniques in a systematic and quantifiable
way. Various algorithms for searching tandem repeats
have been developed (Milosavljevic and Jurka, 1993;
Benson and Waterman, 1994; Benson, 1999; Blanchard
et al., 2000). The techniques presented here can also be
viewed as complementing such algorithms by introducing
a structural perspective.
Organization
The remainder of the paper is organized as follows. In
the next section we develop a general framework for
the analysis of the score of a sequence (repetitive or
not) under any additive scale. We determine the number
of different sequence equivalence classes under circular
permutation and reverse complement operations. We show
how to determine and visualize maximal and minimal
patterns and study the statistical properties of the scales,
including intra scale (mean and variance) and interscale
(correlations) statistics for sequences of various lengths, as
well as asymptotic normality. This framework is essential
in order to compare the behavior of various scales, to
locate a given sequence with respect to a comparable
population, and to automatically set thresholds in database
searches. We then apply the general framework to the
five structural models described above and various tandem
repeats.
Methods and theory
General framework
The general framework we consider begins with an alphabet A of size A and a scale S of length (or size) S. The
scale is a function that assigns a value to any S-tuple of
the alphabet, for instance in the form of a table with A S
entries. In the result section, we deal exclusively with the
nucleotide DNA alphabet (A = 4) and with DNA scales,
such as dinucleotide with S = 2 (e.g. propeller twist)
or trinucleotide with S = 3 (e.g. bendability) structural
scales. The same framework, however, can readily be applied to other situations (e.g. amino acid alphabet with
868
hydrophobicity scales). Given a primary sequence s =
X 1 X 2 . . . X N of length N S over A, we assume that the
scale S is approximately additive in the sense that the corresponding global property of the sequence s can be estimated by ‘sliding’ the scale along the sequence in the form
S (s) = S (X 1 . . . X S ) + S (X 2 . . . X S+1 ) + · · ·
N
−S+1
S (X i . . . X i+S−1 ).
=
(1)
i=1
In practical applications, such quantity can also be averaged over a window of length W to get, for instance, a
more homogeneous per base-pair value (W ≈ N ). This
averaging process does not concern us at this stage since
it merely amounts to using a different scale, with a larger
size. The form given in equation (1) corresponds to a free
boundary condition. The ideas to be developed can be
applied to other boundary conditions, including periodic
boundary conditions, where the sequence is wrapped
around, as described below. With the proper modifications, the theory applies immediately to the case where the
scales are shifted by more than one position at each step.
Consider now a repeat sequence r consisting of a unit
pattern or period p = (X 1 . . . X P ) of length P, and
repetition number R > 1, so that r = (X 1 . . . X P ) R with
N = P R S. Notice that the period is not uniquely
determined since, for instance, X X X X can be viewed
as (X )4 , or as (X X )2 . In addition, we will assume that
P + S − 1 N , or equivalently that S (R − 1)P + 1
so that the scale S is applied starting at least once from
each letter in the repetitive unit, without exceeding the
repetitive sequence boundary. In this case, S (r ) has the
form:
S (r ) = l S ( p) + (2)
where S ( p) is the contribution of the periodic unit
S ( p) = S (X 1 . . . X S ) + S (X 2 . . . X S+1 )
+ · · · + S (X P X 1 . . . X S−1 )
P
=
S (X i . . . X i+S−1 ) [modP].
(3)
i=1
The number l of times the periodic unit is covered by S
and its shifted version is given by:
S−1
PR − S + 1
= R−
.
(4)
l=
P
P
Finally, if l P + S − 1 = R P then the boundary tail is
equal to 0. Otherwise
= S (X l P+1 . . . X l P+S ) + · · · + S (X R P−S+1 . . . X R P )
(5)
Sequence analysis by additive scales
where indices can be taken modulo P, i.e. X l P+1 = X 1
and so forth. The sum in equation (5) has at most P − 1
terms. In practice, at least in the case of DNA, only short
scales are currently available and therefore in most cases,
S P + 1. In this case, equation (2) simplifies to:
S (r ) = (R − 1)S ( p) + .
(6)
Equivalence classes
In the special case of repetitive sequences, we also need
to be able to count the number of different repeats with
respect to a given scale. It is often the case that the scale
S is characterized by some kind of invariance with respect
to the sequences of length S of A. In the case of DNA, the
structural scales we have are invariant with respect to the
reverse complement. When looking at repeat sequences,
this determines how many different repeat patterns of
length P need to be considered.
A triplet repeat, for instance, can be described in
terms of different unit trinucleotides depending on what
strand and triplet frame is chosen. Thus, the repeat
CAGCAGCAG . . . can be said to be a repeat of the triplet
CAG, and also of its reverse complement CTG. Ignoring
repeat boundaries, however, the sequence can also be
described as a repeat of the shifted triplet pairs AGC/GCT
and GCA/TGC. In this way, the 64 different trinucleotides
can be divided into 12 possible repeat classes. Of these
12 classes, only 10 are proper triplet repeat classes in
the sense that they do not result from a repeat pattern of
shorter length. The two classes associated with shorter
patterns are obviously the triplet pairs AAA/TTT and
CCC/GGG which are more precisely described as mononucleotide repeats. [For a generic alphabet A, a reverse
complement operation can be defined by introducing a
one to one function X → X̄ from the alphabet to itself,
satisfying X̄¯ = X so that the reverse complement of
X 1 . . . X N is defined to be X̄ N , . . . , X̄ 1 .]
In the case of a DNA repeat with unit repeat length
P, the number of classes and the number of elements
in each equivalence class is dictated by the action of
the group of transformations associated with the circular
permutations and the reverse complement operations on
the set of all possible strings of length P. AAA. . . /TTT. . .
and CCC. . . /GGG. . . always give rise to two separate
classes with two elements each. In general, a typical
class will contain 2P elements associated with the P
permutations and the P reverse complements. Classes
containing less elements, however, can arise for instance
as a result of sub-periodicity effects when P is not prime,
and of identical reverse complement effects. For instance,
when P = 4, the class of ATAT contains only two
elements since it is identical to its reverse complement and
can be shifted circularly only once before returning to the
original pattern. The number of classes can be counted
using standard group theory arguments detailed in the
appendix. These arguments are not restricted to circular
permutation and reverse complement operations, but apply
to any group of transformations over any sequences.
The number of classes, when only circular permutations
without reverse complement are taken into account, is
given by
1 P
1 (P,k)
Ad =
φ
A
(7)
P d|P
d
P 1k P
where (P, k) is the greatest common divider (gcd) of P
and k. φ(n) is the Euler function counting the number
of integers less than n which are prime to n, i.e. without
common dividers with n. If p1 , . . . , pk is the list of
distinct prime factors of n, then the Euler function can be
expressed as:
φ(n) = n
k i=1
1
1−
.
pi
(8)
When both circular permutations and reverse complement are taken into account, the number of classes for odd
P is given by
1 P
1 (P,d)
Ad =
φ
A
.
(9)
2P d|P
d
2P 1k P
When P is even, the corresponding number of classes is
P
1 P P/2
d
A + A
φ
(10)
2P d|P
d
2
or, equivalently,
1
2P
1k P
P
A(P,d) + A P/2 .
2
(11)
In particular, when P is prime, the number of different
classes under periodic and reverse complement equivalence is
1
[(P − 1)A + A P ].
2P
(12)
The number of classes which are new at a given length
P, i.e. that do not result from the repetition of a shorter
pattern of length dividing P, can easily be obtained by
subtracting the corresponding counts for each divisor of P.
When P is prime, all classes are new except for the classes
resulting from mono-letter repeats. Table 1 in the Results
section exemplifies the application of equations (9)–(12).
869
P.Baldi and P.-F.Baisnée
Extremal sequences and automata
We are interested in the construction and recognition of
sequences s that are extremal for S , i.e. such that S (s) is
very large or very small relative to the other sequences
of length N . For this, we attach to each scale a prefix
automata, or prefix graph. The prefix automata can be
described by a directed graph containing A S−1 nodes,
each labeled by a string of length S − 1 over A of the
form X 1 . . . X S−1 (see Figure 1 for an example). Each
node has A directed outgoing connections. X 1 . . . X S−1
is connected to X 2 . . . X S−1 Y , for each letter Y in A,
hence the notion of prefix. The weight (or length) of
the corresponding transition is provided by the entry
associated with X 1 X 2 . . . X S−1 Y in the structural table.
The A nodes labeled (X ) S−1 = X X X . . . X (monorepeats) are the only ones to have a self-connection.
Any sequence s of length N , is trivially associated
with the path: X 1 . . . X S−1 → X 2 . . . X S → · · · →
X N −S+2 . . . X N . The value of S (s) is found just by adding
the weights of the corresponding connections.
As a result, sequences associated with maximal or
minimal values of S (s) correspond to paths in the
prefix graph, with maximal or minimal total weight or
length. These can easily be found by standard dynamic
programming techniques which can also be extended to
finding, for instance, the k longest or shortest paths. A
repeat pattern of length P is a directed cycle in the prefix
automata graph. Notice that any path of length greater than
A S−1 must intersect itself at least once. Thus any cycle
of length strictly greater than A S−1 must be composed
of non-intersecting cycles of length at most A S−1 . For
instance, with a dinucleotide scale, any repeat unit of
length greater than four must contain at least two cycles of
length at most four. Therefore in the study of repeats, we
need only to study the properties of all non-intersecting
directed cycles of length up to A S−1 together with all
possible ways of joining them.
In addition to dynamic programming techniques, it is
also useful to tabulate the weights of all possible short
cycles for at least two reasons. First, because longer
patterns are built from shorter cycles. Second, at least in
the case of DNA, many important existing repeats, such as
triplet repeats, are based on a short repeating pattern.
While the prefix graph is useful for constructing extremal sequences and recognizing them as long as A, S
and N are small, it is also necessary to develop more
general techniques by which we can rapidly assess, for
any sequence s, the magnitude of S (s) with respect to all
the other comparable sequences. This is best achieved by
viewing the sequences in a probabilistic context.
Probabilistic modeling
Consider now that sequences are being generated by a
random process. In order to fix the ideas, we take for
870
simplicity a Markov model of order 0, i.e. we assume
that sequences are generated by N tosses of the same die
with distribution D = ( p X ) over the alphabet A. The
same analysis, however, can easily be extended to other
probabilistic models such as higher-order Markov models
where distributions are defined, for instance, on pairs or
triplets of letters.
From equation (1), S (s) is now a random variable which
is the sum of N − S + 1 random variables: S (s) =
Y1 + · · · + Y N −S+1 . By construction, all the variables
Yi = S (X i . . . X i+S−1 ) have the same distribution, but
they are not independent. Rather they satisfy a form
of local dependence, called ‘m-dependence’ in statistics.
More precisely, for i < j, Yi and Y j are independent if and
only if j − i S. Using the linearity of the expectation,
we have:
E(S (s)) = (N − S + 1)E(Yi ) ≈ N αS
with
E(Yi ) =
X 1 ...X S
(13)
S (X 1 . . . X S ) p(X 1 ) . . . p(X S ) = αS
(14)
the sum being over all A S S-tuples of the alphabet.
To situate an individual sequence with respect to the
entire population, we need to calculate the variance.
The variance also can be calculated explicitly by taking
advantage of the local dependence of the variables Yi . We
have
Cov(Yi , Y j )
Var(S (s)) = (N − S + 1)Var(Yi ) + 2
0< j−i<S
(15)
with the covariances Cov(Yi , Y j ) = E[(Yi − E(Yi ))(Y j −
E(Y j ))]. As soon as j − i S, Yi and Y j are independent
and the corresponding covariance is 0. Thus, for any given
scale S , one needs only to tabulate the expectation E(Yi )
and the S relevant short-range covariances
Ck = Ck (S ) = Cov(Yi , Yi+k )
(16)
for 0 k < S (C0 = Var(Yi )). Alternatively, by factoring
out the variance of Yi , equation (15) can also be expressed
in terms of the correlations
Cor(Yi , Y j ) .
Var(S (s)) = Var(Yi ) N − S +1+2
0< j−i<S
(17)
To obtain the exact variance at each length N , it is then
only a matter of counting how many times each type of
covariance is present in the sequences and adjust for any
boundary effects as needed.
Sequence analysis by additive scales
If N 2S − 1, then
and the approximation
Var(S (s)) = (N − S + 1)C0
S−1
+2
(N − S − k + 1)Ck .
Var(S (r )) ≈ R Var(S ( p)).
(18)
k=1
If S N < 2S − 1, then
Var(S (s)) = (N − S + 1)C0
N
−S
+2
(N − S − k + 1)Ck .
(19)
k=1
It is worth noticing that, for fixed S, both the expectation
and the variance are linear in N . In particular, for large N
S−1
S−1
Var(S (s)) ≈ N C0 + 2
Ck = N
Ck = NβS .
−S+1
k=1
(20)
In the last equality, for obvious symmetry reasons, we let
C−k = Ck . This notation will prove to be useful below.
In the case of repetitive sequences, it is also useful to
calculate the expectation of S ( p) = Y1 + · · · Y P , and its
variance with periodic boundary conditions modulo P, i.e.
assuming the variables Y1 . . . Y P and the corresponding
letters are arranged along a circle. Here both the expectation and the variance are directly proportional to P and
satisfy E(S ( p)) = αS P and Var(S ( p)) = βS P. Clearly,
for any P, E(S ( p)) = P E(Yi ) so
αS = E(Yi ).
(21)
If P 2S − 1,
βS = C 0 + 2
S−1
Ck =
S−1
Ck .
(22)
−S+1
k=1
When S P < 2S − 1, all variables along the circle are
dependent and therefore βS = βS (P) is given by
βS (P) = C0 + 2
n
k=1
Ck =
n
Ck
βS (P) = C0 + Cn + 2
Ck = Cn +
k=1
Ck
where (u) is the normalized Gaussian distribution. The
factor (2S−1) represents the size of the clusters associated
with m-dependence. For a fixed scale, such size is constant
but the theorem remains true if S grows slowly with N .
Thus equation (27) can readily be applied to S (s) or S (r )
with K = N − S + 1 or K = R P. From equation (20),
the variance of the sequences being considered is linear in
their length: Var(S (s)) ≈ β N , where β depends only on
the scale S . Thus
√ we obtain a convergence rate that scales
at most like 1/ N
Z − EZ
P √
√C
u
−
(u)
(28)
Var(Z )
N
(24)
Normalized distances and extremal sequences. The
value of S (s) or S (r ) of any sequence or repeat of length
N can be compared to the average value of a background
population by computing a normalized Z -score of the
form:
−n+1
when P = 2n. Periodic boundary conditions must be
used in the computation of the covariances Ck whenever
necessary (|k| > P − S).
For a periodic sequence r , where the period P as well as
S are small relative to the length N = R P we can use:
E(S (r )) ≈ R E(S ( p))
Central limit theorem. S (s) consists of a sum of identical but non-independent random variables. Therefore standard central limit theorems for sums of independent random variables cannot be applied. Yet, because the dependencies are local, a sum Z = Y1 + · · · + Y K of K mdependent random variables Yi still approaches a normal
distribution. This can be shown using the theorem in Baldi
and Rinott (1989) which provides also a bound on the rate
of convergence. Here we use the improved bound found in
Rinottand Dembo (1996). We let maxi |Yi − E(Yi )| = B,
K
and E
i=1 |Yi − E(Yi )| /K = µ. For all the scales to
be considered, these constants are well defined and easy to
compute. Under these assumptions,
Z
−
E
Z
7K µ
P √
u − (u) (2S − 1)2 B 2
[Var(Z )]3/2
Var(Z )
(27)
with C ≈ 7µ(2S − 1)2 B 2 β −3/2 . The rate of this bound
is known to be essentially optimal (similar to the Berry–
Esseen theorems, Feller, 1971).
−n
n−1
For long repetitive sequences with period P < S, we can
use the same approach with a larger period P , multiple of
P, so that S P .
(23)
when P = 2n + 1, and
n−1
(26)
(25)
Z (s) =
S (s) − αS N
.
√
βS N
(29)
A repeat r with period unit length P and repetition
R (N = R P) can be compared to a background
population of repeats, or a background population of
871
P.Baldi and P.-F.Baisnée
generic sequences. In the latter case, we have S (r ) ≈
S ( p)R or S (r ) ≈ α N . Therefore the Z -score
√
(α − α) N
(30)
Z (r ) =
√
β
√
grows with N and is larger than the Z -score Z ( p)
computed on the repeat unit. In other words, if a repeat
unit displays extremal features when compared to other
repeat units of the same length, its expansion will appear
even more extreme compared to the background of all
sequences of similar length.
The Z -scores can be used to assess how extreme a
sequence is and to search databases for subsequences with
extremal features. As in the case of alignments, this can
also be done using extreme value distributions (Durbin et
al., 1998). Note also that one can search a database using
a structural profile rather than extreme values. The degree
of similarity between two profiles can be measured, for
instance, using the standard mean square error.
Correlations between scales. It is useful to have some
information regarding the degree of correlation between
two scales and how such correlation behaves at all
sequence lengths. Consider then two scales S1 and S2
of length S1 and S2 . Without any loss of generality
assume that S1 S2 . For sequences s of length N , we
are interested in measuring the correlation between the
random variables S1 (s) = Y1 + · · · + Y N −S1 +1 , with Yi =
S1 (X i . . . X i+S1 −1 ), and S2 (s) = Z 1 + · · · + Z N −S2 +1 ,
with Z i = S2 (X i . . . X i+S2 −1 ). We have:
Cov(S1 (s), S2 (s))
Cor (S1 (s), S2 (s)) = √
. (31)
√
Var(S1 (s)) Var(S2 (s))
Again only terms of the form Cov(Yi , Z j ), where the
distance between i and j is small, are non-zero. More
precisely, non-zero terms can arise only if 0 j − i S1 − 1 or 0 i − j S2 − 1. It is sufficient to tabulate
the finite set of S1 + S2 − 1 covariances Cov(Yi , Z i+k )
Ck = Ck (S1 , S2 ) = E[(Yi − E(Yi ))(Z i+k − E(Z i+k ))]
(32)
with S1 S2 and −S2 + 1 k S1 − 1. These
covariances can be used to compute correlations at all
lengths by writing
Cov(S1 (s), S2 (s)) = (N −
S2 + 1)Cov(Yi , Z i )
Cov(Yi , Z j ).
(33)
+2
i= j
For large N it is clear that, except for small boundary
effects, each type of covariance occurs approximately
872
N times in the formula above. Therefore for large N ,
Cov(S1 (s), S2 (s)) behaves approximately as
S
S
1 −1
1 −1
N C0 +
Ck = N
Ck . (34)
k=−S2 +1,k =0
k=−S2 +1
We have seen in equations (20) that the variance of each
scale is also asymptotically linear in the length N . Thus,
as N increases, the correlation Cor(S1 (s), S2 (s)) rapidly
converges to a constant given by:
S1 −1
k=−S2 +1 C k (S1 S2 )
1/2 . (35)
S1 −1
S2 −1
k=−S1 +1 C k (S1 )
k=−S2 +1 C k (S2 )
In checking calculations on DNA scales (or other alphabets) that are invariant under the reverse complement
operation, it is worth noticing that with a uniform distribution on the alphabet ( p A = pC = pG = pT = 0.25),
the correlations are symmetric. That is, for any
0 < k < S1 we have Ck (S1 , S2 ) = C−k (S1 , S2 ).
This results immediately from the fact that the sum
of the terms S2 (X 1 . . . X S2 ) × S1 (X 1 . . . X S1 ) and
S2 ( X̄ S2 . . . X̄ 1 ) × S1 ( X̄ S2 . . . X̄ S2 −S1 +1 ) is equal to the
sum of the terms S2 (X 1 . . . X S2 ) × S1 (X S2 −S1 +1 . . . X S2 )
and S2 ( X̄ S2 . . . X̄ 1 ) × S1 ( X̄ S1 . . . X̄ 1 ), and similarly
for other degrees of overlaps. The terms in the sums
can be identically paired using the fact that S1 and S2
are assumed to be reverse-complement invariant. The
result is not true if the scales, or the distribution, are not
reverse-complement invariant.
Results
DNA repeat equivalence classes
We wrote a program that cycles through all possible DNA
sequences of length P counting and listing all the classes
that are equivalent under circular permutation and reverse
complement operations. Because of this equivalence, in
the case of scales that are reverse-complement invariant,
it is sufficient to study the repeats of one representative
member of each class. We ran the program up to length
P = 12. The results, shown in Table 1, are in complete
agreement with equations (9)–(12).
In Tables 2, 3 and 4 we list alphabetically all the
members of each equivalence class for sequences of
length 2–4. When P = 4, for instance, one finds
39 classes: 26 classes with 8 elements, 8 classes with
4 elements, and 4 classes with 2 elements. Only 33 classes
are new, in the sense that 6 classes are derived from
patterns already encountered at P = 1 and 2. Likewise,
when P > 2 is a prime number, the total number of classes
is given by:
4P − 4
+2
2P
(36)
Sequence analysis by additive scales
Table 1. Number of repeat unit equivalence classes. New or proper classes
are classes that do not contain a shorter periodic pattern
Sequence length
Classes (total)
Classes (new)
1
2
3
4
5
6
7
8
9
10
11
12
2
6
12
39
104
366
1 172
4 179
14 572
52 740
190 652
700 274
2
4
10
33
102
350
1 170
4 140
14 560
52 632
190 650
699 875
-18.66
-8.11
-13.10
03
0.
-1
-14.00
1.
-1
01
-14.00
5.
08
-1
85
1.
-1
-13.48
C
-9.45
-13.48
A
G
-8.11
-13.10
-9.45
T
-18.66
Fig. 1. Dinucleotide prefix automata for the propeller twist angle
scale. The CAG repeat, for instance, is associated with the cycle C
→ A → G → C in the graph and has a total propeller twist value
of −9.45 + −14.00 − 11.08 = −34.53. The corresponding reverse
complement cycle is given by C → T → G → C. The triplet repeat
class with the largest propeller twist value is CCC followed by CCG.
with two classes of size 2 associated with poly-A and polyC, while all the remaining classes are new and contain 2P
members.
In the appendix, we provide tables in alphabetical order
that allow to invert Tables 3 and 4, i.e. to find the class
associated with any given P-tuple (P = 3, 4).
Table 2. Dinucleotide classes equivalent under circular permutation and
reverse complement operations. Classes are numbered in alphabetical order
vertically. Class members are listed in alphabetical order horizontally.
Classes 1 and 5 are not proper dinucleotide classes
Class
number
List of members
(alphabetical order)
1
2
3
4
5
6
AA
AC
AG
AT
CC
CG
TT
CA
CT
TA
GG
GC
GT
GA
TG
TC
Table 3. Trinucleotide classes equivalent under circular permutation and
reverse complement operations. Classes are numbered in alphabetical order
vertically. Class members are listed in alphabetical order horizontally
Class
number
1
2
3
4
5
6
7
8
9
10
11
12
List of members
(alphabetical order)
AAA
AAC
AAG
AAT
ACC
ACG
ACT
AGC
AGG
ATC
CCC
CCG
TTT
ACA
AGA
ATA
CAC
CGA
AGT
CAG
CCT
ATG
GGG
CGC
CAA
CTT
ATT
CCA
CGT
CTA
CTG
CTC
CAT
GTT
GAA
TAA
GGT
GAC
GTA
GCA
GAG
GAT
TGT
TCT
TAT
GTG
GTC
TAC
GCT
GGA
TCA
TTG
TTC
TTA
TGG
TCG
TAG
TGC
TCC
TGA
CGG
GCC
GCG
GGC
Analysis of DNA repeats by dinucleotide scales
In the case of dinucleotide scales, the prefix automata
contains four nodes (Figure 1). Each DNA sequence is
associated with a path through the corresponding graph,
and exact repeats are associated with cycles. All paths,
including cycles, of length greater than four are composite
in the sense that they contain a cycle of length 4 or less.
In Table 5, we list the dinucleotide scale values
S (X 1 X 2 ) + S (X 2 X 1 ) for the six equivalence classes
associated with all 16 possible dinucleotide repeats of the
form (X 1 X 2 ) R . For each scale, we list classes (represented
by their first alphabetical member) and the corresponding
scale value, in decreasing value order. The highest level
of base stacking energy is achieved by the AT repeat class
(−10.39) and the lowest by the CG repeat class (−24.28).
The ranking of all possible dinucleotide repeats induced
by the propeller twist and the protein deformability scales
are identical with the exception of an inversion between
the CC (−16.22 and 12.2) and CG (−21.11 and 16.1)
classes at the high (flexible) end of the spectrum. At the
873
P.Baldi and P.-F.Baisnée
Table 4. Tetranucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order
vertically. Class members are listed in alphabetical order horizontally
Class
number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
List of members
(alphabetical order)
AAAA
AAAC
AAAG
AAAT
AACC
AACG
AACT
AAGC
AAGG
AAGT
AATC
AATG
AATT
ACAC
ACAG
ACAT
ACCC
ACCG
ACCT
ACGC
ACGG
ACGT
ACTC
ACTG
AGAG
AGAT
AGCC
AGCG
AGCT
AGGC
AGGG
ATAT
ATCC
ATCG
ATGC
CCCC
CCCG
CCGG
CGCG
TTTT
AACA
AAGA
AATA
ACCA
ACGA
ACTA
AGCA
AGGA
ACTT
ATCA
ATGA
ATTA
CACA
AGAC
ATAC
CACC
CCGA
AGGT
CACG
CCGT
CGTA
AGTG
AGTC
CTCT
ATAG
CAGC
CGAG
CTAG
CAGG
CCCT
TATA
ATGG
CGAT
CATG
GGGG
CCGC
CGGC
GCGC
ACAA
AGAA
ATAA
CAAC
CGAA
AGTT
CAAG
CCTT
AGTA
ATTG
ATTC
TAAT
GTGT
CAGA
ATGT
CCAC
CGAC
CCTA
CGCA
CGGA
GTAC
CACT
CAGT
GAGA
ATCT
CCAG
CGCT
GCTA
CCTG
CCTC
CAAA
CTTT
ATTT
CCAA
CGTT
CTAA
CTTG
CTTC
CTTA
CAAT
CATT
TTAA
TGTG
CTGT
CATA
CCCA
CGGT
CTAC
CGTG
CGTC
TACG
CTCA
CTGA
TCTC
CTAT
CTGG
CTCG
TAGC
CTGC
CTCC
GTTT
GAAA
TAAA
GGTT
GAAC
GTTA
GCAA
GAAG
GTAA
GATT
GAAT
TGTT
TCTT
TATT
GTTG
GTTC
TAAC
GCTT
GGAA
TAAG
TCAA
TCAT
TTGT
TTCT
TTAT
TGGT
TCGT
TAGT
TGCT
TCCT
TACT
TGAT
TGAA
TTTG
TTTC
TTTA
TTGG
TTCG
TTAG
TTGC
TTCC
TTAC
TTGA
TTCA
GACA
GTAT
GGGT
GACC
GGTA
GCAC
GACG
GTCT
TACA
GGTG
GGTC
GTAG
GCGT
GGAC
TCTG
TATG
GTGG
GTCG
TACC
GTGC
GTCC
TGTC
TGTA
TGGG
TCGG
TAGG
TGCG
TCCG
GAGT
GACT
GTGA
GTCA
TCAC
TCAG
TGAG
TGAC
GATA
GCCA
GAGC
TAGA
GCTG
GCGA
TATC
GGCT
GCTC
TCTA
TGGC
TCGC
GCAG
GAGG
GCCT
GGAG
GGCA
GGGA
TGCC
TCCC
CATC
GATC
GCAT
CCAT
TCGA
TGCA
GATG
GGAT
TCCA
TGGA
CGCC
GCCG
CGGG
GGCC
GCCC
GCGG
GGCG
GGGC
opposite (stiff) end, we find the single letter repeat class
AA (−37.2 and 5.8) followed by the proper dinucleotide
repeat class AG (−27.48 and 6.6).
In Table 6, we list the dinucleotide scale values for the
12 equivalence classes associated with all possible triplet
repeats of the form (X Y Z ) R . In this special case, we find
the results of Baldi et al. (1999). The high and low ends of
the base stacking energy scale are occupied by the triplet
classes AAT (−15.76) and CCG (−32.54) respectively.
We find again a high degree of correlation between the
propeller twist and protein deformability scales. If we
exclude the classes AAA/TTT (−55.98) and CCC/GGG
(−24.33), which are not proper triplet repeat classes, then
874
the maximum and the minimum of the propeller twist
spectrum are respectively occupied by the classes CCG
(−29.22) and AAG (−46.14). A similar ranking with
the same extremal triplets is observed with the protein
deformability scale: CCG (22.2) occupies the high end,
whereas AAA (8.7) and AAG (9.5) occupy the low end of
the spectrum.
When considering all three dinucleotide scales, three
minima and two maxima are occupied by two of the
three repeat classes known to be involved in triplet
repeat expansion diseases, namely AAG and CCG. GAA
triplet (in the AAG class) expansion is associated with
Friedreich’s ataxia (Orr et al., 1993; Campuzano et al.,
Sequence analysis by additive scales
Table 5. Dinucleotide structural scale values for repeat unit p = X 1 X 2 with
P = 2. S( p) = S(X 1 X 2 ) + S(X 2 X 1 )
Base
stacking
Propeller
twist
AT (−10.39)
AA (−10.74)
CC (−16.52)
AG (−16.59)
AC (−17.08)
CG (−24.28)
CC (−16.22)
CG (−21.11)
AC (−22.55)
AT (−26.86)
AG (−27.48)
AA (−37.32)
Protein
deformability
CG
CC
AC
AT
AG
AA
(16.1)
(12.2)
(12.1)
(7.9)
(6.6)
(5.8)
Table 6. Dinucleotide structural scale values for repeat unit p = X 1 X 2 X 3
with P = 3. S( p) = S(X 1 X 2 ) + S(X 2 X 3 ) + S(X 3 X 1 ). Repeat classes
associated with triplet repeat expansion diseases are in bold
Base
stacking
Propeller
twist
AAT (−15.76)
AAA (−16.11)
ACT (−21.11)
AAG (−21.96)
AAC (−22.45)
ATC (−22.95)
CCC (−24.78)
AGG (−24.85)
ACC (−25.34)
AGC (−27.94)
ACG (−30.01)
CCG (−32.54)
CCC (−24.33)
CCG (−29.22)
ACC (−30.66)
AGC (−34.53)
AGG (−35.59)
ACG (−36.61)
ATC (−37.94)
ACT (−38.95)
AAC (−41.21)
AAT (−45.52)
AAG (−46.14)
AAA (−55.98)
Protein
deformability
CCG
ACG
CCC
ACC
AGC
ATC
AAC
AGG
AAT
ACT
AAG
AAA
(22.2)
(18.9)
(18.3)
(18.2)
(15.9)
(15.9)
(15.0)
(12.7)
(10.8)
(10.7)
(9.5)
(8.7)
1996; Junck and Fink, 1996; Paulson et al., 1997; Koenig,
1998; Lee, 1998; Orr and Zoghbi, 1998; Paulson, 1998;
Pulst, 1998; Stevanin et al., 1998). Abnormal GCC triplet
(in the CCG class) expansion is associated with FRAXE
mental retardation and abnormal expansion of the CGG
triplet with fragile X syndrome (FRAXA) (Nelson, 1995;
Gusella and MacDonald, 1996; Eichler and Nelson, 1998;
Skinner et al., 1998; Gecz and Mulley, 1999). The third
triplet expansion disease related class, AGC, has average
rank in all dinucleotide scales.
In Table 7, we list the scale values for the 39 equivalence
classes associated with all possible tetranucleotide repeats
of the form (X 1 X 2 X 3 X 4 ) R . The maximum of the base
stacking scale is occupied by the dinucleotide repeat
ATAT (−20.78) and the proper tetranucleotide repeat
AAAT (−21.13). The minimum corresponds to CGCG
(−48.56) followed by ACGC (−41.36). We again observe
a substantial positive correlation between the values
produced by the propeller twist and protein deformability
scales together with a weaker negative correlation with
respect to the base stacking energy scale. The high
end of the propeller twist scale is occupied by CCCC
Table 7. Dinucleotide structural scale values for repeat unit p =
X 1 X 2 X 3 X 4 with P = 4. S( p) = S(X 1 X 2 ) + S(X 2 X 3 ) + S(X 3 X 4 ) +
S(X 4 X 1 )
Base
stacking
Propeller
twist
ATAT (−20.78)
AAAT (−21.13)
AATT (−21.13)
AAAA (−21.48)
AACT (−26.48)
AAGT (−26.48)
AGAT (−26.98)
AAAG (−27.33)
ACAT (−27.47)
AAAC (−27.82)
AATC (−28.32)
AATG (−28.32)
ACCT (−29.37)
AAGG (−30.22)
AACC (−30.71)
ATCC (−31.21)
AGCT (−31.97)
CCCC (−33.04)
AGGG (−33.11)
AGAG (−33.18)
AAGC (−33.31)
ACCC (−33.60)
ACAG (−33.67)
ACTC (−33.67)
ACTG (−33.67)
ACAC (−34.16)
ATGC (−34.30)
ACGT (−34.53)
AACG (−35.38)
ATCG (−35.88)
AGCC (−36.20)
AGGC (−36.20)
ACCG (−38.27)
ACGG (−38.27)
CCCG (−40.80)
CCGG (−40.80)
AGCG (−40.87)
ACGC (−41.36)
CGCG (−48.56)
CCCC (−32.44)
CCCG (−37.33)
CCGG (−37.33)
ACCC (−38.77)
CGCG (−42.22)
AGCC (−42.64)
AGGC (−42.64)
ACGC (−43.66)
AGGG (−43.70)
ACCG (−44.72)
ACGG (−44.72)
ATGC (−44.99)
ACAC (−45.10)
ATCC (−46.05)
ACCT (−47.06)
ACGT (−48.08)
AGCG (−48.59)
AACC (−49.32)
ACAT (−49.41)
ACAG (−50.03)
ACTC (−50.03)
ACTG (−50.03)
AGCT (−50.93)
ATCG (−52.00)
AAGC (−53.19)
ATAT (−53.72)
AAGG (−54.25)
AGAT (−54.34)
AGAG (−54.96)
AACG (−55.27)
AATC (−56.60)
AATG (−56.60)
AACT (−57.61)
AAGT (−57.61)
AAAC (−59.87)
AAAT (−64.18)
AATT (−64.18)
AAAG (−64.80)
AAAA (−74.64)
Protein
deformability
CGCG
CCCG
CCGG
ACGC
ATGC
ACCG
ACGG
CCCC
ACCC
ACAC
ACGT
AGCG
ATCG
AGCC
AGGC
ATCC
AACG
AACC
ACAT
AAGC
AATC
AATG
AGGG
ACAG
ACTC
ACTG
AAAC
ACCT
ATAT
AAGG
AGAT
AGCT
AAAT
AATT
AACT
AAGT
AGAG
AAAG
AAAA
(32.2)
(28.3)
(28.3)
(28.2)
(25.2)
(25.0)
(25.0)
(24.4)
(24.3)
(24.2)
(23.0)
(22.7)
(22.7)
(22.0)
(22.0)
(22.0)
(21.8)
(21.1)
(20.0)
(18.8)
(18.8)
(18.8)
(18.8)
(18.7)
(18.7)
(18.7)
(17.9)
(16.8)
(15.8)
(15.6)
(14.5)
(14.5)
(13.7)
(13.7)
(13.6)
(13.6)
(13.2)
(12.4)
(11.6)
(−32.44) and CCCG (−37.33) while that of the protein
deformability scale is occupied by CGCG (32.2) and
CCCG (28.3). The lowest values correspond for both
scales to AAAA (−74.64 and 11.6) and AAAG (−64.80
and 12.4).
All repeat units of length greater than 4 are made up of
shorter cyclic paths in the prefix automata and therefore
their properties can essentially be predicted from the
previous three tables. For all lengths, for instance, the
highest level of base stacking energy is achieved by the
class ATATATAT . . . when P is even, and by the class
AATATATAT . . . when P is odd. The lowest level by the
875
P.Baldi and P.-F.Baisnée
Table 8. Trinucleotide structural scale values for repeat unit p = X 1 X 2 with
P = 2. S( p) = S(X 1 X 2 X 1 ) + S(X 2 X 1 X 2 )
AA
AC
TT
TG
AG
Bendability
AT
TC
0.175
0.017
TA
CA
AT (0.364)
AG (0.058)
AC (0.034)
CC (−0.024)
CG (−0.154)
AA (−0.548)
Position
preference
AA
CG
AT
CC
AC
AG
(72)
(50)
(26)
(26)
(23)
(17)
0.076
GT
CC
GG
CG
GC
CT
GA
Fig. 2. Trinucleotide prefix automata for the bendability scale.
Circle is used for ease of display but does not represent actual
connections. The CAG repeat, for instance, is associated with the
cycle CA → AG → GC → CA in the graph and has a total
bendability value of 0.175+0.017+0.076 = 0.268. It is the highest
bendability value for any triplet repeat. Other edges are not shown.
class CGCGCG . . . when P is even, and CCGCGCG . . .
when P is odd. For protein deformability, the maximal
level is achieved by the class CGCGCG . . . when P is
even, and by CCGCGCG . . . when P is odd. The lowest
level is associated with poly-A (i.e. (A) P ). Poly-C and
poly-A give also the absolute highest and lowest propeller
twist angles at all lengths.
Analysis of DNA repeats by trinucleotide scales
In the case of trinucleotide scales, the prefix automata
contains 16 nodes (Figure 2), each one labeled with
a different dinucleotide. All paths, including cycles, of
length greater than 16 are composite, i.e. contain at least
one cycle of length 16 or less.
The trinucleotide scale values for all repeats with
periodic unit length P = 2 are given in Table 8. The
highest level of bendability is achieved by AT (0.364) and
the lowest by AA (−0.548) and CG (−0.154). The highest
level of position preference is achieved by AA (72) and
CG (50), and the lowest by AG (17).
The trinucleotide scale values for all repeats with
periodic unit length P = 3 are given in Table 9 (see also
Baldi et al., 1999). The highest level of bendability is
achieved by the class AGC (0.268) and the lowest by AAA
(−0.822) and ACC(−0.238). In fact only two classes of
876
repeats (AGC and ATC) have positive bendability and
are well separated from the rest. The highest level of
position preference is achieved by the class AAA (108)
followed by CCG (72), and the lowest by AGG and
ACC (21). The class AGC, which contains the CAG
repeat responsible for the majority of the known triplet
repeat expansion diseases, has the highest bendability. It
is the only repeat class for which all three shifted triplets
have a high individual bendability. Moreover, this class
has a relatively low position preference value, another
sign of flexibility. Therefore one can hypothesize that
long CAG repeats correspond to stretches of DNA that
are highly flexible in all positions. Consistently with their
high flexibility, CAG/CTG repeats have been found to
have the highest affinity for histones among all possible
triplet repeats (Wang and Griffith, 1994, 1995; Godde
and Wolffe, 1996). Other DNA sequences can adopt long
range curvature only if they contain highly flexible triplets
in phase with the helical pitch (roughly every 10.5 bp).
The flexibility of extended CAG repeats has been verified
experimentally (Chastain and Sinden, 1998). The CCG
class, which contains the disease-related triplets CGG
and GCC, is found at the high (rigid) end of the position
preference scale (72), exceeded only by poly-A. This class
is also stiff according to the bendability scale (−0.106).
This is consistent with the fact that CGG/CCG repeats
seem completely unable to form nucleosomes (Wang et
al., 1996; Godde et al., 1996). The AAG class, which
contains the disease related triplet GAA, occupies the
lower (flexible) end of the position preference scale (27).
It is the second lowest considering that the last two classes
have the same value (21). We also note that AAA/TTT is
by large the stiffest of all possible repeats according to
both scales. Such homopolymeric tracts are known from
X-ray crystallography to be rigid and straight (Nelson et
al., 1987) and they are bad candidates for nucleosome positioning. In fact, a number of promoters in yeast contain
homopolymeric dA:dT elements. Studies in two different yeast species have shown that the homopolymeric
elements destabilize nucleosomes and thereby facilitate
Sequence analysis by additive scales
Table 9. Trinucleotide structural scale values for repeat unit p = X 1 X 2 X 3
with P = 3. S( p) = S(X 1 X 2 X 3 ) + S(X 2 X 3 X 1 ) + S(X 3 X 1 X 2 ). Repeat
classes associated with triplet repeat expansion diseases are in bold
Table 10. Trinucleotide structural scale values for repeat unit p =
X 1 X 2 X 3 X 4 with P = 4. S( p) = S(X 1 X 2 X 3 ) + S(X 2 X 3 X 4 ) +
S(X 3 X 4 X 1 ) + S(X 4 X 1 X 2 )
Bendability
Bendability
AGC (0.268)
ATC (0.218)
AGG (−0.013)
AAT (−0.030)
CCC (−0.036)
ACG (−0.049)
ACT (−0.068)
AAG (−0.091)
CCG (−0.106)
AAC (−0.196)
ACC (−0.238)
AAA (−0.822)
Position
preference
AAA (108)
CCG (72)
AAT (63)
ACG (47)
AGC (40)
CCC (39)
ACT (35)
ACC (33)
ATC (33)
AAG (27)
AAC (21)
AGG (21)
the access of transcription factors bound nearby (Iyer
and Struhl, 1995; Zhu and Thiele, 1996). Interestingly,
the sequence of the IT15 gene involved in Huntington’s
disease has a repeat containing 18 adenine nucleotides at
its 3 end.
Whereas the class CCG is extremely rigid according to
the trinucleotide scales, it is extremely flexible according
to the dinucleotide scales. Similarly, the predicted flexibility of the AAG class according to the position preference
scale is in contradiction with the results obtained using all
other di- or trinucleotide scales. Such discrepancies can
result from imperfections of the scales, or from the fact
that each scale captures a different facet of DNA structure.
Dinucleotide and tri-nucleotide scales are in good agreement for CAG repeats and homopolymeric poly-A tracts.
The trinucleotide scale values for all repeats with
periodic unit length P = 4 are given in Table 10. The
highest level of bendability is achieved by the class ATAT
(0.728), which is rather a dinucleotide repeat, followed
by the proper tetranucleotide repeat ATGC (0.420). On
the opposite end of the scale, we find AAAA (−1.096)
and AAAC (−0.470). For the position preference scale,
AAAA (144), AATT (100), CGCG (100), AAAT (99)
are on the higher end, AAAT being the first proper
tetranucleotide repeat, while ACGG (23) occupies the
lower end of the spectrum.
In order to find the most extreme repeats for a given
scale at a given repeat unit length, one would have to
explore scale values for repeat units up to length P =
16 (see Section Extremal sequences and automata).
Because of particular values of the scale, in some cases
the results tabulated above for values of P up to 4 only
are sufficient. For instance, the most bendable repeat with
P = 2n is always ATATAT . . . , while the least bendable
is poly-A. Similarly, the highest value of the position
ATAT (0.728)
ATGC (0.420)
ACAT (0.335)
AGGC (0.301)
AGCT (0.214)
AGAT (0.189)
ACAG (0.183)
ACTG (0.173)
AGAG (0.116)
ACTC (0.082)
ACAC (0.068)
AGCC (0.053)
AAGC (0.027)
ACCT (0.026)
AATG (0.011)
ACGC (0.006)
ACGT (−0.016)
AGGG (−0.025)
AGCG (−0.032)
CCCC (−0.048)
CCGG (−0.058)
CCCG (−0.118)
AAGG (−0.162)
ACGG (−0.169)
AAGT (−0.171)
AATC (−0.181)
ACCG (−0.184)
ATCC (−0.209)
ATCG (−0.226)
AACT (−0.230)
ACCC (−0.250)
AACG (−0.278)
AAAT (−0.304)
CGCG (−0.308)
AAAG (−0.365)
AATT (−0.424)
AACC (−0.468)
AAAC (−0.470)
AAAA (−1.096)
Position
preference
AAAA
AATT
CGCG
AAAT
CCGG
AGCG
AGCT
CCCG
AGCC
ATCG
AATG
AGGC
AAAG
ACGC
ATGC
AAAC
AACG
AACT
AATC
AAGC
ATAT
CCCC
ACCG
AGAT
ACAC
ACCC
ACTC
AAGT
ACAT
ACCT
ATCC
AGAG
AGGG
AACC
AAGG
ACTG
ACGT
ACAG
ACGG
(144)
(100)
(100)
(99)
(94)
(89)
(86)
(85)
(80)
(76)
(68)
(68)
(63)
(63)
(62)
(57)
(57)
(55)
(54)
(53)
(52)
(52)
(49)
(47)
(46)
(46)
(44)
(43)
(43)
(40)
(38)
(34)
(34)
(31)
(31)
(29)
(28)
(25)
(23)
preference scale is always occupied by poly-A. Extremal
results can also be derived by dynamic programming. In
many cases, however, a sequence of interest may have a
very high or low score according to a given scale, without
being the most extreme. The probabilistic theory provides
the means to quantify directly how extreme any given
sequence is with respect to a given family or background.
Probabilistic analysis of DNA scales
For simplicity, we first assume a uniform distribution
p A = pC = pG = pT . In specific applications,
other distributions can be used, such as the background
877
P.Baldi and P.-F.Baisnée
distribution of a given genome or a given class of
DNA sequences. We can then use equations (21)–(24)
to calculate the expectation and variance of S ( p) across
all possible repeat unit patterns p and all scales S . In
particular, E(S ( p)) = αS P and Var(S ( p)) = βS P.
In Table 11 we list the relevant coefficients for the
dinucleotide scales.
300
200
µ=0.0923
σ=0.0771
P=10
µ=0.185
σ=0.154
P=15
µ=0.277
σ=0.231
100
0
4
x 10
2.5
2
Table 11. Basic intra-scale coefficients for dinucleotide scales with repeat
unit length P (2S − 1) = 3
P=5
1.5
1
0.5
0
6
BS
PT
PD
−8.08
6.62
0.31
7.23
−12.59
9.68
2.26
14.20
4.96
9.62
−2.19
5.23
x 10
10
αS = E(Yi )
C0 = Var(Yi )
C1 = Cov((Yi Yi+1 )
βS = Var(S( p))
In Table 12 we list the relevant coefficients for the
trinucleotide scales.
Table 12. Basic intra-scale coefficients for trinucleotide scales with repeat
unit length P (2S − 1) = 5
B
αS = E(Yi )
C0 = Var(Yi )
C1 = Cov(Yi Yi+1 )
C2 = Cov(Yi Yi+2 )
βS = Var(S( p)
−0.018
0.015
0.0015
−0.001
0.015
PP
13.78
103.108
18.214
−2.558
134.42
To double-check the mathematical formula, all the
constants above were also obtained independently by
exhaustive sampling.
Convergence to normality. Using sampling methods we
also studied the convergence of S (s) and S ( p) to a normal
distribution as the length N or P of the sequences is
increased, as predicted by our central limit theorem. In
practice, the convergence rate is very fast. As an example,
an histogram of bendability values for repeat units of
length 5, 10 and 15 is given in Figure 3. Similar results
are observed with plain sequences.
Examples of Z -scores for disease triplets. We have seen
that the triplets involved in expansion diseases often tend
to have extremal structural properties. This was assessed
by computing the scores S ( p). We can now also compute
Z -scores using equations (29) and (30), as in Table 13.
When repeat length is taken into consideration, disease
causing√repeats appear to be even more extreme because
of the N factor in equation (30). For example, a CGG
repeat of length N = 3000 (R = 1000) observed in
878
5
0
Fig. 3. Histogram of bendability values S( p) for all possible repeat
units of length P = 5, 10, 15. Vertical dashed lines represent
standard deviation units.
Table 13. Z -scores for the repeats involved in Huntington’s disease (HD) and
fragile X syndrome (FRAXA )for the bendability and propeller twist scales.
Z ( p) is the Z -score of the repeat unit against the background of all possible
repeat units of same length P = 3. Z (r ) is the Z -score of a long repeat
containing R repeat units against the background of all possible sequences
of same length N = R P, including non repetitive sequences. Values of R
are chosen at the characteristic low end and high end of each disease
Disease
Triplet
Scale
Z ( p)
R (low end)
Z (r )
R (high end)
Z (r )
HD
FRAXA
CAG
B
1.60
36
9.00
121
16.52
CGG
PP
1.56
200
21.48
1000
48.23
FRAXA patients is more than 48 standard deviations away
from the mean propeller twist value of random uniform
sequences of the same length.
Correlations between scales at short lengths. We can use
equations (31)–(35) to study the correlations between the
scales at short lengths and asymptotically. Clearly we can
also consider AT-content as a scale. It can be viewed, for
instance, as a mononucleotide scale with value 1 for A
and T and 0 for C and G. This scale is trivially reversecomplement invariant and perfectly additive. We include
it to see whether it correlates strongly with any of the
structural scales, especially asymptotically.
Correlations at a given position are given in Table 14.
Consistently with (Baldi et al., 1998) and the results
Sequence analysis by additive scales
Table 14. Correlations between the scales at a given position (C0 (S1 , S2 )).
AT-content, base stacking, propeller twist, protein deformability, bendability,
position preference
AT
BS
PT
PD
B
PP
AT
BS
PT
PD
B
PP
1
0.478
1
−0.539
−0.294
1
−0.294
0.043
0.668
1
−0.0098
−0.018
0.249
0.141
1
0.040
−0.089
−0.123
−0.036
−0.080
1
above, the correlations between the structural scales
are very low with the exception of PT and PD (0.668).
BS and PT have also non-trivial opposite correlations
with respect to AT-content (0.48 and −0.54). Here
correlations between a dinucleotide scale S1 and a
trinucleotide
scale S2 are computed using sums of the
form X 1 X 2 X 3 S1 (X 1 X 2 )S2 (X 1 X 2 X 3 ). Because the third
nucleotide does not appear in the dinucleotide scale, in
Baldi et al. (1998) correlations between dinucleotide and
trinucleotide scales were also computed using, for the
dinucleotide scale, the sum S1 (X 1 X 2 )+ S1 (X 2 X 3 ). When
considering neighboring dinucleotides, the correlation
between BS and PT, for instance, increases its magnitude
from −0.294 to −0.550. This effect must be caused
by correlations that are present in runs of overlapping
dinucleotides, but not in the single dinucleotides. Such a
phenomenon may arise if the physical reality behind both
scales is that the structure actually depends on more than
a dinucleotide step, and this is very likely to be the case.
Correlations between scales: asymptotic values. If the
same correlations are computed by shuffling the 4.6 Mbp
of the E.coli genome randomly over a length of 31 bp,
one obtains the numbers given in Table 15 (Pedersen et
al., 2000). The correlation between BS and PT is higher
(−0.744) and so is the correlation of AT-content with BS
and PT (0.899 and −0.882). Incidentally, when measured
on the actual E.coli genome the correlations are even
higher. For instance, the correlation between BS and PT
becomes −0.825.
These results are easily explained by the theory developed here. The asymptotic correlation between the scales
computed using equation (35) are displayed in Table 16.
Because the E.coli genome has a nucleotide distribution
close to uniform, the results are indeed remarkably similar
to Table 15, and would be identical up to sampling fluctuations if in Table 16 we had used the precise distribution
for E.coli (A = 0.2462, C = 0.2542, G = 0.2537, T =
0.2459), instead of a uniform distribution.
It is essential to notice that the asymptotic values do not
require very long sequences but are approximately correct
Table 15. Correlations between the scales measured over 31 bp random
segments from E.coli
AT
BS
PT
PD
B
PP
AT
BS
PT
PD
B
PP
1
0.899
1
−0.882
−0.744
1
−0.777
−0.805
0.801
1
−0.153
−0.181
0.370
0.108
1
0.023
−0.029
−0.154
0.062
−0.206
1
Table 16. Asymptotic correlations between the scales using a uniform
distribution
AT
BS
PT
PD
B
PP
AT
BS
PT
PD
B
PP
1
0.914
1
−0.891
−0.757
1
−0.798
−0.843
0.810
1
−0.167
−0.193
0.387
0.113
1
0.046
−0.003
−0.175
0.051
−0.225
1
already at a length scale of 15–20 base pairs or so (Figure 4). Asymptotically, and with a uniform distribution,
all the dinucleotide scales have strong positive or negative
correlations with each other and with AT-content. Notice
that this is not the case for the trinucleotide scales. It is
also not necessarily the case if the correlations are measured with respect to other nucleotide distributions.
Figure 5 shows how the asymptotic correlation of
bendability with AT content varies with the underlying
distribution. Similar surfaces for all other scales are given
in Figure 8 in the appendix. Notice that in general
A/(A + T ) ≈ 0.5 in eukaryotic genomic DNA as
soon as sufficiently long stretches of DNA are taken
into consideration (Chargaff’s second parity rule) (Prabhu,
1993; Bell and Forsdyke, 1999). This is not necessarily
the case with, for instance, relatively short stretches of
DNA, synthetic DNA, or with bacterial DNA that contains
a strong composition skew associated with independently
replicated regions.
Analysis of a set of expandable repeats in primate
genomes
Because triplets involved in disease expansions seem
to have extremal properties which may be related to
the expansion mechanism, it is worth testing whether
this is a fairly general feature of units associated with
tandem repeats. Here we consider the large set of repeat
unit classes derived in Jurka and Pethiyagoda (1995)
corresponding to frequently encountered tandem repeats
879
P.Baldi and P.-F.Baisnée
1
0.8
PT/PD
0.6
0.4
PT/B
Correlation
0.2
PD/B
PD/PP
0
BS/PP
PT/PP
BS/B
− 0.2
B/PP
− 0.4
− 0.6
BS/PT
− 0.8
BS/PD
−1
0
5
10
15
20
25
30
35
N
Fig. 4. Rapid convergence of correlations between pairs of scales
to the theoretically predicted asymptotic values as a function of
string length with uniform nucleotide distribution and free boundary
conditions. Curves start at N = 2 for pairs of dinucleotide scales,
and N = 3 for all other pairs. Correlations are calculated exactly
up to N = 12, and using a random uniform sample of 70 000 000
points for N = 17, 22, 27, 32.
0.4
0.2
Correlation AT/B
0
0.2
the GenBank database. The first category, in particular,
contains 67 classes of patterns that were found to occur
in tandem repeats of total length N 12 with R 3
in at least two different length sizes. The second category
includes 71 pattern classes that are found to occur in
tandem repeats over 12 nucleotides long in only one length
size. The last category contains 363 pattern classes that
were found not to expand beyond 12 nucleotides. For
each pattern class, simple indicators are provided, such as
average length of repeats, relative abundance of class, and
expandability.
In Figure 6 we display the Z -scores for the all the
repeat units in the first category, as a function of repeat
unit length (P), computed with respect to repeat units
of the same length using equation (29). The distributions
of repeat classes with respect to each structural scale
are approximately symmetric and normal at all lengths,
showing no clear-cut bias towards extremal values of any
scale. When taking into account the relative abundance
of the classes, as quantified in Jurka and Pethiyagoda
(1995) by using the number of nucleotides occurring
in corresponding repetitive sequences, we nevertheless
observe that the most abundant class, (poly-A, 33%),
corresponds to the stiffest repeat at all lengths, for all
scales except BS (Figure 7). It is reasonable to assume
that the structural properties of poly-A are related to
its abundance. The next most frequent repeats, however,
(AC = 17.94%, AG = 5.38%, and AAAT = 3.39%)
do not show a clear pattern of extremal values in the
five scales considered. Likewise, we do not find any
obvious correlation between scale values and relative
abundance or expandability indicators provided in Jurka
and Pethiyagoda (1995).
0.4
0.6
0.8
1
.75
1
1
.50
.75
.50
A/(A+T) Ratio
.25
.25
0
0
AT Content
Fig. 5. Surface representing the asymptotic correlation between the
bendability scale and the AT content scale for different distribution
values of AT-content and A/T proportions.
of multiple lengths (i.e. that are polymorphic) in primate
genomes. The data contains 501 unique classes of repeat
units ranging in length from P = 1 to 6, classified
into three categories: expandable, weakly expandable,
and non-expandable. These categories were derived from
simple statistical criteria calculated over a subset of
880
Discussion
A general framework for sequence analysis in the presence of one or more additive scales has been developed.
The framework solves a number of open issues including:
(1) construction of sequences with extremal properties;
(2) quantitative evaluation of sequences with respect to
a given genomic background; (3) automatic extraction of
extremal sequences and profiles from genomic databases;
(4) rapid convergence to normal distributions when N increases; and (5) complete analysis of correlations between
scales and their rapid convergence towards a fixed asymptotic value. The framework has been applied to DNA sequences and structural scales.
The fundamental requirement for the application of the
framework is the additivity of the scale. This is likely to
be a reasonable approximation for many scales, at least
over relatively short distances. The precise nature of such
distances, however is an open important question that
ultimately will have to be addressed experimentally. The
Sequence analysis by additive scales
BS
PT
5
PD
5
35
5
30
0
0
0
5
5
1
2
3
4
5
5
6
1
2
B
3
4
5
6
1
2
3
4
5
6
PP
5
5
Relative Abundance (%)
25
AT
AG
20
A
15
AGAT
AAGG
10
AC
AAAC
AAAAG
0
0
5
0
5
5
1
2
3
4
5
6
1
2
3
4
5
6
Fig. 6. Distribution of expandable repeat classes over scale values.
Horizontal axis represents repeat unit length (P). Vertical axis
represents scale value distances from expectation for all possible
repeat units, normalized by the corresponding standard deviation
[see equation (29)]. Dots represent a set of 67 expandable repeat
classes. The number of classes at each length 1–6 is as follows:
2, 4, 9, 18, 18, and 16. Circles represent the most extreme repeat
classes, when considering the full population of repeat patterns at a
given length. Note that at each length P, only proper P-tuple repeats
are represented in the set found in Jurka and Pethiyagoda (1995),
excluding repeats with shorter periodicity. For instance poly-A is
only represented once as a single letter repeat. It is worth noticing
that most extreme positions are actually occupied at almost all
length by mono-, di-, or tri-nucleotide repeats. The set of expandable
repeats therefore actually covers the full range of each scale when
taking into account patterns made up of shorter repeated units. Only
5 circles out of 64 remain unmatched by an expandable repeat class.
structural scales used here should be regarded only as a
first order approximation. The twist angle between bases,
for instance, is likely to depend on more than just the two
neighboring bases (Dickerson, 1992). A better estimate
could be derived using the tetranucleotide consisting of
the two bases before and after the twist angle. Unfortunately, the structure of all possible 256 tetranucleotides
is not known and represents a considerable experimental
challenge. But the methods we have developed are
independent of any particular scale, approximation, or
oligonucleotide length. They are readily applicable to new
scales, tetranucleotide and other, as well as to completely
different scales defined over other alphabets (codons,
RNA, proteins, etc.). Furthermore, the methods are also
applicable in conjunction with computationally-derived
scales that are parameterized and fitted to the data using neural network representations and other statistical
machine learning techniques.
AGC
AGG
CCG
AAT
1
2
AAAG
AGCCC
AAAAT
AAAT
3
4
Repeat Unit Length (P)
AAAAC
AACCCT
5
6
Fig. 7. Relative abundance of expandable repeat unit classes in
Jurka and Pethiyagoda (1995) as a function of repeat unit length.
The individual contribution of repeat classes totaling more than 1%
relative abundance are shown.
The theory presented resolves the issue of correlation
between scales. For each pair of scales there is not a
single correlation number because correlation depends
both on background distribution and window length.
Given a fixed background distribution, the correlations
rapidly converge to a fixed asymptotic value that can be
predicted mathematically. This value is attained here over
a characteristic window length of about 10–15 bp, the
same over which normality is achieved, and corresponding
to a few times the size of the longest of the two
scales being compared. This is the range over which
local statistical fluctuations are stabilized, but it does not
correspond necessarily to the range over which the scales
are additive.
With a uniform background distribution, for instance,
the correlation between propeller twist and base stacking
varies monotonically from −0.294 to −0.757, as window
size is increased from 2 to 10 or so. Thus the increase in
window length significantly changes the measured correlation from slightly negative to substantially negative. An
even more striking example is provided by the correlation
between base stacking and protein deformability, which
varies from 0.043 to −0.843, under the same conditions.
Empirical determination of the proper window length
for additivity and for measuring correlations may not
be easy. It must be noted, however, that these large
variations are observed only with the theoretically-derived
base stacking energy scale (as well as the AT-content
scale). In general, all other empirically-derived scales
exhibit pairwise correlations that do not vary dramatically
881
P.Baldi and P.-F.Baisnée
with window length and are relatively small (Figure 4),
except for propeller twist and protein deformability. Thus
for the empirically-derived scales the local behavior and
the aggregate behavior over 10–15 bp is quite similar
in terms of correlations, so that the precise selection of
a window length may not be a serious obstacle in this
case.
Propeller twist and protein deformability are highly,
but not perfectly, correlated, over both very short and
intermediate distances. These scales were derived by
crystallography of naked DNA and DNA in complex with
protein respectively. This suggests that DNA structures
observed in protein–DNA complexes may to some extent
be determined at the DNA-sequence level. Or at least that
the structure of DNA in the complex has to be consistent
with the inherent structural features of the naked DNA.
In general, when substantial positive or negative correlation between two scales is observed, two different sets
of conclusions can be drawn. First, from a practical standpoint, it may be simpler and faster in database searches to
use only one of the two scales, since the results provided
by the second one are redundant. Second, high correlation between two completely different experimental approaches attempting to quantify the connection between
DNA sequence and structure can be taken as a sign that
the approaches are measuring the same underlying reality. Thus correlation analysis, in addition to which scale to
use in practice, may tell us something about their interpretation and validity.
It may at first seem strange that correlations depend
also on background distribution, since structure is a
deterministic function of DNA sequence. In this sense,
a uniform distribution may be the least biased. On the
other hand, if correlations are measured over a large
number of sequences extracted from genomic data, it
is clear that the sequence composition influences the
correlation. Similarly, if the scales are used to pull out
signals against a genomic background, it is important to
take the statistical composition into consideration. In this
respect, it is worth noting that large scale genomic DNA is
characterized by strand invariant compositions ( p A = pT
and pC = pG ) where correlations between empiricallyderived scales tend to be smaller in absolute value. The
framework, however, applies as well to compositions that
are not strand invariant.
We have modeled the background distribution using
single nucleotide probabilities but it should be clear that
the same framework can accommodate more complex
probabilistic models, such as Markov models of order
k. In fact, it is interesting to note that with a higher
order Markov model, some of the correlations between
the scales measured in E.coli would be slightly higher.
It is fair to suspect, however, that the structural models
currently available are somewhat noisy and therefore only
882
marginal gains are to be expected at best from the use of
more refined probabilistic models.
Taken together, all these results indicate that with the
exception of propeller twist and protein deformability,
the empirical scales we have used are by and large
uncorrelated. DNA 3D structure is a complex phenomenon
that cannot be captured locally by a single number, but
rather corresponds to a vector of properties. It is therefore
likely that the scales we have used represent different
attempts at capturing various aspects of DNA structure
from different perspectives. This view is consistent with
the simultaneous presence of predominantly low and
occasionally high correlations between pairs of scales.
In particular, this provides a possible partial explanation
for the differences in interpretation the scales provide for
some of the three extremal triplet classes involved in triplet
repeat expansion diseases.
The CAG-class of repeats is consistently found to
be flexible according to all the models used here and
in agreement with experimental evidence (Chastain and
Sinden, 1998). This class is special among triplet repeats,
being responsible for a large fraction of the currently
known triplet repeat diseases (10 of the 13 mentioned
in Baldi et al. (1999)). Furthermore, in a model study
in E.coli, the CAG triplet repeat was found to be the
predominant genetic expansion product. In this study,
the CAG-class was expanded at least nine times more
frequently than any other triplet (Ohshima et al., 1996).
The CGG-class of repeats, on the other hand, seems
to be very rigid, except for the propeller twist (and thus
protein deformability) scale. Better structural models may
be needed to shed light on such a discrepancy. However, it
is important to remember that the models used in this work
are based on mutually different and also rather indirect
investigations of DNA structure. Any single scale is likely
to capture correctly only some structural features of some
sequence elements. For instance, the enzyme DNase I used
to produce the bendability scale preferentially binds and
cuts sites where DNA is bent or bendable away from
the minor groove. This means that a high DNase I value
can be caused by either a very flexible piece of DNA
(isotropically flexible, or anisotropically flexible in the
right direction), or alternatively by a piece of DNA that
is stiff but curved with a compressed major groove.
The framework derived here can be used to study
how extreme tandem repeats and other sequences are. In
several cases, we find that commonly encountered repeat
units have extremal structural properties. This is the case
for the most common repeat poly-A or poly-T but also for
the repeats involved in triplet repeat expansion diseases. It
is essential to notice that these extremal properties pertain
to the repeat unit class, e.g. the triplet and its shifted
versions, rather than the repeat unit alone. A triplet that
is not extremal for a given scale, may become extremal
Sequence analysis by additive scales
once its two shifted versions are considered. For example,
AGC has relatively low bendability when taken alone, but
corresponds to the most bendable class when GCA and
CAG are taken into account. When the large repetition
numbers associated with repeat diseases are also taken
into consideration, the extremality of the corresponding
DNA sequences with respect to the general background
are even more striking. Incidentally, the triplet repeat
class is under-represented amongst primate DNA repeat
expansions (Figure 7) suggesting that special expansion
mechanisms may be at work for P = 1 and P = 3, at
least in a fraction of the cases.
How the expansion of disease causing triplets occurs,
as well as several related puzzling questions such as why
expansion frequency depends on repeat length, remain
poorly understood although it is widely assumed that
unusual structural characteristics of the repeats may play
a role. Several models have been proposed involving
base-slippage and alternative DNA structures during
DNA replication, recombination, and repair (Wells,
1996; Pearson and Sinden, 1998a,b; Moore et al., 1999).
Growing evidence indicates that the formation of hairpins
in Okazaki fragments (during replication of the lagging
strand) is probably involved in the expansion process
(Chen et al., 1995; Gacy et al., 1995; Wells, 1996;
Mariappan et al., 1998; Miret et al., 1998; Pearson and
Sinden, 1998b). But many questions remain open to
interpretation.
Our analysis of a large set of tandem repeats from
primate genomes reveals that many repeat units do not
have salient structural extreme properties according to the
models used here. The results suggest that tandem repeats
are likely to belong to different classes and result from a
variety of different mechanisms not all of which involve
extremal structural sequences. This is not to say that
structural signatures, rather than extreme patterns, may not
be involved in other cases as suggested, for instance, in
Liao et al. (2000). Further evidence for such a possibility
is provided by the fact that among the most frequently
expanded repeat classes with 3 < P < 7, substrings of
mono repeats (particularly poly-A which is stiff) seem to
be present almost always (Figure 7).
Although some of the equations derived are for exact
repeats, it should be clear that the framework applies
immediately to situations where the repeat is not perfect,
either because of small variations in the sequence or
because the repetition number R contains a fractional
component. Cases are described in the literature (Orr et
al., 1993, for instance), where the G of a few isolated CAG
triplets within a long CAG repeat region are replaced by
a T. Of interest, the CAT triplet belongs to the second
highest bendability class, and therefore the flexibility
properties of such stretches are likely to be preserved.
More generally, the scales could perhaps form the basis of
new alignment penalties in cases where structure, rather
than sequence, is preserved.
The methods developed here can be applied more systematically to other repeats including telomeric repeats,
non-triplet disease-causing repeats, as well as to database
searches for new putative disease-causing repeat classes
(Kleiderlein et al., 1998; Baldi et al., 1999). For instance,
a repeated twelve-mer upstream of the EPM1 gene displays intergenerational instability and has been associated
with myoclonus epilepsy (Lalioti et al., 1997, 1998). A
similarly unstable, AT-rich, 42 bp-repeat is involved in the
fragile site FRA10B (Hewett et al., 1998). In addition, a
particular form of muscular dystrophy (FSHD) seems to
be caused by DNA contraction, rather than expansion. In
FSHD, the repeat units are surprisingly long (3.3 kb), and
located at the tip of the long arm of chromosome IV. The
units are repeated 30–40 times in normal individuals, and
reduced down to eight repeats in affected individuals (van
Deutekom et al., 1993; Winokur et al., 1994; Hewitt et al.,
1994). Statistical analysis of long repeat units may benefit
even more from the techniques developed here.
Finally, these techniques can be applied to repeats
associated with interspersed elements as well, and, more
broadly, to the analysis of structural signals, across
entire genomes (Pedersen et al., 2000), or associated
with specific regions such as regulatory regions, protein
binding sites, SARs (scaffold attachment regions), or
polytene bands in Drosophila. The methods are also being
applied to phylogenetic questions and to whether DNA
structure may have had any influence on the origin of
the genetic code. While all these problems would benefit
from improved structural models, the methods are now in
place to work in conjunction with any new scale that may
become available in the near future with progress in DNA
experimental techniques.
Acknowledgements
The work of PB was initially supported by an NIH SBIR
grant to Net-ID, Inc., and currently by a Laurel Wilkening
Faculty Innovation award at UCI. We would like to thank
Anders Gorm Pedersen and David Ussery for comments
on the manuscript.
Appendix
Dinucleotide and trinucleotide scales
The three dinucleotide scales (Table 17) and two trinucleotide scales (Table 18) used in the text.
Enumeration of equivalence classes
Polya’s enumeration theory solves a number of combinatorial questions related to the action of groups on sets. The
relevant set here is the set B of all words of length P over
the alphabet A with A P elements. The group G of inter883
P.Baldi and P.-F.Baisnée
Table 19. Repeat class for each trinucleotide (N = 3). Classes are numbered
in alphabetical order. The first alphabetical member of each class is in bold
Table 17. Dinucleotide scales
Pair
BS
PT
PD
AA/TT
AC/GT
AG/CT
AT
CA/TG
CC/GG
CG
GA/TC
GC
TA
−5.37
−10.51
−6.78
−6.57
−6.57
−8.26
−9.69
−9.81
−14.59
−3.82
−18.66
−13.10
−14.00
−15.01
−9.45
−8.11
−10.03
−13.48
−11.08
−11.85
2.9
2.3
2.1
1.6
9.8
6.1
12.1
4.5
4.0
6.3
Table 18. Trinucleotide scales
Triplet
B
PP
AAA/TTT
AAC/GTT
AAG/CTT
AAT/ATT
ACA/TGT
ACC/GGT
ACG/CGT
ACT/AGT
AGA/TCT
AGC/GCT
AGG/CCT
ATA/TAT
ATC/GAT
ATG/CAT
CAA/TTG
CAC/GTG
CAG/CTG
CCA/TGG
CCC/GGG
CCG/CGG
CGA/TCG
CGC/GCG
CTA/TAG
CTC/GAG
GAA/TTC
GAC/GTC
GCA/TGC
GCC/GGC
GGA/TCC
GTA/TAC
TAA/TTA
TCA/TGA
−0.274
−0.205
−0.081
−0.280
−0.006
−0.032
−0.033
−0.183
0.027
0.017
−0.057
0.182
−0.110
0.134
0.015
0.040
0.175
−0.246
−0.012
−0.136
−0.003
−0.077
0.090
0.031
−0.037
−0.013
0.076
0.107
0.013
0.025
0.068
0.194
36
6
6
30
6
8
8
11
9
25
8
13
7
18
9
17
2
8
13
2
31
25
18
8
12
8
13
45
5
6
20
8
Triplet
Class
Triplet
Class
Triplet
Class
AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
AGT
ATA
ATC
ATG
ATT
CAA
CAC
CAG
CAT
CCA
CCC
1
2
3
4
2
5
6
7
3
8
9
7
4
10
10
4
2
5
8
10
5
11
CCG
CCT
CGA
CGC
CGG
CGT
CTA
CTC
CTG
CTT
GAA
GAC
GAG
GAT
GCA
GCC
GCG
GCT
GGA
GGC
GGG
GGT
12
9
6
12
12
6
7
9
8
3
3
6
9
10
8
12
12
8
9
12
11
5
GTA
GTC
GTG
GTT
TAA
TAC
TAG
TAT
TCA
TCC
TCG
TCT
TGA
TGC
TGG
TGT
TTA
TTC
TTG
TTT
7
6
5
2
4
7
7
4
10
9
6
3
10
8
5
2
4
3
2
1
set of all permutations of the form ρ k γ l , where γ denotes
the reverse complement permutation, with 0 k P −1,
and l = 0 or 1. Notice that this group is not commutative.
However ρ k γ l = γ l ρ P−k .
In both cases, G acts on B and an equivalence relation
ρ1 ≡ ρ2 is defined if and only if there exists a g ∈ G
such that ρ2 = gρ1 . The total number of classes or orbits
is given by the Burnside lemma and equal to:
1 |Bg |
|G| g∈G
(37)
where Bg = {x ∈ B|g(x) = x} is the set of all the
elements of B that are fixed by g.
We thus need to study Bρ k and Bρ k γ . A case by case
inspection easily shows that:
• If (k, P) = 1, then |Bρ k | = A since only sequences of
one repeated letter are stable.
• If k|P, then |Bρ k | = Ak .
est is a subgroup of the group of all permutations of B. If
we consider circular permutations on one strand only, it is
the circular group with P elements, generated by the single right shift operator ρ. If we consider also the reverse
complement, then the group G is easily described as the
884
• If (k, P) = l, then |Bρ k | = Al .
When G is the group of cyclic permutations, |G| = P.
Putting these results together gives the right hand side of
equation (7), which is equivalent to the left-hand side after
some algebra.
Sequence analysis by additive scales
Table 20. Repeat class for each tetranucleotide (N = 4). Classes are numbered in alphabetical order. The first alphabetical member of each class is in bold
Quadruplet
Class
Quadruplet
Class
Quadruplet
Class
Quadruplet
Class
Quadruplet
Class
Quadruplet
Class
AAAA
AAAC
AAAG
AAAT
AACA
AACC
AACG
AACT
AAGA
AAGC
AAGG
AAGT
AATA
AATC
AATG
AATT
ACAA
ACAC
ACAG
ACAT
ACCA
ACCC
ACCG
ACCT
ACGA
ACGC
ACGG
ACGT
ACTA
ACTC
ACTG
ACTT
AGAA
AGAC
AGAG
AGAT
AGCA
AGCC
AGCG
AGCT
AGGA
AGGC
AGGG
1
2
3
4
2
5
6
7
3
8
9
10
4
11
12
13
2
14
15
16
5
17
18
19
6
20
21
22
7
23
24
10
3
15
25
26
8
27
28
29
9
30
31
AGGT
AGTA
AGTC
AGTG
AGTT
ATAA
ATAC
ATAG
ATAT
ATCA
ATCC
ATCG
ATCT
ATGA
ATGC
ATGG
ATGT
ATTA
ATTC
ATTG
ATTT
CAAA
CAAC
CAAG
CAAT
CACA
CACC
CACG
CACT
CAGA
CAGC
CAGG
CAGT
CATA
CATC
CATG
CATT
CCAA
CCAC
CCAG
CCAT
CCCA
CCCC
19
10
24
23
7
4
16
26
32
11
33
34
26
12
35
33
16
13
12
11
4
2
5
8
11
14
17
20
23
15
27
30
24
16
33
35
12
5
17
27
33
17
36
CCCG
CCCT
CCGA
CCGC
CCGG
CCGT
CCTA
CCTC
CCTG
CCTT
CGAA
CGAC
CGAG
CGAT
CGCA
CGCC
CGCG
CGCT
CGGA
CGGC
CGGG
CGGT
CGTA
CGTC
CGTG
CGTT
CTAA
CTAC
CTAG
CTAT
CTCA
CTCC
CTCG
CTCT
CTGA
CTGC
CTGG
CTGT
CTTA
CTTC
CTTG
CTTT
GAAA
37
31
18
37
38
21
19
31
30
9
6
18
28
34
20
37
39
28
21
38
37
18
22
21
20
6
7
19
29
26
23
31
28
25
24
30
27
15
10
9
8
3
3
GAAC
GAAG
GAAT
GACA
GACC
GACG
GACT
GAGA
GAGC
GAGG
GAGT
GATA
GATC
GATG
GATT
GCAA
GCAC
GCAG
GCAT
GCCA
GCCC
GCCG
GCCT
GCGA
GCGC
GCGG
GCGT
GCTA
GCTC
GCTG
GCTT
GGAA
GGAC
GGAG
GGAT
GGCA
GGCC
GGCG
GGCT
GGGA
GGGC
GGGG
GGGT
6
9
12
15
18
21
24
25
28
31
23
26
34
33
11
8
20
30
35
27
37
38
30
28
39
37
20
29
28
27
8
9
21
31
33
30
38
37
27
31
37
36
17
GGTA
GGTC
GGTG
GGTT
GTAA
GTAC
GTAG
GTAT
GTCA
GTCC
GTCG
GTCT
GTGA
GTGC
GTGG
GTGT
GTTA
GTTC
GTTG
GTTT
TAAA
TAAC
TAAG
TAAT
TACA
TACC
TACG
TACT
TAGA
TAGC
TAGG
TAGT
TATA
TATC
TATG
TATT
TCAA
TCAC
TCAG
TCAT
TCCA
TCCC
TCCG
19
18
17
5
10
22
19
16
24
21
18
15
23
20
17
14
7
6
5
2
4
7
10
13
16
19
22
10
26
29
19
7
32
26
16
4
11
23
24
12
33
31
21
TCCT
TCGA
TCGC
TCGG
TCGT
TCTA
TCTC
TCTG
TCTT
TGAA
TGAC
TGAG
TGAT
TGCA
TGCC
TGCG
TGCT
TGGA
TGGC
TGGG
TGGT
TGTA
TGTC
TGTG
TGTT
TTAA
TTAC
TTAG
TTAT
TTCA
TTCC
TTCG
TTCT
TTGA
TTGC
TTGG
TTGT
TTTA
TTTC
TTTG
TTTT
9
34
28
18
6
26
25
15
3
12
24
23
11
35
30
20
8
33
27
17
5
16
15
14
2
13
10
7
4
12
9
6
3
11
8
5
2
4
3
2
1
When we take into account the reverse complement, we
have |G| = 2P. A case by case inspection again shows
that:
Repeat classes for N = 3 and N = 4
Repeat classes for each trinucleotide (Table 19) and each
tetranucleotide (Table 20).
• If P is odd, then Bρ k γ is empty.
• If P is even, then Bρ k γ has P/2 degrees of freedom
and therefore |Bρ k γ | = A P/2 .
This yields immediately the formula in equations (9)
and (10). If needed, it is also straightforward to count the
number of elements inside each type of equivalence class.
Correlations as a function of background distribution
Examples of surfaces of correlations between the ATcontent scale and the other scales as a function of
background distribution are given in Figure 8. Similar
curves can be obtained for any pair of scales.
885
P.Baldi and P.-F.Baisnée
1
Correlation AT/PT
Correlation AT/BS
1
0.5
0
− 0.5
−1
1
0
− 0.5
−1
1
.75
.50
.25
0
AT Content
0
.25
.50
.75
.75
1
.50
.25
0
AT Content
A/(A+T) Ratio
1
0
.25
.50
.75
1
A/(A+T) Ratio
1
Correlation AT/PP
Correlation AT/PD
0.5
0.5
0
− 0.5
−1
1
0.5
0
− 0.5
−1
1
.75
.50
.25
AT Content
0
0
.25
.50
.75
1
A/(A+T) Ratio
.75
.50
.25
AT Content
0
0
.25
.50
.75
1
A/(A+T) Ratio
Fig. 8. Surface representing the asymptotic correlation between AT-content and all the other scales for different background distributions.
References
Ashley,C.T. and Warren,S.T. (1995) Trinucleotide repeat expansion
and human disease. Annu. Rev. Genet., 29, 703–728.
Baldi,P. and Rinott,Y. (1989) On normal approximations of distributions in terms of dependency graphs. Ann. Prob., 17, 1646–1650.
Baldi,P., Brunak,S., Chauvin,Y. and Krogh,A. (1996) Naturally
occurring nuclesome positioning signals in human exons and
introns. J. Mol. Biol., 263, 503–510.
Baldi,P., Chauvin,Y., Pedersen,A.G. and Brunak,S. (1998) Computational applications of DNA structural scales. In Proceedings of
the 1998 Conference on Intelligent Systems for Molecular Biology (ISMB98) The AAI Press, Menlo Park, CA, pp. 35–42.
Baldi,P., Brunak,S., Chauvin,Y. and Pedersen,A.G. (1999) Structural basis for triplet repeat disorders: a computational analyis.
Bioinformatics, 15, 918–929.
Bell,S.J. and Forsdyke,D.R. (1999) Accounting units in DNA. J.
Theor. Biol., 197, 51–61.
Benson,G. (1999) Tandem repeats finder: a program to analyze
DNA sequences. Nucleic Acids Res., 27, 573–580.
Benson,G. and Waterman,M.S. (1994) A method for fast database
search for all k-nucleotide repeats. Nucleic Acids Res., 22, 4828–
4836.
Blanchard,M.K., Chiapello,H. and Coward,E. (2000) Detecting
localized repeats in genomic sequences: a new strategy and
its applications to Bacillus subtilis and Arabidopsis thaliana
sequences. Comput. Chem., 24, 57–70.
Breslauer,K.J., Frank,R., Blocker,H. and Marky,L.A. (1986) Predicting DNA duplex stability from the base sequence. Proc. Natl.
Acad. Sci. USA, 83, 3746–3750.
Brukner,I., Sánchez,R., Suck,D. and Pongor,S. (1995) Sequence-
886
dependent bending propensity of DNA as revealed by DNase I:
parameters for trinucleotides. EMBO J., 14, 1812–1818.
Calladine,C.R., Drew,H.R. and McCall,M.J. (1988) The intrinsic
structure of DNA in solution. J. Mol. Biol., 201, 127–137.
Campuzano,V., Montermini,L., Molto,M.D., Pianese,L., Cossee,M.,
Cavalcanti,F., Montos,E., Rodius,F., Duclos,F., Monticelli,A.,
Zara,F., Canizares,J., Koutnikova,H., Bidichandani,S.I.,
Gellera,C., Brice,A., Trouillaas,P., Michele,G.D., Filla,A.,
Frutos,R.D., Palau,F., Patel,P.I., Donato,S.D., Mandel,J.L.,
Cocozza,S., Koenig,M. and Pandolfo,M. (1996) Friedreich’s
ataxia: autosomal recessive disease caused by an intronic GAA
triplet repeat expansion. Science, 271, 1423–1427.
Chastain,P.D. and Sinden,R.R. (1998) CTG repeats associated with
human genetic disease are inherently flexible. J. Mol. Biol., 275,
405–411.
Chen,X., Mariappan,S.V.S., Catasti,P., Ratliff,R., Moyzis,R.K.,
Ali,L., Smith,S.S., Bradbury,E.M. and Gupta,G. (1995) Hairpins
are formed by the single DNA strands of the fragile X triplet
repeats: structure and biological implications. Proc. Natl. Acad.
Sci. USA, 52, 5199–5203.
van Deutekom,J.T., Wijmenga,C., van Tienhoven,E.A.E.,
Gruter,A.M., Hewitt,J.E., Padberg,G.W., van Ommen,G.J.B.,
Hofker,M.H. and Frants,R.R. (1993) FSHD associated DNA
rearrangements are due to deletions of integral copies of a 3.3 kb
tandemly repeated unit. Hum. Mol. Genet., 2, 2037–2042.
Dickerson,R.E. (1992) DNA structure from A to Z. Meth. Enzymol.,
211, 67–111.
Drew,H.R. and Travers,A.A. (1985) DNA bending and its relation
to nucleosome positioning. J. Mol. Biol., 186, 773–790.
Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biologi-
Sequence analysis by additive scales
cal sequence analysis. Probabilstic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
Eichler,E.E. and Nelson,D.L. (1998) The FRAXA fragile site and
fragile X syndrome. In Rubinsztein,D.C. and Hayden,M.R. (eds),
Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 13–50.
Feller,W. (1971) An Introduction to Probability Theory and its
Applications. vol. 2, 2nd edn, Wiley, New York.
Fye,R.M. and Benham,C.J. (1999) Exact method for numerically
analyzing a model of local denaturation in superhelically stressed
DNA. Phys. Rev. E, 59, 3408–3426.
Gacy,A.M., Goellner,G., Juranic,N., Macura,S. and McMurray,C.T.
(1995) Trinucleotide repeats that expand in human disease
form hairpin structures in vitro. Cell, 8, 533–540.
Gecz,J. and Mulley,J.C. (1999) Characterisation and expression of
a large, 13.7 kb fmr2 isoform. Eur. J. Hum. Genet., 7, 157–162.
Godde,J.S. and Wolffe,A.P. (1996) Nucleosome assembly on CTG
triplet repeats. J. Biol. Chem., 271, 15222–15229.
Godde,J.S., Kass,S.U., Hirst,M.C. and Wolffe,A.P. (1996) Nucleosome assembly on methylated CGG triplet repeats in the fragile X mental retardation gene 1 promoter. J. Biol. Chem., 271,
24325–24328.
Goodsell,D.S. and Dickerson,R.E. (1994) Bending and curvature
calculations in B-DNA. Nucleic Acids Res., 22, 5497–5503.
Grove,A., Galeone,A., Mayol,L. and Geiduschek,E.P. (1996) Localized DNA flexibility contributes to target site selection by DNAbending proteins. J. Mol. Biol., 260, 120–125.
Gusella,J.F. and MacDonald,M.E. (1996) Trinucleotide instability:
a repeating theme in human inherited disorders. Annu. Rev. Med.,
47, 201–209.
Hardy,J. and Gwinn-Hardy,K. (1998) Genetic classification of
primary neurodegenerative disease. Science, 282, 1075–1079.
Hassan,M.A.E. and Calladine,C.R. (1996) Propeller-twisting of
base-pairs and the conformational mobility of dinucleotide steps
in DNA. J. Mol. Biol., 259, 95–103.
Hewett,D.R., Handt,O., Hobson,L., Mangelsdorf,M., Eyre,H.J.,
Baker,E., Sutherland,G.R., Schuffenhauer,S., Mao,J.I. and
Richards,R.I. (1998) FRA10B structure reveals common elements in repeat expansion and chromosomal fragile site genesis.
Mol. Cell, 1, 773–781.
Hewitt,J.E., Lyle,R., Clark,L.N., Valleley,E.M., Wright,T.J., Wijmenga,C., van Deutekom,J.C., Francis,F., Sharpe,P.T. and
Hofker,M.H. et al. (1994) Analysis of the tandem repeat locus
D4Z4 associated with facioscapulohumeral muscular dystrophy.
Hum. Mol. Genet., 3, 1287–1295.
Hunter,C.A. (1996) Sequence-dependent DNA structure. Bioessays,
18, 157–162.
Iyer,V. and Struhl,K. (1995) Poly (dA:dT), a ubiquitous promoter
element that stimulates transcription via its intrinsic DNA
structure. EMBO J., 14, 2570–2579.
Jeffreys,A.J. (1997) Spontaneous and induced minisatellite instability in the human genome. Clin. Sci., 93, 383–390.
Junck,L. and Fink,J.K. (1996) Machado-Joseph disease and SCA3:
the genotype meets the phenotypes. Neurology, 46, 4–8.
Jurka,J. (1998) Repeats in genomic DNA: mining and meaning.
Curr. Opin. Struct. Biol., 8, 333–337.
Jurka,J. and Pethiyagoda,C. (1995) Simple repetitive DNA sequences from primates: compilation and analysis. J. Mol. Evol.,
40, 120–126.
Jurka,J., Walichiewicz,J. and Milosavljevic,A. (1992) Prototypic
sequences for human repetitive DNA. J. Mol. Evol., 35, 286–291.
Kleiderlein,J.J., Nisson,P.E., Jessee,J., Li,W., Becker,K.G.,
Derby,M.L., Ross,C.A. and Margolis,R.L. (1998) CCG repeats
in cDNAs from human brain. Hum. Genet., 103, 666–673.
Koenig,M. (1998) Friedreich’s ataxia. In Rubinsztein,D.C. and
Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS,
Oxford, pp. 219–238.
Lahm,A. and Suck,D. (1991) DNase I-induced DNA conformation:
2 Å structure of a DNase I-octamer complex. J. Mol. Biol., 222,
645–667.
Lalioti,M.D., Scott,H.S., Buresi,C., Rossier,C., Bottani,A., Morris,M.A., Malafosse,A. and Antonarakis,S.E. (1997) Dodecamer
repeat expansion in cystatin B gene in progressive myoclonus
epilepsy. Nature, 386, 847–851.
Lalioti,M.D., Scott,H.S., Genton,P., Grid,D., Ouazzani,R.,
M’Rabet,A., Ibrahim,S., Gouider,R., Dravet,C., Chkili,T.,
Bottani,A., Buresi,C., Malafosse,A. and Antonarakis,S.E.
(1998) A PCR amplification method reveals instability of the dodecamer repeat in progressive myoclonus epilepsy (EPM1)
and no correlation between the size of the repeat and age at
onset. Am. J. Hum. Genet., 62, 842–847.
Lee,C.C. (1998) Spinocerebellar ataxia type 6 (SCA6). In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat
Disorders BIOS, Oxford, pp. 145–154.
Liao,G., Rehm,E.J. and Rubin,G.M. (2000) Insertion site preferences of the P transposable element in Drosophila melanogaster.
Proc. Natl. Acad. Sci. USA, 97, 3347–3351.
Liu,K. and Stein,A. (1997) DNA sequence encodes information for
nucleosome array formation. J. Mol. Biol., 270, 559–573.
Lu,Q., Wallrath,L.L. and Elgin,S.C.R. (1994) Nucleosome positioning and gene regulation. J. Cell. Biochem., 55, 83–92.
Mariappan,S.V.S., Silks III,L.A., Bradbury,E.M. and Gupta,G.
(1998) Fragile X DNA triplet repeats, (GCC)n , form hairpins with single hydrogen-bonded cytosine. cytosine mispairs at
the CpG sites: isotope-edited nuclear magnetic resonance spectroscopy on (GCC)n with selective 15 N 4-labeled cytosine bases.
J. Mol. Biol., 283, 111–120.
Milosavljevic,A. and Jurka,J. (1993) Discovering simple DNA
sequences by the algorithmic significance method. CABIOS, 9,
407–411.
Miret,J.J., Pessoa-Brandao,L. and Lahue,R.S. (1998) Orientationdependent and sequence-specific expansions of CTG/CAG trinucleotide repeats in Saccharomyces cerevisiae. Proc. Natl. Acad.
Sci. USA, 95, 12438–12443.
Moore,H., Greenwell,P.W., Liu,C.P., Arnheim,N. and Petes,T.D.
(1999) Triplet repeats form secondary structures that escape DNA repair in yeast. Proc. Natl. Acad. Sci. USA, 96, 1504–
1509.
Nelson,D.L. (1995) The fragile X syndrome. Semin. Cell Biol., 6,
5–11.
Nelson,H.C.M., Finch,J.T., Luisi,B.F. and Klug,A. (1987) The
structure of an oligo(dA)-oligo(dT) tract and its biological
implications. Nature, 330, 221–226.
Ohshima,K., Kang,S. and Wells,R.D. (1996) CTG triplet repeats
from human hereditary diseases are dominant genetic expansion
products in Escherichia coli. J. Biol. Chem., 271, 1853–1856.
Olson,W.K., Gorin,A.A., Lu,X., Hock,L.M. and Zhurkin,V.B.
887
P.Baldi and P.-F.Baisnée
(1998) DNA sequence-dependent deformability deduced
from protein–DNA crystal complexes. Proc. Natl. Acad. Sci.
USA, 95, 11 163–11 168.
Ornstein,R.L., Rein,R., Breen,D.L. and MacElroy,R.D. (1978) An
optimised potential function for the calculation of nucleic acid
interaction energies. I. Base stacking. Biopolymers, 17, 2341–
2360.
Orr,H.T. and Zoghbi,H.Y. (1998) Polyglutamine tract vs. protein
context in SCA1 pathogenesis. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 105–118.
Orr,H.T., Chung,M., Banfi,S., Jr.,T.J.K., Servadio,A., Beaudet,A.L.,
McCall,A.E., Duvick,L.A., Ranum,L.P.W. and Zoghbi,H.Y.
(1993) Expansion of an unstable trinucleotide CAG repeat
in spinocerebellar ataxia type 1. Nature Genet., 4, 221–226.
Parvin,J.D., McCormick,R.J., Sharp,P.A. and Fisher,D.E. (1995)
Pre-bending of a promoter sequence enhances affinity for the
TATA-binding factor. Nature, 373, 724–727.
Paulson,H.L. (1998) Spinocerebellar ataxia type 3/machado-joseph
disease. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis
of Triplet Repeat Disorders BIOS, Oxford, pp. 129–144.
Paulson,H.L., Perez,M.K., Trottier,Y., Trojanowski,J.Q., Subramony,S.H., Das,S.S., Vig,P., Mandel,J.L., Fischbeck,K.H. and
Pittman,R.N. (1997) Intranuclear inclusions of expanded polyglutamine protein in spinocerebellar ataxia type 3. Neuron, 19,
333–344.
Pazin,M.J. and Kadonaga,J.T. (1997) SWI2/SNF2 and related proteins: ATP-driven motors that disrupt protein–DNA interactions?
Cell, 88, 737–740.
Pearson,C.E. and Sinden,R.R. (1998a) Slipped strand DNA, dynamic mutations and human disease. In Wells,R.D. and Warren,S.T. (eds), Genetic Instabilities and Hereditary Neurological
Diseases Academic Press, New York, pp. 585–621.
Pearson,C.E. and Sinden,R.R. (1998b) Trinucleotide repeat DNA
structures: dynamic mutations from dynamic DNA. Curr. Opin.
Struct. Biol., 8, 321–330.
Pedersen,A.G., Baldi,P., Brunak,S. and Chauvin,Y. (1998) DNA
structure in human RNA polymerase II promoters. J. Mol. Biol.,
281, 663–673.
Pedersen,A.G., Jensen,L.J., Brunak,S., Staerfeldt,H.H. and
Ussery,D.W. (2000) A DNA structural atlas for Escherichia coli.
J. Mol. Biol., 299, 907–930.
Ponomarenko,M.P.,
Ponomarenko,J.V.,
Frolov,A.S.,
Podkolodny,N.L., Savinkova,L.K., Kolchanov,N.A. and Overton,G.C. (1999) Identification of sequence-dependent DNA
sites interacting with proteins. Bioinformatics, 15, 687.
Prabhu,V.V. (1993) Symmetry observations in long nucleotide
sequences. Nucleic Acids Res., 21, 2797–2800.
Pulst,S.-M. (1998) Spinocerebellar ataxia type 2. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat
Disorders BIOS, Oxford, pp. 119–128.
Rinott,Y. and Dembo,A. (1996) Some examples of normal approximations by Stein’s method. In Aldous,D. and Pemantle,R. (eds),
Random Discrete Structures Springer, New York, pp. 25–44.
Ross,C.A. (1995) When more is less: pathogenesis of glutamine
repeat neurodegenerative diseases. Neuron, 15, 493–496.
Rubinsztein,D.C. and Amos,B. (1998) Trinucleotide repeat mutation processes. In Rubinsztein,D.C. and Hayden,M.R. (eds),
888
Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 257–268.
Rubinsztein,D.C. and Hayden,M.R. (1998) Introduction. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat
Disorders BIOS, Oxford, pp. 1–12.
Satchwell,S.C., Drew,H.R. and Travers,A.A. (1986) Sequence periodicities in chicken nucleosome core DNA. J. Mol. Biol., 191,
659–675.
Sheridan,S.D., Benham,C.J. and Hatfield,G.W. (1998) Activation
of gene expression by a novel DNA structural transmission
mechanism that requires supercoiling-induced DNA duplex
destabilization in an upstream activating sequence. J. Biol. Chem.
Simpson,R.T. (1991) Nucleosome positioning: occurrence, mechanisms, and functional consequences. Prog. Nucleic Acids Res.
Mol. Biol., 40, 143–184.
Sinden,R.R. (1994) DNA Structure and Function. Academic Press,
San Diego, CA.
Skinner,J.A., Foss,G.S., Miller,W.J. and Davies,K.E. (1998) Molecular studies of the fragile sites FRAXE and FRAXF. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat
Disorders BIOS, Oxford, pp. 51–60.
Starr,D.B., Hoopes,B.C. and Hawley,D.K. (1995) DNA bending is
an important component of site-specific recognition by the TATA
binding protein. J. Mol. Biol., 250, 434–446.
Stevanin,G., Daviid,G., Abbas,N., Dürr,A., Holmberg,M., Duyckaerts,C., Giunti,P., Cancel,G., Ruberg,M., Mandel,J.-L. and
Brice,A. (1998) Spinocerebellar ataxia type 7 (SCA7). In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat
Disorders BIOS, Oxford, pp. 155–168.
Suck,D. (1994) DNA recognition by Dnase I. J. Mol. Recognition,
7, 65–70.
The Huntington’s Disease Collaborative Research Group, (1993) A
novel gene containing a trinucleotide repeat that is expanded and
unstable on Huntington’s disease chromosomes. Cell, 72, 971–
983.
Tsukiyama,T. and Wu,C. (1997) Chromatin remodeling and transcription. Curr. Opin. Gen. Dev., 7, 182–191.
Wang,Y.-H. and Griffith,J.D. (1994) Preferential nucleosome assembly at DNA triplet repeats from the myotonic dystrophy gene.
Science, 265, 669–671.
Wang,Y.-H. and Griffith,J.D. (1995) Expanded CTG triplet blocks
from the myotonic dystrophy gene create the strongest known
natural nucleosome positioning elements. Genomics, 25, 570–
573.
Wang,Y.-H., Gellibolian,R., Shimizu,M., Wells,R.D. and Griffith,J.D. (1996) Long CCG triplet repeat blocks exclude
nucleosomes—a possible mechanism for the nature of fragile
sites in chromosomes. J. Mol. Biol., 263, 511–516.
Wells,R.D. (1996) Molecular basis of triplet repeat diseases. J. Biol.
Chem., 271, 2875–2878.
Werner,M.H. and Burley,S.K. (1997) Architectural transcription
factors: proteins that remodel DNA. Cell, 88, 733–736.
Winokur,S.T., Bengtsson,U., Feddersen,J., Mathews,K.D., Weiffenbach,B., Bailey,H., Markovich,R.P., Murray,J.C., Wasmuth,J.J.,
Altherr,M.R. and Schutte,B.C. (1994) The DNA rearrangement
associated with facioscapulohumeral muscular dystrophy involves a heterochromatin-associated repetitive element: implications for a role of chromatin structure in the pathogenesis of the
disease. Chromosome Res., 2, 225–234.
Sequence analysis by additive scales
Wolffe,A.P. and Drew,H.R. (1995) DNA structure: implications
for chromatin structure and function. In Elgin,S.C.R. (ed.),
Chromatin Structure and Gene Expression IRL Press, Oxford,
pp. 27–48.
Wolffe,A.P. and Matzke,M.A. (1999) Epigenetics: regulation
throughrepression. Science, 286, 481–486.
Zhu,Z. and Thiele,D.J. (1996) A specialized nucleosome modulates
transcription factor access to a C.glabrata metal responsive
promoter. Cell, 87, 459–470.
889