Vol. 16 no. 10 2000 Pages 865–889 BIOINFORMATICS Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths Pierre Baldi 1, 2,∗ and Pierre-François Baisnée 1 1 Department of Information and Computer Science and 2 Department of Biological Chemistry, College of Medicine, University of California, Irvine, CA 92697-3425, USA Received on April 24, 2000; accepted on May 25, 2000 Abstract Motivation: DNA structure plays an important role in a variety of biological processes. Different di- and trinucleotide scales have been proposed to capture various aspects of DNA structure including base stacking energy, propeller twist angle, protein deformability, bendability, and position preference. Yet, a general framework for the computational analysis and prediction of DNA structure is still lacking. Such a framework should in particular address the following issues: (1) construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and profiles from genomic databases; (4) distribution and asymptotic behavior as the length N of the sequences increases; and (5) complete analysis of correlations between scales. Results: We develop a general framework for sequence analysis based on additive scales, structural or other, that addresses all these issues. We show how to construct extremal sequences and calibrate scores for automatic genomic and database extraction. We show that distributions rapidly converge to normality as N increases. Pairwise correlations between scales depend both on background distribution and sequence length and rapidly converge to an analytically predictable asymptotic value. For di- and tri-nucleotide scales, normal behavior and asymptotic correlation values are attained over a characteristic window length of about 10–15 bp. With a uniform background distribution, pairwise correlations between empirically-derived scales remain relatively small and roughly constant at all lengths, except for propeller twist and protein deformability which are positively correlated. There is a positive (resp. negative) correlation between dinucleotide base stacking (resp. propeller twist and protein deformability) and AT-content that increases in magnitude with length. The framework is applied to the analysis of various DNA tandem repeats. We derive exact expressions for counting the number of repeat unit classes ∗ To whom all correspondence should be addressed. c Oxford University Press 2000 at all lengths. Tandem repeats are likely to result from a variety of different mechanisms, a fraction of which is likely to depend on profiles characterized by extreme structural features. Contact: [email protected]; [email protected] Introduction Evidence is mounting that DNA structural properties beyond the double helical pattern play an important role in a number of fundamental biological processes, both under healthy and pathological conditions. This is not too surprising if one realizes that meters of DNA must be compacted into a nucleus that is only a few microns in diameter while, at the same time, preserving the ability of turning thousands of genes on and off in a precisely orchestrated fashion. The threedimensional structure of DNA, as well as its organization into chromatin fibers, seems to be essential to its functions and has been implicated in diverse phenomena ranging from protein binding sites, to gene regulation, to triplet repeat expansion diseases. The goal of this work is to develop computational methods for the structural analysis of DNA sequences. While DNA structure is our primary motivation and area of application, the framework we develop is completely general and applies to sequences over any alphabet, including codon, RNA, and protein alphabets, whenever local additive scales, as defined below, are available. DNA structure DNA structure has been found to depend on the exact sequence of nucleotides, an effect that seems to be caused largely by interactions between neighboring base pairs (Ornstein et al., 1978; Satchwell et al., 1986; Breslauer et al., 1986; Calladine et al., 1988; Goodsell and Dickerson, 1994; Sinden, 1994; Brukner et al., 1995; Hassan and Calladine, 1996; Hunter, 1996; Ponomarenko et al., 1999; Fye and Benham, 1999). This means that different sequences can have different intrinsic structures, or different propensities for forming particular structures. 865 P.Baldi and P.-F.Baisnée Periodic repetitions of bent DNA in phase with the helical pitch, for instance, will cause DNA to assume a macroscopically curved structure. Flexible or intrinsically curved DNA is energetically more favorable to wrap around histones than rigid and unbent DNA, and this has been shown to influence nucleosome positioning (Drew and Travers, 1985; Satchwell et al., 1986; Simpson, 1991; Lu et al., 1994; Wolffe and Drew, 1995; Baldi et al., 1996; Zhu and Thiele, 1996; Liu and Stein, 1997). In addition, the chromatin complex structure of DNA and the positioning of nucleosomes along the genome have been found to play an important (generally inhibitory) role in the regulation of gene transcription (Pazin and Kadonaga, 1997; Tsukiyama and Wu, 1997; Werner and Burley, 1997; Pedersen et al., 1998). Sequence-dependent DNA structure is often important for DNA binding proteins, such as TBP (TATA-binding-protein) (Parvin et al., 1995; Starr et al., 1995; Grove et al., 1996) and gene regulation (Sheridan et al., 1998). While the number of resolved structures of DNA–protein complexes continues to grow in the PDB database, the field of computational DNA structural analysis is clearly far behind its protein cousin and completely lacks any degree of systemicity. Most likely, most DNA structural signals remain to be uncovered. DNA structural scales Based on many different empirical measurements or theoretical approaches, several models have been constructed that relate the nucleotide sequence to DNA flexibility and curvature (Ornstein et al., 1978; Satchwell et al., 1986; Goodsell and Dickerson, 1994; Sinden, 1994; Brukner et al., 1995; Hassan and Calladine, 1996; Hunter, 1996; Baldi et al., 1998; Ponomarenko et al., 1999). These models are typically in the form of dinucleotide or trinucleotide scales that assign a particular value to each di- or tri-nucleotide and its reverse complement. A non-exhaustive list of such scales includes: (1) The dinucleotide base stacking energy (BS) scale (Ornstein et al., 1978) expressed in kilocalories per mole. The scale is derived from approximate quantum mechanical calculations on crystal structures. (2) The dinucleotide propeller twist angle (PT) scale (Hassan and Calladine, 1996) measured in degrees. This scale is based on X-ray crystallography of DNA oligomers. Dinucleotides with a large negative propeller-twist angle tend to be more rigid than dinucleotides with low negative propeller-twist angle. (3) The dinucleotide protein deformability (PD) scale (Olson et al., 1998) derived from empirical energy functions extracted from the fluctuations and correlations of structural parameters in DNA–protein 866 crystal complexes. Dinucleotides with large PD values tend to be more flexible. (4) The trinucleotide bendability (B) model (Brukner et al., 1995) based on Dnase I cutting frequencies. The enzyme Dnase I preferably binds (to the minor groove) and cuts DNA that is bent, or bendable, towards the major groove (Lahm and Suck, 1991; Suck, 1994). Thus Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or anisotropic bendability. These frequencies allow for the derivation of bendability parameters for the 32 complementary trinucleotide pairs. Large B values correspond to flexibility. (5) The trinucleotide position preference (PP) scale derived from experimental investigations of the positioning of DNA in nucleosomes. It has been found that certain trinucleotides have a strong preference for being positioned in phase with the helical repeat. Depending on the exact rotational position, such triplets will have minor grooves facing either towards or away from the nucleosome core (Satchwell et al., 1986). Based on the premise that flexible sequences can occupy any rotational position on nucleosomal DNA, these preference values can be used as a triplet scale that measures DNA flexibility. Hence, in this model, all triplets with close to zero preference are assumed to be flexible, while triplets with preference for facing either in or out are taken to be more rigid and have larger PP values. Note that we do not use this scale as a measure of how well different triplets form nucleosomal DNA. Instead, the absolute value, or unsigned nucleosome positioning preference, is used here, as in Pedersen et al. (1998), as a measure of DNA flexibility. For completeness, all these scales are displayed in the appendix. In previous studies, we found these models useful (Baldi et al., 1996; Pedersen et al., 2000; Baldi et al., 1999), in particular for the detection of putative new structural signatures associated with an increase of bendability in downsteam regions of RNA polymerase II promoters. A similar approach (Liao et al., 2000) was used to analyze the structure of insertion sites for P transposable elements in Drosophila melanogaster and suggest that the corresponding transposition mechanism recognizes a structural signature rather than a specific sequence motif. With the exception of BS, all the models were determined by purely experimental observations of sequencestructure correlations. Additional scales capturing DNA properties related directly or indirectly to structure, such as enthalpy, or melting temperature, have also been Sequence analysis by additive scales proposed (Breslauer et al., 1986; Ponomarenko et al., 1999). The primary focus of this work is not on assessing the merits and pitfalls of each model, but rather on the development of general methods for the systematic application of any scale to any sequence of any length, up to entire genomes, under the assumption that the scale can be used additively within a sliding window. In general, this assumption will provide a reasonable approximation, at least up to a certain length to be determined experimentally. In particular, we are interested in the development of methods for the automatic recognition of structural motifs associated with extremal features, such as extreme stiffness or bendability. The calibration of corresponding thresholds is expected to be useful in database searches and is conceptually similar, for instance, to the calibration of thresholds for detecting sequence homology. More generally, however, database searches may also be conducted on the basis of structural signatures or profiles that need not be extremal and could be obtained from reasonable training sets. Certain protein binding sites, for instance, are highly degenerate at the DNA sequence level, with low sequence homology, while exhibiting at the same time a high degree of DNA structural similarity. Similarly, periodic flexible triplets in phase with the double helical pitch are necessary to ensure long range curvature, for instance in nucleosome regions. Although several scales may agree on some structural features, the fact remains that they may also display divergent interpretations of some sequence elements. While no final consensus regarding these models exists, it is likely that each one provides a slightly different and partially complementary view of DNA structure. Thus a second goal of this work is the comparison of the models in the limited sense of estimating the statistical correlation between different scales. In Baldi et al. (1998) it was shown that by and large many of the commonly used scales exhibit low correlations measured at the level of single dior tri-nucleotides. Empirical measurement of correlations between the scales over longer lengths in Escherichia coli have recently revealed different unexplained patterns (Pedersen et al., 2000). Here we provide a complete explanation of this phenomena and show how correlations vary with background distribution and with window length. Finally, while the methods introduced can be applied to any DNA sequence, we focus here on a particularly important class of DNA sequences, namely DNA tandem repeats, where the general framework is further specialized. DNA repeats Genomes, especially eukaryotic genomes, are replete with DNA repetitive regions (Jurka et al., 1992; Jeffreys, 1997; Jurka, 1998). Well over 30% of the human genome has been estimated to comprise repetitive DNA of some sort (Benson and Waterman, 1994) the exact function of which is often unknown. Such DNA arises through many different evolutionary and genetic mechanisms. Over 950 different classes (Jurka, 1998) of repeats have been censed. Two major groups of repeats exist: interspersed repeats, and tandem repeats. While the methods to be developed can be applied to both groups, our analysis will focus on tandem repeats, consisting of two or more contiguous copies of a particular pattern of nucleotides. Tandem repeats may cover up to 10% of the human genome. Tandem repeats vary widely, over several orders of magnitude, both in terms of the length of the repeating pattern and the number of more or less exact contiguous copies. Repeats are often polymorphic and therefore play a major role in linkage studies and DNA fingerprinting. In many cases, the genetic origin, the structure, and the function of these repetitive regions is poorly understood. There exist a few examples, however, where the repeats are known to play a biological role in both healthy and pathological conditions. Certain tandem repeats, for instance, have been associated with protein binding sites or interactions with transcription factors. An important advance in epigenetics research has been the realization that interactions between repeated DNA sequences can trigger the formation and the transmission of inactive genetic states and DNA modifications (Wolffe and Matzke, 1999). In several of these cases, the particular DNA-helical structural features of the repeat sequences seem to play an essential role. Interest in tandem repeats has been heightened over the last few years by the discovery that several important degenerative disorders including Huntington’s disease, myotonic dystrophy, fragile X syndrome, and several forms of ataxia, result from the abnormal expansion of particular DNA triplets (The Huntington’s Disease Collaborative Research Group, 1993; Ashley and Warren, 1995; Ross, 1995; Gusella and MacDonald, 1996; Hardy and Gwinn-Hardy, 1998; Rubinsztein and Hayden, 1998; Baldi et al., 1999). The exact mechanism by which a triplet repeat mutation causes disease varies as indicated by the fact that currently known repeat expansions are found both in 5 UTRs, in 3 UTRs, in introns, and within coding sequences of various affected genes (Ashley and Warren, 1995; Gusella and MacDonald, 1996; Rubinsztein and Amos, 1998; Rubinsztein and Hayden, 1998). For instance, fragile X mental retardation is associated with an expanded CGG repeat in the 5 UTR of the FMR1 gene (Nelson, 1995; Eichler and Nelson, 1998). The 64 possible triplets can be clustered into 12 equivalence classes when shift and reverse complement operations are considered (see below). Currently only three repeat classes CAG, CGG, and GAA, out of the possible twelve, are associated with triplet repeat disorders. There is evidence that unusual structural features of the repeats play a role in their expansion (Wells, 1996; 867 P.Baldi and P.-F.Baisnée Pearson and Sinden, 1998a,b; Moore et al., 1999). In Baldi et al. (1999), the structural scales above were used to show that the triplet classes involved in the diseases have extreme structural characteristics of very high or very low flexibility. Methods to quantify the degree of extremality relative to other sequences, however, were not developed. Furthermore, other triplet or non-triplet repeats may play a role in diseases as well as other biological processes. Therefore the techniques need to be improved and extended to all classes of repeats. Hence, given the importance of repeating patterns and the exponential growth of sequence databases, our goal is also to develop new tools for the computational analysis of the structural properties of arbitrary repeats and begin to apply such techniques in a systematic and quantifiable way. Various algorithms for searching tandem repeats have been developed (Milosavljevic and Jurka, 1993; Benson and Waterman, 1994; Benson, 1999; Blanchard et al., 2000). The techniques presented here can also be viewed as complementing such algorithms by introducing a structural perspective. Organization The remainder of the paper is organized as follows. In the next section we develop a general framework for the analysis of the score of a sequence (repetitive or not) under any additive scale. We determine the number of different sequence equivalence classes under circular permutation and reverse complement operations. We show how to determine and visualize maximal and minimal patterns and study the statistical properties of the scales, including intra scale (mean and variance) and interscale (correlations) statistics for sequences of various lengths, as well as asymptotic normality. This framework is essential in order to compare the behavior of various scales, to locate a given sequence with respect to a comparable population, and to automatically set thresholds in database searches. We then apply the general framework to the five structural models described above and various tandem repeats. Methods and theory General framework The general framework we consider begins with an alphabet A of size A and a scale S of length (or size) S. The scale is a function that assigns a value to any S-tuple of the alphabet, for instance in the form of a table with A S entries. In the result section, we deal exclusively with the nucleotide DNA alphabet (A = 4) and with DNA scales, such as dinucleotide with S = 2 (e.g. propeller twist) or trinucleotide with S = 3 (e.g. bendability) structural scales. The same framework, however, can readily be applied to other situations (e.g. amino acid alphabet with 868 hydrophobicity scales). Given a primary sequence s = X 1 X 2 . . . X N of length N S over A, we assume that the scale S is approximately additive in the sense that the corresponding global property of the sequence s can be estimated by ‘sliding’ the scale along the sequence in the form S (s) = S (X 1 . . . X S ) + S (X 2 . . . X S+1 ) + · · · N −S+1 S (X i . . . X i+S−1 ). = (1) i=1 In practical applications, such quantity can also be averaged over a window of length W to get, for instance, a more homogeneous per base-pair value (W ≈ N ). This averaging process does not concern us at this stage since it merely amounts to using a different scale, with a larger size. The form given in equation (1) corresponds to a free boundary condition. The ideas to be developed can be applied to other boundary conditions, including periodic boundary conditions, where the sequence is wrapped around, as described below. With the proper modifications, the theory applies immediately to the case where the scales are shifted by more than one position at each step. Consider now a repeat sequence r consisting of a unit pattern or period p = (X 1 . . . X P ) of length P, and repetition number R > 1, so that r = (X 1 . . . X P ) R with N = P R S. Notice that the period is not uniquely determined since, for instance, X X X X can be viewed as (X )4 , or as (X X )2 . In addition, we will assume that P + S − 1 N , or equivalently that S (R − 1)P + 1 so that the scale S is applied starting at least once from each letter in the repetitive unit, without exceeding the repetitive sequence boundary. In this case, S (r ) has the form: S (r ) = l S ( p) + (2) where S ( p) is the contribution of the periodic unit S ( p) = S (X 1 . . . X S ) + S (X 2 . . . X S+1 ) + · · · + S (X P X 1 . . . X S−1 ) P = S (X i . . . X i+S−1 ) [modP]. (3) i=1 The number l of times the periodic unit is covered by S and its shifted version is given by: S−1 PR − S + 1 = R− . (4) l= P P Finally, if l P + S − 1 = R P then the boundary tail is equal to 0. Otherwise = S (X l P+1 . . . X l P+S ) + · · · + S (X R P−S+1 . . . X R P ) (5) Sequence analysis by additive scales where indices can be taken modulo P, i.e. X l P+1 = X 1 and so forth. The sum in equation (5) has at most P − 1 terms. In practice, at least in the case of DNA, only short scales are currently available and therefore in most cases, S P + 1. In this case, equation (2) simplifies to: S (r ) = (R − 1)S ( p) + . (6) Equivalence classes In the special case of repetitive sequences, we also need to be able to count the number of different repeats with respect to a given scale. It is often the case that the scale S is characterized by some kind of invariance with respect to the sequences of length S of A. In the case of DNA, the structural scales we have are invariant with respect to the reverse complement. When looking at repeat sequences, this determines how many different repeat patterns of length P need to be considered. A triplet repeat, for instance, can be described in terms of different unit trinucleotides depending on what strand and triplet frame is chosen. Thus, the repeat CAGCAGCAG . . . can be said to be a repeat of the triplet CAG, and also of its reverse complement CTG. Ignoring repeat boundaries, however, the sequence can also be described as a repeat of the shifted triplet pairs AGC/GCT and GCA/TGC. In this way, the 64 different trinucleotides can be divided into 12 possible repeat classes. Of these 12 classes, only 10 are proper triplet repeat classes in the sense that they do not result from a repeat pattern of shorter length. The two classes associated with shorter patterns are obviously the triplet pairs AAA/TTT and CCC/GGG which are more precisely described as mononucleotide repeats. [For a generic alphabet A, a reverse complement operation can be defined by introducing a one to one function X → X̄ from the alphabet to itself, satisfying X̄¯ = X so that the reverse complement of X 1 . . . X N is defined to be X̄ N , . . . , X̄ 1 .] In the case of a DNA repeat with unit repeat length P, the number of classes and the number of elements in each equivalence class is dictated by the action of the group of transformations associated with the circular permutations and the reverse complement operations on the set of all possible strings of length P. AAA. . . /TTT. . . and CCC. . . /GGG. . . always give rise to two separate classes with two elements each. In general, a typical class will contain 2P elements associated with the P permutations and the P reverse complements. Classes containing less elements, however, can arise for instance as a result of sub-periodicity effects when P is not prime, and of identical reverse complement effects. For instance, when P = 4, the class of ATAT contains only two elements since it is identical to its reverse complement and can be shifted circularly only once before returning to the original pattern. The number of classes can be counted using standard group theory arguments detailed in the appendix. These arguments are not restricted to circular permutation and reverse complement operations, but apply to any group of transformations over any sequences. The number of classes, when only circular permutations without reverse complement are taken into account, is given by 1 P 1 (P,k) Ad = φ A (7) P d|P d P 1k P where (P, k) is the greatest common divider (gcd) of P and k. φ(n) is the Euler function counting the number of integers less than n which are prime to n, i.e. without common dividers with n. If p1 , . . . , pk is the list of distinct prime factors of n, then the Euler function can be expressed as: φ(n) = n k i=1 1 1− . pi (8) When both circular permutations and reverse complement are taken into account, the number of classes for odd P is given by 1 P 1 (P,d) Ad = φ A . (9) 2P d|P d 2P 1k P When P is even, the corresponding number of classes is P 1 P P/2 d A + A φ (10) 2P d|P d 2 or, equivalently, 1 2P 1k P P A(P,d) + A P/2 . 2 (11) In particular, when P is prime, the number of different classes under periodic and reverse complement equivalence is 1 [(P − 1)A + A P ]. 2P (12) The number of classes which are new at a given length P, i.e. that do not result from the repetition of a shorter pattern of length dividing P, can easily be obtained by subtracting the corresponding counts for each divisor of P. When P is prime, all classes are new except for the classes resulting from mono-letter repeats. Table 1 in the Results section exemplifies the application of equations (9)–(12). 869 P.Baldi and P.-F.Baisnée Extremal sequences and automata We are interested in the construction and recognition of sequences s that are extremal for S , i.e. such that S (s) is very large or very small relative to the other sequences of length N . For this, we attach to each scale a prefix automata, or prefix graph. The prefix automata can be described by a directed graph containing A S−1 nodes, each labeled by a string of length S − 1 over A of the form X 1 . . . X S−1 (see Figure 1 for an example). Each node has A directed outgoing connections. X 1 . . . X S−1 is connected to X 2 . . . X S−1 Y , for each letter Y in A, hence the notion of prefix. The weight (or length) of the corresponding transition is provided by the entry associated with X 1 X 2 . . . X S−1 Y in the structural table. The A nodes labeled (X ) S−1 = X X X . . . X (monorepeats) are the only ones to have a self-connection. Any sequence s of length N , is trivially associated with the path: X 1 . . . X S−1 → X 2 . . . X S → · · · → X N −S+2 . . . X N . The value of S (s) is found just by adding the weights of the corresponding connections. As a result, sequences associated with maximal or minimal values of S (s) correspond to paths in the prefix graph, with maximal or minimal total weight or length. These can easily be found by standard dynamic programming techniques which can also be extended to finding, for instance, the k longest or shortest paths. A repeat pattern of length P is a directed cycle in the prefix automata graph. Notice that any path of length greater than A S−1 must intersect itself at least once. Thus any cycle of length strictly greater than A S−1 must be composed of non-intersecting cycles of length at most A S−1 . For instance, with a dinucleotide scale, any repeat unit of length greater than four must contain at least two cycles of length at most four. Therefore in the study of repeats, we need only to study the properties of all non-intersecting directed cycles of length up to A S−1 together with all possible ways of joining them. In addition to dynamic programming techniques, it is also useful to tabulate the weights of all possible short cycles for at least two reasons. First, because longer patterns are built from shorter cycles. Second, at least in the case of DNA, many important existing repeats, such as triplet repeats, are based on a short repeating pattern. While the prefix graph is useful for constructing extremal sequences and recognizing them as long as A, S and N are small, it is also necessary to develop more general techniques by which we can rapidly assess, for any sequence s, the magnitude of S (s) with respect to all the other comparable sequences. This is best achieved by viewing the sequences in a probabilistic context. Probabilistic modeling Consider now that sequences are being generated by a random process. In order to fix the ideas, we take for 870 simplicity a Markov model of order 0, i.e. we assume that sequences are generated by N tosses of the same die with distribution D = ( p X ) over the alphabet A. The same analysis, however, can easily be extended to other probabilistic models such as higher-order Markov models where distributions are defined, for instance, on pairs or triplets of letters. From equation (1), S (s) is now a random variable which is the sum of N − S + 1 random variables: S (s) = Y1 + · · · + Y N −S+1 . By construction, all the variables Yi = S (X i . . . X i+S−1 ) have the same distribution, but they are not independent. Rather they satisfy a form of local dependence, called ‘m-dependence’ in statistics. More precisely, for i < j, Yi and Y j are independent if and only if j − i S. Using the linearity of the expectation, we have: E(S (s)) = (N − S + 1)E(Yi ) ≈ N αS with E(Yi ) = X 1 ...X S (13) S (X 1 . . . X S ) p(X 1 ) . . . p(X S ) = αS (14) the sum being over all A S S-tuples of the alphabet. To situate an individual sequence with respect to the entire population, we need to calculate the variance. The variance also can be calculated explicitly by taking advantage of the local dependence of the variables Yi . We have Cov(Yi , Y j ) Var(S (s)) = (N − S + 1)Var(Yi ) + 2 0< j−i<S (15) with the covariances Cov(Yi , Y j ) = E[(Yi − E(Yi ))(Y j − E(Y j ))]. As soon as j − i S, Yi and Y j are independent and the corresponding covariance is 0. Thus, for any given scale S , one needs only to tabulate the expectation E(Yi ) and the S relevant short-range covariances Ck = Ck (S ) = Cov(Yi , Yi+k ) (16) for 0 k < S (C0 = Var(Yi )). Alternatively, by factoring out the variance of Yi , equation (15) can also be expressed in terms of the correlations Cor(Yi , Y j ) . Var(S (s)) = Var(Yi ) N − S +1+2 0< j−i<S (17) To obtain the exact variance at each length N , it is then only a matter of counting how many times each type of covariance is present in the sequences and adjust for any boundary effects as needed. Sequence analysis by additive scales If N 2S − 1, then and the approximation Var(S (s)) = (N − S + 1)C0 S−1 +2 (N − S − k + 1)Ck . Var(S (r )) ≈ R Var(S ( p)). (18) k=1 If S N < 2S − 1, then Var(S (s)) = (N − S + 1)C0 N −S +2 (N − S − k + 1)Ck . (19) k=1 It is worth noticing that, for fixed S, both the expectation and the variance are linear in N . In particular, for large N S−1 S−1 Var(S (s)) ≈ N C0 + 2 Ck = N Ck = NβS . −S+1 k=1 (20) In the last equality, for obvious symmetry reasons, we let C−k = Ck . This notation will prove to be useful below. In the case of repetitive sequences, it is also useful to calculate the expectation of S ( p) = Y1 + · · · Y P , and its variance with periodic boundary conditions modulo P, i.e. assuming the variables Y1 . . . Y P and the corresponding letters are arranged along a circle. Here both the expectation and the variance are directly proportional to P and satisfy E(S ( p)) = αS P and Var(S ( p)) = βS P. Clearly, for any P, E(S ( p)) = P E(Yi ) so αS = E(Yi ). (21) If P 2S − 1, βS = C 0 + 2 S−1 Ck = S−1 Ck . (22) −S+1 k=1 When S P < 2S − 1, all variables along the circle are dependent and therefore βS = βS (P) is given by βS (P) = C0 + 2 n k=1 Ck = n Ck βS (P) = C0 + Cn + 2 Ck = Cn + k=1 Ck where (u) is the normalized Gaussian distribution. The factor (2S−1) represents the size of the clusters associated with m-dependence. For a fixed scale, such size is constant but the theorem remains true if S grows slowly with N . Thus equation (27) can readily be applied to S (s) or S (r ) with K = N − S + 1 or K = R P. From equation (20), the variance of the sequences being considered is linear in their length: Var(S (s)) ≈ β N , where β depends only on the scale S . Thus √ we obtain a convergence rate that scales at most like 1/ N Z − EZ P √ √C u − (u) (28) Var(Z ) N (24) Normalized distances and extremal sequences. The value of S (s) or S (r ) of any sequence or repeat of length N can be compared to the average value of a background population by computing a normalized Z -score of the form: −n+1 when P = 2n. Periodic boundary conditions must be used in the computation of the covariances Ck whenever necessary (|k| > P − S). For a periodic sequence r , where the period P as well as S are small relative to the length N = R P we can use: E(S (r )) ≈ R E(S ( p)) Central limit theorem. S (s) consists of a sum of identical but non-independent random variables. Therefore standard central limit theorems for sums of independent random variables cannot be applied. Yet, because the dependencies are local, a sum Z = Y1 + · · · + Y K of K mdependent random variables Yi still approaches a normal distribution. This can be shown using the theorem in Baldi and Rinott (1989) which provides also a bound on the rate of convergence. Here we use the improved bound found in Rinottand Dembo (1996). We let maxi |Yi − E(Yi )| = B, K and E i=1 |Yi − E(Yi )| /K = µ. For all the scales to be considered, these constants are well defined and easy to compute. Under these assumptions, Z − E Z 7K µ P √ u − (u) (2S − 1)2 B 2 [Var(Z )]3/2 Var(Z ) (27) with C ≈ 7µ(2S − 1)2 B 2 β −3/2 . The rate of this bound is known to be essentially optimal (similar to the Berry– Esseen theorems, Feller, 1971). −n n−1 For long repetitive sequences with period P < S, we can use the same approach with a larger period P , multiple of P, so that S P . (23) when P = 2n + 1, and n−1 (26) (25) Z (s) = S (s) − αS N . √ βS N (29) A repeat r with period unit length P and repetition R (N = R P) can be compared to a background population of repeats, or a background population of 871 P.Baldi and P.-F.Baisnée generic sequences. In the latter case, we have S (r ) ≈ S ( p)R or S (r ) ≈ α N . Therefore the Z -score √ (α − α) N (30) Z (r ) = √ β √ grows with N and is larger than the Z -score Z ( p) computed on the repeat unit. In other words, if a repeat unit displays extremal features when compared to other repeat units of the same length, its expansion will appear even more extreme compared to the background of all sequences of similar length. The Z -scores can be used to assess how extreme a sequence is and to search databases for subsequences with extremal features. As in the case of alignments, this can also be done using extreme value distributions (Durbin et al., 1998). Note also that one can search a database using a structural profile rather than extreme values. The degree of similarity between two profiles can be measured, for instance, using the standard mean square error. Correlations between scales. It is useful to have some information regarding the degree of correlation between two scales and how such correlation behaves at all sequence lengths. Consider then two scales S1 and S2 of length S1 and S2 . Without any loss of generality assume that S1 S2 . For sequences s of length N , we are interested in measuring the correlation between the random variables S1 (s) = Y1 + · · · + Y N −S1 +1 , with Yi = S1 (X i . . . X i+S1 −1 ), and S2 (s) = Z 1 + · · · + Z N −S2 +1 , with Z i = S2 (X i . . . X i+S2 −1 ). We have: Cov(S1 (s), S2 (s)) Cor (S1 (s), S2 (s)) = √ . (31) √ Var(S1 (s)) Var(S2 (s)) Again only terms of the form Cov(Yi , Z j ), where the distance between i and j is small, are non-zero. More precisely, non-zero terms can arise only if 0 j − i S1 − 1 or 0 i − j S2 − 1. It is sufficient to tabulate the finite set of S1 + S2 − 1 covariances Cov(Yi , Z i+k ) Ck = Ck (S1 , S2 ) = E[(Yi − E(Yi ))(Z i+k − E(Z i+k ))] (32) with S1 S2 and −S2 + 1 k S1 − 1. These covariances can be used to compute correlations at all lengths by writing Cov(S1 (s), S2 (s)) = (N − S2 + 1)Cov(Yi , Z i ) Cov(Yi , Z j ). (33) +2 i= j For large N it is clear that, except for small boundary effects, each type of covariance occurs approximately 872 N times in the formula above. Therefore for large N , Cov(S1 (s), S2 (s)) behaves approximately as S S 1 −1 1 −1 N C0 + Ck = N Ck . (34) k=−S2 +1,k =0 k=−S2 +1 We have seen in equations (20) that the variance of each scale is also asymptotically linear in the length N . Thus, as N increases, the correlation Cor(S1 (s), S2 (s)) rapidly converges to a constant given by: S1 −1 k=−S2 +1 C k (S1 S2 ) 1/2 . (35) S1 −1 S2 −1 k=−S1 +1 C k (S1 ) k=−S2 +1 C k (S2 ) In checking calculations on DNA scales (or other alphabets) that are invariant under the reverse complement operation, it is worth noticing that with a uniform distribution on the alphabet ( p A = pC = pG = pT = 0.25), the correlations are symmetric. That is, for any 0 < k < S1 we have Ck (S1 , S2 ) = C−k (S1 , S2 ). This results immediately from the fact that the sum of the terms S2 (X 1 . . . X S2 ) × S1 (X 1 . . . X S1 ) and S2 ( X̄ S2 . . . X̄ 1 ) × S1 ( X̄ S2 . . . X̄ S2 −S1 +1 ) is equal to the sum of the terms S2 (X 1 . . . X S2 ) × S1 (X S2 −S1 +1 . . . X S2 ) and S2 ( X̄ S2 . . . X̄ 1 ) × S1 ( X̄ S1 . . . X̄ 1 ), and similarly for other degrees of overlaps. The terms in the sums can be identically paired using the fact that S1 and S2 are assumed to be reverse-complement invariant. The result is not true if the scales, or the distribution, are not reverse-complement invariant. Results DNA repeat equivalence classes We wrote a program that cycles through all possible DNA sequences of length P counting and listing all the classes that are equivalent under circular permutation and reverse complement operations. Because of this equivalence, in the case of scales that are reverse-complement invariant, it is sufficient to study the repeats of one representative member of each class. We ran the program up to length P = 12. The results, shown in Table 1, are in complete agreement with equations (9)–(12). In Tables 2, 3 and 4 we list alphabetically all the members of each equivalence class for sequences of length 2–4. When P = 4, for instance, one finds 39 classes: 26 classes with 8 elements, 8 classes with 4 elements, and 4 classes with 2 elements. Only 33 classes are new, in the sense that 6 classes are derived from patterns already encountered at P = 1 and 2. Likewise, when P > 2 is a prime number, the total number of classes is given by: 4P − 4 +2 2P (36) Sequence analysis by additive scales Table 1. Number of repeat unit equivalence classes. New or proper classes are classes that do not contain a shorter periodic pattern Sequence length Classes (total) Classes (new) 1 2 3 4 5 6 7 8 9 10 11 12 2 6 12 39 104 366 1 172 4 179 14 572 52 740 190 652 700 274 2 4 10 33 102 350 1 170 4 140 14 560 52 632 190 650 699 875 -18.66 -8.11 -13.10 03 0. -1 -14.00 1. -1 01 -14.00 5. 08 -1 85 1. -1 -13.48 C -9.45 -13.48 A G -8.11 -13.10 -9.45 T -18.66 Fig. 1. Dinucleotide prefix automata for the propeller twist angle scale. The CAG repeat, for instance, is associated with the cycle C → A → G → C in the graph and has a total propeller twist value of −9.45 + −14.00 − 11.08 = −34.53. The corresponding reverse complement cycle is given by C → T → G → C. The triplet repeat class with the largest propeller twist value is CCC followed by CCG. with two classes of size 2 associated with poly-A and polyC, while all the remaining classes are new and contain 2P members. In the appendix, we provide tables in alphabetical order that allow to invert Tables 3 and 4, i.e. to find the class associated with any given P-tuple (P = 3, 4). Table 2. Dinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally. Classes 1 and 5 are not proper dinucleotide classes Class number List of members (alphabetical order) 1 2 3 4 5 6 AA AC AG AT CC CG TT CA CT TA GG GC GT GA TG TC Table 3. Trinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally Class number 1 2 3 4 5 6 7 8 9 10 11 12 List of members (alphabetical order) AAA AAC AAG AAT ACC ACG ACT AGC AGG ATC CCC CCG TTT ACA AGA ATA CAC CGA AGT CAG CCT ATG GGG CGC CAA CTT ATT CCA CGT CTA CTG CTC CAT GTT GAA TAA GGT GAC GTA GCA GAG GAT TGT TCT TAT GTG GTC TAC GCT GGA TCA TTG TTC TTA TGG TCG TAG TGC TCC TGA CGG GCC GCG GGC Analysis of DNA repeats by dinucleotide scales In the case of dinucleotide scales, the prefix automata contains four nodes (Figure 1). Each DNA sequence is associated with a path through the corresponding graph, and exact repeats are associated with cycles. All paths, including cycles, of length greater than four are composite in the sense that they contain a cycle of length 4 or less. In Table 5, we list the dinucleotide scale values S (X 1 X 2 ) + S (X 2 X 1 ) for the six equivalence classes associated with all 16 possible dinucleotide repeats of the form (X 1 X 2 ) R . For each scale, we list classes (represented by their first alphabetical member) and the corresponding scale value, in decreasing value order. The highest level of base stacking energy is achieved by the AT repeat class (−10.39) and the lowest by the CG repeat class (−24.28). The ranking of all possible dinucleotide repeats induced by the propeller twist and the protein deformability scales are identical with the exception of an inversion between the CC (−16.22 and 12.2) and CG (−21.11 and 16.1) classes at the high (flexible) end of the spectrum. At the 873 P.Baldi and P.-F.Baisnée Table 4. Tetranucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally Class number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 List of members (alphabetical order) AAAA AAAC AAAG AAAT AACC AACG AACT AAGC AAGG AAGT AATC AATG AATT ACAC ACAG ACAT ACCC ACCG ACCT ACGC ACGG ACGT ACTC ACTG AGAG AGAT AGCC AGCG AGCT AGGC AGGG ATAT ATCC ATCG ATGC CCCC CCCG CCGG CGCG TTTT AACA AAGA AATA ACCA ACGA ACTA AGCA AGGA ACTT ATCA ATGA ATTA CACA AGAC ATAC CACC CCGA AGGT CACG CCGT CGTA AGTG AGTC CTCT ATAG CAGC CGAG CTAG CAGG CCCT TATA ATGG CGAT CATG GGGG CCGC CGGC GCGC ACAA AGAA ATAA CAAC CGAA AGTT CAAG CCTT AGTA ATTG ATTC TAAT GTGT CAGA ATGT CCAC CGAC CCTA CGCA CGGA GTAC CACT CAGT GAGA ATCT CCAG CGCT GCTA CCTG CCTC CAAA CTTT ATTT CCAA CGTT CTAA CTTG CTTC CTTA CAAT CATT TTAA TGTG CTGT CATA CCCA CGGT CTAC CGTG CGTC TACG CTCA CTGA TCTC CTAT CTGG CTCG TAGC CTGC CTCC GTTT GAAA TAAA GGTT GAAC GTTA GCAA GAAG GTAA GATT GAAT TGTT TCTT TATT GTTG GTTC TAAC GCTT GGAA TAAG TCAA TCAT TTGT TTCT TTAT TGGT TCGT TAGT TGCT TCCT TACT TGAT TGAA TTTG TTTC TTTA TTGG TTCG TTAG TTGC TTCC TTAC TTGA TTCA GACA GTAT GGGT GACC GGTA GCAC GACG GTCT TACA GGTG GGTC GTAG GCGT GGAC TCTG TATG GTGG GTCG TACC GTGC GTCC TGTC TGTA TGGG TCGG TAGG TGCG TCCG GAGT GACT GTGA GTCA TCAC TCAG TGAG TGAC GATA GCCA GAGC TAGA GCTG GCGA TATC GGCT GCTC TCTA TGGC TCGC GCAG GAGG GCCT GGAG GGCA GGGA TGCC TCCC CATC GATC GCAT CCAT TCGA TGCA GATG GGAT TCCA TGGA CGCC GCCG CGGG GGCC GCCC GCGG GGCG GGGC opposite (stiff) end, we find the single letter repeat class AA (−37.2 and 5.8) followed by the proper dinucleotide repeat class AG (−27.48 and 6.6). In Table 6, we list the dinucleotide scale values for the 12 equivalence classes associated with all possible triplet repeats of the form (X Y Z ) R . In this special case, we find the results of Baldi et al. (1999). The high and low ends of the base stacking energy scale are occupied by the triplet classes AAT (−15.76) and CCG (−32.54) respectively. We find again a high degree of correlation between the propeller twist and protein deformability scales. If we exclude the classes AAA/TTT (−55.98) and CCC/GGG (−24.33), which are not proper triplet repeat classes, then 874 the maximum and the minimum of the propeller twist spectrum are respectively occupied by the classes CCG (−29.22) and AAG (−46.14). A similar ranking with the same extremal triplets is observed with the protein deformability scale: CCG (22.2) occupies the high end, whereas AAA (8.7) and AAG (9.5) occupy the low end of the spectrum. When considering all three dinucleotide scales, three minima and two maxima are occupied by two of the three repeat classes known to be involved in triplet repeat expansion diseases, namely AAG and CCG. GAA triplet (in the AAG class) expansion is associated with Friedreich’s ataxia (Orr et al., 1993; Campuzano et al., Sequence analysis by additive scales Table 5. Dinucleotide structural scale values for repeat unit p = X 1 X 2 with P = 2. S( p) = S(X 1 X 2 ) + S(X 2 X 1 ) Base stacking Propeller twist AT (−10.39) AA (−10.74) CC (−16.52) AG (−16.59) AC (−17.08) CG (−24.28) CC (−16.22) CG (−21.11) AC (−22.55) AT (−26.86) AG (−27.48) AA (−37.32) Protein deformability CG CC AC AT AG AA (16.1) (12.2) (12.1) (7.9) (6.6) (5.8) Table 6. Dinucleotide structural scale values for repeat unit p = X 1 X 2 X 3 with P = 3. S( p) = S(X 1 X 2 ) + S(X 2 X 3 ) + S(X 3 X 1 ). Repeat classes associated with triplet repeat expansion diseases are in bold Base stacking Propeller twist AAT (−15.76) AAA (−16.11) ACT (−21.11) AAG (−21.96) AAC (−22.45) ATC (−22.95) CCC (−24.78) AGG (−24.85) ACC (−25.34) AGC (−27.94) ACG (−30.01) CCG (−32.54) CCC (−24.33) CCG (−29.22) ACC (−30.66) AGC (−34.53) AGG (−35.59) ACG (−36.61) ATC (−37.94) ACT (−38.95) AAC (−41.21) AAT (−45.52) AAG (−46.14) AAA (−55.98) Protein deformability CCG ACG CCC ACC AGC ATC AAC AGG AAT ACT AAG AAA (22.2) (18.9) (18.3) (18.2) (15.9) (15.9) (15.0) (12.7) (10.8) (10.7) (9.5) (8.7) 1996; Junck and Fink, 1996; Paulson et al., 1997; Koenig, 1998; Lee, 1998; Orr and Zoghbi, 1998; Paulson, 1998; Pulst, 1998; Stevanin et al., 1998). Abnormal GCC triplet (in the CCG class) expansion is associated with FRAXE mental retardation and abnormal expansion of the CGG triplet with fragile X syndrome (FRAXA) (Nelson, 1995; Gusella and MacDonald, 1996; Eichler and Nelson, 1998; Skinner et al., 1998; Gecz and Mulley, 1999). The third triplet expansion disease related class, AGC, has average rank in all dinucleotide scales. In Table 7, we list the scale values for the 39 equivalence classes associated with all possible tetranucleotide repeats of the form (X 1 X 2 X 3 X 4 ) R . The maximum of the base stacking scale is occupied by the dinucleotide repeat ATAT (−20.78) and the proper tetranucleotide repeat AAAT (−21.13). The minimum corresponds to CGCG (−48.56) followed by ACGC (−41.36). We again observe a substantial positive correlation between the values produced by the propeller twist and protein deformability scales together with a weaker negative correlation with respect to the base stacking energy scale. The high end of the propeller twist scale is occupied by CCCC Table 7. Dinucleotide structural scale values for repeat unit p = X 1 X 2 X 3 X 4 with P = 4. S( p) = S(X 1 X 2 ) + S(X 2 X 3 ) + S(X 3 X 4 ) + S(X 4 X 1 ) Base stacking Propeller twist ATAT (−20.78) AAAT (−21.13) AATT (−21.13) AAAA (−21.48) AACT (−26.48) AAGT (−26.48) AGAT (−26.98) AAAG (−27.33) ACAT (−27.47) AAAC (−27.82) AATC (−28.32) AATG (−28.32) ACCT (−29.37) AAGG (−30.22) AACC (−30.71) ATCC (−31.21) AGCT (−31.97) CCCC (−33.04) AGGG (−33.11) AGAG (−33.18) AAGC (−33.31) ACCC (−33.60) ACAG (−33.67) ACTC (−33.67) ACTG (−33.67) ACAC (−34.16) ATGC (−34.30) ACGT (−34.53) AACG (−35.38) ATCG (−35.88) AGCC (−36.20) AGGC (−36.20) ACCG (−38.27) ACGG (−38.27) CCCG (−40.80) CCGG (−40.80) AGCG (−40.87) ACGC (−41.36) CGCG (−48.56) CCCC (−32.44) CCCG (−37.33) CCGG (−37.33) ACCC (−38.77) CGCG (−42.22) AGCC (−42.64) AGGC (−42.64) ACGC (−43.66) AGGG (−43.70) ACCG (−44.72) ACGG (−44.72) ATGC (−44.99) ACAC (−45.10) ATCC (−46.05) ACCT (−47.06) ACGT (−48.08) AGCG (−48.59) AACC (−49.32) ACAT (−49.41) ACAG (−50.03) ACTC (−50.03) ACTG (−50.03) AGCT (−50.93) ATCG (−52.00) AAGC (−53.19) ATAT (−53.72) AAGG (−54.25) AGAT (−54.34) AGAG (−54.96) AACG (−55.27) AATC (−56.60) AATG (−56.60) AACT (−57.61) AAGT (−57.61) AAAC (−59.87) AAAT (−64.18) AATT (−64.18) AAAG (−64.80) AAAA (−74.64) Protein deformability CGCG CCCG CCGG ACGC ATGC ACCG ACGG CCCC ACCC ACAC ACGT AGCG ATCG AGCC AGGC ATCC AACG AACC ACAT AAGC AATC AATG AGGG ACAG ACTC ACTG AAAC ACCT ATAT AAGG AGAT AGCT AAAT AATT AACT AAGT AGAG AAAG AAAA (32.2) (28.3) (28.3) (28.2) (25.2) (25.0) (25.0) (24.4) (24.3) (24.2) (23.0) (22.7) (22.7) (22.0) (22.0) (22.0) (21.8) (21.1) (20.0) (18.8) (18.8) (18.8) (18.8) (18.7) (18.7) (18.7) (17.9) (16.8) (15.8) (15.6) (14.5) (14.5) (13.7) (13.7) (13.6) (13.6) (13.2) (12.4) (11.6) (−32.44) and CCCG (−37.33) while that of the protein deformability scale is occupied by CGCG (32.2) and CCCG (28.3). The lowest values correspond for both scales to AAAA (−74.64 and 11.6) and AAAG (−64.80 and 12.4). All repeat units of length greater than 4 are made up of shorter cyclic paths in the prefix automata and therefore their properties can essentially be predicted from the previous three tables. For all lengths, for instance, the highest level of base stacking energy is achieved by the class ATATATAT . . . when P is even, and by the class AATATATAT . . . when P is odd. The lowest level by the 875 P.Baldi and P.-F.Baisnée Table 8. Trinucleotide structural scale values for repeat unit p = X 1 X 2 with P = 2. S( p) = S(X 1 X 2 X 1 ) + S(X 2 X 1 X 2 ) AA AC TT TG AG Bendability AT TC 0.175 0.017 TA CA AT (0.364) AG (0.058) AC (0.034) CC (−0.024) CG (−0.154) AA (−0.548) Position preference AA CG AT CC AC AG (72) (50) (26) (26) (23) (17) 0.076 GT CC GG CG GC CT GA Fig. 2. Trinucleotide prefix automata for the bendability scale. Circle is used for ease of display but does not represent actual connections. The CAG repeat, for instance, is associated with the cycle CA → AG → GC → CA in the graph and has a total bendability value of 0.175+0.017+0.076 = 0.268. It is the highest bendability value for any triplet repeat. Other edges are not shown. class CGCGCG . . . when P is even, and CCGCGCG . . . when P is odd. For protein deformability, the maximal level is achieved by the class CGCGCG . . . when P is even, and by CCGCGCG . . . when P is odd. The lowest level is associated with poly-A (i.e. (A) P ). Poly-C and poly-A give also the absolute highest and lowest propeller twist angles at all lengths. Analysis of DNA repeats by trinucleotide scales In the case of trinucleotide scales, the prefix automata contains 16 nodes (Figure 2), each one labeled with a different dinucleotide. All paths, including cycles, of length greater than 16 are composite, i.e. contain at least one cycle of length 16 or less. The trinucleotide scale values for all repeats with periodic unit length P = 2 are given in Table 8. The highest level of bendability is achieved by AT (0.364) and the lowest by AA (−0.548) and CG (−0.154). The highest level of position preference is achieved by AA (72) and CG (50), and the lowest by AG (17). The trinucleotide scale values for all repeats with periodic unit length P = 3 are given in Table 9 (see also Baldi et al., 1999). The highest level of bendability is achieved by the class AGC (0.268) and the lowest by AAA (−0.822) and ACC(−0.238). In fact only two classes of 876 repeats (AGC and ATC) have positive bendability and are well separated from the rest. The highest level of position preference is achieved by the class AAA (108) followed by CCG (72), and the lowest by AGG and ACC (21). The class AGC, which contains the CAG repeat responsible for the majority of the known triplet repeat expansion diseases, has the highest bendability. It is the only repeat class for which all three shifted triplets have a high individual bendability. Moreover, this class has a relatively low position preference value, another sign of flexibility. Therefore one can hypothesize that long CAG repeats correspond to stretches of DNA that are highly flexible in all positions. Consistently with their high flexibility, CAG/CTG repeats have been found to have the highest affinity for histones among all possible triplet repeats (Wang and Griffith, 1994, 1995; Godde and Wolffe, 1996). Other DNA sequences can adopt long range curvature only if they contain highly flexible triplets in phase with the helical pitch (roughly every 10.5 bp). The flexibility of extended CAG repeats has been verified experimentally (Chastain and Sinden, 1998). The CCG class, which contains the disease-related triplets CGG and GCC, is found at the high (rigid) end of the position preference scale (72), exceeded only by poly-A. This class is also stiff according to the bendability scale (−0.106). This is consistent with the fact that CGG/CCG repeats seem completely unable to form nucleosomes (Wang et al., 1996; Godde et al., 1996). The AAG class, which contains the disease related triplet GAA, occupies the lower (flexible) end of the position preference scale (27). It is the second lowest considering that the last two classes have the same value (21). We also note that AAA/TTT is by large the stiffest of all possible repeats according to both scales. Such homopolymeric tracts are known from X-ray crystallography to be rigid and straight (Nelson et al., 1987) and they are bad candidates for nucleosome positioning. In fact, a number of promoters in yeast contain homopolymeric dA:dT elements. Studies in two different yeast species have shown that the homopolymeric elements destabilize nucleosomes and thereby facilitate Sequence analysis by additive scales Table 9. Trinucleotide structural scale values for repeat unit p = X 1 X 2 X 3 with P = 3. S( p) = S(X 1 X 2 X 3 ) + S(X 2 X 3 X 1 ) + S(X 3 X 1 X 2 ). Repeat classes associated with triplet repeat expansion diseases are in bold Table 10. Trinucleotide structural scale values for repeat unit p = X 1 X 2 X 3 X 4 with P = 4. S( p) = S(X 1 X 2 X 3 ) + S(X 2 X 3 X 4 ) + S(X 3 X 4 X 1 ) + S(X 4 X 1 X 2 ) Bendability Bendability AGC (0.268) ATC (0.218) AGG (−0.013) AAT (−0.030) CCC (−0.036) ACG (−0.049) ACT (−0.068) AAG (−0.091) CCG (−0.106) AAC (−0.196) ACC (−0.238) AAA (−0.822) Position preference AAA (108) CCG (72) AAT (63) ACG (47) AGC (40) CCC (39) ACT (35) ACC (33) ATC (33) AAG (27) AAC (21) AGG (21) the access of transcription factors bound nearby (Iyer and Struhl, 1995; Zhu and Thiele, 1996). Interestingly, the sequence of the IT15 gene involved in Huntington’s disease has a repeat containing 18 adenine nucleotides at its 3 end. Whereas the class CCG is extremely rigid according to the trinucleotide scales, it is extremely flexible according to the dinucleotide scales. Similarly, the predicted flexibility of the AAG class according to the position preference scale is in contradiction with the results obtained using all other di- or trinucleotide scales. Such discrepancies can result from imperfections of the scales, or from the fact that each scale captures a different facet of DNA structure. Dinucleotide and tri-nucleotide scales are in good agreement for CAG repeats and homopolymeric poly-A tracts. The trinucleotide scale values for all repeats with periodic unit length P = 4 are given in Table 10. The highest level of bendability is achieved by the class ATAT (0.728), which is rather a dinucleotide repeat, followed by the proper tetranucleotide repeat ATGC (0.420). On the opposite end of the scale, we find AAAA (−1.096) and AAAC (−0.470). For the position preference scale, AAAA (144), AATT (100), CGCG (100), AAAT (99) are on the higher end, AAAT being the first proper tetranucleotide repeat, while ACGG (23) occupies the lower end of the spectrum. In order to find the most extreme repeats for a given scale at a given repeat unit length, one would have to explore scale values for repeat units up to length P = 16 (see Section Extremal sequences and automata). Because of particular values of the scale, in some cases the results tabulated above for values of P up to 4 only are sufficient. For instance, the most bendable repeat with P = 2n is always ATATAT . . . , while the least bendable is poly-A. Similarly, the highest value of the position ATAT (0.728) ATGC (0.420) ACAT (0.335) AGGC (0.301) AGCT (0.214) AGAT (0.189) ACAG (0.183) ACTG (0.173) AGAG (0.116) ACTC (0.082) ACAC (0.068) AGCC (0.053) AAGC (0.027) ACCT (0.026) AATG (0.011) ACGC (0.006) ACGT (−0.016) AGGG (−0.025) AGCG (−0.032) CCCC (−0.048) CCGG (−0.058) CCCG (−0.118) AAGG (−0.162) ACGG (−0.169) AAGT (−0.171) AATC (−0.181) ACCG (−0.184) ATCC (−0.209) ATCG (−0.226) AACT (−0.230) ACCC (−0.250) AACG (−0.278) AAAT (−0.304) CGCG (−0.308) AAAG (−0.365) AATT (−0.424) AACC (−0.468) AAAC (−0.470) AAAA (−1.096) Position preference AAAA AATT CGCG AAAT CCGG AGCG AGCT CCCG AGCC ATCG AATG AGGC AAAG ACGC ATGC AAAC AACG AACT AATC AAGC ATAT CCCC ACCG AGAT ACAC ACCC ACTC AAGT ACAT ACCT ATCC AGAG AGGG AACC AAGG ACTG ACGT ACAG ACGG (144) (100) (100) (99) (94) (89) (86) (85) (80) (76) (68) (68) (63) (63) (62) (57) (57) (55) (54) (53) (52) (52) (49) (47) (46) (46) (44) (43) (43) (40) (38) (34) (34) (31) (31) (29) (28) (25) (23) preference scale is always occupied by poly-A. Extremal results can also be derived by dynamic programming. In many cases, however, a sequence of interest may have a very high or low score according to a given scale, without being the most extreme. The probabilistic theory provides the means to quantify directly how extreme any given sequence is with respect to a given family or background. Probabilistic analysis of DNA scales For simplicity, we first assume a uniform distribution p A = pC = pG = pT . In specific applications, other distributions can be used, such as the background 877 P.Baldi and P.-F.Baisnée distribution of a given genome or a given class of DNA sequences. We can then use equations (21)–(24) to calculate the expectation and variance of S ( p) across all possible repeat unit patterns p and all scales S . In particular, E(S ( p)) = αS P and Var(S ( p)) = βS P. In Table 11 we list the relevant coefficients for the dinucleotide scales. 300 200 µ=0.0923 σ=0.0771 P=10 µ=0.185 σ=0.154 P=15 µ=0.277 σ=0.231 100 0 4 x 10 2.5 2 Table 11. Basic intra-scale coefficients for dinucleotide scales with repeat unit length P (2S − 1) = 3 P=5 1.5 1 0.5 0 6 BS PT PD −8.08 6.62 0.31 7.23 −12.59 9.68 2.26 14.20 4.96 9.62 −2.19 5.23 x 10 10 αS = E(Yi ) C0 = Var(Yi ) C1 = Cov((Yi Yi+1 ) βS = Var(S( p)) In Table 12 we list the relevant coefficients for the trinucleotide scales. Table 12. Basic intra-scale coefficients for trinucleotide scales with repeat unit length P (2S − 1) = 5 B αS = E(Yi ) C0 = Var(Yi ) C1 = Cov(Yi Yi+1 ) C2 = Cov(Yi Yi+2 ) βS = Var(S( p) −0.018 0.015 0.0015 −0.001 0.015 PP 13.78 103.108 18.214 −2.558 134.42 To double-check the mathematical formula, all the constants above were also obtained independently by exhaustive sampling. Convergence to normality. Using sampling methods we also studied the convergence of S (s) and S ( p) to a normal distribution as the length N or P of the sequences is increased, as predicted by our central limit theorem. In practice, the convergence rate is very fast. As an example, an histogram of bendability values for repeat units of length 5, 10 and 15 is given in Figure 3. Similar results are observed with plain sequences. Examples of Z -scores for disease triplets. We have seen that the triplets involved in expansion diseases often tend to have extremal structural properties. This was assessed by computing the scores S ( p). We can now also compute Z -scores using equations (29) and (30), as in Table 13. When repeat length is taken into consideration, disease causing√repeats appear to be even more extreme because of the N factor in equation (30). For example, a CGG repeat of length N = 3000 (R = 1000) observed in 878 5 0 Fig. 3. Histogram of bendability values S( p) for all possible repeat units of length P = 5, 10, 15. Vertical dashed lines represent standard deviation units. Table 13. Z -scores for the repeats involved in Huntington’s disease (HD) and fragile X syndrome (FRAXA )for the bendability and propeller twist scales. Z ( p) is the Z -score of the repeat unit against the background of all possible repeat units of same length P = 3. Z (r ) is the Z -score of a long repeat containing R repeat units against the background of all possible sequences of same length N = R P, including non repetitive sequences. Values of R are chosen at the characteristic low end and high end of each disease Disease Triplet Scale Z ( p) R (low end) Z (r ) R (high end) Z (r ) HD FRAXA CAG B 1.60 36 9.00 121 16.52 CGG PP 1.56 200 21.48 1000 48.23 FRAXA patients is more than 48 standard deviations away from the mean propeller twist value of random uniform sequences of the same length. Correlations between scales at short lengths. We can use equations (31)–(35) to study the correlations between the scales at short lengths and asymptotically. Clearly we can also consider AT-content as a scale. It can be viewed, for instance, as a mononucleotide scale with value 1 for A and T and 0 for C and G. This scale is trivially reversecomplement invariant and perfectly additive. We include it to see whether it correlates strongly with any of the structural scales, especially asymptotically. Correlations at a given position are given in Table 14. Consistently with (Baldi et al., 1998) and the results Sequence analysis by additive scales Table 14. Correlations between the scales at a given position (C0 (S1 , S2 )). AT-content, base stacking, propeller twist, protein deformability, bendability, position preference AT BS PT PD B PP AT BS PT PD B PP 1 0.478 1 −0.539 −0.294 1 −0.294 0.043 0.668 1 −0.0098 −0.018 0.249 0.141 1 0.040 −0.089 −0.123 −0.036 −0.080 1 above, the correlations between the structural scales are very low with the exception of PT and PD (0.668). BS and PT have also non-trivial opposite correlations with respect to AT-content (0.48 and −0.54). Here correlations between a dinucleotide scale S1 and a trinucleotide scale S2 are computed using sums of the form X 1 X 2 X 3 S1 (X 1 X 2 )S2 (X 1 X 2 X 3 ). Because the third nucleotide does not appear in the dinucleotide scale, in Baldi et al. (1998) correlations between dinucleotide and trinucleotide scales were also computed using, for the dinucleotide scale, the sum S1 (X 1 X 2 )+ S1 (X 2 X 3 ). When considering neighboring dinucleotides, the correlation between BS and PT, for instance, increases its magnitude from −0.294 to −0.550. This effect must be caused by correlations that are present in runs of overlapping dinucleotides, but not in the single dinucleotides. Such a phenomenon may arise if the physical reality behind both scales is that the structure actually depends on more than a dinucleotide step, and this is very likely to be the case. Correlations between scales: asymptotic values. If the same correlations are computed by shuffling the 4.6 Mbp of the E.coli genome randomly over a length of 31 bp, one obtains the numbers given in Table 15 (Pedersen et al., 2000). The correlation between BS and PT is higher (−0.744) and so is the correlation of AT-content with BS and PT (0.899 and −0.882). Incidentally, when measured on the actual E.coli genome the correlations are even higher. For instance, the correlation between BS and PT becomes −0.825. These results are easily explained by the theory developed here. The asymptotic correlation between the scales computed using equation (35) are displayed in Table 16. Because the E.coli genome has a nucleotide distribution close to uniform, the results are indeed remarkably similar to Table 15, and would be identical up to sampling fluctuations if in Table 16 we had used the precise distribution for E.coli (A = 0.2462, C = 0.2542, G = 0.2537, T = 0.2459), instead of a uniform distribution. It is essential to notice that the asymptotic values do not require very long sequences but are approximately correct Table 15. Correlations between the scales measured over 31 bp random segments from E.coli AT BS PT PD B PP AT BS PT PD B PP 1 0.899 1 −0.882 −0.744 1 −0.777 −0.805 0.801 1 −0.153 −0.181 0.370 0.108 1 0.023 −0.029 −0.154 0.062 −0.206 1 Table 16. Asymptotic correlations between the scales using a uniform distribution AT BS PT PD B PP AT BS PT PD B PP 1 0.914 1 −0.891 −0.757 1 −0.798 −0.843 0.810 1 −0.167 −0.193 0.387 0.113 1 0.046 −0.003 −0.175 0.051 −0.225 1 already at a length scale of 15–20 base pairs or so (Figure 4). Asymptotically, and with a uniform distribution, all the dinucleotide scales have strong positive or negative correlations with each other and with AT-content. Notice that this is not the case for the trinucleotide scales. It is also not necessarily the case if the correlations are measured with respect to other nucleotide distributions. Figure 5 shows how the asymptotic correlation of bendability with AT content varies with the underlying distribution. Similar surfaces for all other scales are given in Figure 8 in the appendix. Notice that in general A/(A + T ) ≈ 0.5 in eukaryotic genomic DNA as soon as sufficiently long stretches of DNA are taken into consideration (Chargaff’s second parity rule) (Prabhu, 1993; Bell and Forsdyke, 1999). This is not necessarily the case with, for instance, relatively short stretches of DNA, synthetic DNA, or with bacterial DNA that contains a strong composition skew associated with independently replicated regions. Analysis of a set of expandable repeats in primate genomes Because triplets involved in disease expansions seem to have extremal properties which may be related to the expansion mechanism, it is worth testing whether this is a fairly general feature of units associated with tandem repeats. Here we consider the large set of repeat unit classes derived in Jurka and Pethiyagoda (1995) corresponding to frequently encountered tandem repeats 879 P.Baldi and P.-F.Baisnée 1 0.8 PT/PD 0.6 0.4 PT/B Correlation 0.2 PD/B PD/PP 0 BS/PP PT/PP BS/B − 0.2 B/PP − 0.4 − 0.6 BS/PT − 0.8 BS/PD −1 0 5 10 15 20 25 30 35 N Fig. 4. Rapid convergence of correlations between pairs of scales to the theoretically predicted asymptotic values as a function of string length with uniform nucleotide distribution and free boundary conditions. Curves start at N = 2 for pairs of dinucleotide scales, and N = 3 for all other pairs. Correlations are calculated exactly up to N = 12, and using a random uniform sample of 70 000 000 points for N = 17, 22, 27, 32. 0.4 0.2 Correlation AT/B 0 0.2 the GenBank database. The first category, in particular, contains 67 classes of patterns that were found to occur in tandem repeats of total length N 12 with R 3 in at least two different length sizes. The second category includes 71 pattern classes that are found to occur in tandem repeats over 12 nucleotides long in only one length size. The last category contains 363 pattern classes that were found not to expand beyond 12 nucleotides. For each pattern class, simple indicators are provided, such as average length of repeats, relative abundance of class, and expandability. In Figure 6 we display the Z -scores for the all the repeat units in the first category, as a function of repeat unit length (P), computed with respect to repeat units of the same length using equation (29). The distributions of repeat classes with respect to each structural scale are approximately symmetric and normal at all lengths, showing no clear-cut bias towards extremal values of any scale. When taking into account the relative abundance of the classes, as quantified in Jurka and Pethiyagoda (1995) by using the number of nucleotides occurring in corresponding repetitive sequences, we nevertheless observe that the most abundant class, (poly-A, 33%), corresponds to the stiffest repeat at all lengths, for all scales except BS (Figure 7). It is reasonable to assume that the structural properties of poly-A are related to its abundance. The next most frequent repeats, however, (AC = 17.94%, AG = 5.38%, and AAAT = 3.39%) do not show a clear pattern of extremal values in the five scales considered. Likewise, we do not find any obvious correlation between scale values and relative abundance or expandability indicators provided in Jurka and Pethiyagoda (1995). 0.4 0.6 0.8 1 .75 1 1 .50 .75 .50 A/(A+T) Ratio .25 .25 0 0 AT Content Fig. 5. Surface representing the asymptotic correlation between the bendability scale and the AT content scale for different distribution values of AT-content and A/T proportions. of multiple lengths (i.e. that are polymorphic) in primate genomes. The data contains 501 unique classes of repeat units ranging in length from P = 1 to 6, classified into three categories: expandable, weakly expandable, and non-expandable. These categories were derived from simple statistical criteria calculated over a subset of 880 Discussion A general framework for sequence analysis in the presence of one or more additive scales has been developed. The framework solves a number of open issues including: (1) construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and profiles from genomic databases; (4) rapid convergence to normal distributions when N increases; and (5) complete analysis of correlations between scales and their rapid convergence towards a fixed asymptotic value. The framework has been applied to DNA sequences and structural scales. The fundamental requirement for the application of the framework is the additivity of the scale. This is likely to be a reasonable approximation for many scales, at least over relatively short distances. The precise nature of such distances, however is an open important question that ultimately will have to be addressed experimentally. The Sequence analysis by additive scales BS PT 5 PD 5 35 5 30 0 0 0 5 5 1 2 3 4 5 5 6 1 2 B 3 4 5 6 1 2 3 4 5 6 PP 5 5 Relative Abundance (%) 25 AT AG 20 A 15 AGAT AAGG 10 AC AAAC AAAAG 0 0 5 0 5 5 1 2 3 4 5 6 1 2 3 4 5 6 Fig. 6. Distribution of expandable repeat classes over scale values. Horizontal axis represents repeat unit length (P). Vertical axis represents scale value distances from expectation for all possible repeat units, normalized by the corresponding standard deviation [see equation (29)]. Dots represent a set of 67 expandable repeat classes. The number of classes at each length 1–6 is as follows: 2, 4, 9, 18, 18, and 16. Circles represent the most extreme repeat classes, when considering the full population of repeat patterns at a given length. Note that at each length P, only proper P-tuple repeats are represented in the set found in Jurka and Pethiyagoda (1995), excluding repeats with shorter periodicity. For instance poly-A is only represented once as a single letter repeat. It is worth noticing that most extreme positions are actually occupied at almost all length by mono-, di-, or tri-nucleotide repeats. The set of expandable repeats therefore actually covers the full range of each scale when taking into account patterns made up of shorter repeated units. Only 5 circles out of 64 remain unmatched by an expandable repeat class. structural scales used here should be regarded only as a first order approximation. The twist angle between bases, for instance, is likely to depend on more than just the two neighboring bases (Dickerson, 1992). A better estimate could be derived using the tetranucleotide consisting of the two bases before and after the twist angle. Unfortunately, the structure of all possible 256 tetranucleotides is not known and represents a considerable experimental challenge. But the methods we have developed are independent of any particular scale, approximation, or oligonucleotide length. They are readily applicable to new scales, tetranucleotide and other, as well as to completely different scales defined over other alphabets (codons, RNA, proteins, etc.). Furthermore, the methods are also applicable in conjunction with computationally-derived scales that are parameterized and fitted to the data using neural network representations and other statistical machine learning techniques. AGC AGG CCG AAT 1 2 AAAG AGCCC AAAAT AAAT 3 4 Repeat Unit Length (P) AAAAC AACCCT 5 6 Fig. 7. Relative abundance of expandable repeat unit classes in Jurka and Pethiyagoda (1995) as a function of repeat unit length. The individual contribution of repeat classes totaling more than 1% relative abundance are shown. The theory presented resolves the issue of correlation between scales. For each pair of scales there is not a single correlation number because correlation depends both on background distribution and window length. Given a fixed background distribution, the correlations rapidly converge to a fixed asymptotic value that can be predicted mathematically. This value is attained here over a characteristic window length of about 10–15 bp, the same over which normality is achieved, and corresponding to a few times the size of the longest of the two scales being compared. This is the range over which local statistical fluctuations are stabilized, but it does not correspond necessarily to the range over which the scales are additive. With a uniform background distribution, for instance, the correlation between propeller twist and base stacking varies monotonically from −0.294 to −0.757, as window size is increased from 2 to 10 or so. Thus the increase in window length significantly changes the measured correlation from slightly negative to substantially negative. An even more striking example is provided by the correlation between base stacking and protein deformability, which varies from 0.043 to −0.843, under the same conditions. Empirical determination of the proper window length for additivity and for measuring correlations may not be easy. It must be noted, however, that these large variations are observed only with the theoretically-derived base stacking energy scale (as well as the AT-content scale). In general, all other empirically-derived scales exhibit pairwise correlations that do not vary dramatically 881 P.Baldi and P.-F.Baisnée with window length and are relatively small (Figure 4), except for propeller twist and protein deformability. Thus for the empirically-derived scales the local behavior and the aggregate behavior over 10–15 bp is quite similar in terms of correlations, so that the precise selection of a window length may not be a serious obstacle in this case. Propeller twist and protein deformability are highly, but not perfectly, correlated, over both very short and intermediate distances. These scales were derived by crystallography of naked DNA and DNA in complex with protein respectively. This suggests that DNA structures observed in protein–DNA complexes may to some extent be determined at the DNA-sequence level. Or at least that the structure of DNA in the complex has to be consistent with the inherent structural features of the naked DNA. In general, when substantial positive or negative correlation between two scales is observed, two different sets of conclusions can be drawn. First, from a practical standpoint, it may be simpler and faster in database searches to use only one of the two scales, since the results provided by the second one are redundant. Second, high correlation between two completely different experimental approaches attempting to quantify the connection between DNA sequence and structure can be taken as a sign that the approaches are measuring the same underlying reality. Thus correlation analysis, in addition to which scale to use in practice, may tell us something about their interpretation and validity. It may at first seem strange that correlations depend also on background distribution, since structure is a deterministic function of DNA sequence. In this sense, a uniform distribution may be the least biased. On the other hand, if correlations are measured over a large number of sequences extracted from genomic data, it is clear that the sequence composition influences the correlation. Similarly, if the scales are used to pull out signals against a genomic background, it is important to take the statistical composition into consideration. In this respect, it is worth noting that large scale genomic DNA is characterized by strand invariant compositions ( p A = pT and pC = pG ) where correlations between empiricallyderived scales tend to be smaller in absolute value. The framework, however, applies as well to compositions that are not strand invariant. We have modeled the background distribution using single nucleotide probabilities but it should be clear that the same framework can accommodate more complex probabilistic models, such as Markov models of order k. In fact, it is interesting to note that with a higher order Markov model, some of the correlations between the scales measured in E.coli would be slightly higher. It is fair to suspect, however, that the structural models currently available are somewhat noisy and therefore only 882 marginal gains are to be expected at best from the use of more refined probabilistic models. Taken together, all these results indicate that with the exception of propeller twist and protein deformability, the empirical scales we have used are by and large uncorrelated. DNA 3D structure is a complex phenomenon that cannot be captured locally by a single number, but rather corresponds to a vector of properties. It is therefore likely that the scales we have used represent different attempts at capturing various aspects of DNA structure from different perspectives. This view is consistent with the simultaneous presence of predominantly low and occasionally high correlations between pairs of scales. In particular, this provides a possible partial explanation for the differences in interpretation the scales provide for some of the three extremal triplet classes involved in triplet repeat expansion diseases. The CAG-class of repeats is consistently found to be flexible according to all the models used here and in agreement with experimental evidence (Chastain and Sinden, 1998). This class is special among triplet repeats, being responsible for a large fraction of the currently known triplet repeat diseases (10 of the 13 mentioned in Baldi et al. (1999)). Furthermore, in a model study in E.coli, the CAG triplet repeat was found to be the predominant genetic expansion product. In this study, the CAG-class was expanded at least nine times more frequently than any other triplet (Ohshima et al., 1996). The CGG-class of repeats, on the other hand, seems to be very rigid, except for the propeller twist (and thus protein deformability) scale. Better structural models may be needed to shed light on such a discrepancy. However, it is important to remember that the models used in this work are based on mutually different and also rather indirect investigations of DNA structure. Any single scale is likely to capture correctly only some structural features of some sequence elements. For instance, the enzyme DNase I used to produce the bendability scale preferentially binds and cuts sites where DNA is bent or bendable away from the minor groove. This means that a high DNase I value can be caused by either a very flexible piece of DNA (isotropically flexible, or anisotropically flexible in the right direction), or alternatively by a piece of DNA that is stiff but curved with a compressed major groove. The framework derived here can be used to study how extreme tandem repeats and other sequences are. In several cases, we find that commonly encountered repeat units have extremal structural properties. This is the case for the most common repeat poly-A or poly-T but also for the repeats involved in triplet repeat expansion diseases. It is essential to notice that these extremal properties pertain to the repeat unit class, e.g. the triplet and its shifted versions, rather than the repeat unit alone. A triplet that is not extremal for a given scale, may become extremal Sequence analysis by additive scales once its two shifted versions are considered. For example, AGC has relatively low bendability when taken alone, but corresponds to the most bendable class when GCA and CAG are taken into account. When the large repetition numbers associated with repeat diseases are also taken into consideration, the extremality of the corresponding DNA sequences with respect to the general background are even more striking. Incidentally, the triplet repeat class is under-represented amongst primate DNA repeat expansions (Figure 7) suggesting that special expansion mechanisms may be at work for P = 1 and P = 3, at least in a fraction of the cases. How the expansion of disease causing triplets occurs, as well as several related puzzling questions such as why expansion frequency depends on repeat length, remain poorly understood although it is widely assumed that unusual structural characteristics of the repeats may play a role. Several models have been proposed involving base-slippage and alternative DNA structures during DNA replication, recombination, and repair (Wells, 1996; Pearson and Sinden, 1998a,b; Moore et al., 1999). Growing evidence indicates that the formation of hairpins in Okazaki fragments (during replication of the lagging strand) is probably involved in the expansion process (Chen et al., 1995; Gacy et al., 1995; Wells, 1996; Mariappan et al., 1998; Miret et al., 1998; Pearson and Sinden, 1998b). But many questions remain open to interpretation. Our analysis of a large set of tandem repeats from primate genomes reveals that many repeat units do not have salient structural extreme properties according to the models used here. The results suggest that tandem repeats are likely to belong to different classes and result from a variety of different mechanisms not all of which involve extremal structural sequences. This is not to say that structural signatures, rather than extreme patterns, may not be involved in other cases as suggested, for instance, in Liao et al. (2000). Further evidence for such a possibility is provided by the fact that among the most frequently expanded repeat classes with 3 < P < 7, substrings of mono repeats (particularly poly-A which is stiff) seem to be present almost always (Figure 7). Although some of the equations derived are for exact repeats, it should be clear that the framework applies immediately to situations where the repeat is not perfect, either because of small variations in the sequence or because the repetition number R contains a fractional component. Cases are described in the literature (Orr et al., 1993, for instance), where the G of a few isolated CAG triplets within a long CAG repeat region are replaced by a T. Of interest, the CAT triplet belongs to the second highest bendability class, and therefore the flexibility properties of such stretches are likely to be preserved. More generally, the scales could perhaps form the basis of new alignment penalties in cases where structure, rather than sequence, is preserved. The methods developed here can be applied more systematically to other repeats including telomeric repeats, non-triplet disease-causing repeats, as well as to database searches for new putative disease-causing repeat classes (Kleiderlein et al., 1998; Baldi et al., 1999). For instance, a repeated twelve-mer upstream of the EPM1 gene displays intergenerational instability and has been associated with myoclonus epilepsy (Lalioti et al., 1997, 1998). A similarly unstable, AT-rich, 42 bp-repeat is involved in the fragile site FRA10B (Hewett et al., 1998). In addition, a particular form of muscular dystrophy (FSHD) seems to be caused by DNA contraction, rather than expansion. In FSHD, the repeat units are surprisingly long (3.3 kb), and located at the tip of the long arm of chromosome IV. The units are repeated 30–40 times in normal individuals, and reduced down to eight repeats in affected individuals (van Deutekom et al., 1993; Winokur et al., 1994; Hewitt et al., 1994). Statistical analysis of long repeat units may benefit even more from the techniques developed here. Finally, these techniques can be applied to repeats associated with interspersed elements as well, and, more broadly, to the analysis of structural signals, across entire genomes (Pedersen et al., 2000), or associated with specific regions such as regulatory regions, protein binding sites, SARs (scaffold attachment regions), or polytene bands in Drosophila. The methods are also being applied to phylogenetic questions and to whether DNA structure may have had any influence on the origin of the genetic code. While all these problems would benefit from improved structural models, the methods are now in place to work in conjunction with any new scale that may become available in the near future with progress in DNA experimental techniques. Acknowledgements The work of PB was initially supported by an NIH SBIR grant to Net-ID, Inc., and currently by a Laurel Wilkening Faculty Innovation award at UCI. We would like to thank Anders Gorm Pedersen and David Ussery for comments on the manuscript. Appendix Dinucleotide and trinucleotide scales The three dinucleotide scales (Table 17) and two trinucleotide scales (Table 18) used in the text. Enumeration of equivalence classes Polya’s enumeration theory solves a number of combinatorial questions related to the action of groups on sets. The relevant set here is the set B of all words of length P over the alphabet A with A P elements. The group G of inter883 P.Baldi and P.-F.Baisnée Table 19. Repeat class for each trinucleotide (N = 3). Classes are numbered in alphabetical order. The first alphabetical member of each class is in bold Table 17. Dinucleotide scales Pair BS PT PD AA/TT AC/GT AG/CT AT CA/TG CC/GG CG GA/TC GC TA −5.37 −10.51 −6.78 −6.57 −6.57 −8.26 −9.69 −9.81 −14.59 −3.82 −18.66 −13.10 −14.00 −15.01 −9.45 −8.11 −10.03 −13.48 −11.08 −11.85 2.9 2.3 2.1 1.6 9.8 6.1 12.1 4.5 4.0 6.3 Table 18. Trinucleotide scales Triplet B PP AAA/TTT AAC/GTT AAG/CTT AAT/ATT ACA/TGT ACC/GGT ACG/CGT ACT/AGT AGA/TCT AGC/GCT AGG/CCT ATA/TAT ATC/GAT ATG/CAT CAA/TTG CAC/GTG CAG/CTG CCA/TGG CCC/GGG CCG/CGG CGA/TCG CGC/GCG CTA/TAG CTC/GAG GAA/TTC GAC/GTC GCA/TGC GCC/GGC GGA/TCC GTA/TAC TAA/TTA TCA/TGA −0.274 −0.205 −0.081 −0.280 −0.006 −0.032 −0.033 −0.183 0.027 0.017 −0.057 0.182 −0.110 0.134 0.015 0.040 0.175 −0.246 −0.012 −0.136 −0.003 −0.077 0.090 0.031 −0.037 −0.013 0.076 0.107 0.013 0.025 0.068 0.194 36 6 6 30 6 8 8 11 9 25 8 13 7 18 9 17 2 8 13 2 31 25 18 8 12 8 13 45 5 6 20 8 Triplet Class Triplet Class Triplet Class AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC 1 2 3 4 2 5 6 7 3 8 9 7 4 10 10 4 2 5 8 10 5 11 CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT 12 9 6 12 12 6 7 9 8 3 3 6 9 10 8 12 12 8 9 12 11 5 GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT 7 6 5 2 4 7 7 4 10 9 6 3 10 8 5 2 4 3 2 1 set of all permutations of the form ρ k γ l , where γ denotes the reverse complement permutation, with 0 k P −1, and l = 0 or 1. Notice that this group is not commutative. However ρ k γ l = γ l ρ P−k . In both cases, G acts on B and an equivalence relation ρ1 ≡ ρ2 is defined if and only if there exists a g ∈ G such that ρ2 = gρ1 . The total number of classes or orbits is given by the Burnside lemma and equal to: 1 |Bg | |G| g∈G (37) where Bg = {x ∈ B|g(x) = x} is the set of all the elements of B that are fixed by g. We thus need to study Bρ k and Bρ k γ . A case by case inspection easily shows that: • If (k, P) = 1, then |Bρ k | = A since only sequences of one repeated letter are stable. • If k|P, then |Bρ k | = Ak . est is a subgroup of the group of all permutations of B. If we consider circular permutations on one strand only, it is the circular group with P elements, generated by the single right shift operator ρ. If we consider also the reverse complement, then the group G is easily described as the 884 • If (k, P) = l, then |Bρ k | = Al . When G is the group of cyclic permutations, |G| = P. Putting these results together gives the right hand side of equation (7), which is equivalent to the left-hand side after some algebra. Sequence analysis by additive scales Table 20. Repeat class for each tetranucleotide (N = 4). Classes are numbered in alphabetical order. The first alphabetical member of each class is in bold Quadruplet Class Quadruplet Class Quadruplet Class Quadruplet Class Quadruplet Class Quadruplet Class AAAA AAAC AAAG AAAT AACA AACC AACG AACT AAGA AAGC AAGG AAGT AATA AATC AATG AATT ACAA ACAC ACAG ACAT ACCA ACCC ACCG ACCT ACGA ACGC ACGG ACGT ACTA ACTC ACTG ACTT AGAA AGAC AGAG AGAT AGCA AGCC AGCG AGCT AGGA AGGC AGGG 1 2 3 4 2 5 6 7 3 8 9 10 4 11 12 13 2 14 15 16 5 17 18 19 6 20 21 22 7 23 24 10 3 15 25 26 8 27 28 29 9 30 31 AGGT AGTA AGTC AGTG AGTT ATAA ATAC ATAG ATAT ATCA ATCC ATCG ATCT ATGA ATGC ATGG ATGT ATTA ATTC ATTG ATTT CAAA CAAC CAAG CAAT CACA CACC CACG CACT CAGA CAGC CAGG CAGT CATA CATC CATG CATT CCAA CCAC CCAG CCAT CCCA CCCC 19 10 24 23 7 4 16 26 32 11 33 34 26 12 35 33 16 13 12 11 4 2 5 8 11 14 17 20 23 15 27 30 24 16 33 35 12 5 17 27 33 17 36 CCCG CCCT CCGA CCGC CCGG CCGT CCTA CCTC CCTG CCTT CGAA CGAC CGAG CGAT CGCA CGCC CGCG CGCT CGGA CGGC CGGG CGGT CGTA CGTC CGTG CGTT CTAA CTAC CTAG CTAT CTCA CTCC CTCG CTCT CTGA CTGC CTGG CTGT CTTA CTTC CTTG CTTT GAAA 37 31 18 37 38 21 19 31 30 9 6 18 28 34 20 37 39 28 21 38 37 18 22 21 20 6 7 19 29 26 23 31 28 25 24 30 27 15 10 9 8 3 3 GAAC GAAG GAAT GACA GACC GACG GACT GAGA GAGC GAGG GAGT GATA GATC GATG GATT GCAA GCAC GCAG GCAT GCCA GCCC GCCG GCCT GCGA GCGC GCGG GCGT GCTA GCTC GCTG GCTT GGAA GGAC GGAG GGAT GGCA GGCC GGCG GGCT GGGA GGGC GGGG GGGT 6 9 12 15 18 21 24 25 28 31 23 26 34 33 11 8 20 30 35 27 37 38 30 28 39 37 20 29 28 27 8 9 21 31 33 30 38 37 27 31 37 36 17 GGTA GGTC GGTG GGTT GTAA GTAC GTAG GTAT GTCA GTCC GTCG GTCT GTGA GTGC GTGG GTGT GTTA GTTC GTTG GTTT TAAA TAAC TAAG TAAT TACA TACC TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT TCAA TCAC TCAG TCAT TCCA TCCC TCCG 19 18 17 5 10 22 19 16 24 21 18 15 23 20 17 14 7 6 5 2 4 7 10 13 16 19 22 10 26 29 19 7 32 26 16 4 11 23 24 12 33 31 21 TCCT TCGA TCGC TCGG TCGT TCTA TCTC TCTG TCTT TGAA TGAC TGAG TGAT TGCA TGCC TGCG TGCT TGGA TGGC TGGG TGGT TGTA TGTC TGTG TGTT TTAA TTAC TTAG TTAT TTCA TTCC TTCG TTCT TTGA TTGC TTGG TTGT TTTA TTTC TTTG TTTT 9 34 28 18 6 26 25 15 3 12 24 23 11 35 30 20 8 33 27 17 5 16 15 14 2 13 10 7 4 12 9 6 3 11 8 5 2 4 3 2 1 When we take into account the reverse complement, we have |G| = 2P. A case by case inspection again shows that: Repeat classes for N = 3 and N = 4 Repeat classes for each trinucleotide (Table 19) and each tetranucleotide (Table 20). • If P is odd, then Bρ k γ is empty. • If P is even, then Bρ k γ has P/2 degrees of freedom and therefore |Bρ k γ | = A P/2 . This yields immediately the formula in equations (9) and (10). If needed, it is also straightforward to count the number of elements inside each type of equivalence class. Correlations as a function of background distribution Examples of surfaces of correlations between the ATcontent scale and the other scales as a function of background distribution are given in Figure 8. Similar curves can be obtained for any pair of scales. 885 P.Baldi and P.-F.Baisnée 1 Correlation AT/PT Correlation AT/BS 1 0.5 0 − 0.5 −1 1 0 − 0.5 −1 1 .75 .50 .25 0 AT Content 0 .25 .50 .75 .75 1 .50 .25 0 AT Content A/(A+T) Ratio 1 0 .25 .50 .75 1 A/(A+T) Ratio 1 Correlation AT/PP Correlation AT/PD 0.5 0.5 0 − 0.5 −1 1 0.5 0 − 0.5 −1 1 .75 .50 .25 AT Content 0 0 .25 .50 .75 1 A/(A+T) Ratio .75 .50 .25 AT Content 0 0 .25 .50 .75 1 A/(A+T) Ratio Fig. 8. Surface representing the asymptotic correlation between AT-content and all the other scales for different background distributions. References Ashley,C.T. and Warren,S.T. (1995) Trinucleotide repeat expansion and human disease. Annu. Rev. Genet., 29, 703–728. Baldi,P. and Rinott,Y. (1989) On normal approximations of distributions in terms of dependency graphs. Ann. Prob., 17, 1646–1650. Baldi,P., Brunak,S., Chauvin,Y. and Krogh,A. (1996) Naturally occurring nuclesome positioning signals in human exons and introns. J. Mol. Biol., 263, 503–510. Baldi,P., Chauvin,Y., Pedersen,A.G. and Brunak,S. (1998) Computational applications of DNA structural scales. In Proceedings of the 1998 Conference on Intelligent Systems for Molecular Biology (ISMB98) The AAI Press, Menlo Park, CA, pp. 35–42. Baldi,P., Brunak,S., Chauvin,Y. and Pedersen,A.G. (1999) Structural basis for triplet repeat disorders: a computational analyis. Bioinformatics, 15, 918–929. Bell,S.J. and Forsdyke,D.R. (1999) Accounting units in DNA. J. Theor. Biol., 197, 51–61. Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. Benson,G. and Waterman,M.S. (1994) A method for fast database search for all k-nucleotide repeats. Nucleic Acids Res., 22, 4828– 4836. Blanchard,M.K., Chiapello,H. and Coward,E. (2000) Detecting localized repeats in genomic sequences: a new strategy and its applications to Bacillus subtilis and Arabidopsis thaliana sequences. Comput. Chem., 24, 57–70. Breslauer,K.J., Frank,R., Blocker,H. and Marky,L.A. (1986) Predicting DNA duplex stability from the base sequence. Proc. Natl. Acad. Sci. USA, 83, 3746–3750. Brukner,I., Sánchez,R., Suck,D. and Pongor,S. (1995) Sequence- 886 dependent bending propensity of DNA as revealed by DNase I: parameters for trinucleotides. EMBO J., 14, 1812–1818. Calladine,C.R., Drew,H.R. and McCall,M.J. (1988) The intrinsic structure of DNA in solution. J. Mol. Biol., 201, 127–137. Campuzano,V., Montermini,L., Molto,M.D., Pianese,L., Cossee,M., Cavalcanti,F., Montos,E., Rodius,F., Duclos,F., Monticelli,A., Zara,F., Canizares,J., Koutnikova,H., Bidichandani,S.I., Gellera,C., Brice,A., Trouillaas,P., Michele,G.D., Filla,A., Frutos,R.D., Palau,F., Patel,P.I., Donato,S.D., Mandel,J.L., Cocozza,S., Koenig,M. and Pandolfo,M. (1996) Friedreich’s ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion. Science, 271, 1423–1427. Chastain,P.D. and Sinden,R.R. (1998) CTG repeats associated with human genetic disease are inherently flexible. J. Mol. Biol., 275, 405–411. Chen,X., Mariappan,S.V.S., Catasti,P., Ratliff,R., Moyzis,R.K., Ali,L., Smith,S.S., Bradbury,E.M. and Gupta,G. (1995) Hairpins are formed by the single DNA strands of the fragile X triplet repeats: structure and biological implications. Proc. Natl. Acad. Sci. USA, 52, 5199–5203. van Deutekom,J.T., Wijmenga,C., van Tienhoven,E.A.E., Gruter,A.M., Hewitt,J.E., Padberg,G.W., van Ommen,G.J.B., Hofker,M.H. and Frants,R.R. (1993) FSHD associated DNA rearrangements are due to deletions of integral copies of a 3.3 kb tandemly repeated unit. Hum. Mol. Genet., 2, 2037–2042. Dickerson,R.E. (1992) DNA structure from A to Z. Meth. Enzymol., 211, 67–111. Drew,H.R. and Travers,A.A. (1985) DNA bending and its relation to nucleosome positioning. J. Mol. Biol., 186, 773–790. Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biologi- Sequence analysis by additive scales cal sequence analysis. Probabilstic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. Eichler,E.E. and Nelson,D.L. (1998) The FRAXA fragile site and fragile X syndrome. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 13–50. Feller,W. (1971) An Introduction to Probability Theory and its Applications. vol. 2, 2nd edn, Wiley, New York. Fye,R.M. and Benham,C.J. (1999) Exact method for numerically analyzing a model of local denaturation in superhelically stressed DNA. Phys. Rev. E, 59, 3408–3426. Gacy,A.M., Goellner,G., Juranic,N., Macura,S. and McMurray,C.T. (1995) Trinucleotide repeats that expand in human disease form hairpin structures in vitro. Cell, 8, 533–540. Gecz,J. and Mulley,J.C. (1999) Characterisation and expression of a large, 13.7 kb fmr2 isoform. Eur. J. Hum. Genet., 7, 157–162. Godde,J.S. and Wolffe,A.P. (1996) Nucleosome assembly on CTG triplet repeats. J. Biol. Chem., 271, 15222–15229. Godde,J.S., Kass,S.U., Hirst,M.C. and Wolffe,A.P. (1996) Nucleosome assembly on methylated CGG triplet repeats in the fragile X mental retardation gene 1 promoter. J. Biol. Chem., 271, 24325–24328. Goodsell,D.S. and Dickerson,R.E. (1994) Bending and curvature calculations in B-DNA. Nucleic Acids Res., 22, 5497–5503. Grove,A., Galeone,A., Mayol,L. and Geiduschek,E.P. (1996) Localized DNA flexibility contributes to target site selection by DNAbending proteins. J. Mol. Biol., 260, 120–125. Gusella,J.F. and MacDonald,M.E. (1996) Trinucleotide instability: a repeating theme in human inherited disorders. Annu. Rev. Med., 47, 201–209. Hardy,J. and Gwinn-Hardy,K. (1998) Genetic classification of primary neurodegenerative disease. Science, 282, 1075–1079. Hassan,M.A.E. and Calladine,C.R. (1996) Propeller-twisting of base-pairs and the conformational mobility of dinucleotide steps in DNA. J. Mol. Biol., 259, 95–103. Hewett,D.R., Handt,O., Hobson,L., Mangelsdorf,M., Eyre,H.J., Baker,E., Sutherland,G.R., Schuffenhauer,S., Mao,J.I. and Richards,R.I. (1998) FRA10B structure reveals common elements in repeat expansion and chromosomal fragile site genesis. Mol. Cell, 1, 773–781. Hewitt,J.E., Lyle,R., Clark,L.N., Valleley,E.M., Wright,T.J., Wijmenga,C., van Deutekom,J.C., Francis,F., Sharpe,P.T. and Hofker,M.H. et al. (1994) Analysis of the tandem repeat locus D4Z4 associated with facioscapulohumeral muscular dystrophy. Hum. Mol. Genet., 3, 1287–1295. Hunter,C.A. (1996) Sequence-dependent DNA structure. Bioessays, 18, 157–162. Iyer,V. and Struhl,K. (1995) Poly (dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J., 14, 2570–2579. Jeffreys,A.J. (1997) Spontaneous and induced minisatellite instability in the human genome. Clin. Sci., 93, 383–390. Junck,L. and Fink,J.K. (1996) Machado-Joseph disease and SCA3: the genotype meets the phenotypes. Neurology, 46, 4–8. Jurka,J. (1998) Repeats in genomic DNA: mining and meaning. Curr. Opin. Struct. Biol., 8, 333–337. Jurka,J. and Pethiyagoda,C. (1995) Simple repetitive DNA sequences from primates: compilation and analysis. J. Mol. Evol., 40, 120–126. Jurka,J., Walichiewicz,J. and Milosavljevic,A. (1992) Prototypic sequences for human repetitive DNA. J. Mol. Evol., 35, 286–291. Kleiderlein,J.J., Nisson,P.E., Jessee,J., Li,W., Becker,K.G., Derby,M.L., Ross,C.A. and Margolis,R.L. (1998) CCG repeats in cDNAs from human brain. Hum. Genet., 103, 666–673. Koenig,M. (1998) Friedreich’s ataxia. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 219–238. Lahm,A. and Suck,D. (1991) DNase I-induced DNA conformation: 2 Å structure of a DNase I-octamer complex. J. Mol. Biol., 222, 645–667. Lalioti,M.D., Scott,H.S., Buresi,C., Rossier,C., Bottani,A., Morris,M.A., Malafosse,A. and Antonarakis,S.E. (1997) Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy. Nature, 386, 847–851. Lalioti,M.D., Scott,H.S., Genton,P., Grid,D., Ouazzani,R., M’Rabet,A., Ibrahim,S., Gouider,R., Dravet,C., Chkili,T., Bottani,A., Buresi,C., Malafosse,A. and Antonarakis,S.E. (1998) A PCR amplification method reveals instability of the dodecamer repeat in progressive myoclonus epilepsy (EPM1) and no correlation between the size of the repeat and age at onset. Am. J. Hum. Genet., 62, 842–847. Lee,C.C. (1998) Spinocerebellar ataxia type 6 (SCA6). In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 145–154. Liao,G., Rehm,E.J. and Rubin,G.M. (2000) Insertion site preferences of the P transposable element in Drosophila melanogaster. Proc. Natl. Acad. Sci. USA, 97, 3347–3351. Liu,K. and Stein,A. (1997) DNA sequence encodes information for nucleosome array formation. J. Mol. Biol., 270, 559–573. Lu,Q., Wallrath,L.L. and Elgin,S.C.R. (1994) Nucleosome positioning and gene regulation. J. Cell. Biochem., 55, 83–92. Mariappan,S.V.S., Silks III,L.A., Bradbury,E.M. and Gupta,G. (1998) Fragile X DNA triplet repeats, (GCC)n , form hairpins with single hydrogen-bonded cytosine. cytosine mispairs at the CpG sites: isotope-edited nuclear magnetic resonance spectroscopy on (GCC)n with selective 15 N 4-labeled cytosine bases. J. Mol. Biol., 283, 111–120. Milosavljevic,A. and Jurka,J. (1993) Discovering simple DNA sequences by the algorithmic significance method. CABIOS, 9, 407–411. Miret,J.J., Pessoa-Brandao,L. and Lahue,R.S. (1998) Orientationdependent and sequence-specific expansions of CTG/CAG trinucleotide repeats in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA, 95, 12438–12443. Moore,H., Greenwell,P.W., Liu,C.P., Arnheim,N. and Petes,T.D. (1999) Triplet repeats form secondary structures that escape DNA repair in yeast. Proc. Natl. Acad. Sci. USA, 96, 1504– 1509. Nelson,D.L. (1995) The fragile X syndrome. Semin. Cell Biol., 6, 5–11. Nelson,H.C.M., Finch,J.T., Luisi,B.F. and Klug,A. (1987) The structure of an oligo(dA)-oligo(dT) tract and its biological implications. Nature, 330, 221–226. Ohshima,K., Kang,S. and Wells,R.D. (1996) CTG triplet repeats from human hereditary diseases are dominant genetic expansion products in Escherichia coli. J. Biol. Chem., 271, 1853–1856. Olson,W.K., Gorin,A.A., Lu,X., Hock,L.M. and Zhurkin,V.B. 887 P.Baldi and P.-F.Baisnée (1998) DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl. Acad. Sci. USA, 95, 11 163–11 168. Ornstein,R.L., Rein,R., Breen,D.L. and MacElroy,R.D. (1978) An optimised potential function for the calculation of nucleic acid interaction energies. I. Base stacking. Biopolymers, 17, 2341– 2360. Orr,H.T. and Zoghbi,H.Y. (1998) Polyglutamine tract vs. protein context in SCA1 pathogenesis. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 105–118. Orr,H.T., Chung,M., Banfi,S., Jr.,T.J.K., Servadio,A., Beaudet,A.L., McCall,A.E., Duvick,L.A., Ranum,L.P.W. and Zoghbi,H.Y. (1993) Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nature Genet., 4, 221–226. Parvin,J.D., McCormick,R.J., Sharp,P.A. and Fisher,D.E. (1995) Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor. Nature, 373, 724–727. Paulson,H.L. (1998) Spinocerebellar ataxia type 3/machado-joseph disease. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 129–144. Paulson,H.L., Perez,M.K., Trottier,Y., Trojanowski,J.Q., Subramony,S.H., Das,S.S., Vig,P., Mandel,J.L., Fischbeck,K.H. and Pittman,R.N. (1997) Intranuclear inclusions of expanded polyglutamine protein in spinocerebellar ataxia type 3. Neuron, 19, 333–344. Pazin,M.J. and Kadonaga,J.T. (1997) SWI2/SNF2 and related proteins: ATP-driven motors that disrupt protein–DNA interactions? Cell, 88, 737–740. Pearson,C.E. and Sinden,R.R. (1998a) Slipped strand DNA, dynamic mutations and human disease. In Wells,R.D. and Warren,S.T. (eds), Genetic Instabilities and Hereditary Neurological Diseases Academic Press, New York, pp. 585–621. Pearson,C.E. and Sinden,R.R. (1998b) Trinucleotide repeat DNA structures: dynamic mutations from dynamic DNA. Curr. Opin. Struct. Biol., 8, 321–330. Pedersen,A.G., Baldi,P., Brunak,S. and Chauvin,Y. (1998) DNA structure in human RNA polymerase II promoters. J. Mol. Biol., 281, 663–673. Pedersen,A.G., Jensen,L.J., Brunak,S., Staerfeldt,H.H. and Ussery,D.W. (2000) A DNA structural atlas for Escherichia coli. J. Mol. Biol., 299, 907–930. Ponomarenko,M.P., Ponomarenko,J.V., Frolov,A.S., Podkolodny,N.L., Savinkova,L.K., Kolchanov,N.A. and Overton,G.C. (1999) Identification of sequence-dependent DNA sites interacting with proteins. Bioinformatics, 15, 687. Prabhu,V.V. (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res., 21, 2797–2800. Pulst,S.-M. (1998) Spinocerebellar ataxia type 2. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 119–128. Rinott,Y. and Dembo,A. (1996) Some examples of normal approximations by Stein’s method. In Aldous,D. and Pemantle,R. (eds), Random Discrete Structures Springer, New York, pp. 25–44. Ross,C.A. (1995) When more is less: pathogenesis of glutamine repeat neurodegenerative diseases. Neuron, 15, 493–496. Rubinsztein,D.C. and Amos,B. (1998) Trinucleotide repeat mutation processes. In Rubinsztein,D.C. and Hayden,M.R. (eds), 888 Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 257–268. Rubinsztein,D.C. and Hayden,M.R. (1998) Introduction. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 1–12. Satchwell,S.C., Drew,H.R. and Travers,A.A. (1986) Sequence periodicities in chicken nucleosome core DNA. J. Mol. Biol., 191, 659–675. Sheridan,S.D., Benham,C.J. and Hatfield,G.W. (1998) Activation of gene expression by a novel DNA structural transmission mechanism that requires supercoiling-induced DNA duplex destabilization in an upstream activating sequence. J. Biol. Chem. Simpson,R.T. (1991) Nucleosome positioning: occurrence, mechanisms, and functional consequences. Prog. Nucleic Acids Res. Mol. Biol., 40, 143–184. Sinden,R.R. (1994) DNA Structure and Function. Academic Press, San Diego, CA. Skinner,J.A., Foss,G.S., Miller,W.J. and Davies,K.E. (1998) Molecular studies of the fragile sites FRAXE and FRAXF. In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 51–60. Starr,D.B., Hoopes,B.C. and Hawley,D.K. (1995) DNA bending is an important component of site-specific recognition by the TATA binding protein. J. Mol. Biol., 250, 434–446. Stevanin,G., Daviid,G., Abbas,N., Dürr,A., Holmberg,M., Duyckaerts,C., Giunti,P., Cancel,G., Ruberg,M., Mandel,J.-L. and Brice,A. (1998) Spinocerebellar ataxia type 7 (SCA7). In Rubinsztein,D.C. and Hayden,M.R. (eds), Analysis of Triplet Repeat Disorders BIOS, Oxford, pp. 155–168. Suck,D. (1994) DNA recognition by Dnase I. J. Mol. Recognition, 7, 65–70. The Huntington’s Disease Collaborative Research Group, (1993) A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell, 72, 971– 983. Tsukiyama,T. and Wu,C. (1997) Chromatin remodeling and transcription. Curr. Opin. Gen. Dev., 7, 182–191. Wang,Y.-H. and Griffith,J.D. (1994) Preferential nucleosome assembly at DNA triplet repeats from the myotonic dystrophy gene. Science, 265, 669–671. Wang,Y.-H. and Griffith,J.D. (1995) Expanded CTG triplet blocks from the myotonic dystrophy gene create the strongest known natural nucleosome positioning elements. Genomics, 25, 570– 573. Wang,Y.-H., Gellibolian,R., Shimizu,M., Wells,R.D. and Griffith,J.D. (1996) Long CCG triplet repeat blocks exclude nucleosomes—a possible mechanism for the nature of fragile sites in chromosomes. J. Mol. Biol., 263, 511–516. Wells,R.D. (1996) Molecular basis of triplet repeat diseases. J. Biol. Chem., 271, 2875–2878. Werner,M.H. and Burley,S.K. (1997) Architectural transcription factors: proteins that remodel DNA. Cell, 88, 733–736. Winokur,S.T., Bengtsson,U., Feddersen,J., Mathews,K.D., Weiffenbach,B., Bailey,H., Markovich,R.P., Murray,J.C., Wasmuth,J.J., Altherr,M.R. and Schutte,B.C. (1994) The DNA rearrangement associated with facioscapulohumeral muscular dystrophy involves a heterochromatin-associated repetitive element: implications for a role of chromatin structure in the pathogenesis of the disease. Chromosome Res., 2, 225–234. Sequence analysis by additive scales Wolffe,A.P. and Drew,H.R. (1995) DNA structure: implications for chromatin structure and function. In Elgin,S.C.R. (ed.), Chromatin Structure and Gene Expression IRL Press, Oxford, pp. 27–48. Wolffe,A.P. and Matzke,M.A. (1999) Epigenetics: regulation throughrepression. Science, 286, 481–486. Zhu,Z. and Thiele,D.J. (1996) A specialized nucleosome modulates transcription factor access to a C.glabrata metal responsive promoter. Cell, 87, 459–470. 889
© Copyright 2026 Paperzz