Alu and LINE1 Distributions in the Human Chromosomes: Evidence of Global Genomic Organization Expressed in the Form of Power Laws Diamantis Sellis,* Astero Provata, and Yannis Almirantis* *National Center for Scientific Research ‘‘Demokritos,’’ Institute of Biology, Athens, Greece; and National Center for Scientific Research ‘‘Demokritos,’’ Institute of Physical Chemistry, Athens, Greece Spatial distribution and clustering of repetitive elements are extensively studied during the last years, as well as their colocalization with other genomic components. Here we investigate the large-scale features of Alu and LINE1 spatial arrangement in the human genome by studying the size distribution of interrepeat distances. In most cases, we have found power-law size distributions extending in several orders of magnitude. We have also studied the correlations of the extent of the power law (linear region in double-logarithmic scale) and of the corresponding exponent (slope) with other genomic properties. A model has been formulated to explain the formation of the observed power laws. According to the model, 2 kinds of events occur repetitively in evolutionary time: random insertion of several types of intruding sequences and occasional loss of repeats belonging to the initial population due to ‘‘elimination’’ events. This simple mechanism is shown to reproduce the observed power-law size distributions and is compatible with our present knowledge on the dynamics of repeat proliferation in the genome. Introduction About 45% of the human genome consists of transposable elements (TEs) (Deininger and Batzer 2002; Makalowski 2003), their majority being retroelements. Most of them are members of the Alu and LINE1 families. Alu belongs to the ‘‘short interspersed elements’’ (SINEs) and is found in the human genome in a number of ;1,100,000 copies, covering ;10% of its total length. Alu typical sequence is ;300 nt long and is CG and CpG rich. LINE1 belongs to the ‘‘long interspersed elements’’ (LINEs). The number of LINE1 copies in the human genome is ;700,000. Intact LINE1s have a length of ;6 103 nt, but most copies are considerably truncated. LINE1 repeats are AT rich and are present in all studied mammalian genomes (Smit et al. 1995). Their evolutionary history is long, dating to the beginnings of eukaryotic existence (Ostertag and Kazazian 2001). LINE1 uses a retrotransposase encoded in its sequence, whereas Alu propagates by means of the retrotransposition machinery of active LINE1 elements (Dewannieux et al. 2003). Both repeat families are divided into subfamilies. Elements belonging to each subfamily share high similarity in some characteristic (diagnostic) positions and common ancestry. Alu proliferation has a shorter history than that of LINE1s and probably began in a common ancestor of primates and rodents (Ullu and Tschudi 1984). About 112 MYA, transposons named FLA evolved from 7SL RNA, and ;81 MYA, an FLA dimerization led to the formation of 2 subfamilies (Jo and Jb) of the J family of Alu elements. The next main Alu subfamilies, S, Sx, and Y, have arisen 48, 37, and 19 MYA, respectively (Kapitonov and Jurka 1996; for nomenclature conventions, see also Batzer et al. 1999). Even younger subfamilies (Ya5, age: 4 Myr; Yb8, age: 3 Myr) have arisen after hominization, and the human genome is polymorphic for a considerable number of their members. The distribution of most classes of repeat elements in the genome usually deviates from randomness. LINE1s are found with a higher probability in the AT-rich genomic Key words: Alu, LINE1, power laws, repeat distributions. E-mail: [email protected]. Mol. Biol. Evol. 24(11):2385–2399. 2007 doi:10.1093/molbev/msm181 Advance Access publication August 29, 2007 Ó The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] compartments. Alus have a clear preference for the GC-rich genomic compartments, and this tendency is most pronounced in the older subfamilies (about genomic patchiness with respect to GC content, see Bernardi 2000a, 2000b). However, the very young subfamilies do not follow this pattern, sharing the preference for the AT-rich regions of LINE1 elements. This peculiarity in the genomic distribution of Alu elements is not yet completely understood (Brookfield 2001; Pavlicek et al. 2001; Deininger and Batzer 2002; Medstrand et al. 2002; Jurka et al. 2004; Belle et al. 2005; Hackenberg et al. 2005). TEs were initially considered as selfish entities propagating in the host genome as ‘‘junk DNA,’’ as long as this proliferation is tolerated without causing severe damages. Now, it becomes more and more accepted that the evolution of TEs interacts in a complex way with other aspects of the whole genomic dynamics. It has been reported that 5% of all alternatively spliced internal exons include parts of Alu insertions (Sorek et al. 2002). However, this view is questioned by other authors (Pavlicek, Clay, and Bernardi 2002). The impact of Alus and of other TEs in the modification of the expression pattern of existing genes is not doubted (for Alus, see e.g., Deininger and Batzer 1999; Dagan et al. 2004; for LINE1, see e.g., Ostertag and Kazazian 2001). It is very probable that the proliferation of Alu, LINE1, and other TE families has provided a variety of advantages to the host genome in the long evolutionary time (not necessarily reflected into positive selection of newly transposed copies). In the present work, we study the large-scale pattern of distribution for several repeat families in the human genome. This is done examining the size distribution of distances between consecutive repeats belonging to the same family. Furthermore, attempting to explain the findings concerning repeat distribution at chromosomal level, we introduce a minimal model (named ‘‘insertion–elimination model’’) based on well-established events of genomic dynamics. These are i) eliminations of repeats of a specific repeat family and ii) insertions of sequences of various origins. Both are well-known molecular events occurring regularly in the long evolutionary time. It is shown that the proposed model reproduces the general distribution pattern, which prevails in human chromosomes. Also, all the examined relations between genomic quantities and quantities 2386 Sellis et al. characterizing the interrepeat distances’ distributions are compatible with the proposed model. The article is organized as follows: In Methods, we present technical aspects of the subsequent analysis. In the beginning of Results and Discussion, the statistical concepts, which will be used in the sequel, are briefly reviewed. Then, the proposed insertion–elimination model is introduced in order to compare its features with the properties of repeat distributions in the human genome. In the next subsection, evidence is presented about the regular occurrence of repeat elimination events during genomic evolution (which is a prerequisite of the proposed model). Continuing, the properties of Alu and LINE1 repeat elements’ distribution are systematically presented and the validity of the insertion–elimination model is assessed. In a final section, some conclusions and perspectives of this work are drawn. Methods The sequences of the assembled human chromosomes build 35.1 were downloaded from National Center for Biotechnology Information genomic biology (ftp://ftp.ncbi. nih.gov/genomes/H_sapiens/Assembled_chromosomes/). The existence of gaps in human-assembled chromosomes always poses a problem when measuring distances between any types of localizations (here repeats) at whole chromosome scale. We have chosen to remove gaps longer than 50,000 nt. This strategy has been followed after some initial tests because these gaps could affect the shape of the interrepeat size distributions, whose linear part (in double-log scale) starts around this order of magnitude. The shorter gaps (which are more common) were retained in order not to disturb the chromosomal architecture. Moreover, their length is not expected to affect considerably the linear part of the studied distributions. This was verified in trials where all gaps were removed and figures were left practically unchanged. Notice that chromosomal coordinates included in tables 3 and 6 refer to human chromosomes after the elimination of gaps longer than 50,000 nt. We have used RepeatMasker (Smit et al. 1996–2004; www.repeatmasker.org), version 3.1.2, combined with libraries (release 20051025) derived from RepBase (Jurka 2000; www.girinst.org/) and WU-Blast v.2.0_10/05/2005 (Gish 2003; http://blast.wustl.edu). The data for several repeat populations in human chromosomes were extracted after a suitable parsing of the standard RepeatMasker output. Throughout this work, we present the size distributions of spacers separating the repeats of a given class in the form of cumulative distributions for reasons described in Results and Discussion. The analysis of chromosomal regions has been done using the same values of ‘‘maximum divergence’’ and ‘‘minimum length’’ as in the whole chromosome analysis (see supplementary material, Supplementary Material online). Cumulative distribution plots of interrepeat distances, as well as a simple linear regression analysis in log–log scale, were made with Grace-5.1.14. The values of extent (E), slope (l), r2, and slope standard deviation for all the effectuated genomic size distribution computations are presented in supplementary material (ex- tended table 1, Supplementary Material online). The same values for the simulation examples are given in figure 10. Scatterplots depicting the correlation between E or l and GC or SI (quantity of ‘‘subsequently inserted’’ genomic material after the proliferation period of a repeat class or subfamily) content were made using STATISTICA version 6.0. The names of chromosomes are depicted in the plots, whereas the values of the associated square correlation coefficient (r2) and probability (P) are given in the corresponding tables. There are some inherent difficulties in isolating chromosomal regions suitable for the comparison between high and low GC% or SI% sequence properties (figs. 7 and 9 and tables 3 and 6). This is the reason why we did not include examples of all TE classes. However, even in cases not included because of low repeat population numbers and thus forming rudimentary distributions, the results were similar to the presented ones. One difficulty is due to the necessary sequence length for this kind of analysis. Moreover, in the case of the GC-dependence study, if the GC% contrast is high between pairs of regions (necessary prerequisite in order to obtain unambiguous results), the density of the studied repeat populations is always very low in one of the two examined regions (in the GC rich for LINE1s and in the AT rich for Alus). Several difficulties are also inherent in the search for power laws in low/high-SI regions: Some examples are a) SI (L1P) is uniformly low, depending only on AluY insertions. b) The main components of the SI material for AluSx populations (SI[AluSx]) are L1P populations, resulting in the same conflict met in the search of large GC%-poor or rich regions. Additionally, the SI% value is relatively homogeneous along each chromosome (with the exception of chromosome X and partially of 19), thus making very rare the cases of large low- or high-SI regions. Examples of simulations using the insertion–elimination model are given in figure 10. In these cases, in an artificial chromosome 2,000,000 nt long, 20,000 delimiters (simulating a repeat population) are randomly distributed. After consecutive rounds of delimiter eliminations and influx of external sequence segments (here of 100 nt long), a relatively small fraction of the initial number of delimiters is left. Their distances form well-shaped power-law distributions. In case (a), 40 influx (insertion) events follow each delimiter elimination (spacers’ merging) event until a population of 500 delimiters is left. In case (b), the insertion– elimination ratio equals to 20 and the remaining delimiters are 1,000. Both cases, as well as results of other simulations not presented here, demonstrate power-law formation for a long (transient) time interval before the asymptotic limit is reached. In the presented simulations, the insertion of a repeat element close to another one of the same family at inverse orientation and the subsequent elimination of both (including the spacer between them) are modeled as a single repeat elimination occurring randomly. It is well documented that the scarceness of nearby repeats is very pronounced in the small distances, thus making obvious that the lost interrepeat spacers are practically all below the threshold influencing our analysis (Lobachev et al. 2000; Stenger et al. 2001). A concrete example is provided in CSAC (2005), where for Power Laws in Repeat Distributions 2387 Table 1 Quantitative Information (by Regression Analysis) about Power Laws Observed in All Human Chromosomes for the Considered Repeat Categories Chromosome Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AluJb AluJo AluSx L1M L1P E l E l E l E l E l 1.40 1.37 1.16 1.49 1.25 1.15 1.30 1.25 1.15 1.16 1.50 1.36 1.28 1.87 1.03 1.50 1.60 1.40 1.86 1.40 1.28 1.18 1.25 1.27 1.22 0.92 0.95 0.89 0.97 0.82 1.07 0.52 1.34 1.70 1.06 1.18 0.71 1.14 1.78 1.44 1.55 1.17 1.77 1.03 1.03 1.79 0.81 0.72 1.06 1.28 1.04 1.37 1.15 1.38 1.93 1.25 1.75 1.40 1.50 1.83 1.70 1.62 1.16 1.73 1.50 1.62 2.05 1.20 1.72 1.52 1.84 2.51 1.31 1.17 0.84 0.82 0.89 1.04 0.67 0.66 1.03 1.52 1.06 1.05 1.06 1.21 1.79 1.70 1.76 0.75 1.56 1.02 0.81 1.62 0.71 0.50 1.03 1.15 1.40 1.24 1.05 1.38 1.27 1.46 1.27 0.92 1.37 1.38 0.68 0.92 0.92 1.38 1.72 0.70 2.06 1.27 1.36 1.14 1.26 3.06 1.80 0.89 0.51 0.64 0.67 0.58 0.76 0.50 1.15 1.21 0.89 0.99 1.50 1.41 1.93 1.22 2.05 1.46 1.67 1.16 1.34 1.78 0.75 0.60 1.27 — — — — — 1.13 — 1.39 0.80 1.40 0.92 0.70 — — — 1.14 — 1.50 0.80 — 1.39 1.70 2.08 1.93 — — — — — 2.36 — 1.88 1.85 2.40 2.38 3.22 — — — 0.91 — 1.08 1.49 — 0.62 1.96 0.74 0.93 0.80 1.35 1.02 1.03 1.22 1.83 0.70 1.48 1.25 1.60 1.15 — 1.25 0.92 1.14 1.60 — 1.70 1.50 0.80 1.70 1.15 1.03 1.66 1.69 1.94 2.58 1.87 2.17 1.60 2.25 1.59 1.76 1.72 1.84 — 1.17 2.09 0.95 0.76 — 0.84 1.13 1.17 0.38 2.36 1.27 NOTE.—Blanc cells mean that no power law (linearity in double-logarithmic scale) is found. Chromosome length, GC%, and regression coefficients (r2) are included in the supplementary material (Supplementary Material online), as well as the complete set of the corresponding plots. Notice that all r2 values are higher than 0.97 while their mean is higher than 0.99. ;600 deletions of Alus in the chimp genome 2 Mnt were lost. So in each couple of inverted repeat elimination ;3,500 nt are lost, supposing equidistribution. Simple inspection of figures 1 and 3 and related supplementary material (Supplementary Material online) reveals that in almost all cases of reported power-law–like distributions, the linear part of the distribution lies clearly in a range of lengths higher than 10,000 nt, well above 3,500 nt. Evidence, presented in CSAC (2005), about a power law in the distribution of the eliminated genomic segments’ lengths (CSAC [2005], p.74 see also) makes the number of deleted spacers longer than 10,000 nt statistically insignificant (;6 out of 612). Scripts in PERL and programs in FORTRAN used for parsing RepeatMasker output files and producing the presented results, as well as STATISTICA and EXCEL spreadsheets with repeat population data in human chromosomes, are available upon request. Results and Discussion Preliminaries The distributional features of TEs have been the subject of intense research during the last years. In most cases, substantially nonrandom distributions have been observed. Major factors related to the uneven distribution of repeats are the local GC abundance, the underlying genes’ distribution, mutual avoidance of different repeat populations, and intrafamily repeat coclustering. Here we perform a sys- tematic investigation of the size distribution of the spacers (distances) separating consecutive repeats of the major Alu and LINE1 classes in the human genome. When clustering appears only in short-length scales, clusters with sizes ranging around a mean value are formed. In such cases, for large values of length S, these short-range distributions have an exponentially decaying tail of the form: PðSÞ eaS ; 0 , a: When the clustering is scale free, long-range correlations extend in a self-similar way for several length scales (and ideally for the whole examined genomic length) and the corresponding spacers’ size distributions follow ‘‘power laws,’’ which correspond to linear graphs in a double-logarithmic scale: PðSÞ Sf 5S1l ; 0 , l: High values of the extent of linearity in log–log plots (E) means that self-similarity spans many orders of magnitude, whereas low absolute values of the slope (l) are connected to an abundance of relatively long spacers in every length scale. More interesting are power laws with 0,l 2; which implies that the distribution has infinite mean square. This means that for large data sets, standard deviation is unbounded (i.e., the larger the data set is the higher gets the computed standard deviation). Several power laws found in the real world have values of l around 2 (Newman 2005). For higher values of l, the distribution becomes 2388 Sellis et al. progressively similar to a short-ranged one, that is, randomness tends to prevail in the large length scales (large values of S). Power laws in size distributions are in general associated with the geometrical features of self-similarity and fractality (Mandelbrot 1982; Feder 1988), that is, the distribution (in our case) of a given repeat family in the genome looks similar when examined at several length scales. In the subsequent analysis, we use the ‘‘cumulative size distribution’’ defined as follows: ZN pðrÞdr; PðSÞ5 S where p(r) is the original spacers’ size distribution. The cumulative form of the distribution has in general better statistical properties as it forms smoother ‘‘tails,’’ less affected by fluctuations. For reviews on power-law size distributions, their properties, and alternative forms, see, for example, Adamic and Huberman (2002); Li (2002); and Newman (2005). The cumulative form of a power-law size distribution is again a power law characterized by an exponent (slope) equal to that of the original distribution minus 1: if pðrÞ r 1l , then Z PðSÞ ðr 1l Þdr Sl : In the sequel, plots depicting cumulative size distributions of interrepeat distances are presented. The logarithms of interrepeat distances (S) are shown in the horizontal axis, whereas in the vertical one are the logarithms of the number N(S) of all the spacers longer or equal to S. For genomic and for model-generated size distributions, circles and rhombs are used, respectively. Additionally, to genomic size distributions, figures include also a bundle of 10 size distributions where repeats are positioned randomly (continuous lines). The number of the randomly positioned repeats is taken to be equal to the number of repeats of the associated genomic sequence, and the length of the (computer generated) artificial sequence is equal to the length of the considered genomic region. The inclusion of these random data sets in the figures, using the same scale with the genomic data sets, gives pictorially the measure of the divergence in shape and form between the observed genomic distribution patterns and the expected ones on the grounds of purely random repeat positioning. Power laws have already been reported in several aspects of the genome structure, as in the finding of longrange correlations in noncoding sequences (Li and Kaneko 1992; Peng et al. 1992; Voss 1992), the size distribution of purine and pyrimidine islands in the noncoding (Provata and Almirantis 1997), the size distribution of noncoding segments between exons in higher eukaryotes (Almirantis and Provata 1999), the detection of ‘‘Zipf-like laws’’ (found principally in the noncoding) in rank diagrams when the n nucleotides (n 5 6, 7, . . .) are considered as ‘‘words’’ of the genomic text (Mantegna et al. 1994), through wavelet analysis of genomic sequences in Arneodo et al. (1995) and Audit et al. (2001), etc. More recently, the existence of a genome-wide power-law size distribution for several genomic compartments on the basis of their GC content (putative isochors) has been reported in Cohen et al. (2005). The ‘‘Insertion–Elimination Model’’ There is a wide corpus of literature aiming to explain the appearance of power-law size distributions in several phenomena, ranging from physics to biology, social sciences, and linguistics (see e.g., Mandelbrot 1982; Feder 1988). Various types of aggregative growth have been proven, both theoretically and through simulations, to be at the basis of power-law distributions (Jullien and Botet 1987; Vicsek 1989). Genomic dynamics includes several types of events, which may be modeled as ‘‘aggregative’’ (i.e., a chromosome incorporates new material and becomes larger), like transport and integration of genomic material from different organisms (leading occasionally to ‘‘gene transfer’’), insertions of sequences of viral origin, retropositions, incorporations of other nonretroelement sequences, etc. Aggregative phenomena (being extremely abundant during the evolutionary history of genomes) seem to form the most solid basis for the understanding of the appearance of long-range features and power-law size distributions. For 1-dimensional models, it may be shown analytically (Takayasu et al. 1991) that the combination of i) aggregation (fusion together) events of neighboring ‘‘particles,’’ and ii) events of intrusion of new particles from outside that eventually aggregate with internal particles may lead asymptotically to the appearance of power laws in the particles’ size distributions. The simultaneous action of both types of events is necessary for the emergence of power-law size distributions. The model is solvable, and the exponent f of the power law may be determined analytically to be f5 l 15 4=3; provided that an external particle, which intrudes from outside in the 1-dimensional system, ‘‘fuses’’ with a particle inside the system with a constant probability (i.e., independently of the size of the internal particle), whereas there is also the possibility that the inserted particle may stand alone. This means that insertion events may either enlarge already existing particles or give rise to new particles into the system, thus compensating for the constant decrease of the number of internal particles due to fusions of nearby particles (Takayasu et al. 1991). The structure and dynamics of the genome pose serious restrictions to aggregative models, which try to explain the generation of power laws in the interrepeat distances’ size distribution. Let us consider the ‘‘spacers’’ between repeats of a given family as particles characterized by their length and constrained to the 1-dimensional topology induced by the structure of the DNA molecule. Any elimination of one of the repeats of the considered population leads to the aggregation (merging) of 2 adjacent spacers. Insertion of sequence segments, either genuinely of external (e.g., viral) origin or processed copies of genomic origin (which are reincorporated near randomly, like subsequent repeat families, processed pseudogenes, etc.), may be seen as influx of external particles aggregating with one of the particles (spacers) already belonging to the 1-dimension system (the ‘‘thread’’ of the DNA molecule). Power Laws in Repeat Distributions Here, 2 main differences from the exactly solvable model of Takayasu et al. (1991) may be recognized as follows: A. In our model, the probability of an insertion into a given spacer is obviously proportional to the length of that spacer: this feature poses particular difficulties in the analytical treatment of the problem as it causes the procedures of 1) aggregation (adjacent particles) and 2) insertion (i.e., aggregation of intruding sequences with existing spacers) to depend mutually (to be coupled). This dependence poses severe mathematical difficulties in the search of an analytical solution, which are not met in mechanisms with no such coupling. However, the fact that long spacers become more easily longer (if the probability of absorption of externally inserted material increases with size) is obviously in accordance with the emergence of a power law. This may be understood because the tail of the distribution, formed by the longer spacers, ‘‘fattens’’ more drastically in the case of coupled procedures. B. A more fundamental divergence of our problem from solvable aggregation models is the following: The evolution of any given TE family that has ceased to proliferate into the genome and is exposed to a nonzero rate of eliminations in the evolutionary time always results asymptotically to the trivial situation where the whole DNA molecule becomes a unique spacer after the disappearance of the last member of this TE family. As a consequence, the size distribution of spacers between consecutive repeats of a given family in a chromosome cannot be studied by means of asymptotic solutions of any (solvable or not) aggregative model because we have to take into account repeat eliminations. The power laws found so far have to be treated as ‘‘transient in the long evolutionary time’’ because, inevitably, the only asymptotic solution is the trivial one. In such cases, in order to test the explanatory power of a model, we use simulations for finite time intervals and comparisons of snapshots of the produced spacers’ distributions with the ones found in the study of real chromosomes. Simulating the insertion–elimination model, we have produced power-law size distributions of spacers between ‘‘delimiters’’ (which correspond to repeats of a given family), see figure 10. The detailed procedure is described in Methods. The extent and the slope of the linear part of the distributions in double-log scale are within the range met in genomic distributions (see figs. 1 and 3 and supplementary material, Supplementary Material online). In different cases of natural phenomena, which follow similar mathematical descriptions, the value of the exponent (slope) of a power law is the same and is often found to be in very good accordance with theoretical predictions. In some of these cases, especially in material sciences, geography, astrophysics, etc., the underlying dynamics is usually working uniformly for a considerable range of lengths, thus generating self-similarity and a unique exponent value for all these orders of magnitude. In biology too, when a fractal geometry on the phenotype is functionally advantageous, the range of the produced power laws is often expected to cover sizes between tissues and cells, as occurs in 2389 the blood vessel network in a tissue, the bronchial architecture of the lung, etc. (Mandelbrot 1982; Feder 1988). On the other hand, genome dynamics includes a variety of different, simultaneously working procedures. Different genomic dynamical procedures could provide a simple and welldefined size distribution power law each if envisaged to act independently. These power laws could have different exponents and would reach stability in different timescales. Moreover, the probability of occurrence of several types of aggregative events is not homogeneous within the genome (for viral insertions, see Rynditch et al. 1998, for repeat insertions, see the literature cited herein, etc.). Obviously, in real genomic evolution, we always observe the combined action of all these procedures. Additionally, thresholds, or rather filters of tolerance, are imposed to these phenomena (due to their impact onto the phenotype) when genomic architecture and functionality are disturbed beyond some extent. As a general consequence, the range and the perfection of the observed power laws are expected to be conditioned by all the above considerations. More specifically, absence of universality in the values of the determined exponents, cutoffs in the linearity found in double-logarithmic scale, and juxtapositions of more than one linear regions with different exponents for different ranges of length scales (see e.g., figs. 1d and 3d) may be attributed to one or more of the above limitations, justified by the complexity of genomic evolution. Notice that all these features can be found in the figures presented herein. The value of idealized models like the one proposed here (see also Li 1992; Li and Kaneko 1992; Buldyrev et al. 1993; Nikolaou and Almirantis 2005), which may be solved or simulated producing features qualitatively similar to the observed genomic regularities, lies in their use as roadmaps to form plausible scenarios for the causal explanation of the genomic structure and function. In the following, we shall investigate the possibility that the described simple insertion–elimination mechanism may explain our findings about the distributional features of Alu and LINE1 repeats. The insertion of external segments into the genome is well established, thus it will not be further discussed, whereas evidence (based upon the related literature) about the occasional loss of repeats of a given population, allowing adjacent spacers to merge, is presented in the next subsection. Regular Occurrence of Repeat Elimination Events during Genomic Evolution Eliminations of initially retroposed elements, either by gradual corruption or by several types of recombination events, are extensively discussed in the literature. A considerable part of that discussion is devoted to the possibility that preferential elimination of Alu elements from the AT-rich regions may have resulted into the observed shift of the old Alu distribution, as briefly discussed in the Introduction. For the purposes of our search, no such massive occurrence of elimination events is necessary. As simulations verify, a moderated elimination activity (events of ‘‘type i’’), alongside with the insertion of ‘‘next generations’’ of repeat families (events of ‘‘type ii’’) in the genome, 2390 Sellis et al. would be sufficient to produce power-law type interrepeat distances’ distributions. As discussed by several authors (see e.g., Pavlicek et al. 2001; Deininger and Batzer 2002; Hackenberg et al. 2005), long-term negative selection may have acted on Alus in the AT-rich regions. One reason for the selectional pressure for their elimination may be the severe alteration of local composition of AT-rich regions when Alus are inserted, whereas the same may apply in the case of LINE1s inserted in GC-rich regions. Probably this is the cause why severely truncated LINE1s survive relatively longer in the genome in comparison with intact ones. Webster et al. (2003) and Belle et al. (2005) have found, comparing human and chimpanzee genomic sequences, that the rate of decomposition due to single nucleotide substitutions or indel events does not depend on the GC content of the surrounding region. Thus, they concluded that the scarceness of Alus in AT-rich regions could not be explained in terms of composition divergence between repeats and surrounding sequence. However, these results are derived only from the recent evolutionary past after the human–chimpanzee divergence. We know that during the time of rapid Alu propagation in the primate genome, the frequency of insertion events was about 100 times the present frequency (Shen et al. 1991). Possibly, for sufficiently high values of Alu abundance, unequal homologous Alu– Alu recombination events lead to deletions, reducing the Alu number in AT-rich compartments. Moreover, when a tolerance threshold in the Alu abundance is crossed, the chromatin structure may be severely altered, resulting into counterselection of further Alu accumulation in these regions (Deininger and Batzer 2002). Evidence for the need of constitutional similarity between the intruding sequence and the position of insertion in retroviral integration is given by Rynditch et al. (1998). As it was pointed out (Filipski et al. 1989; see also Gu et al. 2000), Alu sequences can undergo compositional matching to their surrounding region. The literature cited above seems to indicate that simple degradation and indel events cannot make repeats to disappear, at least in relatively short evolutionary time intervals. On the other hand, a selective pressure may favor the elimination of Alu and other TEs, mainly by recombination events, when the genomic architecture and function are disturbed. Moreover, neutral (not selection driven) eliminations may also occur. One way of systematic repeat deletion based on recombination is found to be the elimination of nearby located inverted repeats. As extensively studied in the case of Alu elements, pairs of closely interspersed Alus of inverse orientation are considerably underrepresented in the human genome (Stenger et al. 2001). The probability of elimination of these pairs of nearby inverted repeats depends inversely on their distance and on their percentage of similarity. However, very close similarity is not required: pairs of inverse Alus only 86% similar can efficiently stimulate recombination (Lobachev et al. 2000). So, the insertion of a repeat closely enough to an older one of the same subfamily (but not necessarily a transpositionally active one) may trigger their mutual elimination. In this case, the spacer between the repeats is removed while the 2 surrounding spacers ‘‘merge.’’ More recent evidence on the occurrence of this type of repeat elimination events is provided by the comparison of the initial sequence of the chimpanzee genome with the human genome (CSAC 2005). It was found that in the short evolutionary time after the divergence of human and chimpanzee genomes, several hundred eliminations of adjacent pairs of Alu, LINE1, and retroviral repeats (present in their common ancestor) have occurred in both genomes. In the same study, it is specified that such recombination events have occurred even when the divergence between adjacent Alus was .25%. Thus, recombination-driven elimination may occur even between members of different Alu subfamilies. Power Laws in Alu and LINE1 Distributions: Properties and Features The principal finding about genomic organization of this study is the existence of power laws in the size distributions of the distances between consecutive repeats of most Alu and LINE1 repeat classes in human chromosomes. In this subsection, these power-law distributions are systematically studied and relations between genome properties and distribution parameters are set forward while the validity of the proposed insertion–elimination model is assessed on the basis of these relations. The Role of Divergence for Alus and of Length for LINE1s Let us first present the main results of our work for Alu elements. The considered classes of Alu elements are the 4 high-population subfamilies of Alus: Jo, Jb, Sx, and Y, which are the only groups with sufficient numbers of Alu elements in all chromosomes for a reliable statistical analysis. Preliminary work with the data provided by the application of RepeatMasker (see Methods) has shown that the most pronounced power laws are observed when only a fraction of Alu repeats of a given subfamily in each chromosome is considered. These repeats are always those that do not diverge from the subfamily consensus beyond a certain limit. This divergence is measured as defined in RepeatMasker (Smit et al. 1996–2004; www.repeatmasker. org). The limit taken in each case is denoted by maximum divergence in the presented figures. We consider that the optimum for a power law is reached when the extent of the linear region of the curve in double-logarithmic scale is maximized, provided a sufficiently high value of r2 when applying linear regression (in all accepted cases r2 . 0.97). In all chromosomes, the 3 older families (Jo, Jb, and Sx) present power-law behavior. The extent (E) of these power laws ranges up to 3 orders of magnitude, whereas l is less than 2 in all Alu distributions with one exception (where l 5 2.05). Some examples are included in figure 1, and the full set of our results is given in table 1. In the supplementary material (Supplementary Material online), an extended version of table 1 is provided. Figure 1d is representative of a few cases of graphs showing 2 extended linear regions with different slopes. Interestingly, considering the (much younger) AluY subfamily, a power-law size distribution of interrepeat distances is formed in no more Power Laws in Repeat Distributions 2391 FIG. 1.—Characteristic cases of power laws in Alu spacers’ size distributions in whole human chromosomes. Case (d) is representative of the coexistence of 2 different exponents in different length scales. Quantitative information is presented in table 1. Randomly generated surrogate data over 10 runs are also shown by solid lines. than 3 cases. Nevertheless, in most chromosomes, we observe clear deviations from the random surrogate data sets, with tails in the spacers’ distribution longer than the observed in the surrogate sequences (see supplementary material, Supplementary Material online). Interrepeat distances’ size distributions of AluSx and AluJb for low and high divergence from the consensus sequences are depicted in figure 2. These are representative of all the examined distances’ distributions of Alus. The subset of repeats with the lower divergence from the consensus always presents a more extended power law. In the supplementary material (Supplementary Material online), the plots of the corresponding interrepeat distances of the full AluSx and AluJb populations, regardless of divergence values, are also included for comparison. Size distribution of the distances between LINE1 elements are qualitatively similar to that observed for Alu repeats. LINE1 elements can be divided in ;50 subfamilies. A major division of the LINE1 repeats is in 2 groups: the mammalian-wide L1M, which may be found in most mammalian species, and the much younger primate-wide L1P, which have proliferated only after the separation of the primate lineage (Smit et al. 1995). In order to have TE populations of a suitable size for our study, we have adopted the FIG. 2.—Two examples of the effect of divergence from the Alu subfamily consensus on the extent E of the power law. (Surrogate data as in fig. 1.). 2392 Sellis et al. FIG. 3.—Characteristic cases of power laws in LINE1 spacers’ size distributions in whole human chromosomes. Case (d) is representative of the coexistence of 2 different exponents in different length scales. Quantitative information is presented in table 1. (Surrogate data as in fig. 1.). division of LINE1 elements into these 2 major classes. Some typical distance size distributions are depicted in figure 3, and the full set of our results may be found in table 1. The mean extent of the power law is somewhat lower in LINE1s than in Alus while l is higher than 2 in 9 out of 35 cases. Again, in some cases, we meet graphs with 2 linear regions with different slopes (l values), see figure 3d. Each of the 2 groups of repeats (L1M and L1P) is hosting LINE1 elements of several subfamilies with their divergences measured by comparison with the appropriate consensus sequence. Thus, in order to fine-tune our search, we have checked the extent and quality of the observed power laws with respect to the length of the included repeats. As we have mentioned in the Introduction, LINE1 repeats are often severely truncated. The much older L1Ms are more truncated than primate-specific LINE1s. Typical examples of comparison between high- and lowlength LINE1 subsets are given in figure 4a and b. In all examined cases, the longer (more intact and less truncated) elements form better power laws. Plots corresponding to figure 4a and b interrepeat distances of L1 populations, regardless of degree of truncation, are included in the supplementary material (Supplementary Material online). The strong dependence of the observed power-law extent E on the degree of similarity with the consensus sequence (better results are acquired for lower diversity values) corroborates the proposed model: as expected and verified by the cited literature, high divergence from the subfamily consensuses limits the possibility or reduces the rate of repeat eliminations due to recombination events, thus relaxing the overall ability of the insertion–elimination mechanism to generate power laws. Trying to improve the rudimentary linearity in AluY double-log plots by taking the low divergence fraction of the whole repeat population has limited or no effect on the extent of the observed linearity for most chromosomes (see supplementary material, Supplementary Material online). This feature complies with the proposed model explanation for the role of divergence: the generally low divergence found in AluY populations is not expected to considerably hinder recombination events. In the case of groups of LINE1 elements, the thresholds taken in the length play the same role (see fig. 4), as severely truncated copies are less apt to recombination. Power-Law Slope (l) and Extent (E) Dependence on the GC Content In figure 5a and b, scatterplots for AluJb and L1P are presented. They illustrate the correlation between the slope l of the linear region in the interrepeat distance cumulative size distribution graphs and the GC content of each chromosome. Quantitative information for all the examined repeat populations is given in table 2. For the full set of plots, see the supplementary material (Supplementary Material online). In all cases, strong correlation between l and GC content was found, which for Alu is positive and for LINE1 negative. Power Laws in Repeat Distributions 2393 FIG. 4.—Two examples of the effect of the length of truncated LINE1 repeats on the extent E of the power law. (Surrogate data as in fig. 1.). No such clear correlation has been found between E and GC content for whole chromosomes. In most cases, correlation is weak or absent (see table 2, fig. 6a), whereas a clear positive correlation is found only for L1P (fig. 6b). Taking into account the isochore structure of human genome (Pavlicek, Paces, et al. 2002), we have examined (in each chromosome) pairs of regions, one with high- and one with low-GC content. Here a coherent picture emerges. The extent of linearity (power-law behavior, measured by E) clearly depends on the GC content: inter-Alu spacers’ size distributions present high E values in the GC-poor regions and low E values (or even absence of power law) in the GC-rich regions. For LINE1s, the situation is completely inverted. Examples are presented in figure 7a and b, whereas a quantitative account for 2 pairs of chromosomal regions is given in table 3. Notice that the results of the analysis for chromosomal regions totally comply with the type of dependence between l and GC content found for complete chromosomes. The correlation between GC content and l value (positive for LINE1 and negative for Alu) may be easily explained on the grounds of relative preference for GCand AT-rich regions of these 2 repeat classes, in combination with the patchy chromosome structure with respect to GC/AT constitution. GC-rich chromosomes have a significant contribution of regions with high-GC content, the H isochores, in their structure (i.e., large regions with reduced LINE1 populations) and are consequently characterized by an abundance of large spacers. Assuming a linear size distribution of spacers between LINE1 elements in double-log scale, a ‘‘fatter’’ tail is formed. Such a distribution is Table 2 Correlation between i) GC Content and Slope (m) and ii) GC Content and the Extent (E) of the Power Law for Several Repeat Classes in All Human Chromosomes Repeat Class FIG. 5.—GC%–l correlation plots (AluJb and L1P) for the complete set of human chromosomes. Quantitative information for all correlation plots of this kind is presented in table 2. GC%–l correlations AluJo AluJb (fig. 5a) AluSx L1M L1P (fig. 5b) GC%–E correlations AluJo AluJb (fig. 6a) AluSx L1M L1P (fig. 6b) Regression Line Slope r2 P Value 0.1008 0.1029 0.1104 0.1715 0.1713 0.5136 0.5870 0.3884 0.4874 0.6704 0.00008 0.00001 0.0011 0.008 0.000003 0.0137 0.0278 0.0378 0.0057 0.0668 0.0112 0.1294 0.0452 0.0020 0.3094 0.6219 0.0842 0.3187 0.8843 0.0072 NOTE.—For all plots, see supplementary material (Supplementary Material online). 2394 Sellis et al. law is higher in the regions where the elimination rate for each repeat category is known to be higher. For Alus, this is the GC-poor subgenome, whereas for LINE1, this is the GC-rich one, for reasons presented previously and extensively discussed in the cited literature. Obviously, this property is in accordance with the proposed mechanism. The clear positive correlation, found in the set of whole chromosomes, between the GC content and the extent of the power law in L1P repeats corroborates the conclusions derived from the study of chromosomal regions with high- and low-GC content (see fig. 6b). However, the positive (even if weak) correlation between the GC content and the extent of the power law in AluJb remains puzzling, given that, for the reasons explained above, this correlation is expected to be negative. In the remaining Alu cases, there are, rather insignificant, positive correlations (see table 2). A possible explanation may be related to the compositional asymmetry of the human genome with respect to GC and AT content (high contribution of large AT-rich genomic regions). For this reason, GC-rich chromosomes develop Alu power laws, which include a wider range of relatively short spacers (distances) while they continue to host sufficiently large AT-rich (i.e., scarce in Alu) regions. Therein, the elimination procedure still allows the elimination–insertion mechanism to work, generating the large spacers necessary for an extended tail of the distribution. Thus, GC richness in the framework of human genome may be the cause of an extension of the power-law range. Power-Law Extent (E) Dependence on the ‘‘Subsequently Inserted’’ Amount of Repeats FIG. 6.—GC%–E correlation plots (AluJb and L1P) for the complete set of human chromosomes. Quantitative information for all correlation plots of this kind is presented in table 2. associated with lower absolute slope values. The situation is inverted for Alu size distributions. More important for the assessment of the proposed mechanism is the finding connecting the extent E of the power law with the GC content when examining GC-rich and GC-poor chromosomal regions for several Alu and LINE1 repeat classes (see table 3 and fig. 7). For both repeat families, a coherent picture emerges: the extent of the power Next, we examine the existence of correlation between the extent (E) of the power law in the spacers’ size distributions and the quantity of repeats inserted ‘‘after’’ the peak of proliferation activity of each repeat group in all chromosomes. We denote the ‘‘SI’’ genomic material by SI followed by a parenthesis including the name of the repeat class we consider. SI is measured in terms of sequence length, expressed percent. We extracted the information about the chronology of the consecutive proliferation bursts of the repeat groups in the human genome from the related literature (Kapitonov and Jurka 1996; Dagan et al. 2004; Hackenberg et al. 2005). The ages of the examined repeat FIG. 7.—Size distributions of the spacers between AluSx repeats in a GC-poor and a GC-rich region of chromosome 9 (see table 3). (Surrogate data as in fig. 1.) Power Laws in Repeat Distributions Table 3 Quantitative Information (by Regression Analysis) about Power Laws Observed in Chromosomal Regions of Highand Low-GC Content Repeat Class Chromosome 1 AluJb GC Content AluJo L1M L1P Chromosome 9b AluJb AluJo AluSx (fig. 7) L1M L1P E l r2 a Poor Rich Poor Rich Poor Rich Poor Rich 1.60 1.25 1.14 1.04 0.80 1.35 0.56 1.09 0.80 1.77 0.71 1.75 1.68 1.08 2.11 1.60 0.9961 0.9967 0.9910 0.9880 0.9951 0.9917 0.9933 0.9537 Poor Rich Poor Rich Poor Rich Poor Rich Poor Rich 1.14 0.80 1.48 0.70 1.25 0.44 0.80 1.25 0.96 1.13 1.04 2.19 0.68 2.49 0.72 3.46 2.14 1.14 1.97 0.78 0.9877 0.9710 0.9753 0.9869 0.9898 0.9815 0.9926 0.9617 0.9894 0.9861 NOTE.—For all plots, see supplementary material (Supplementary Material online). a GC-poor region: 68–105 Mb (GC% 5 37.4), GC-rich region: 0–37 Mb (GC% 5 48.2). b GC-poor region: 2–32 Mb (GC% 5 37.7), GC-rich region: 100–117.7 Mb (GC% 5 49.3). families generate the following order: L1M . AluJo . AluJb . AluSx . L1P . AluY (see table 4). This order does not exclude some retroposing activity of a group, whereas the next one has started to proliferate. It seems that especially AluSx has continued to retropose during much of the time of the LINE1P proliferation activity. However, the peak of the AluSx proliferation has preceded that of L1P (Ohshima et al. 2003; Hackenberg et al. 2005). Notice that SI, as defined here, does not include the totality of the inserted material, such as simple repeats, minor Alu subfamilies, and other retroelements. In table 5, the quantitative information from this study is given, whereas in figure 8a and b, correlation diagrams for AluJo and L1M are depicted. In all cases, positive correlation is found between the extent of the power-law behavior and the amount of repeats inserted into the chromosome after the peak of the active proliferation of the examined repeat population. We note that in the case of AluJb, the values of P and r2 indicate a marginally low correlation. In order to further corroborate and extend the results acquired for whole chromosomes, we attempted a genomewide search for pairs of sequence regions in each chromosome with a significant difference in the mean amount of inserted sequence length after the active proliferation period of a given repeat family. Then we compared the extent of the power law in these 2 regions (same methodology with the one followed for high and low GC%). For a typical example, see figure 9b and c, where AluJb spacers’ distributions for a pair of SI(AluJb)-low/SI(AluJb)-high regions of chromosome 19 are compared, and figure 9a, where the SI(AluJb) profile is given for the whole chromosome 19. See table 6 for the results of all the examined cases. In re- 2395 Table 4 Detailed List of the Repeat Classes Included in the SI Genomic Fraction Associated to the Repeat Families Studied Herein Repeat Class Subsequently Inserted Repeats (SI) of a Repeat Class SI(L1M) 5 L1P þ Alu SI(AluJo) 5 L1P þ Alu AluJo SI(AluJb) 5 L1P þ Alu AluJo AluJb SI(AluSx) 5 L1P þ AluY SI(L1P) 5 AluY L1M AluJo AluJb AluSx L1P NOTE.—These quantities are computed using data from the standard output of RepeatMasker. Here with ‘‘Alu,’’ we denote the total sequence length expressed percent of Alu repeats in a given chromosome or genomic region. gions with high contrast in the SI material for a given repeat class, the differences in the extent of the corresponding power law are impressive. In most cases, one may observe 2- to 3-fold increase of the E value in the SI-rich region. These results are in accordance with the findings derived from the study of entire chromosomes. See in the supplementary material (Supplementary Material online) for the full set of plots and in Methods about the difficulties to locate many chromosomal regions with SI contrast suitable for our study. The (always) positive E–SI correlation reflects the need, on the basis of the proposed mechanism, of a sufficient amount of inserted material in order for a power-law distribution to be formed (see fig. 10). This result is corroborated by the clearly contrasting behavior of pairs of chromosomal regions within the same chromosome, which significantly differ in their relative amount of SI sequences (see table 6 and for examples fig. 9). Correlation of the Extent of the Observed Power Laws with Repeats’ Ages The scarceness of power-law behavior in the case of the ‘‘young’’ AluY elements is compatible with the proposed mechanism, when taking into account that no considerable amount of more recent insertions has occurred. The mean extent of the observed power law for the older Alu subfamilies increases with age (see table 1). Age is a measure for both the amount of more recently inserted material and the total number of elimination events for each subfamily. Mean values of E are 1.55(AluJo) . 1.35(AluJb) . 1.31(AluSx). The same is marginally true Table 5 Correlation between SI Content and the Extent (E) of the Power Law for Several Repeat Classes Correlation a SI(AluJo)%–E(AluJo) SI(AluJb)%–E(AluJb) SI(AluSx)%–E(AluSx) SI(L1M)%–E(L1M)b SI(L1P)%–E(L1P) Regression Line Slope r2 P Value 0.0698 0.0242 0.0844 0.0663 0.3360 0.3554 0.1058 0.2215 0.4418 0.2632 0.0021 0.1210 0.0203 0.0132 0.0146 NOTE.—For all plots, see supplementary material (Supplementary Material online). a See also figure8a. b See also figure 8b. 2396 Sellis et al. FIG. 8.—SI%–E correlation plots (AluJo and L1M) for human chromosomes where a power law is observed. With SI is denoted the sequence length due to subsequent insertions of Alu or LINE1 repeats after the peak of the proliferation period of the considered repeat class. Quantitative information for all correlation plots of this kind is presented in table 5. for LINE1s: 1.25(L1M) . 1.23(L1P). However, the lack of power law for L1M in several chromosomes does not allow direct comparison between the 2 LINE1 classes. This lack (as well as the lack of power-law behavior in the distribution of L1P in 2 chromosomes) and the overall lower ‘‘quality’’ of power laws in LINE1s versus Alus (lower E and higher l values) do not have an unambiguous explanation. Probably the truncation of LINEs, which is very strong in the case of the older mammalian class, reduced their recombination activity, relaxing the occurrence of elimination events so much, that in some cases no clear power law is observed. The picture is probably further blurred due to the action of other processes of genomic rearrangement (alongside with the elimination– insertion mechanism), which eventually deformed and deteriorated the initially formed power-law distributions. Power-Law Extent and Elimination Rate Jurka et al. (2004) have observed a rapid decrease of recent Alu repeats in Y chromosome as a function of time. FIG. 9.—(a) Plots representing along the whole chromosome 19 (x axis): i) the AluJb coverage of the sequence (right y axis, broken line) and ii) the SI(AluJb) coverage of the sequence (left y axis, continuous line) (both quantities are expressed percent). Boxes mark the low- and high-SI regions whose AluJb spacers’ size distributions are depicted in figures (b) and (c), respectively. (Surrogate data as in fig. 1.) This result is derived comparing the populations of very recent and less recent AluY subfamilies under the assumption of a constant AluY insertion rate during the relatively recent evolutionary past. These authors concluded that chromosome Y presents higher repeat elimination rates than X and autosomes (see also Lahn et al. 2001). On the other hand, chromosome Y presents overall the highest extent of power laws: In 3 out of 5 examined Alu and L1 repeat classes, the highest value of E has been observed in Y chromosome while it also has the highest average E Power Laws in Repeat Distributions Table 6 Quantitative Information (by Regression Analysis) about Power Laws Observed in Chromosomal Regions of Highand Low-SI% Content Repeat Class Chromosome 19 AluJo SI% Content E l r2 Low: 16.37 High: 31.22 0.68 1.03 2.41 1.97 0.9941 0.9921 Low: 14.96 High: 27.21 Low: 14.96 High: 32.63 0.46 1.93 0.56 1.17 2.17 1.15 2.3 1.51 0.9968 0.9905 0.9906 0.9801 Low: 15.04 High: 34.54 Low: 12.53 High: 33.15 Low: 7.28 High: 29.43 1.12 2.28 0.8 1.6 1.02 1.97 0.51 0.62 0.63 0.43 0.68 0.64 0.9863 0.9801 0.9964 0.9790 0.9858 0.9946 Low: 18.46 High: 33.08 0.8 1.49 1.93 1.78 0.9901 0.9851 a Chromosome 19b AluJbc L1M Chromosome Xd AluJo AluJb AluSx Chromosome 19e L1M NOTE.—For all plots, see supplementary material (Supplementary Material online). a Low-SI region: 4.3–32.7 Mb, high-SI region: 7.7–15 Mb. b Low-SI region: 4.3–32.7 Mb, high-SI region: 13.5–24.3 Mb. c See also figure 9. d Low-SI region: 3.5–32 Mb, high-SI region: 54–79 Mb. e Low-SI region: 115–150 Mb, high-SI region: 50–80 Mb. value (see table 1). The combination of the result of Jurka et al. (2004) with our findings corroborates the proposed insertion–elimination mechanism for the generation of power laws in the interrepeat distances’ size distributions. Conclusions and Perspectives As we presented in detail in the previous section, the interrepeat distances’ size distributions for Alus and LINE1s in the human genome follow, in most cases, power laws, which in some cases reach an extent of 3 orders of magnitude. The proposed insertion–elimination model, based in simple and well-known molecular events, may ex- 2397 plain this finding and complies with the genomic and distributional features studied so far. These molecular events occurring mainly in noncoding regions are essentially neutral. As already mentioned, they are controlled by thresholds imposed by the condition not to affect the viability of the organism. They are rather tolerated than selected, belonging to the zone of genomic dynamics called by Holmquist (1989) as ‘‘molecular ecology of the noncoding DNA.’’ Thus, no direct biological significance may be assigned to the power-law size distributions engendered by this dynamics. These size distributions are associated to self-similarity and fractality (Mandelbrot 1982; Feder 1988) of the genome, which may be indirectly related to its 3-dimensional structure, aptitude to absorb externally intruding sequences, and the use of its extended noncoding parts as a potential source of biological information in evolutionary time. In a preliminary examination of the mouse genome, the generality of some of the results derived from the human genome study was verified. Four SINE families (B1_Mus1, B3, RSINE1, and B3A) and one LINE group (L1M, the mammalian-wide L1) have been selected on the basis of their copy number and age. As expected, in the case of the younger B1_Mus1 (found only in the Mus genus), the poorest result is met: practically no power law is observed. For the older, rodent-wide B3, RSINE1, and B3A families, clear evidence of power law is found when considering the range of low divergences from each family consensus. In accordance to the prediction of the insertion– elimination model, when the upper range of divergence values is considered, the same repeat families give poor or no evidence of power-law occurrence. In the case of L1M, as in the human genome, the grouping of several repeat families together makes length the appropriate parameter in order to assess the influence of elimination propensity in the extent of the power law. Again, in accordance to the proposed model, the less truncated repeat collection gives the more pronounced power law. For the corresponding figures, see in the supplementary material (Supplementary Material online). The systematic study of a collection of genomes of representative organisms for the distributional features of their repeat populations is undertaken and will be presented in a future work. FIG. 10.—Two typical examples of (transient) power-law size distributions generated by the insertion–elimination model. For details, see Methods. The slope standard deviation is 0.071 for (a) and 0.116 for (b). 2398 Sellis et al. Supplementary Material Supplementary material and extended table 1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). Acknowledgments We would like to thank the RepeatMasker and RepBase teams for allowing us to install and use the necessary programs and databases for the study of repeats in human chromosomes. We are grateful to Dr L. Peristeras and Dr Th. Georgomanolis for their helpful assistance in the installation and configuration of Linux and RepeatMasker and to Mrs N. Chousou-Polydouri for her valuable suggestions during the final preparation of the manuscript. We would also like to thank the 2 anonymous referees whose comments have considerably improved the present work. We thank National Center for Scientific Research ‘‘Demokritos’’ for financial support. Literature Cited Adamic LA, Huberman BA. 2002. Zipf’s law and the internet. Glottometrics. 3:143–150. Almirantis Y, Provata A. 1999. A long- and short-range correlations in genome organization. J Stat Phys. 97:233–262. Arneodo A, Bacry E, Graves PV, Muzy JF. 1995. Characterizing long-range correlations in DNA-sequences from wavelet analysis. Phys Rev Lett. 74:3293–3296. Audit B, Thermes C, Vaillant C, D’aubenton-Carafa Y, Muzy JF, Ameodo A. 2001. Long-range correlations in genomic DNA: a signature of the nucleosomal structure. Phys Rev Lett. 86:2471–2474. Batzer MA, Deininger PL, Hellmann-Blumberg U, Jurka J, Labuda D, Rubin CM, Schmid CW, Zietkiewicz E, Zuckerkandl E. 1999. Standardized nomenclature for Alu repeats. J Mol Evol. 42:3–6. Belle EMS, Webster MT, Eyre-Walker A. 2005. Why are young and old repetitive elements distributed differently in the human genome? J Mol Evol. 60:290–296. Bernardi G. 2000a. The compositional evolution of vertebrate genomes. Gene. 251:31–43. Bernardi G. 2000b. Isochores and the evolutionary genomics of vertebrates. Gene. 241:3–17. Brookfield JFY. 2001. Selection on Alu sequences? Curr Biol. 11:R900–R901. Buldyrev SV, Goldberger AL, Havlin S, Peng CK, Stanley HE, Stanley MHR, Simons M. 1993. Fractal landscapes and molecular evolution: modeling the myosin heavy-chain gene family. Biophys J. 65:2673–2679. [CSAC] The Chimpanzee Sequencing and Analysis Consortium. 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 437:69–87. Cohen N, Dagan T, Stone L, Graur D. 2005. GC composition of the human genome: in search of isochores. Mol Biol Evol. 22:1260–1272. Dagan T, Sorek R, Sharon E, Ast G, Graur D. 2004. AluGene: a database of Alu elements incorporated within protein-coding genes. Nucleic Acids Res. 32:D489–D492 Sp. Iss. SI. Deininger PL, Batzer MA. 1999. Alu repeats and human disease. Mol Genet Metab. 67:183–193. Deininger PL, Batzer MA. 2002. Mammalian retroelements. Genome Res. 12:1455–1465. Dewannieux M, Esnault C, Heidmann T. 2003. LINE-mediated retrotransposition of marked Alu sequences. Nat Genet. 35:41–48. Feder J. 1988. Fractals. New York: Plenum Press. Filipski J, Salinas J, Rodier F. 1989. Chromosome localizationdependent compositional bias of point mutations in Alu repetitive sequences. J Mol Biol. 206:563–566. Gish W. 2003. WU-BLAST 2.0. [Internet]. [cited 2007 November]; Available from http://blast.wustl.edu. Gu Z, Wang H, Nekrutenko A, Li W-H. 2000. Densities, length proportions, and other distributional features of repetitive sequences in the human genome estimated from 430 megabases of genomic sequence. Gene. 259:81–88. Hackenberg M, Bernaola-Galvan P, Carpena P, Oliver JL. 2005. The biased distribution of Alus in human isochores might be driven by recombination. J Mol Evol. 60:365–377. Holmquist G. 1989. Evolution of chromosomal bands: molecular ecology of noncoding DNA. J Mol Evol. 28:469–486. Jullien R, Botet R. 1987. Aggregation and fractal aggregates. Singapore: World Scientific. Jurka J. 2000. Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 16:418–420. Jurka J, Kohany O, Pavlicek A, Kapitonov VV, Jurka MV. 2004. Duplication, co-clustering, and selection of human Alu retrotransposons. Proc Natl Acad Sci USA. 101:1268–1272. Kapitonov V, Jurka J. 1996. The age of Alu subfamilies. J Mol Evol. 42:59–65. Lahn BT, Pearson NM, Jegalian K. 2001. The human Y chromosome, in the light of evolution. Nat Rev Genet. 2:207–216. Li W. 1992. Generating nontrivial long-range correlations and 1/f spectra by replication and mutation. Int J Bifurcat Chaos. 2:137–154. Li W. 2002. Zipf’s law everywhere. Glottometrics. 5:14–21. Li W, Kaneko K. 1992. Long-range correlation and partial 1/falpha spectrum in a noncoding DNA-sequence. Europhys Lett. 17:655–660. Lobachev KS, Stenger JE, Kozyreva OG, Jurka J, Gordenin DA, Resnick MA. 2000. Inverted Alu repeats unstable in yeast are excluded from the human genome. EMBO J. 19:3822–3830. Makalowski W. 2003. Not junk after all. Science. 300:1246–1247. Mandelbrot BB. 1982. The fractal geometry of nature. San Francisco, CA: W.H. Freeman. Mantegna RN, Buldyrev SN, Goldberger AL, Havlin S, Peng CK, Simons M, Stanley HE. 1994. Linguistic features of noncoding DNA-sequences. Phys Rev Lett. 73:3169–3172. Medstrand P, van de Lagemaat LN, Mager DL. 2002. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 12:1483–1495. Newman MEJ. 2005. Power laws, Pareto distributions and Zipf’s law. Contemp Phys. 46:323–351. Nikolaou C, Almirantis Y. 2005. ‘‘Word’’ preference in the genomic text and genome evolution: different modes of n-tuplet usage in coding and noncoding sequences. J Mol Evol. 61:23–35. Ohshima K, Hattori M, Yada T, Gojobori T, Sakaki Y, Okada N. 2003. Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol. 4:Art No. R74. Ostertag EM, Kazazian HH. 2001. Biology of mammalian L1 retrotransposons. Annu Rev Genet. 35:501–538. Pavlicek A, Clay O, Bernardi G. 2002. Transposable elements encoding functional proteins: pitfalls in unprocessed genomic data? FEBS Lett. 523:252–253. Pavlicek A, Jabbari K, Paces J, Paces V, Hejnar J, Bernardi G. 2001. Similar integration but different stability of Alus and LINEs in the human genome. Gene. 276:39–45. Pavlicek A, Paces J, Clay O, Bernardi G. 2002. A compact view of isochores in the draft human genome sequence. FEBS Lett. 511:165–169. Power Laws in Repeat Distributions Peng CK, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M, Stanley HE. 1992. Long-range correlations in nucleotide-sequences. Nature. 356:168–170. Provata A, Almirantis Y. 1997. Scaling properties of coding and non-coding DNA sequences. Physica A. 247:482–496. Rynditch AV, Zoubak S, Tsyba L, Tryapitsina-Guley N, Bernardi G. 1998. The regional integration of retroviral sequences into the mosaic genomes of mammals. Gene. 222:1–16. Shen MR, Batzer MA, Deininger PL. 1991. Evolution of the master Alu gene(s). J Mol Evol. 33:311–320. Smit AFA, Hubley R, Green P. 1996-2004. RepeatMasker Open3.0. [Internet]. [cited 2007 November]; Available from: http:// www.repeatmasker.org]. Smit AFA, Toth G, Riggs AD, Jurka J. 1995. Ancestral, mammalian-wide subfamilies of line-1 repetitive sequences. J Mol Biol. 246:401–417. Sorek R, Ast G, Graur D. 2002. Alu-containing exons are alternatively spliced. Genome Res. 12:1060–1067. Stenger JE, Lobachev KS, Gordenin D, Darden TA, Jurka J, Resnick MA. 2001. Biased distribution of inverted and direct 2399 Alus in the human genome: implications for insertion, exclusion, and genome stability. Genome Res. 11:12–27. Sverdlov ED. 2000. Retroviruses and primate evolution. Bioessays. 22:161–171. Takayasu H, Takayasu M, Provata A, Huber G. 1991. Statistical properties of aggregation with injection. J Stat Phys. 65:725–745. Ullu E, Tschudi C. 1984. Alu sequences are processed 7SL RNA genes. Nature. 312:171–172. Vicsek T. 1989. Fractal growth phenomena. Singapore: World Scientific. Voss RF. 1992. Evolution of long-range fractal correlations and 1/f noise in DNA-base sequences. Phys Rev Lett. 68:3805–3808. Webster MT, Smith NGC, Ellegren H. 2003. Compositional evolution of noncoding DNA in the human and chimpanzee genomes. Mol Biol Evol. 20:278–286. Aoife McLysaght, Associate Editor Accepted August 9, 2007
© Copyright 2026 Paperzz