Alu and LINE1 Distributions in the Human

Alu and LINE1 Distributions in the Human Chromosomes: Evidence of
Global Genomic Organization Expressed in the Form of Power Laws
Diamantis Sellis,* Astero Provata, and Yannis Almirantis*
*National Center for Scientific Research ‘‘Demokritos,’’ Institute of Biology, Athens, Greece; and National Center for
Scientific Research ‘‘Demokritos,’’ Institute of Physical Chemistry, Athens, Greece
Spatial distribution and clustering of repetitive elements are extensively studied during the last years, as well as their
colocalization with other genomic components. Here we investigate the large-scale features of Alu and LINE1 spatial
arrangement in the human genome by studying the size distribution of interrepeat distances. In most cases, we have
found power-law size distributions extending in several orders of magnitude. We have also studied the correlations of the
extent of the power law (linear region in double-logarithmic scale) and of the corresponding exponent (slope) with other
genomic properties. A model has been formulated to explain the formation of the observed power laws. According to the
model, 2 kinds of events occur repetitively in evolutionary time: random insertion of several types of intruding sequences
and occasional loss of repeats belonging to the initial population due to ‘‘elimination’’ events. This simple mechanism is
shown to reproduce the observed power-law size distributions and is compatible with our present knowledge on the
dynamics of repeat proliferation in the genome.
Introduction
About 45% of the human genome consists of transposable elements (TEs) (Deininger and Batzer 2002;
Makalowski 2003), their majority being retroelements.
Most of them are members of the Alu and LINE1 families.
Alu belongs to the ‘‘short interspersed elements’’ (SINEs)
and is found in the human genome in a number of
;1,100,000 copies, covering ;10% of its total length.
Alu typical sequence is ;300 nt long and is CG and
CpG rich. LINE1 belongs to the ‘‘long interspersed elements’’ (LINEs). The number of LINE1 copies in the human genome is ;700,000. Intact LINE1s have a length of
;6 103 nt, but most copies are considerably truncated.
LINE1 repeats are AT rich and are present in all studied
mammalian genomes (Smit et al. 1995). Their evolutionary
history is long, dating to the beginnings of eukaryotic existence (Ostertag and Kazazian 2001). LINE1 uses a retrotransposase encoded in its sequence, whereas Alu
propagates by means of the retrotransposition machinery
of active LINE1 elements (Dewannieux et al. 2003). Both
repeat families are divided into subfamilies. Elements belonging to each subfamily share high similarity in some
characteristic (diagnostic) positions and common ancestry.
Alu proliferation has a shorter history than that of LINE1s
and probably began in a common ancestor of primates and
rodents (Ullu and Tschudi 1984). About 112 MYA, transposons named FLA evolved from 7SL RNA, and ;81
MYA, an FLA dimerization led to the formation of 2 subfamilies (Jo and Jb) of the J family of Alu elements. The next
main Alu subfamilies, S, Sx, and Y, have arisen 48, 37, and
19 MYA, respectively (Kapitonov and Jurka 1996; for nomenclature conventions, see also Batzer et al. 1999). Even
younger subfamilies (Ya5, age: 4 Myr; Yb8, age: 3 Myr)
have arisen after hominization, and the human genome is
polymorphic for a considerable number of their members.
The distribution of most classes of repeat elements in
the genome usually deviates from randomness. LINE1s are
found with a higher probability in the AT-rich genomic
Key words: Alu, LINE1, power laws, repeat distributions.
E-mail: [email protected].
Mol. Biol. Evol. 24(11):2385–2399. 2007
doi:10.1093/molbev/msm181
Advance Access publication August 29, 2007
Ó The Author 2007. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
compartments. Alus have a clear preference for the GC-rich
genomic compartments, and this tendency is most pronounced in the older subfamilies (about genomic patchiness
with respect to GC content, see Bernardi 2000a, 2000b).
However, the very young subfamilies do not follow this
pattern, sharing the preference for the AT-rich regions of
LINE1 elements. This peculiarity in the genomic distribution of Alu elements is not yet completely understood
(Brookfield 2001; Pavlicek et al. 2001; Deininger and Batzer
2002; Medstrand et al. 2002; Jurka et al. 2004; Belle et al.
2005; Hackenberg et al. 2005).
TEs were initially considered as selfish entities propagating in the host genome as ‘‘junk DNA,’’ as long as this
proliferation is tolerated without causing severe damages.
Now, it becomes more and more accepted that the evolution
of TEs interacts in a complex way with other aspects of the
whole genomic dynamics. It has been reported that 5% of all
alternatively spliced internal exons include parts of Alu insertions (Sorek et al. 2002). However, this view is questioned
by other authors (Pavlicek, Clay, and Bernardi 2002). The
impact of Alus and of other TEs in the modification of
the expression pattern of existing genes is not doubted
(for Alus, see e.g., Deininger and Batzer 1999; Dagan
et al. 2004; for LINE1, see e.g., Ostertag and Kazazian
2001). It is very probable that the proliferation of Alu,
LINE1, and other TE families has provided a variety of advantages to the host genome in the long evolutionary time
(not necessarily reflected into positive selection of newly
transposed copies).
In the present work, we study the large-scale pattern of
distribution for several repeat families in the human genome. This is done examining the size distribution of distances between consecutive repeats belonging to the same
family. Furthermore, attempting to explain the findings
concerning repeat distribution at chromosomal level, we introduce a minimal model (named ‘‘insertion–elimination
model’’) based on well-established events of genomic dynamics. These are i) eliminations of repeats of a specific
repeat family and ii) insertions of sequences of various origins. Both are well-known molecular events occurring regularly in the long evolutionary time. It is shown that the
proposed model reproduces the general distribution pattern,
which prevails in human chromosomes. Also, all the examined relations between genomic quantities and quantities
2386 Sellis et al.
characterizing the interrepeat distances’ distributions are
compatible with the proposed model.
The article is organized as follows: In Methods, we
present technical aspects of the subsequent analysis. In
the beginning of Results and Discussion, the statistical
concepts, which will be used in the sequel, are briefly reviewed. Then, the proposed insertion–elimination model is
introduced in order to compare its features with the properties of repeat distributions in the human genome. In the
next subsection, evidence is presented about the regular occurrence of repeat elimination events during genomic evolution (which is a prerequisite of the proposed model).
Continuing, the properties of Alu and LINE1 repeat elements’ distribution are systematically presented and the validity of the insertion–elimination model is assessed. In
a final section, some conclusions and perspectives of this
work are drawn.
Methods
The sequences of the assembled human chromosomes
build 35.1 were downloaded from National Center for Biotechnology Information genomic biology (ftp://ftp.ncbi.
nih.gov/genomes/H_sapiens/Assembled_chromosomes/).
The existence of gaps in human-assembled chromosomes
always poses a problem when measuring distances between
any types of localizations (here repeats) at whole chromosome scale. We have chosen to remove gaps longer than
50,000 nt. This strategy has been followed after some initial
tests because these gaps could affect the shape of the interrepeat size distributions, whose linear part (in double-log
scale) starts around this order of magnitude. The shorter
gaps (which are more common) were retained in order
not to disturb the chromosomal architecture. Moreover,
their length is not expected to affect considerably the linear
part of the studied distributions. This was verified in trials
where all gaps were removed and figures were left practically unchanged. Notice that chromosomal coordinates included in tables 3 and 6 refer to human chromosomes after
the elimination of gaps longer than 50,000 nt.
We have used RepeatMasker (Smit et al. 1996–2004;
www.repeatmasker.org), version 3.1.2, combined with libraries (release 20051025) derived from RepBase (Jurka
2000; www.girinst.org/) and WU-Blast v.2.0_10/05/2005
(Gish 2003; http://blast.wustl.edu). The data for several repeat populations in human chromosomes were extracted after a suitable parsing of the standard RepeatMasker output.
Throughout this work, we present the size distributions of spacers separating the repeats of a given class in
the form of cumulative distributions for reasons described
in Results and Discussion. The analysis of chromosomal
regions has been done using the same values of ‘‘maximum
divergence’’ and ‘‘minimum length’’ as in the whole chromosome analysis (see supplementary material, Supplementary Material online). Cumulative distribution plots of
interrepeat distances, as well as a simple linear regression
analysis in log–log scale, were made with Grace-5.1.14.
The values of extent (E), slope (l), r2, and slope standard
deviation for all the effectuated genomic size distribution
computations are presented in supplementary material (ex-
tended table 1, Supplementary Material online). The same
values for the simulation examples are given in figure 10.
Scatterplots depicting the correlation between E or l
and GC or SI (quantity of ‘‘subsequently inserted’’ genomic
material after the proliferation period of a repeat class or
subfamily) content were made using STATISTICA version
6.0. The names of chromosomes are depicted in the plots,
whereas the values of the associated square correlation
coefficient (r2) and probability (P) are given in the corresponding tables.
There are some inherent difficulties in isolating chromosomal regions suitable for the comparison between high
and low GC% or SI% sequence properties (figs. 7 and 9 and
tables 3 and 6). This is the reason why we did not include
examples of all TE classes. However, even in cases not included because of low repeat population numbers and thus
forming rudimentary distributions, the results were similar
to the presented ones. One difficulty is due to the necessary
sequence length for this kind of analysis. Moreover, in the
case of the GC-dependence study, if the GC% contrast is
high between pairs of regions (necessary prerequisite in order to obtain unambiguous results), the density of the studied repeat populations is always very low in one of the two
examined regions (in the GC rich for LINE1s and in the AT
rich for Alus). Several difficulties are also inherent in the
search for power laws in low/high-SI regions: Some examples are a) SI (L1P) is uniformly low, depending only on
AluY insertions. b) The main components of the SI material for AluSx populations (SI[AluSx]) are L1P populations, resulting in the same conflict met in the search of
large GC%-poor or rich regions. Additionally, the SI%
value is relatively homogeneous along each chromosome
(with the exception of chromosome X and partially of 19),
thus making very rare the cases of large low- or high-SI
regions.
Examples of simulations using the insertion–elimination
model are given in figure 10. In these cases, in an artificial
chromosome 2,000,000 nt long, 20,000 delimiters (simulating a repeat population) are randomly distributed. After
consecutive rounds of delimiter eliminations and influx
of external sequence segments (here of 100 nt long), a relatively small fraction of the initial number of delimiters is
left. Their distances form well-shaped power-law distributions. In case (a), 40 influx (insertion) events follow each
delimiter elimination (spacers’ merging) event until a population of 500 delimiters is left. In case (b), the insertion–
elimination ratio equals to 20 and the remaining delimiters
are 1,000. Both cases, as well as results of other simulations
not presented here, demonstrate power-law formation for
a long (transient) time interval before the asymptotic limit
is reached.
In the presented simulations, the insertion of a repeat
element close to another one of the same family at inverse
orientation and the subsequent elimination of both (including the spacer between them) are modeled as a single repeat
elimination occurring randomly. It is well documented that
the scarceness of nearby repeats is very pronounced in the
small distances, thus making obvious that the lost interrepeat spacers are practically all below the threshold influencing our analysis (Lobachev et al. 2000; Stenger et al. 2001).
A concrete example is provided in CSAC (2005), where for
Power Laws in Repeat Distributions
2387
Table 1
Quantitative Information (by Regression Analysis) about Power Laws Observed in All Human Chromosomes for the
Considered Repeat Categories
Chromosome
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
AluJb
AluJo
AluSx
L1M
L1P
E
l
E
l
E
l
E
l
E
l
1.40
1.37
1.16
1.49
1.25
1.15
1.30
1.25
1.15
1.16
1.50
1.36
1.28
1.87
1.03
1.50
1.60
1.40
1.86
1.40
1.28
1.18
1.25
1.27
1.22
0.92
0.95
0.89
0.97
0.82
1.07
0.52
1.34
1.70
1.06
1.18
0.71
1.14
1.78
1.44
1.55
1.17
1.77
1.03
1.03
1.79
0.81
0.72
1.06
1.28
1.04
1.37
1.15
1.38
1.93
1.25
1.75
1.40
1.50
1.83
1.70
1.62
1.16
1.73
1.50
1.62
2.05
1.20
1.72
1.52
1.84
2.51
1.31
1.17
0.84
0.82
0.89
1.04
0.67
0.66
1.03
1.52
1.06
1.05
1.06
1.21
1.79
1.70
1.76
0.75
1.56
1.02
0.81
1.62
0.71
0.50
1.03
1.15
1.40
1.24
1.05
1.38
1.27
1.46
1.27
0.92
1.37
1.38
0.68
0.92
0.92
1.38
1.72
0.70
2.06
1.27
1.36
1.14
1.26
3.06
1.80
0.89
0.51
0.64
0.67
0.58
0.76
0.50
1.15
1.21
0.89
0.99
1.50
1.41
1.93
1.22
2.05
1.46
1.67
1.16
1.34
1.78
0.75
0.60
1.27
—
—
—
—
—
1.13
—
1.39
0.80
1.40
0.92
0.70
—
—
—
1.14
—
1.50
0.80
—
1.39
1.70
2.08
1.93
—
—
—
—
—
2.36
—
1.88
1.85
2.40
2.38
3.22
—
—
—
0.91
—
1.08
1.49
—
0.62
1.96
0.74
0.93
0.80
1.35
1.02
1.03
1.22
1.83
0.70
1.48
1.25
1.60
1.15
—
1.25
0.92
1.14
1.60
—
1.70
1.50
0.80
1.70
1.15
1.03
1.66
1.69
1.94
2.58
1.87
2.17
1.60
2.25
1.59
1.76
1.72
1.84
—
1.17
2.09
0.95
0.76
—
0.84
1.13
1.17
0.38
2.36
1.27
NOTE.—Blanc cells mean that no power law (linearity in double-logarithmic scale) is found. Chromosome length, GC%, and regression coefficients (r2) are included in
the supplementary material (Supplementary Material online), as well as the complete set of the corresponding plots. Notice that all r2 values are higher than 0.97 while their
mean is higher than 0.99.
;600 deletions of Alus in the chimp genome 2 Mnt were
lost. So in each couple of inverted repeat elimination
;3,500 nt are lost, supposing equidistribution. Simple
inspection of figures 1 and 3 and related supplementary
material (Supplementary Material online) reveals that in
almost all cases of reported power-law–like distributions,
the linear part of the distribution lies clearly in a range
of lengths higher than 10,000 nt, well above 3,500 nt. Evidence, presented in CSAC (2005), about a power law in the
distribution of the eliminated genomic segments’ lengths
(CSAC [2005], p.74 see also) makes the number of deleted
spacers longer than 10,000 nt statistically insignificant (;6
out of 612).
Scripts in PERL and programs in FORTRAN used for
parsing RepeatMasker output files and producing the presented results, as well as STATISTICA and EXCEL
spreadsheets with repeat population data in human chromosomes, are available upon request.
Results and Discussion
Preliminaries
The distributional features of TEs have been the subject of intense research during the last years. In most cases,
substantially nonrandom distributions have been observed.
Major factors related to the uneven distribution of repeats
are the local GC abundance, the underlying genes’ distribution, mutual avoidance of different repeat populations,
and intrafamily repeat coclustering. Here we perform a sys-
tematic investigation of the size distribution of the spacers
(distances) separating consecutive repeats of the major Alu
and LINE1 classes in the human genome. When clustering
appears only in short-length scales, clusters with sizes ranging around a mean value are formed. In such cases, for large
values of length S, these short-range distributions have an
exponentially decaying tail of the form:
PðSÞ eaS ;
0 , a:
When the clustering is scale free, long-range correlations extend in a self-similar way for several length scales
(and ideally for the whole examined genomic length) and
the corresponding spacers’ size distributions follow ‘‘power
laws,’’ which correspond to linear graphs in a double-logarithmic scale:
PðSÞ Sf 5S1l ;
0 , l:
High values of the extent of linearity in log–log plots
(E) means that self-similarity spans many orders of magnitude, whereas low absolute values of the slope (l) are connected to an abundance of relatively long spacers in every
length scale.
More interesting are power laws with 0,l 2; which
implies that the distribution has infinite mean square. This
means that for large data sets, standard deviation is unbounded (i.e., the larger the data set is the higher gets
the computed standard deviation). Several power laws
found in the real world have values of l around 2 (Newman
2005). For higher values of l, the distribution becomes
2388 Sellis et al.
progressively similar to a short-ranged one, that is, randomness tends to prevail in the large length scales (large values
of S).
Power laws in size distributions are in general associated with the geometrical features of self-similarity and
fractality (Mandelbrot 1982; Feder 1988), that is, the distribution (in our case) of a given repeat family in the genome looks similar when examined at several length scales.
In the subsequent analysis, we use the ‘‘cumulative
size distribution’’ defined as follows:
ZN
pðrÞdr;
PðSÞ5
S
where p(r) is the original spacers’ size distribution. The cumulative form of the distribution has in general better statistical properties as it forms smoother ‘‘tails,’’ less affected
by fluctuations. For reviews on power-law size distributions, their properties, and alternative forms, see, for example, Adamic and Huberman (2002); Li (2002); and
Newman (2005).
The cumulative form of a power-law size distribution
is again a power law characterized by an exponent (slope)
equal to that of the original distribution minus 1: if
pðrÞ r 1l , then
Z
PðSÞ ðr 1l Þdr Sl :
In the sequel, plots depicting cumulative size distributions of interrepeat distances are presented. The logarithms
of interrepeat distances (S) are shown in the horizontal axis,
whereas in the vertical one are the logarithms of the number
N(S) of all the spacers longer or equal to S. For genomic and
for model-generated size distributions, circles and rhombs
are used, respectively. Additionally, to genomic size distributions, figures include also a bundle of 10 size distributions where repeats are positioned randomly (continuous
lines). The number of the randomly positioned repeats is
taken to be equal to the number of repeats of the associated
genomic sequence, and the length of the (computer generated) artificial sequence is equal to the length of the considered genomic region. The inclusion of these random data
sets in the figures, using the same scale with the genomic
data sets, gives pictorially the measure of the divergence in
shape and form between the observed genomic distribution
patterns and the expected ones on the grounds of purely
random repeat positioning.
Power laws have already been reported in several aspects of the genome structure, as in the finding of longrange correlations in noncoding sequences (Li and Kaneko
1992; Peng et al. 1992; Voss 1992), the size distribution of
purine and pyrimidine islands in the noncoding (Provata
and Almirantis 1997), the size distribution of noncoding
segments between exons in higher eukaryotes (Almirantis
and Provata 1999), the detection of ‘‘Zipf-like laws’’ (found
principally in the noncoding) in rank diagrams when the n
nucleotides (n 5 6, 7, . . .) are considered as ‘‘words’’ of the
genomic text (Mantegna et al. 1994), through wavelet analysis of genomic sequences in Arneodo et al. (1995) and
Audit et al. (2001), etc. More recently, the existence of a
genome-wide power-law size distribution for several genomic compartments on the basis of their GC content (putative isochors) has been reported in Cohen et al. (2005).
The ‘‘Insertion–Elimination Model’’
There is a wide corpus of literature aiming to explain
the appearance of power-law size distributions in several
phenomena, ranging from physics to biology, social sciences, and linguistics (see e.g., Mandelbrot 1982; Feder
1988). Various types of aggregative growth have been
proven, both theoretically and through simulations, to be
at the basis of power-law distributions (Jullien and Botet
1987; Vicsek 1989). Genomic dynamics includes several
types of events, which may be modeled as ‘‘aggregative’’
(i.e., a chromosome incorporates new material and becomes
larger), like transport and integration of genomic material
from different organisms (leading occasionally to ‘‘gene
transfer’’), insertions of sequences of viral origin, retropositions, incorporations of other nonretroelement sequences,
etc. Aggregative phenomena (being extremely abundant
during the evolutionary history of genomes) seem to form
the most solid basis for the understanding of the appearance
of long-range features and power-law size distributions.
For 1-dimensional models, it may be shown analytically (Takayasu et al. 1991) that the combination of i) aggregation (fusion together) events of neighboring
‘‘particles,’’ and ii) events of intrusion of new particles from
outside that eventually aggregate with internal particles
may lead asymptotically to the appearance of power laws
in the particles’ size distributions. The simultaneous action
of both types of events is necessary for the emergence of
power-law size distributions. The model is solvable, and
the exponent f of the power law may be determined analytically to be f5 l 15 4=3; provided that an
external particle, which intrudes from outside in the
1-dimensional system, ‘‘fuses’’ with a particle inside the
system with a constant probability (i.e., independently of
the size of the internal particle), whereas there is also the
possibility that the inserted particle may stand alone. This
means that insertion events may either enlarge already existing particles or give rise to new particles into the system,
thus compensating for the constant decrease of the number
of internal particles due to fusions of nearby particles
(Takayasu et al. 1991).
The structure and dynamics of the genome pose serious restrictions to aggregative models, which try to explain
the generation of power laws in the interrepeat distances’
size distribution. Let us consider the ‘‘spacers’’ between repeats of a given family as particles characterized by their
length and constrained to the 1-dimensional topology induced by the structure of the DNA molecule. Any elimination of one of the repeats of the considered population leads
to the aggregation (merging) of 2 adjacent spacers. Insertion
of sequence segments, either genuinely of external (e.g., viral) origin or processed copies of genomic origin (which are
reincorporated near randomly, like subsequent repeat families, processed pseudogenes, etc.), may be seen as influx of
external particles aggregating with one of the particles
(spacers) already belonging to the 1-dimension system
(the ‘‘thread’’ of the DNA molecule).
Power Laws in Repeat Distributions
Here, 2 main differences from the exactly solvable
model of Takayasu et al. (1991) may be recognized as
follows:
A. In our model, the probability of an insertion into
a given spacer is obviously proportional to the length
of that spacer: this feature poses particular difficulties
in the analytical treatment of the problem as it causes
the procedures of 1) aggregation (adjacent particles)
and 2) insertion (i.e., aggregation of intruding sequences with existing spacers) to depend mutually (to
be coupled). This dependence poses severe mathematical difficulties in the search of an analytical solution,
which are not met in mechanisms with no such
coupling. However, the fact that long spacers become
more easily longer (if the probability of absorption of
externally inserted material increases with size) is
obviously in accordance with the emergence of a power
law. This may be understood because the tail of the
distribution, formed by the longer spacers, ‘‘fattens’’
more drastically in the case of coupled procedures.
B. A more fundamental divergence of our problem from
solvable aggregation models is the following: The
evolution of any given TE family that has ceased to
proliferate into the genome and is exposed to a nonzero
rate of eliminations in the evolutionary time always
results asymptotically to the trivial situation where the
whole DNA molecule becomes a unique spacer after
the disappearance of the last member of this TE family.
As a consequence, the size distribution of spacers between consecutive repeats of a given family in a chromosome cannot be studied by means of asymptotic solutions of
any (solvable or not) aggregative model because we have to
take into account repeat eliminations. The power laws
found so far have to be treated as ‘‘transient in the long evolutionary time’’ because, inevitably, the only asymptotic solution is the trivial one. In such cases, in order to test the
explanatory power of a model, we use simulations for finite
time intervals and comparisons of snapshots of the produced spacers’ distributions with the ones found in the
study of real chromosomes.
Simulating the insertion–elimination model, we have
produced power-law size distributions of spacers between
‘‘delimiters’’ (which correspond to repeats of a given family), see figure 10. The detailed procedure is described in
Methods. The extent and the slope of the linear part of
the distributions in double-log scale are within the range
met in genomic distributions (see figs. 1 and 3 and supplementary material, Supplementary Material online).
In different cases of natural phenomena, which follow
similar mathematical descriptions, the value of the exponent (slope) of a power law is the same and is often found
to be in very good accordance with theoretical predictions.
In some of these cases, especially in material sciences, geography, astrophysics, etc., the underlying dynamics is usually working uniformly for a considerable range of lengths,
thus generating self-similarity and a unique exponent value
for all these orders of magnitude. In biology too, when
a fractal geometry on the phenotype is functionally advantageous, the range of the produced power laws is often expected to cover sizes between tissues and cells, as occurs in
2389
the blood vessel network in a tissue, the bronchial architecture of the lung, etc. (Mandelbrot 1982; Feder 1988). On the
other hand, genome dynamics includes a variety of different, simultaneously working procedures. Different genomic
dynamical procedures could provide a simple and welldefined size distribution power law each if envisaged to
act independently. These power laws could have different
exponents and would reach stability in different timescales. Moreover, the probability of occurrence of several
types of aggregative events is not homogeneous within the
genome (for viral insertions, see Rynditch et al. 1998, for
repeat insertions, see the literature cited herein, etc.). Obviously, in real genomic evolution, we always observe the
combined action of all these procedures. Additionally,
thresholds, or rather filters of tolerance, are imposed to
these phenomena (due to their impact onto the phenotype)
when genomic architecture and functionality are disturbed
beyond some extent. As a general consequence, the range
and the perfection of the observed power laws are expected to be conditioned by all the above considerations.
More specifically, absence of universality in the values of
the determined exponents, cutoffs in the linearity found in
double-logarithmic scale, and juxtapositions of more than
one linear regions with different exponents for different
ranges of length scales (see e.g., figs. 1d and 3d) may be attributed to one or more of the above limitations, justified by
the complexity of genomic evolution. Notice that all these
features can be found in the figures presented herein.
The value of idealized models like the one proposed
here (see also Li 1992; Li and Kaneko 1992; Buldyrev et al.
1993; Nikolaou and Almirantis 2005), which may be solved
or simulated producing features qualitatively similar to the
observed genomic regularities, lies in their use as roadmaps
to form plausible scenarios for the causal explanation of the
genomic structure and function.
In the following, we shall investigate the possibility
that the described simple insertion–elimination mechanism
may explain our findings about the distributional features of
Alu and LINE1 repeats. The insertion of external segments
into the genome is well established, thus it will not be further discussed, whereas evidence (based upon the related
literature) about the occasional loss of repeats of a given
population, allowing adjacent spacers to merge, is presented in the next subsection.
Regular Occurrence of Repeat Elimination Events during
Genomic Evolution
Eliminations of initially retroposed elements, either by
gradual corruption or by several types of recombination
events, are extensively discussed in the literature. A considerable part of that discussion is devoted to the possibility
that preferential elimination of Alu elements from the
AT-rich regions may have resulted into the observed shift
of the old Alu distribution, as briefly discussed in the Introduction. For the purposes of our search, no such massive
occurrence of elimination events is necessary. As simulations verify, a moderated elimination activity (events of
‘‘type i’’), alongside with the insertion of ‘‘next generations’’ of repeat families (events of ‘‘type ii’’) in the genome,
2390 Sellis et al.
would be sufficient to produce power-law type interrepeat
distances’ distributions.
As discussed by several authors (see e.g., Pavlicek
et al. 2001; Deininger and Batzer 2002; Hackenberg
et al. 2005), long-term negative selection may have acted
on Alus in the AT-rich regions. One reason for the selectional pressure for their elimination may be the severe alteration of local composition of AT-rich regions when Alus
are inserted, whereas the same may apply in the case of
LINE1s inserted in GC-rich regions. Probably this is the
cause why severely truncated LINE1s survive relatively
longer in the genome in comparison with intact ones.
Webster et al. (2003) and Belle et al. (2005) have
found, comparing human and chimpanzee genomic sequences, that the rate of decomposition due to single nucleotide
substitutions or indel events does not depend on the GC
content of the surrounding region. Thus, they concluded
that the scarceness of Alus in AT-rich regions could not
be explained in terms of composition divergence between
repeats and surrounding sequence. However, these results
are derived only from the recent evolutionary past after
the human–chimpanzee divergence. We know that during
the time of rapid Alu propagation in the primate genome, the
frequency of insertion events was about 100 times the present frequency (Shen et al. 1991). Possibly, for sufficiently
high values of Alu abundance, unequal homologous Alu–
Alu recombination events lead to deletions, reducing the
Alu number in AT-rich compartments. Moreover, when
a tolerance threshold in the Alu abundance is crossed,
the chromatin structure may be severely altered, resulting
into counterselection of further Alu accumulation in these
regions (Deininger and Batzer 2002). Evidence for the need
of constitutional similarity between the intruding sequence
and the position of insertion in retroviral integration is given
by Rynditch et al. (1998). As it was pointed out (Filipski
et al. 1989; see also Gu et al. 2000), Alu sequences can undergo compositional matching to their surrounding region.
The literature cited above seems to indicate that simple degradation and indel events cannot make repeats to disappear,
at least in relatively short evolutionary time intervals. On
the other hand, a selective pressure may favor the elimination of Alu and other TEs, mainly by recombination events,
when the genomic architecture and function are disturbed.
Moreover, neutral (not selection driven) eliminations may
also occur.
One way of systematic repeat deletion based on recombination is found to be the elimination of nearby located inverted repeats. As extensively studied in the case
of Alu elements, pairs of closely interspersed Alus of inverse orientation are considerably underrepresented in
the human genome (Stenger et al. 2001). The probability
of elimination of these pairs of nearby inverted repeats depends inversely on their distance and on their percentage of
similarity. However, very close similarity is not required:
pairs of inverse Alus only 86% similar can efficiently stimulate recombination (Lobachev et al. 2000). So, the insertion of a repeat closely enough to an older one of the same
subfamily (but not necessarily a transpositionally active
one) may trigger their mutual elimination. In this case,
the spacer between the repeats is removed while the 2 surrounding spacers ‘‘merge.’’ More recent evidence on the
occurrence of this type of repeat elimination events is provided by the comparison of the initial sequence of the chimpanzee genome with the human genome (CSAC 2005). It
was found that in the short evolutionary time after the divergence of human and chimpanzee genomes, several hundred eliminations of adjacent pairs of Alu, LINE1, and
retroviral repeats (present in their common ancestor) have
occurred in both genomes. In the same study, it is specified
that such recombination events have occurred even when
the divergence between adjacent Alus was .25%. Thus,
recombination-driven elimination may occur even between
members of different Alu subfamilies.
Power Laws in Alu and LINE1 Distributions: Properties
and Features
The principal finding about genomic organization of
this study is the existence of power laws in the size distributions of the distances between consecutive repeats of
most Alu and LINE1 repeat classes in human chromosomes. In this subsection, these power-law distributions
are systematically studied and relations between genome
properties and distribution parameters are set forward while
the validity of the proposed insertion–elimination model is
assessed on the basis of these relations.
The Role of Divergence for Alus and of Length for
LINE1s
Let us first present the main results of our work for Alu
elements. The considered classes of Alu elements are the 4
high-population subfamilies of Alus: Jo, Jb, Sx, and Y,
which are the only groups with sufficient numbers of
Alu elements in all chromosomes for a reliable statistical
analysis. Preliminary work with the data provided by the
application of RepeatMasker (see Methods) has shown that
the most pronounced power laws are observed when only
a fraction of Alu repeats of a given subfamily in each chromosome is considered. These repeats are always those that
do not diverge from the subfamily consensus beyond a certain limit. This divergence is measured as defined in
RepeatMasker (Smit et al. 1996–2004; www.repeatmasker.
org). The limit taken in each case is denoted by maximum
divergence in the presented figures. We consider that the
optimum for a power law is reached when the extent of
the linear region of the curve in double-logarithmic scale
is maximized, provided a sufficiently high value of r2 when
applying linear regression (in all accepted cases r2 . 0.97).
In all chromosomes, the 3 older families (Jo, Jb, and Sx)
present power-law behavior. The extent (E) of these power
laws ranges up to 3 orders of magnitude, whereas l is less
than 2 in all Alu distributions with one exception (where
l 5 2.05). Some examples are included in figure 1, and
the full set of our results is given in table 1. In the supplementary material (Supplementary Material online), an extended version of table 1 is provided. Figure 1d is
representative of a few cases of graphs showing 2 extended
linear regions with different slopes. Interestingly, considering the (much younger) AluY subfamily, a power-law size
distribution of interrepeat distances is formed in no more
Power Laws in Repeat Distributions
2391
FIG. 1.—Characteristic cases of power laws in Alu spacers’ size distributions in whole human chromosomes. Case (d) is representative of the
coexistence of 2 different exponents in different length scales. Quantitative information is presented in table 1. Randomly generated surrogate data over
10 runs are also shown by solid lines.
than 3 cases. Nevertheless, in most chromosomes, we observe clear deviations from the random surrogate data sets,
with tails in the spacers’ distribution longer than the observed in the surrogate sequences (see supplementary material, Supplementary Material online).
Interrepeat distances’ size distributions of AluSx and
AluJb for low and high divergence from the consensus sequences are depicted in figure 2. These are representative of
all the examined distances’ distributions of Alus. The subset of repeats with the lower divergence from the consensus
always presents a more extended power law. In the supplementary material (Supplementary Material online), the
plots of the corresponding interrepeat distances of the full
AluSx and AluJb populations, regardless of divergence values, are also included for comparison.
Size distribution of the distances between LINE1 elements are qualitatively similar to that observed for Alu repeats. LINE1 elements can be divided in ;50 subfamilies.
A major division of the LINE1 repeats is in 2 groups: the
mammalian-wide L1M, which may be found in most mammalian species, and the much younger primate-wide L1P,
which have proliferated only after the separation of the primate lineage (Smit et al. 1995). In order to have TE populations of a suitable size for our study, we have adopted the
FIG. 2.—Two examples of the effect of divergence from the Alu subfamily consensus on the extent E of the power law. (Surrogate data as
in fig. 1.).
2392 Sellis et al.
FIG. 3.—Characteristic cases of power laws in LINE1 spacers’ size distributions in whole human chromosomes. Case (d) is representative of the
coexistence of 2 different exponents in different length scales. Quantitative information is presented in table 1. (Surrogate data as in fig. 1.).
division of LINE1 elements into these 2 major classes.
Some typical distance size distributions are depicted in figure 3, and the full set of our results may be found in table 1.
The mean extent of the power law is somewhat lower in
LINE1s than in Alus while l is higher than 2 in 9 out of
35 cases. Again, in some cases, we meet graphs with 2 linear regions with different slopes (l values), see figure 3d.
Each of the 2 groups of repeats (L1M and L1P) is hosting LINE1 elements of several subfamilies with their divergences measured by comparison with the appropriate
consensus sequence. Thus, in order to fine-tune our search,
we have checked the extent and quality of the observed
power laws with respect to the length of the included repeats. As we have mentioned in the Introduction, LINE1
repeats are often severely truncated. The much older
L1Ms are more truncated than primate-specific LINE1s.
Typical examples of comparison between high- and lowlength LINE1 subsets are given in figure 4a and b. In all
examined cases, the longer (more intact and less truncated)
elements form better power laws. Plots corresponding to
figure 4a and b interrepeat distances of L1 populations, regardless of degree of truncation, are included in the supplementary material (Supplementary Material online).
The strong dependence of the observed power-law
extent E on the degree of similarity with the consensus
sequence (better results are acquired for lower diversity
values) corroborates the proposed model: as expected
and verified by the cited literature, high divergence from
the subfamily consensuses limits the possibility or reduces
the rate of repeat eliminations due to recombination events,
thus relaxing the overall ability of the insertion–elimination
mechanism to generate power laws. Trying to improve the
rudimentary linearity in AluY double-log plots by taking
the low divergence fraction of the whole repeat population
has limited or no effect on the extent of the observed linearity for most chromosomes (see supplementary material,
Supplementary Material online). This feature complies with
the proposed model explanation for the role of divergence:
the generally low divergence found in AluY populations is
not expected to considerably hinder recombination events.
In the case of groups of LINE1 elements, the thresholds taken in the length play the same role (see fig. 4), as
severely truncated copies are less apt to recombination.
Power-Law Slope (l) and Extent (E) Dependence on the
GC Content
In figure 5a and b, scatterplots for AluJb and L1P are
presented. They illustrate the correlation between the slope
l of the linear region in the interrepeat distance cumulative
size distribution graphs and the GC content of each chromosome. Quantitative information for all the examined repeat populations is given in table 2. For the full set of plots,
see the supplementary material (Supplementary Material
online). In all cases, strong correlation between l and
GC content was found, which for Alu is positive and for
LINE1 negative.
Power Laws in Repeat Distributions
2393
FIG. 4.—Two examples of the effect of the length of truncated LINE1 repeats on the extent E of the power law. (Surrogate data as in fig. 1.).
No such clear correlation has been found between E
and GC content for whole chromosomes. In most cases,
correlation is weak or absent (see table 2, fig. 6a), whereas
a clear positive correlation is found only for L1P (fig. 6b).
Taking into account the isochore structure of human
genome (Pavlicek, Paces, et al. 2002), we have examined
(in each chromosome) pairs of regions, one with high- and
one with low-GC content. Here a coherent picture emerges.
The extent of linearity (power-law behavior, measured by
E) clearly depends on the GC content: inter-Alu spacers’
size distributions present high E values in the GC-poor regions and low E values (or even absence of power law) in
the GC-rich regions. For LINE1s, the situation is completely inverted. Examples are presented in figure 7a and
b, whereas a quantitative account for 2 pairs of chromosomal regions is given in table 3. Notice that the results
of the analysis for chromosomal regions totally comply
with the type of dependence between l and GC content
found for complete chromosomes.
The correlation between GC content and l value
(positive for LINE1 and negative for Alu) may be easily
explained on the grounds of relative preference for GCand AT-rich regions of these 2 repeat classes, in combination with the patchy chromosome structure with respect to
GC/AT constitution. GC-rich chromosomes have a significant contribution of regions with high-GC content, the H
isochores, in their structure (i.e., large regions with reduced
LINE1 populations) and are consequently characterized by
an abundance of large spacers. Assuming a linear size distribution of spacers between LINE1 elements in double-log
scale, a ‘‘fatter’’ tail is formed. Such a distribution is
Table 2
Correlation between i) GC Content and Slope (m) and ii) GC
Content and the Extent (E) of the Power Law for Several
Repeat Classes in All Human Chromosomes
Repeat Class
FIG. 5.—GC%–l correlation plots (AluJb and L1P) for the complete
set of human chromosomes. Quantitative information for all correlation
plots of this kind is presented in table 2.
GC%–l correlations
AluJo
AluJb (fig. 5a)
AluSx
L1M
L1P (fig. 5b)
GC%–E correlations
AluJo
AluJb (fig. 6a)
AluSx
L1M
L1P (fig. 6b)
Regression Line Slope
r2
P Value
0.1008
0.1029
0.1104
0.1715
0.1713
0.5136
0.5870
0.3884
0.4874
0.6704
0.00008
0.00001
0.0011
0.008
0.000003
0.0137
0.0278
0.0378
0.0057
0.0668
0.0112
0.1294
0.0452
0.0020
0.3094
0.6219
0.0842
0.3187
0.8843
0.0072
NOTE.—For all plots, see supplementary material (Supplementary Material
online).
2394 Sellis et al.
law is higher in the regions where the elimination rate for
each repeat category is known to be higher. For Alus, this is
the GC-poor subgenome, whereas for LINE1, this is the
GC-rich one, for reasons presented previously and extensively discussed in the cited literature. Obviously, this property is in accordance with the proposed mechanism.
The clear positive correlation, found in the set of
whole chromosomes, between the GC content and the extent of the power law in L1P repeats corroborates the conclusions derived from the study of chromosomal regions
with high- and low-GC content (see fig. 6b). However,
the positive (even if weak) correlation between the GC content and the extent of the power law in AluJb remains puzzling, given that, for the reasons explained above, this
correlation is expected to be negative. In the remaining
Alu cases, there are, rather insignificant, positive correlations (see table 2). A possible explanation may be related
to the compositional asymmetry of the human genome with
respect to GC and AT content (high contribution of large
AT-rich genomic regions). For this reason, GC-rich chromosomes develop Alu power laws, which include a wider
range of relatively short spacers (distances) while they continue to host sufficiently large AT-rich (i.e., scarce in Alu)
regions. Therein, the elimination procedure still allows the
elimination–insertion mechanism to work, generating the
large spacers necessary for an extended tail of the distribution. Thus, GC richness in the framework of human genome
may be the cause of an extension of the power-law range.
Power-Law Extent (E) Dependence on the
‘‘Subsequently Inserted’’ Amount of Repeats
FIG. 6.—GC%–E correlation plots (AluJb and L1P) for the complete
set of human chromosomes. Quantitative information for all correlation
plots of this kind is presented in table 2.
associated with lower absolute slope values. The situation is
inverted for Alu size distributions.
More important for the assessment of the proposed
mechanism is the finding connecting the extent E of the
power law with the GC content when examining GC-rich
and GC-poor chromosomal regions for several Alu and
LINE1 repeat classes (see table 3 and fig. 7). For both repeat
families, a coherent picture emerges: the extent of the power
Next, we examine the existence of correlation between
the extent (E) of the power law in the spacers’ size distributions and the quantity of repeats inserted ‘‘after’’ the peak
of proliferation activity of each repeat group in all chromosomes. We denote the ‘‘SI’’ genomic material by SI followed by a parenthesis including the name of the repeat
class we consider. SI is measured in terms of sequence
length, expressed percent. We extracted the information
about the chronology of the consecutive proliferation bursts
of the repeat groups in the human genome from the related
literature (Kapitonov and Jurka 1996; Dagan et al. 2004;
Hackenberg et al. 2005). The ages of the examined repeat
FIG. 7.—Size distributions of the spacers between AluSx repeats in a GC-poor and a GC-rich region of chromosome 9 (see table 3). (Surrogate data
as in fig. 1.)
Power Laws in Repeat Distributions
Table 3
Quantitative Information (by Regression Analysis) about
Power Laws Observed in Chromosomal Regions of Highand Low-GC Content
Repeat Class
Chromosome 1
AluJb
GC Content
AluJo
L1M
L1P
Chromosome 9b
AluJb
AluJo
AluSx (fig. 7)
L1M
L1P
E
l
r2
a
Poor
Rich
Poor
Rich
Poor
Rich
Poor
Rich
1.60
1.25
1.14
1.04
0.80
1.35
0.56
1.09
0.80
1.77
0.71
1.75
1.68
1.08
2.11
1.60
0.9961
0.9967
0.9910
0.9880
0.9951
0.9917
0.9933
0.9537
Poor
Rich
Poor
Rich
Poor
Rich
Poor
Rich
Poor
Rich
1.14
0.80
1.48
0.70
1.25
0.44
0.80
1.25
0.96
1.13
1.04
2.19
0.68
2.49
0.72
3.46
2.14
1.14
1.97
0.78
0.9877
0.9710
0.9753
0.9869
0.9898
0.9815
0.9926
0.9617
0.9894
0.9861
NOTE.—For all plots, see supplementary material (Supplementary Material
online).
a
GC-poor region: 68–105 Mb (GC% 5 37.4), GC-rich region: 0–37 Mb
(GC% 5 48.2).
b
GC-poor region: 2–32 Mb (GC% 5 37.7), GC-rich region: 100–117.7 Mb
(GC% 5 49.3).
families generate the following order: L1M . AluJo .
AluJb . AluSx . L1P . AluY (see table 4). This order
does not exclude some retroposing activity of a group,
whereas the next one has started to proliferate. It seems that
especially AluSx has continued to retropose during much of
the time of the LINE1P proliferation activity. However, the
peak of the AluSx proliferation has preceded that of L1P
(Ohshima et al. 2003; Hackenberg et al. 2005). Notice that
SI, as defined here, does not include the totality of the inserted material, such as simple repeats, minor Alu subfamilies, and other retroelements. In table 5, the quantitative
information from this study is given, whereas in figure
8a and b, correlation diagrams for AluJo and L1M are depicted. In all cases, positive correlation is found between the
extent of the power-law behavior and the amount of repeats
inserted into the chromosome after the peak of the active
proliferation of the examined repeat population. We note that
in the case of AluJb, the values of P and r2 indicate a marginally low correlation.
In order to further corroborate and extend the results
acquired for whole chromosomes, we attempted a genomewide search for pairs of sequence regions in each chromosome with a significant difference in the mean amount of
inserted sequence length after the active proliferation period
of a given repeat family. Then we compared the extent of
the power law in these 2 regions (same methodology with
the one followed for high and low GC%). For a typical example, see figure 9b and c, where AluJb spacers’ distributions for a pair of SI(AluJb)-low/SI(AluJb)-high regions of
chromosome 19 are compared, and figure 9a, where the
SI(AluJb) profile is given for the whole chromosome 19.
See table 6 for the results of all the examined cases. In re-
2395
Table 4
Detailed List of the Repeat Classes Included in the SI
Genomic Fraction Associated to the Repeat Families Studied
Herein
Repeat Class
Subsequently Inserted Repeats (SI) of a Repeat Class
SI(L1M) 5 L1P þ Alu
SI(AluJo) 5 L1P þ Alu AluJo
SI(AluJb) 5 L1P þ Alu AluJo AluJb
SI(AluSx) 5 L1P þ AluY
SI(L1P) 5 AluY
L1M
AluJo
AluJb
AluSx
L1P
NOTE.—These quantities are computed using data from the standard output of
RepeatMasker. Here with ‘‘Alu,’’ we denote the total sequence length expressed
percent of Alu repeats in a given chromosome or genomic region.
gions with high contrast in the SI material for a given repeat
class, the differences in the extent of the corresponding
power law are impressive. In most cases, one may observe
2- to 3-fold increase of the E value in the SI-rich region.
These results are in accordance with the findings derived
from the study of entire chromosomes. See in the supplementary material (Supplementary Material online) for the
full set of plots and in Methods about the difficulties to locate many chromosomal regions with SI contrast suitable
for our study.
The (always) positive E–SI correlation reflects the
need, on the basis of the proposed mechanism, of a sufficient
amount of inserted material in order for a power-law distribution to be formed (see fig. 10). This result is corroborated
by the clearly contrasting behavior of pairs of chromosomal
regions within the same chromosome, which significantly
differ in their relative amount of SI sequences (see table 6
and for examples fig. 9).
Correlation of the Extent of the Observed Power Laws
with Repeats’ Ages
The scarceness of power-law behavior in the case of
the ‘‘young’’ AluY elements is compatible with the proposed mechanism, when taking into account that no considerable amount of more recent insertions has occurred.
The mean extent of the observed power law for the older
Alu subfamilies increases with age (see table 1). Age is
a measure for both the amount of more recently inserted
material and the total number of elimination events for
each subfamily. Mean values of E are 1.55(AluJo) .
1.35(AluJb) . 1.31(AluSx). The same is marginally true
Table 5
Correlation between SI Content and the Extent (E) of the
Power Law for Several Repeat Classes
Correlation
a
SI(AluJo)%–E(AluJo)
SI(AluJb)%–E(AluJb)
SI(AluSx)%–E(AluSx)
SI(L1M)%–E(L1M)b
SI(L1P)%–E(L1P)
Regression Line Slope
r2
P Value
0.0698
0.0242
0.0844
0.0663
0.3360
0.3554
0.1058
0.2215
0.4418
0.2632
0.0021
0.1210
0.0203
0.0132
0.0146
NOTE.—For all plots, see supplementary material (Supplementary Material
online).
a
See also figure8a.
b
See also figure 8b.
2396 Sellis et al.
FIG. 8.—SI%–E correlation plots (AluJo and L1M) for human
chromosomes where a power law is observed. With SI is denoted the
sequence length due to subsequent insertions of Alu or LINE1 repeats
after the peak of the proliferation period of the considered repeat class.
Quantitative information for all correlation plots of this kind is presented
in table 5.
for LINE1s: 1.25(L1M) . 1.23(L1P). However, the lack
of power law for L1M in several chromosomes does not
allow direct comparison between the 2 LINE1 classes.
This lack (as well as the lack of power-law behavior in
the distribution of L1P in 2 chromosomes) and the overall
lower ‘‘quality’’ of power laws in LINE1s versus Alus
(lower E and higher l values) do not have an unambiguous
explanation. Probably the truncation of LINEs, which is
very strong in the case of the older mammalian class, reduced their recombination activity, relaxing the occurrence of elimination events so much, that in some cases
no clear power law is observed. The picture is probably
further blurred due to the action of other processes of
genomic rearrangement (alongside with the elimination–
insertion mechanism), which eventually deformed and
deteriorated the initially formed power-law distributions.
Power-Law Extent and Elimination Rate
Jurka et al. (2004) have observed a rapid decrease of
recent Alu repeats in Y chromosome as a function of time.
FIG. 9.—(a) Plots representing along the whole chromosome 19 (x
axis): i) the AluJb coverage of the sequence (right y axis, broken line) and
ii) the SI(AluJb) coverage of the sequence (left y axis, continuous line)
(both quantities are expressed percent). Boxes mark the low- and high-SI
regions whose AluJb spacers’ size distributions are depicted in figures (b)
and (c), respectively. (Surrogate data as in fig. 1.)
This result is derived comparing the populations of very
recent and less recent AluY subfamilies under the
assumption of a constant AluY insertion rate during the relatively recent evolutionary past. These authors concluded
that chromosome Y presents higher repeat elimination rates
than X and autosomes (see also Lahn et al. 2001). On the
other hand, chromosome Y presents overall the highest
extent of power laws: In 3 out of 5 examined Alu and
L1 repeat classes, the highest value of E has been observed
in Y chromosome while it also has the highest average E
Power Laws in Repeat Distributions
Table 6
Quantitative Information (by Regression Analysis) about
Power Laws Observed in Chromosomal Regions of Highand Low-SI% Content
Repeat Class
Chromosome 19
AluJo
SI% Content
E
l
r2
Low: 16.37
High: 31.22
0.68
1.03
2.41
1.97
0.9941
0.9921
Low: 14.96
High: 27.21
Low: 14.96
High: 32.63
0.46
1.93
0.56
1.17
2.17
1.15
2.3
1.51
0.9968
0.9905
0.9906
0.9801
Low: 15.04
High: 34.54
Low: 12.53
High: 33.15
Low: 7.28
High: 29.43
1.12
2.28
0.8
1.6
1.02
1.97
0.51
0.62
0.63
0.43
0.68
0.64
0.9863
0.9801
0.9964
0.9790
0.9858
0.9946
Low: 18.46
High: 33.08
0.8
1.49
1.93
1.78
0.9901
0.9851
a
Chromosome 19b
AluJbc
L1M
Chromosome Xd
AluJo
AluJb
AluSx
Chromosome 19e
L1M
NOTE.—For all plots, see supplementary material (Supplementary Material
online).
a
Low-SI region: 4.3–32.7 Mb, high-SI region: 7.7–15 Mb.
b
Low-SI region: 4.3–32.7 Mb, high-SI region: 13.5–24.3 Mb.
c
See also figure 9.
d
Low-SI region: 3.5–32 Mb, high-SI region: 54–79 Mb.
e
Low-SI region: 115–150 Mb, high-SI region: 50–80 Mb.
value (see table 1). The combination of the result of
Jurka et al. (2004) with our findings corroborates the
proposed insertion–elimination mechanism for the generation of power laws in the interrepeat distances’ size
distributions.
Conclusions and Perspectives
As we presented in detail in the previous section, the
interrepeat distances’ size distributions for Alus and
LINE1s in the human genome follow, in most cases, power
laws, which in some cases reach an extent of 3 orders of
magnitude. The proposed insertion–elimination model,
based in simple and well-known molecular events, may ex-
2397
plain this finding and complies with the genomic and distributional features studied so far. These molecular events
occurring mainly in noncoding regions are essentially neutral. As already mentioned, they are controlled by thresholds imposed by the condition not to affect the viability
of the organism. They are rather tolerated than selected,
belonging to the zone of genomic dynamics called by
Holmquist (1989) as ‘‘molecular ecology of the noncoding
DNA.’’ Thus, no direct biological significance may be assigned to the power-law size distributions engendered by
this dynamics. These size distributions are associated to
self-similarity and fractality (Mandelbrot 1982; Feder
1988) of the genome, which may be indirectly related to
its 3-dimensional structure, aptitude to absorb externally
intruding sequences, and the use of its extended noncoding
parts as a potential source of biological information in
evolutionary time.
In a preliminary examination of the mouse genome,
the generality of some of the results derived from the human
genome study was verified. Four SINE families (B1_Mus1,
B3, RSINE1, and B3A) and one LINE group (L1M, the
mammalian-wide L1) have been selected on the basis of
their copy number and age. As expected, in the case of
the younger B1_Mus1 (found only in the Mus genus),
the poorest result is met: practically no power law is observed. For the older, rodent-wide B3, RSINE1, and
B3A families, clear evidence of power law is found when
considering the range of low divergences from each family
consensus. In accordance to the prediction of the insertion–
elimination model, when the upper range of divergence values is considered, the same repeat families give poor or no
evidence of power-law occurrence. In the case of L1M, as
in the human genome, the grouping of several repeat families together makes length the appropriate parameter in order to assess the influence of elimination propensity in the
extent of the power law. Again, in accordance to the proposed model, the less truncated repeat collection gives the
more pronounced power law. For the corresponding figures, see in the supplementary material (Supplementary
Material online).
The systematic study of a collection of genomes of
representative organisms for the distributional features of
their repeat populations is undertaken and will be presented
in a future work.
FIG. 10.—Two typical examples of (transient) power-law size distributions generated by the insertion–elimination model. For details, see Methods.
The slope standard deviation is 0.071 for (a) and 0.116 for (b).
2398 Sellis et al.
Supplementary Material
Supplementary material and extended table 1 are
available at Molecular Biology and Evolution online
(http://www.mbe.oxfordjournals.org/).
Acknowledgments
We would like to thank the RepeatMasker and RepBase teams for allowing us to install and use the necessary
programs and databases for the study of repeats in human
chromosomes. We are grateful to Dr L. Peristeras and Dr
Th. Georgomanolis for their helpful assistance in the installation and configuration of Linux and RepeatMasker and to
Mrs N. Chousou-Polydouri for her valuable suggestions
during the final preparation of the manuscript. We would
also like to thank the 2 anonymous referees whose comments have considerably improved the present work. We
thank National Center for Scientific Research ‘‘Demokritos’’
for financial support.
Literature Cited
Adamic LA, Huberman BA. 2002. Zipf’s law and the internet.
Glottometrics. 3:143–150.
Almirantis Y, Provata A. 1999. A long- and short-range
correlations in genome organization. J Stat Phys. 97:233–262.
Arneodo A, Bacry E, Graves PV, Muzy JF. 1995. Characterizing
long-range correlations in DNA-sequences from wavelet
analysis. Phys Rev Lett. 74:3293–3296.
Audit B, Thermes C, Vaillant C, D’aubenton-Carafa Y, Muzy JF,
Ameodo A. 2001. Long-range correlations in genomic DNA:
a signature of the nucleosomal structure. Phys Rev Lett.
86:2471–2474.
Batzer MA, Deininger PL, Hellmann-Blumberg U, Jurka J,
Labuda D, Rubin CM, Schmid CW, Zietkiewicz E,
Zuckerkandl E. 1999. Standardized nomenclature for Alu
repeats. J Mol Evol. 42:3–6.
Belle EMS, Webster MT, Eyre-Walker A. 2005. Why are young
and old repetitive elements distributed differently in the
human genome? J Mol Evol. 60:290–296.
Bernardi G. 2000a. The compositional evolution of vertebrate
genomes. Gene. 251:31–43.
Bernardi G. 2000b. Isochores and the evolutionary genomics of
vertebrates. Gene. 241:3–17.
Brookfield JFY. 2001. Selection on Alu sequences? Curr Biol.
11:R900–R901.
Buldyrev SV, Goldberger AL, Havlin S, Peng CK, Stanley HE,
Stanley MHR, Simons M. 1993. Fractal landscapes and
molecular evolution: modeling the myosin heavy-chain gene
family. Biophys J. 65:2673–2679.
[CSAC] The Chimpanzee Sequencing and Analysis Consortium.
2005. Initial sequence of the chimpanzee genome and
comparison with the human genome. Nature. 437:69–87.
Cohen N, Dagan T, Stone L, Graur D. 2005. GC composition of
the human genome: in search of isochores. Mol Biol Evol.
22:1260–1272.
Dagan T, Sorek R, Sharon E, Ast G, Graur D. 2004. AluGene:
a database of Alu elements incorporated within protein-coding
genes. Nucleic Acids Res. 32:D489–D492 Sp. Iss. SI.
Deininger PL, Batzer MA. 1999. Alu repeats and human disease.
Mol Genet Metab. 67:183–193.
Deininger PL, Batzer MA. 2002. Mammalian retroelements.
Genome Res. 12:1455–1465.
Dewannieux M, Esnault C, Heidmann T. 2003. LINE-mediated retrotransposition of marked Alu sequences. Nat Genet. 35:41–48.
Feder J. 1988. Fractals. New York: Plenum Press.
Filipski J, Salinas J, Rodier F. 1989. Chromosome localizationdependent compositional bias of point mutations in Alu
repetitive sequences. J Mol Biol. 206:563–566.
Gish W. 2003. WU-BLAST 2.0. [Internet]. [cited 2007 November]; Available from http://blast.wustl.edu.
Gu Z, Wang H, Nekrutenko A, Li W-H. 2000. Densities, length
proportions, and other distributional features of repetitive
sequences in the human genome estimated from 430
megabases of genomic sequence. Gene. 259:81–88.
Hackenberg M, Bernaola-Galvan P, Carpena P, Oliver JL. 2005.
The biased distribution of Alus in human isochores might be
driven by recombination. J Mol Evol. 60:365–377.
Holmquist G. 1989. Evolution of chromosomal bands: molecular
ecology of noncoding DNA. J Mol Evol. 28:469–486.
Jullien R, Botet R. 1987. Aggregation and fractal aggregates.
Singapore: World Scientific.
Jurka J. 2000. Repbase update: a database and an electronic
journal of repetitive elements. Trends Genet. 16:418–420.
Jurka J, Kohany O, Pavlicek A, Kapitonov VV, Jurka MV. 2004.
Duplication, co-clustering, and selection of human Alu
retrotransposons. Proc Natl Acad Sci USA. 101:1268–1272.
Kapitonov V, Jurka J. 1996. The age of Alu subfamilies. J Mol
Evol. 42:59–65.
Lahn BT, Pearson NM, Jegalian K. 2001. The human Y chromosome, in the light of evolution. Nat Rev Genet. 2:207–216.
Li W. 1992. Generating nontrivial long-range correlations and 1/f
spectra by replication and mutation. Int J Bifurcat Chaos.
2:137–154.
Li W. 2002. Zipf’s law everywhere. Glottometrics. 5:14–21.
Li W, Kaneko K. 1992. Long-range correlation and partial 1/falpha spectrum in a noncoding DNA-sequence. Europhys
Lett. 17:655–660.
Lobachev KS, Stenger JE, Kozyreva OG, Jurka J, Gordenin DA,
Resnick MA. 2000. Inverted Alu repeats unstable in yeast are
excluded from the human genome. EMBO J. 19:3822–3830.
Makalowski W. 2003. Not junk after all. Science. 300:1246–1247.
Mandelbrot BB. 1982. The fractal geometry of nature. San
Francisco, CA: W.H. Freeman.
Mantegna RN, Buldyrev SN, Goldberger AL, Havlin S,
Peng CK, Simons M, Stanley HE. 1994. Linguistic features
of noncoding DNA-sequences. Phys Rev Lett. 73:3169–3172.
Medstrand P, van de Lagemaat LN, Mager DL. 2002. Retroelement distributions in the human genome: variations associated
with age and proximity to genes. Genome Res. 12:1483–1495.
Newman MEJ. 2005. Power laws, Pareto distributions and Zipf’s
law. Contemp Phys. 46:323–351.
Nikolaou C, Almirantis Y. 2005. ‘‘Word’’ preference in the genomic
text and genome evolution: different modes of n-tuplet usage in
coding and noncoding sequences. J Mol Evol. 61:23–35.
Ohshima K, Hattori M, Yada T, Gojobori T, Sakaki Y, Okada N.
2003. Whole-genome screening indicates a possible burst of
formation of processed pseudogenes and Alu repeats by particular
L1 subfamilies in ancestral primates. Genome Biol. 4:Art No. R74.
Ostertag EM, Kazazian HH. 2001. Biology of mammalian L1
retrotransposons. Annu Rev Genet. 35:501–538.
Pavlicek A, Clay O, Bernardi G. 2002. Transposable elements
encoding functional proteins: pitfalls in unprocessed genomic
data? FEBS Lett. 523:252–253.
Pavlicek A, Jabbari K, Paces J, Paces V, Hejnar J, Bernardi G.
2001. Similar integration but different stability of Alus and
LINEs in the human genome. Gene. 276:39–45.
Pavlicek A, Paces J, Clay O, Bernardi G. 2002. A compact view
of isochores in the draft human genome sequence. FEBS Lett.
511:165–169.
Power Laws in Repeat Distributions
Peng CK, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F,
Simons M, Stanley HE. 1992. Long-range correlations in
nucleotide-sequences. Nature. 356:168–170.
Provata A, Almirantis Y. 1997. Scaling properties of coding and
non-coding DNA sequences. Physica A. 247:482–496.
Rynditch AV, Zoubak S, Tsyba L, Tryapitsina-Guley N, Bernardi G.
1998. The regional integration of retroviral sequences into the
mosaic genomes of mammals. Gene. 222:1–16.
Shen MR, Batzer MA, Deininger PL. 1991. Evolution of the
master Alu gene(s). J Mol Evol. 33:311–320.
Smit AFA, Hubley R, Green P. 1996-2004. RepeatMasker Open3.0. [Internet]. [cited 2007 November]; Available from: http://
www.repeatmasker.org].
Smit AFA, Toth G, Riggs AD, Jurka J. 1995. Ancestral,
mammalian-wide subfamilies of line-1 repetitive sequences. J
Mol Biol. 246:401–417.
Sorek R, Ast G, Graur D. 2002. Alu-containing exons are
alternatively spliced. Genome Res. 12:1060–1067.
Stenger JE, Lobachev KS, Gordenin D, Darden TA, Jurka J,
Resnick MA. 2001. Biased distribution of inverted and direct
2399
Alus in the human genome: implications for insertion,
exclusion, and genome stability. Genome Res. 11:12–27.
Sverdlov ED. 2000. Retroviruses and primate evolution.
Bioessays. 22:161–171.
Takayasu H, Takayasu M, Provata A, Huber G. 1991. Statistical
properties of aggregation with injection. J Stat Phys.
65:725–745.
Ullu E, Tschudi C. 1984. Alu sequences are processed 7SL RNA
genes. Nature. 312:171–172.
Vicsek T. 1989. Fractal growth phenomena. Singapore: World
Scientific.
Voss RF. 1992. Evolution of long-range fractal correlations and 1/f
noise in DNA-base sequences. Phys Rev Lett. 68:3805–3808.
Webster MT, Smith NGC, Ellegren H. 2003. Compositional
evolution of noncoding DNA in the human and chimpanzee
genomes. Mol Biol Evol. 20:278–286.
Aoife McLysaght, Associate Editor
Accepted August 9, 2007