Relationship between Segmental Duplications and Repeat

Genome Informatics 16(1): 13–21 (2005)
13
Relationship between Segmental Duplications and
Repeat Sequences in Human Chromosome 7
Hiroo Murakami
Sachiyo Aburatani
[email protected]
[email protected]
Katsuhisa Horimoto
[email protected]
Laboratory of Biostatistics, Institute of Medical Science, University of Tokyo,
Shirokane-dai 4-6-1, Minato-ku, Tokyo 108-8639, Japan
Abstract
Various types of repeat sequences are abundant in genomic sequences, and they are associated
with the biological phenomena at distinct levels. In particular, comparative analyses of wholegenome-sized sequence data have revealed that repeat sequences cause segmental duplications,
which are a type of chromosomal structural arrangement. In this study, we analyzed the relationships between segmental duplications and repeat sequences in human chromosome 7. For this
purpose, three methods for detecting repeat sequences were applied to the genomic sequences of
human chromosome 7: RepeatMasker for the dispersed repeats, TRF for the tandem repeats, and
STEPSTONE for the inter-spread repeats. By plotting the detected repeat sequences against the
locations on the chromosome, all three types of repeats were found to be concentrated around the
regions of segmental duplications, as a macroscopic feature of their distributions. Furthermore, the
latter two repeat sequences were classified in terms of their periods, and the distribution bias of the
detected repeat sequences was statistically tested between the segmental duplication regions and
the other regions. As a result, the periods of two repeats were biased, with less than a 5% level of
significance probability by the χ2 test, and the repeats with long periods, about 130bp and more
than 400bp, were attributed to a bias with a 5% level of significance probability by the normalized
residual test. The mechanism of segmental duplications is discussed based on the present results.
Keywords: chromosome rearrangement, segmental duplication, dispersed repeat, tandem repeat,
inter-spread repeat
1
Introduction
It is well known that mammalian genomic DNA sequences are filled with large low-copy duplicated
sequences. A segmental duplication is a nearly identical sequence of a genomic DNA segment, typically ranging in size from 1kb to 200 kb [1]. Segmental duplication is one of the common types of
chromosomal structural arrangements, and it may contribute to gene and genome evolution [2, 5].
Furthermore, segmental duplication plays an important role in the phenotypes of gene expression and
genetic disease [1].
Recently, comparative analyses of whole-genome-sized sequence data have revealed that the repeat
sequences cause segmental duplications [1, 4]. A model of the molecular mechanism involved in
segmental duplication was proposed, based on the recombination properties of Alu, a type of dispersed
repeat, in addition to the physical properties of the genomic sequence [10]. However, it is still unclear
what type of repeat sequence is actually involved in segmental duplication [9]. Thus, a comprehensive
analysis of the relationship between repeat sequences and segmental duplications is needed.
In this study, we analyzed the relationships between segmental duplications and repeat sequences
in human chromosome 7, in which has the most segmental duplications among the human chromosomes [6]. For this purpose, the repeat sequences were detected by three tools, and the distributions
14
Murakami et al.
of the detected repeat sequences were corresponded with the regions of the chromosome, in which the
segmental duplications were observed. In particular, we focused on the period of the repeat sequence
to reveal the features of the repeat sequences that are related to the segmental duplications.
2
2.1
Materials and Method
Data
In this study, we analyzed the human chromosome 7 genomic sequence (153794793 bp, GenBank accession: BL000002, version 020621) [6]. The regions of segmental duplications in chromosome 7 were
defined as the sequences with more than 90% similarity over more than 1kb in length, and the locations
of the segmental duplication regions were identified from the supplemental figure [6, 12]; the figure
was converted to a high-resolution bit-map, in which one dot corresponds to 29,289 bp in the genomic
sequence. The segmental duplications were classified into two types: one is the intra-chromosomal segmental duplication, which indicates the segmental duplications in the same chromosome, and the other
is the inter-chromosomal segmental duplication, which indicates the segmental duplication between
different chromosomes [6].
The number of intra-chromosomal segmental duplications was 47, with an average length of 7.47
dots, and that of inter-chromosomal segmental duplication was 20, with an average length of 7.35
dots. In this study, we define the repeats related with the segmental duplications, if they are located
within 1 dot from both ends of the segmental duplication regions.
2.2
Methods for Detecting Repeat Sequences
Three methods were used to detect the repeat sequences: RepeatMasker [11], Tandem Repeat Finder
(TRF, released in 2002) [3] and STEPSTONE (version 1.07) [7]. RepeatMasker detects dispersed
repeats by using similarity-search programs with a consensus pattern database, RepBase. Tandem
Repeat Finder (TRF) is one of the commonly used programs for detecting tandem repeats. STEPSTONE detects inter-spread repeats, which are periodic repeat sequences composed of a repeated
consensus sequence and a non-consensus spacer region. Thus, three types of repeat sequences, dispersed repeat, tandem repeat, and inter-spread repeat, were investigated in the present study.
3
Results
First, we explored the macroscopic relationship between the segmental duplications and the repeat
sequences, by plotting the locations and the periods of the repeats against the locations of the segmental duplications over the entire region of human chromosome 7. Then, the characteristics of the
repeat sequences related with the segmental duplications were revealed by statistical tests, in terms
of the period of the repeat sequence.
3.1
Densities of Repeat Sequences Detected by RepeatMasker, TRF and STEPSTONE
To investigate the relationship between the segmental duplications and the repeat sequences, first, we
plotted the locations of the repeat sequences detected by the three methods over the entire chromosome, with the segmental duplication regions. To view the macroscopic features of the distribution of
the detected repeat sequences, the density of the repeat sequences within a range of the chromosome
was calculated.
The densities of the three types of repeat sequences are schematically shown in Figure 1. The
total number of dispersed repeats was 264,407 (upper plot in Figure 1). The number of dispersed
Relationship between Segmental Duplications and Repeat Sequences
15
repeats located around the intra-chromosomal segmental duplication regions in the plot was 19,645,
and that around the inter-chromosomal duplication regions was 5,435. In the figure, the density
of dispersed repeats is relatively higher nearby and within the regions of intra- and inter-segmental
duplications than in the other regions. Indeed, in the intra-segmental regions, the density of dispersed
repeats per dot is 55.95 (= 19, 645/(47 × 7.47)), while the density in the remaining regions is 49.95
(= (264, 407 − 19, 645)/(5, 251 − 47 × 7.47)). As an exception, the density is low in the large region
of inter-segmental duplication in the middle of the chromosome. This is because the density in the
inter-segmental regions, 36.97 (= 5, 435/(7.35 × 20)), is less than that in the remaining regions, 50.74
(= (264, 407 − 5, 435)/(5, 251 − 7.35 × 20)).
To identify the relationship between the segmental duplications and the tandem and inter-spread
repeat sequences, we plotted the densities of the repeat sequences detected by TRF and STEPSTONE
(middle and lower plots in Figure 1), after removing the dispersed repeats detected by RepeatMasker.
The total number of repeat sequences detected by TRF was 6,749; 599 repeat sequences were located
around the intra-segmental duplication regions, and 275 repeat sequences were located around the
inter-segmental duplication regions. The total number of repeats detected by STEPSTONE was
37,190; 2,517 repeat sequences were located around the intra-segmental duplication regions, and 972
repeat sequences were located around the inter-segmental duplication regions. Since the location
data of the segmental duplications are estimated by the supplemental figure [6, 12] in the present
study, the density of the two types of repeats could not be estimated precisely, in the remaining
regions after removing the dispersed repeats detected by RepeatMasker. By a rough calculation, the
densities of repeats detected by TRF are 1.71 (= 599/(47 × 7.47)) and 1.87 (= 275/(7.35 × 20))
in the intra- and inter-segmental duplication regions, while those in the remaining regions are 1.26
(= (6, 749 − 599)/(5, 251 − 47 × 7.47)) and 1.27 (= (6, 749 − 275)/(5, 251 − 7.35 × 20)) . As for the
densities of repeats detected by STEPSTONE, the densities in the segmental duplication regions are
almost equal to those in the remaining regions; 7.17 (= 2, 517/(47×7.47)) and 6.61 (= 972/(7.35×20))
in the intra- and inter-segmental duplication regions and 7.08 (= (37, 190 − 2, 517)/(5, 251 − 47 × 7.47))
and 7.10 (= (37, 190 − 972)/(5, 251 − 7.35 × 20)) in the remaining regions. As a result, the density
of repeat sequences detected by TRF and STEPSTONE is slightly high in the segmental duplication
regions.
The fluctuation of the densities is large in the repeats detected by RepeatMasker and STEPSTONE,
while it is small in those detected by TRF. Although the degree of the density fluctuation is different
between the densities detected by TRF and by STEPSTONE, the macroscopic form of the density
over the entire chromosome is similar between the repeats detected by TRF and by STEPSTONE.
In addition, the densities at both ends of the chromosome are high due to the telomere region, where
repeats were frequently found in a previous study [3].
3.2
Periods of Repeat Sequences Detected by TRF and STEPSTONE
To inspect the relationship between segmental duplications and repeat sequences apart from the repeat
density, further, each period of the detected repeat sequences was plotted on the chromosome, as shown
in Figure 2. Note that the periods in the figure are investigated for the repeats after removing the
dispersed repeats detected by RepeatMasker. This is because the periods of dispersed repeats detected
by RepeatMasker are well characterized in the database [11]. At any rate, this plot is useful for
intuitively investigating the relationship between the periods and the segmental duplications over the
entire chromosome.
The repeats detected by STEPSTONE show more diverse ranges of the periods, especially the long
periods, than those revealed by TRF. One of the remarkable features of the periods in the segmental
duplications is that relatively long periods frequently emerge in both distributions of the periods by
the two methods. In particular, the long periods in the segmental duplications are found not in the
middle of the segmental duplication regions but in the boundaries of them. For example, in about
50,000,000
Location
100,000,000
150,000,000
Figure1:Correspondencebetweensegmentalduplicationregionsanddetectedrepeatsequences.Thehori
zontalaxisisthelocati
onofthegenomicsequence,
and theverticalaxisisthedensity oftherepeatsequencesdetected by RepeatMasker (upper),TRF (middle)and STEPSTONE (lower),respectively.The
density isnormalized,with theaverage and the standard deviation ofthenumbersofrepeatsdetected within 1 dot.The verticalli
nesthrough t
he graph
indicatethelocationsofsegmentaldupli
cationregions:intra-chromosomalduplicationregion(blue)andi
nter-chromosomaldupl
icationregion(red).
1
16
Murakami et al.
50,000,000
Location
100,000,000
150,000,000
Figure2:Correspondencebetween theperiodsoftherepeatsequencesdetected by TRF (upper)and STEPSTONE (l
ower)and thelocationsofsegmental
duplicationregionsinhumanchromosome7.Intheplots,thehorizontalaxisisthelocationofthegenomicsequence,andtheverticalaxisistheperiod.The
periodvaluesrangefrom 1bp(bott
om)to500bp(top).Theperiodswithmorethan500bpareincludedinthebarwiththe500bpperiod.Theverti
callines
through the graph indicate the locations of the segmental dupli
cation regions: intra-chromosomal duplication region (bl
ue) and inter-chromosomal
duplicationregion(red).
1
Relationship between Segmental Duplications and Repeat Sequences
17
18
Murakami et al.
55Mbp region (a large red band in the figure), long periods are concentrated around the boundaries of
the segmental duplication regions. In the following subsection, we will statistically test these intuitive
features from Figure 2.
3.3
Statistical Tests for the Periods of Repeats in Segmental Duplication Regions
To further analyze the relationship between the repeat sequence periods and the segmental duplications, we listed the periods of the repeat sequences detected by TRF and STEPSTONE in Table 1.
All repeats found in the entire region of chromosome were classified in terms of the period, and then
the periods of the repeats found in the intra- and inter-segmental duplication regions were selected
from them.
Table 1 reveals that the numbers of repeat sequences detected by STEPSTONE are larger than
those found by TRF in almost all periods. This is partly because STEPSTONE detects both tandem
and inter-spread repeats, and partly because STEPSTONE requires local sequence similarity within
the repeat sequence, rather than entire similarity, as in the case of the detection by TRF. In addition,
STEPSTONE detects repeat sequences with longer periods than those detected by TRF. Although
the sequence similarity of the repeats detected by STEPSTONE is less than that revealed by TRF,
STEPSTONE detect a wide variety of repeats, especially to the period. Although the number of
repeats in the intra-segmental duplication regions is larger than that in the inter-segmental duplication
regions, the decrease degree of repeat numbers in the ascending order of period ranges in the intrasegmental duplication regions is almost similar to that in the inter-segmental duplication regions, in
both cases of the detection by TRF and STEPSTONE.
Table 1: Periods of the repeat sequences detected by TRF and STEPSTONE.
Period
TRF
STEPSTONE
all
intra inter
all
intra inter
1-20 3181 139
50
10168 446
144
21-40 2013 100
34
10275 420
154
41-60 772
47
23
6052
283
97
61-80 336
19
16
4883
197
80
81-100 215
11
7
4102
212
73
101-120
85
0
0
967
39
8
121-140
43
6
4
229
20
8
141-160
30
3
2
101
6
3
161-180
31
2
1
72
5
1
181-200
8
3
3
39
2
2
201-300
21
2
2
85
3
3
301-400
2
0
0
53
4
2
>= 401
12
1
1
167
14
5
The total numbers of detected repeat sequences within each period are listed in the ‘all’ column,
and the numbers of repeats around the intra-and inter- segmental duplications are listed in the ‘intra’
and ‘inter’ columns, respectively.
Apart from the number of repeats, the relative ratios in each period were investigated by a χ2 test
between four pairs of the distributions of repeats in Table 2; the comparison between the total number
of repeats and the number of repeats detected in both the intra- and inter- segmental duplication
regions, between the total number of repeats and the number of repeats detected in the intra-segmental
duplication regions, between the total number of repeats and the number of repeats detected in
the inter-segmental duplication regions, and between the numbers of repeats detected in the intra-
Relationship between Segmental Duplications and Repeat Sequences
19
segmental duplication regions and those in the inter-segmental duplication regions. Thus, four tests
were performed in the respective cases of the detection by TRF and STEPSTONE. The test reveled
that the distributions of the periods in the segmental duplication regions are highly biased against the
distribution of the periods in the entire region, while the difference of the periods between the intraand inter-segmental duplication regions is not significant at the 5% level. These results indicate the
existence of repeat sequences with periods specific to the segmental duplications.
Table 2: Distribution bias of the periods in segmental duplication regions.
all vs. (intra + inter) all vs. intra all vs. inter intra vs. inter
TRF
p <0.001
p <0.005
p <0.001
NS
STEPSTONE
p <0.005
p <0.01
p <0.05
NS
NS indicates not significant (p ≥0.05).
The periods specific to the segmental duplications, due to the bias of period distributions, were
revealed by the normalized residual analysis in Table 3; each residual test was performed for in three
period distributions (all, intra, and inter) of the detection by TRF and STEPSTONE. The two tests
for the distributions of the detection by TRF and STEPSTONE shared many biased periods with 5%
significance probability. Indeed, the periods of 121 to 140 bp, of 181 to 200 bp, and of more than 400
bp are frequently found in the table. As expected from the overall features of Figure 2, the relatively
longer periods are attributable to the period distribution bias. Indeed, the periods of 121 to 140 bp
and of more than 400 bp are estimated as biased periods with 5% significance probability in the three
of four cases. In this study, we will further investigate features of the repeats with the two long biased
periods.
Table 3: Characteristic periods in the biased distribution.
all vs. intra
all vs. inter
TRF
181-200, ≥401
61-80, 121-140, 181-200, ≥401
STEPSTONE 81-100, 121-140, ≥401
121-140
Another remarkable feature of the repeats with the biased periods is that the G+C content is
highly biased in most of the repeats. Indeed, a bias with l% significance probability is found in 20
of 28 repeats with 121 to 140 bp periods, and in 13 of 19 repeats with more than 400 bp periods.
Interestingly, high G+C content regions favor the formation of Z-form DNA and are involved in the
release of negative super-coiling [8]. Although the relationship between the release of negative supercoiling and the segmental duplication is still unclear, the observed G+C content bias might be related
to the frequent occurrence of the repeats in the segmental duplication regions.
4
Discussion
We have presented analyses of the relationship between segmental duplications and repeat sequences
in human chromosome 7 by the graphical plots and the statistical analyses. The plots revealed that
the repeats, especially those with long periods, are frequently found around the segmental duplication regions. The statistical analyses showed that some periods are biased between the segmental
duplication regions and the other regions, and not between the intra- and inter-segmental duplication
regions.
In a previous study [6], two characteristic features were described for two types of segmental
duplications: the intra-segmental duplications of chromosome 7 are larger, and share higher sequence
20
Murakami et al.
similarity than the inter-segmental duplications. In addition, the distributions of their locations differ
from each other. The inter-segmental duplications are frequently observed in the peri-centromeric
and sub-telomeric regions [2, 6], while the intra-segmental duplications are found throughout the
entire chromosomes. Interestingly, little bias of periods exists between the two types of segmental
duplications. Thus, the mechanism for the occurrence of the duplication might be similar between the
intra- and inter-segmental duplications.
In this study, we detected 28 repeat sequences with 121 - 140 bp periods, and 19 repeat sequences
with more than 400 bp periods, as the biased periods by a statistical test. Interestingly, although the
G+C content is frequently biased with significant probability, little sequence similarity is observed
among the repeat sequences by visual inspections. Thus, the biased repeat sequences related to the
segmental duplications may be characterized by the physical properties, such as G+C content, rather
than the sequence similarity.
Acknowledgments
One of the authors (K. H.) was partly supported by a Grant-in-Aid for Scientific Research on Priority
Areas “Genome Information Science” (grant 17017015) from the Ministry of Education, Culture,
Sports, Science and Technology of Japan.
References
[1] Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers,
E.W., Li, P.W., and Eichler, E.E., Recent segmental duplications in the human genome, Science,
297:1003–1007, 2002.
[2] Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler, E.E., Segmental duplications:
Organization and impact within the current human genome project assembly, Genome Res.,
11:1005–1017, 2001.
[3] Benson, G., Tandem repeats finder: A program to analyze DNA sequences, Nucleic Acids Res.,
27(2):573–580, 1999.
[4] Cheung, J., Estivill, X., Khaja, R., MacDonald, J.R., Lau, K., Tsui, L.-C., and Scherer, S.W.,
Genome-wide detection of segmental duplications and potential assembly errors in the human
genome sequence, Genome Biol., 4:R25, 2003.
[5] Eichler, E.E. and Sankoff, D., Structural dynamics of eukaryotic chromosome evolution, Science,
301:793–797, 2003.
[6] Hillier, L.W., Fulton, R.S., Fulton, L.A., Graves, T.A., Pepin, K.H., Wagner-McPherson, C.,
Layman, D., Maas, J., Jaeger, S., Walker, R., Wylie, K., Sekhon, M., Becker, M.C., O’Laughlin,
M.D., Schaller, M.E., et al., The DNA sequence of human chromosome 7, Nature, 424:157–164.,
2003.
[7] Murakami, H., Sugaya, N., Sato, M., Imaizumi, A., Aburatani, S., and Horimoto, H., Detection
of inter-spread repeat sequence in genomic DNA sequence, Genome Informatics, 15(1):170–179,
2004.
[8] Rich, A. and Zhang, S., Z-DNA: The long road to biological function, Nat. Rev. Genet., 4:566–572,
2003.
[9] Zhang, L., Lu., H.H.S., Chung, W., Yang, J., and Li, W.H., Patterns of segmental duplication in
the human genome, Mol. Biol. Evol., 22(1):135–141, 2005.
Relationship between Segmental Duplications and Repeat Sequences
21
[10] Zhou, Y. and Mishra, B., Quantifying the mechanisms for segmental duplications in mammalian
genomes by statistical analysis and modeling, Proc. Natl. Acad. Sci. USA, 102(11):4051–4056,
2005.
[11] http://ftp.genome.washington.edu/RM/RepeatMasker.html
[12] http://www.nature.com/nature/journal/v424/n6945/suppinfo/nature01782.html