Genome Informatics 16(1): 13–21 (2005) 13 Relationship between Segmental Duplications and Repeat Sequences in Human Chromosome 7 Hiroo Murakami Sachiyo Aburatani [email protected] [email protected] Katsuhisa Horimoto [email protected] Laboratory of Biostatistics, Institute of Medical Science, University of Tokyo, Shirokane-dai 4-6-1, Minato-ku, Tokyo 108-8639, Japan Abstract Various types of repeat sequences are abundant in genomic sequences, and they are associated with the biological phenomena at distinct levels. In particular, comparative analyses of wholegenome-sized sequence data have revealed that repeat sequences cause segmental duplications, which are a type of chromosomal structural arrangement. In this study, we analyzed the relationships between segmental duplications and repeat sequences in human chromosome 7. For this purpose, three methods for detecting repeat sequences were applied to the genomic sequences of human chromosome 7: RepeatMasker for the dispersed repeats, TRF for the tandem repeats, and STEPSTONE for the inter-spread repeats. By plotting the detected repeat sequences against the locations on the chromosome, all three types of repeats were found to be concentrated around the regions of segmental duplications, as a macroscopic feature of their distributions. Furthermore, the latter two repeat sequences were classified in terms of their periods, and the distribution bias of the detected repeat sequences was statistically tested between the segmental duplication regions and the other regions. As a result, the periods of two repeats were biased, with less than a 5% level of significance probability by the χ2 test, and the repeats with long periods, about 130bp and more than 400bp, were attributed to a bias with a 5% level of significance probability by the normalized residual test. The mechanism of segmental duplications is discussed based on the present results. Keywords: chromosome rearrangement, segmental duplication, dispersed repeat, tandem repeat, inter-spread repeat 1 Introduction It is well known that mammalian genomic DNA sequences are filled with large low-copy duplicated sequences. A segmental duplication is a nearly identical sequence of a genomic DNA segment, typically ranging in size from 1kb to 200 kb [1]. Segmental duplication is one of the common types of chromosomal structural arrangements, and it may contribute to gene and genome evolution [2, 5]. Furthermore, segmental duplication plays an important role in the phenotypes of gene expression and genetic disease [1]. Recently, comparative analyses of whole-genome-sized sequence data have revealed that the repeat sequences cause segmental duplications [1, 4]. A model of the molecular mechanism involved in segmental duplication was proposed, based on the recombination properties of Alu, a type of dispersed repeat, in addition to the physical properties of the genomic sequence [10]. However, it is still unclear what type of repeat sequence is actually involved in segmental duplication [9]. Thus, a comprehensive analysis of the relationship between repeat sequences and segmental duplications is needed. In this study, we analyzed the relationships between segmental duplications and repeat sequences in human chromosome 7, in which has the most segmental duplications among the human chromosomes [6]. For this purpose, the repeat sequences were detected by three tools, and the distributions 14 Murakami et al. of the detected repeat sequences were corresponded with the regions of the chromosome, in which the segmental duplications were observed. In particular, we focused on the period of the repeat sequence to reveal the features of the repeat sequences that are related to the segmental duplications. 2 2.1 Materials and Method Data In this study, we analyzed the human chromosome 7 genomic sequence (153794793 bp, GenBank accession: BL000002, version 020621) [6]. The regions of segmental duplications in chromosome 7 were defined as the sequences with more than 90% similarity over more than 1kb in length, and the locations of the segmental duplication regions were identified from the supplemental figure [6, 12]; the figure was converted to a high-resolution bit-map, in which one dot corresponds to 29,289 bp in the genomic sequence. The segmental duplications were classified into two types: one is the intra-chromosomal segmental duplication, which indicates the segmental duplications in the same chromosome, and the other is the inter-chromosomal segmental duplication, which indicates the segmental duplication between different chromosomes [6]. The number of intra-chromosomal segmental duplications was 47, with an average length of 7.47 dots, and that of inter-chromosomal segmental duplication was 20, with an average length of 7.35 dots. In this study, we define the repeats related with the segmental duplications, if they are located within 1 dot from both ends of the segmental duplication regions. 2.2 Methods for Detecting Repeat Sequences Three methods were used to detect the repeat sequences: RepeatMasker [11], Tandem Repeat Finder (TRF, released in 2002) [3] and STEPSTONE (version 1.07) [7]. RepeatMasker detects dispersed repeats by using similarity-search programs with a consensus pattern database, RepBase. Tandem Repeat Finder (TRF) is one of the commonly used programs for detecting tandem repeats. STEPSTONE detects inter-spread repeats, which are periodic repeat sequences composed of a repeated consensus sequence and a non-consensus spacer region. Thus, three types of repeat sequences, dispersed repeat, tandem repeat, and inter-spread repeat, were investigated in the present study. 3 Results First, we explored the macroscopic relationship between the segmental duplications and the repeat sequences, by plotting the locations and the periods of the repeats against the locations of the segmental duplications over the entire region of human chromosome 7. Then, the characteristics of the repeat sequences related with the segmental duplications were revealed by statistical tests, in terms of the period of the repeat sequence. 3.1 Densities of Repeat Sequences Detected by RepeatMasker, TRF and STEPSTONE To investigate the relationship between the segmental duplications and the repeat sequences, first, we plotted the locations of the repeat sequences detected by the three methods over the entire chromosome, with the segmental duplication regions. To view the macroscopic features of the distribution of the detected repeat sequences, the density of the repeat sequences within a range of the chromosome was calculated. The densities of the three types of repeat sequences are schematically shown in Figure 1. The total number of dispersed repeats was 264,407 (upper plot in Figure 1). The number of dispersed Relationship between Segmental Duplications and Repeat Sequences 15 repeats located around the intra-chromosomal segmental duplication regions in the plot was 19,645, and that around the inter-chromosomal duplication regions was 5,435. In the figure, the density of dispersed repeats is relatively higher nearby and within the regions of intra- and inter-segmental duplications than in the other regions. Indeed, in the intra-segmental regions, the density of dispersed repeats per dot is 55.95 (= 19, 645/(47 × 7.47)), while the density in the remaining regions is 49.95 (= (264, 407 − 19, 645)/(5, 251 − 47 × 7.47)). As an exception, the density is low in the large region of inter-segmental duplication in the middle of the chromosome. This is because the density in the inter-segmental regions, 36.97 (= 5, 435/(7.35 × 20)), is less than that in the remaining regions, 50.74 (= (264, 407 − 5, 435)/(5, 251 − 7.35 × 20)). To identify the relationship between the segmental duplications and the tandem and inter-spread repeat sequences, we plotted the densities of the repeat sequences detected by TRF and STEPSTONE (middle and lower plots in Figure 1), after removing the dispersed repeats detected by RepeatMasker. The total number of repeat sequences detected by TRF was 6,749; 599 repeat sequences were located around the intra-segmental duplication regions, and 275 repeat sequences were located around the inter-segmental duplication regions. The total number of repeats detected by STEPSTONE was 37,190; 2,517 repeat sequences were located around the intra-segmental duplication regions, and 972 repeat sequences were located around the inter-segmental duplication regions. Since the location data of the segmental duplications are estimated by the supplemental figure [6, 12] in the present study, the density of the two types of repeats could not be estimated precisely, in the remaining regions after removing the dispersed repeats detected by RepeatMasker. By a rough calculation, the densities of repeats detected by TRF are 1.71 (= 599/(47 × 7.47)) and 1.87 (= 275/(7.35 × 20)) in the intra- and inter-segmental duplication regions, while those in the remaining regions are 1.26 (= (6, 749 − 599)/(5, 251 − 47 × 7.47)) and 1.27 (= (6, 749 − 275)/(5, 251 − 7.35 × 20)) . As for the densities of repeats detected by STEPSTONE, the densities in the segmental duplication regions are almost equal to those in the remaining regions; 7.17 (= 2, 517/(47×7.47)) and 6.61 (= 972/(7.35×20)) in the intra- and inter-segmental duplication regions and 7.08 (= (37, 190 − 2, 517)/(5, 251 − 47 × 7.47)) and 7.10 (= (37, 190 − 972)/(5, 251 − 7.35 × 20)) in the remaining regions. As a result, the density of repeat sequences detected by TRF and STEPSTONE is slightly high in the segmental duplication regions. The fluctuation of the densities is large in the repeats detected by RepeatMasker and STEPSTONE, while it is small in those detected by TRF. Although the degree of the density fluctuation is different between the densities detected by TRF and by STEPSTONE, the macroscopic form of the density over the entire chromosome is similar between the repeats detected by TRF and by STEPSTONE. In addition, the densities at both ends of the chromosome are high due to the telomere region, where repeats were frequently found in a previous study [3]. 3.2 Periods of Repeat Sequences Detected by TRF and STEPSTONE To inspect the relationship between segmental duplications and repeat sequences apart from the repeat density, further, each period of the detected repeat sequences was plotted on the chromosome, as shown in Figure 2. Note that the periods in the figure are investigated for the repeats after removing the dispersed repeats detected by RepeatMasker. This is because the periods of dispersed repeats detected by RepeatMasker are well characterized in the database [11]. At any rate, this plot is useful for intuitively investigating the relationship between the periods and the segmental duplications over the entire chromosome. The repeats detected by STEPSTONE show more diverse ranges of the periods, especially the long periods, than those revealed by TRF. One of the remarkable features of the periods in the segmental duplications is that relatively long periods frequently emerge in both distributions of the periods by the two methods. In particular, the long periods in the segmental duplications are found not in the middle of the segmental duplication regions but in the boundaries of them. For example, in about 50,000,000 Location 100,000,000 150,000,000 Figure1:Correspondencebetweensegmentalduplicationregionsanddetectedrepeatsequences.Thehori zontalaxisisthelocati onofthegenomicsequence, and theverticalaxisisthedensity oftherepeatsequencesdetected by RepeatMasker (upper),TRF (middle)and STEPSTONE (lower),respectively.The density isnormalized,with theaverage and the standard deviation ofthenumbersofrepeatsdetected within 1 dot.The verticalli nesthrough t he graph indicatethelocationsofsegmentaldupli cationregions:intra-chromosomalduplicationregion(blue)andi nter-chromosomaldupl icationregion(red). 1 16 Murakami et al. 50,000,000 Location 100,000,000 150,000,000 Figure2:Correspondencebetween theperiodsoftherepeatsequencesdetected by TRF (upper)and STEPSTONE (l ower)and thelocationsofsegmental duplicationregionsinhumanchromosome7.Intheplots,thehorizontalaxisisthelocationofthegenomicsequence,andtheverticalaxisistheperiod.The periodvaluesrangefrom 1bp(bott om)to500bp(top).Theperiodswithmorethan500bpareincludedinthebarwiththe500bpperiod.Theverti callines through the graph indicate the locations of the segmental dupli cation regions: intra-chromosomal duplication region (bl ue) and inter-chromosomal duplicationregion(red). 1 Relationship between Segmental Duplications and Repeat Sequences 17 18 Murakami et al. 55Mbp region (a large red band in the figure), long periods are concentrated around the boundaries of the segmental duplication regions. In the following subsection, we will statistically test these intuitive features from Figure 2. 3.3 Statistical Tests for the Periods of Repeats in Segmental Duplication Regions To further analyze the relationship between the repeat sequence periods and the segmental duplications, we listed the periods of the repeat sequences detected by TRF and STEPSTONE in Table 1. All repeats found in the entire region of chromosome were classified in terms of the period, and then the periods of the repeats found in the intra- and inter-segmental duplication regions were selected from them. Table 1 reveals that the numbers of repeat sequences detected by STEPSTONE are larger than those found by TRF in almost all periods. This is partly because STEPSTONE detects both tandem and inter-spread repeats, and partly because STEPSTONE requires local sequence similarity within the repeat sequence, rather than entire similarity, as in the case of the detection by TRF. In addition, STEPSTONE detects repeat sequences with longer periods than those detected by TRF. Although the sequence similarity of the repeats detected by STEPSTONE is less than that revealed by TRF, STEPSTONE detect a wide variety of repeats, especially to the period. Although the number of repeats in the intra-segmental duplication regions is larger than that in the inter-segmental duplication regions, the decrease degree of repeat numbers in the ascending order of period ranges in the intrasegmental duplication regions is almost similar to that in the inter-segmental duplication regions, in both cases of the detection by TRF and STEPSTONE. Table 1: Periods of the repeat sequences detected by TRF and STEPSTONE. Period TRF STEPSTONE all intra inter all intra inter 1-20 3181 139 50 10168 446 144 21-40 2013 100 34 10275 420 154 41-60 772 47 23 6052 283 97 61-80 336 19 16 4883 197 80 81-100 215 11 7 4102 212 73 101-120 85 0 0 967 39 8 121-140 43 6 4 229 20 8 141-160 30 3 2 101 6 3 161-180 31 2 1 72 5 1 181-200 8 3 3 39 2 2 201-300 21 2 2 85 3 3 301-400 2 0 0 53 4 2 >= 401 12 1 1 167 14 5 The total numbers of detected repeat sequences within each period are listed in the ‘all’ column, and the numbers of repeats around the intra-and inter- segmental duplications are listed in the ‘intra’ and ‘inter’ columns, respectively. Apart from the number of repeats, the relative ratios in each period were investigated by a χ2 test between four pairs of the distributions of repeats in Table 2; the comparison between the total number of repeats and the number of repeats detected in both the intra- and inter- segmental duplication regions, between the total number of repeats and the number of repeats detected in the intra-segmental duplication regions, between the total number of repeats and the number of repeats detected in the inter-segmental duplication regions, and between the numbers of repeats detected in the intra- Relationship between Segmental Duplications and Repeat Sequences 19 segmental duplication regions and those in the inter-segmental duplication regions. Thus, four tests were performed in the respective cases of the detection by TRF and STEPSTONE. The test reveled that the distributions of the periods in the segmental duplication regions are highly biased against the distribution of the periods in the entire region, while the difference of the periods between the intraand inter-segmental duplication regions is not significant at the 5% level. These results indicate the existence of repeat sequences with periods specific to the segmental duplications. Table 2: Distribution bias of the periods in segmental duplication regions. all vs. (intra + inter) all vs. intra all vs. inter intra vs. inter TRF p <0.001 p <0.005 p <0.001 NS STEPSTONE p <0.005 p <0.01 p <0.05 NS NS indicates not significant (p ≥0.05). The periods specific to the segmental duplications, due to the bias of period distributions, were revealed by the normalized residual analysis in Table 3; each residual test was performed for in three period distributions (all, intra, and inter) of the detection by TRF and STEPSTONE. The two tests for the distributions of the detection by TRF and STEPSTONE shared many biased periods with 5% significance probability. Indeed, the periods of 121 to 140 bp, of 181 to 200 bp, and of more than 400 bp are frequently found in the table. As expected from the overall features of Figure 2, the relatively longer periods are attributable to the period distribution bias. Indeed, the periods of 121 to 140 bp and of more than 400 bp are estimated as biased periods with 5% significance probability in the three of four cases. In this study, we will further investigate features of the repeats with the two long biased periods. Table 3: Characteristic periods in the biased distribution. all vs. intra all vs. inter TRF 181-200, ≥401 61-80, 121-140, 181-200, ≥401 STEPSTONE 81-100, 121-140, ≥401 121-140 Another remarkable feature of the repeats with the biased periods is that the G+C content is highly biased in most of the repeats. Indeed, a bias with l% significance probability is found in 20 of 28 repeats with 121 to 140 bp periods, and in 13 of 19 repeats with more than 400 bp periods. Interestingly, high G+C content regions favor the formation of Z-form DNA and are involved in the release of negative super-coiling [8]. Although the relationship between the release of negative supercoiling and the segmental duplication is still unclear, the observed G+C content bias might be related to the frequent occurrence of the repeats in the segmental duplication regions. 4 Discussion We have presented analyses of the relationship between segmental duplications and repeat sequences in human chromosome 7 by the graphical plots and the statistical analyses. The plots revealed that the repeats, especially those with long periods, are frequently found around the segmental duplication regions. The statistical analyses showed that some periods are biased between the segmental duplication regions and the other regions, and not between the intra- and inter-segmental duplication regions. In a previous study [6], two characteristic features were described for two types of segmental duplications: the intra-segmental duplications of chromosome 7 are larger, and share higher sequence 20 Murakami et al. similarity than the inter-segmental duplications. In addition, the distributions of their locations differ from each other. The inter-segmental duplications are frequently observed in the peri-centromeric and sub-telomeric regions [2, 6], while the intra-segmental duplications are found throughout the entire chromosomes. Interestingly, little bias of periods exists between the two types of segmental duplications. Thus, the mechanism for the occurrence of the duplication might be similar between the intra- and inter-segmental duplications. In this study, we detected 28 repeat sequences with 121 - 140 bp periods, and 19 repeat sequences with more than 400 bp periods, as the biased periods by a statistical test. Interestingly, although the G+C content is frequently biased with significant probability, little sequence similarity is observed among the repeat sequences by visual inspections. Thus, the biased repeat sequences related to the segmental duplications may be characterized by the physical properties, such as G+C content, rather than the sequence similarity. Acknowledgments One of the authors (K. H.) was partly supported by a Grant-in-Aid for Scientific Research on Priority Areas “Genome Information Science” (grant 17017015) from the Ministry of Education, Culture, Sports, Science and Technology of Japan. References [1] Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E., Recent segmental duplications in the human genome, Science, 297:1003–1007, 2002. [2] Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler, E.E., Segmental duplications: Organization and impact within the current human genome project assembly, Genome Res., 11:1005–1017, 2001. [3] Benson, G., Tandem repeats finder: A program to analyze DNA sequences, Nucleic Acids Res., 27(2):573–580, 1999. [4] Cheung, J., Estivill, X., Khaja, R., MacDonald, J.R., Lau, K., Tsui, L.-C., and Scherer, S.W., Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence, Genome Biol., 4:R25, 2003. [5] Eichler, E.E. and Sankoff, D., Structural dynamics of eukaryotic chromosome evolution, Science, 301:793–797, 2003. [6] Hillier, L.W., Fulton, R.S., Fulton, L.A., Graves, T.A., Pepin, K.H., Wagner-McPherson, C., Layman, D., Maas, J., Jaeger, S., Walker, R., Wylie, K., Sekhon, M., Becker, M.C., O’Laughlin, M.D., Schaller, M.E., et al., The DNA sequence of human chromosome 7, Nature, 424:157–164., 2003. [7] Murakami, H., Sugaya, N., Sato, M., Imaizumi, A., Aburatani, S., and Horimoto, H., Detection of inter-spread repeat sequence in genomic DNA sequence, Genome Informatics, 15(1):170–179, 2004. [8] Rich, A. and Zhang, S., Z-DNA: The long road to biological function, Nat. Rev. Genet., 4:566–572, 2003. [9] Zhang, L., Lu., H.H.S., Chung, W., Yang, J., and Li, W.H., Patterns of segmental duplication in the human genome, Mol. Biol. Evol., 22(1):135–141, 2005. Relationship between Segmental Duplications and Repeat Sequences 21 [10] Zhou, Y. and Mishra, B., Quantifying the mechanisms for segmental duplications in mammalian genomes by statistical analysis and modeling, Proc. Natl. Acad. Sci. USA, 102(11):4051–4056, 2005. [11] http://ftp.genome.washington.edu/RM/RepeatMasker.html [12] http://www.nature.com/nature/journal/v424/n6945/suppinfo/nature01782.html
© Copyright 2025 Paperzz