Nucleotide sequence definition of a major human repeated DNA, the

volume 10 Number 10 1982
Nucleic Acids Research
Nucleotide sequence definition of a major human repeated DNA, the Hind HI 1.9 kb family
Laura Manuelidis
Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06510, USA
Received 3 February 1982; Revised and Accepted 30 April 1982
ABSTRACT
A human Hind III 1.9 kb repeated DNA fragment was isolated and cloned in
pBR322. A cloned member that hybridized predominantly to the 1.9 kb Hind III
band in a digest of whole human DNA was chosen for sequencing. It is an
1894 bp fragment that shows no significant internal repeats. Few pCG residues are observed in the sequence and there are numerous stop codons. Detailed sequence comparisons confirm this is a novel class of repeats that is
not related to previously characterized human satellite DNAs or Alu sequences.
At least a portion of the sequence described is conserved in
evolution.
INTRODUCTION
A 1.9 kb band is visible in Hind III digests of human male and female
DNA and by densitometry, accounts for >_ 0-25% of the genome; by hybridization experiments its sequence has been shown to be distinct from previously identified human repeated DNAs including the simple DNA satellites,
the Eco RI 340 bp dimer repeats and the Alu sequence family (1). The complete sequence of a cloned member of this 1.9 kb family is here reported. An
accompanying paper describes minor sequence variations in the whole genomic
array that are detectable in hybridization studies using this cloned DNA as
a probe (2). Such studies clearly link the sequence described here to several primate Kpn I bands observed by other investigators (3, 4 ) . The 1.9 kb
sequence also is likely to be related to repeated sequences found in cloning
studies of the 3' side of the Bglobin gene (5) and to those identified in X
chromosome cell hybrid experiments (6).
The 1.9 kb sequence shows no internal repeats of statistical significance, and in accord with previous hybridization studies, it is not homologous to any of the previously described human repeated DNA families.
Re-
lated Kpn I repeats have not yet been sequenced and thus cannot be compared
directly for homologies.
At least part of the 1.9 kb
© IRL Press Limited, 1 Falconberg Court, London W1V 5FG, U.K.
0305-1048/82/1010-32118 2.00/0
sequence is conserved
3211
Nucleic Acids Research
in evolution (1, 2, 3 ) . The "function" of this sequence is unknown.
MATERIALS AND METHODS
The Hind III band visualized in a digest of whole human DNA was excised
and eluted from a preparative 0.8% agarose gel, cloned in A Charon 27 and
subcloned in pBR322 (2). One recombinant clone that in Southern blot experiments hybridized predominantly to the 1.9 kb Hind III band in digests of
human nuclear DNA (1) was chosen for further sequence studies. Plasmid DNA
isolated from E_^ coli as described
ethidlum bromide centrifugation.
(7) was purified by equilibrium CsCl-
The DNA insert was mapped using restric-
tion enzymes and standard agarose gel electrophoresls (8, 9 ) .
Appropriate
restriction fragments were 5' or 3' end labelled as described in detail (9,
10, 11) using a series of 20%, 8% and 4% gels.
analyzed
at
least
twice
Most of the sequence was
(Fig. 1) and complementary
strand
analysis
or
sequence analysis in a second experiment showed unambiguous confirming data.
Restriction enzyme map sites consistent with fragments obtained prior to
sequencing were confirmed in the sequence.
The
sequence
was
analyzed extensively
using
the sequence
analysis
system of the NIH MOLGEN SUMEX-AIM computer facility at Stanford University.
The homology, symmetry and dyad-symmetry searches in this system are based
on an updated version of the Korn-Queen algorithm; details of the parameters
used and their statistical significance were generously provided and can be
obtained from the MOLGEN group at Stanford (J. Clayton, P. Friedland, L.
Kedes, and D. Brutlag).
The system has numerous sequence files available
for comparative searches.
RESULTS AND DISCUSSION
Mapping and Sequencing Strategy for the Hind III 1.9 kb Insert
The enzymes used for sequencing of end labelled insert fragments are
depicted in Fig. 1.
Bgl II/Hind III digests were used to sequence positions
1-700 since fragments of 400 bp and 1500 bp were well resolved from 4362 bp
plasmid DNA.
positions
Similarly, Xba I/Hind III digests yielded the sequence from
1200-1900.
Rsa I/Eco RII digests were found most useful for
sequencing central fragments of the Insert.
Complementary strand analyses
were also done using Rsa I/Bgl II digests or Rsa I/Sau IIIA (positions 14241667) digests.
Hinf I, Hinc II, Kpn I and Taq I cleavage studies on whole
plasmid or end labelled fragments were also used to determine the orientation of the insert in pBR322 and to confirm restriction sites obtained in
3212
Nucleic Acids Research
7 8
v Hind III
0 Eco R II
T Bgl II
9 10 II 12 13 14 15 16 17 IS 19 P
t\ Rsa I
S Kpn |
t Xba I
Figure 1. Diagrammatic summary of restriction gel fragments analyzed with
major enzymes of use in DNA sequencing.
3' represents 3' end labelled
fragments used for sequencing, all other fragments were 5' labelled. Arrows
show direction of sequencing from each labelled end. Double line represents
identically labelled fragments sequenced twice. Most of the sequence analysis contains complementary strand confirmation. Ep represents the end of
the vector arm to position 29 (the Hind III insert site) and p represents
plasmid from position 29 onwards; insert scale is 1900 bp.
the sequence.
Restriction Sites of the Hind III 1.9 kb Insert
A complete restriction map is shown with the obtained DNA sequence of the
1.9kb insert (Fig. 2 ) . The sequence determinations confirmed our initial
restriction site studies indicating no Pst I, Ava I, Bgl I, Bam HI, Xho I,
Ava II, Bst E II or Eco RI sites in the cloned insert.
The Rsa I, Eco RII,
Bgl II and Xba I sites were confirmed by sequencing.
Restriction studies
using Hinf I and Hinc II digestion on the cloned plasmid DNA, and Alu I, Hae
III and Taq I on end labelled fragments, were compatible with the sequence
obtained.
Studies with whole genomic DNA have also confirmed the absence of
Pst I, Bam HI, and Bst EII sites within the majority of uncloned members of
this family.
were
observed
Hind III, Xba I, Hae III, Rsa I, Eco RII and Hinc II sites
at
the same
positions
in the genomic
family with minor
variations; a single base change in one of the Eco Rl* sites (at position
1642 or 1652) is likely to be represented as Eco RI sites in a fair proportion of uncloned members of this family (2).
Nucleotide Composition
There were 42.6% Adenine (A), 21.9% Cytosine (C), 16.5% Guanine (G),
and 19.1% Thymlne (T) residues (reading 5' to 3').
Taken together this
yielded a base composition of 38.4% GC and 61.7% AT for the double stranded
molecule.
Whole human DNA bands at a density of 1.697 in neutral CsCl (un-
published data) which is equivalent to average GC content of 39-40%.
this DNA is somewhat AT rich.
Thus
We have never been able to purify or signifi-
cantly resolve fragments containing the Hind III 1.9 kb band in high mole-
3213
Ttblllll
Alul
UiiDIII
1
Ddol EcoiU*
1
liphl
1
EcoRII
I
1
1
Mall
Tiql
I
HboII
Sf.NI
I
Ilinl'I
II
DJel
II
Lcol'l
Ttblllll
11
Hull
Dd«l
1
Alol
EcoPl
Ddol
Usll
11 1
•30
860
870
880
890
900
910
CTTOGAACTAAAGCrCTCCrCAGCAAATGTAAAAGAATAGAAATTATAACAAACTGTCCCTCAGACCACA
11
Alul
Odol
I
LcoUII
710
720
730
740
750
760
770
ATTAACAAGGATACCCAGGAATTUAACTCAGCTCTGCACCAAGCAGAQCr/JlAliACATCTACACAACTCr
1
Udcl
Hu.KI
EcoUI*
111
U|iAI
570
580
590
600
610
620
630
TATATOCACCCAATACAGGAGCACCCAGATllCATAAACCAAGTCCrGAGTGACCTACAAAGAGACTTAGA
Upol
1
430
440
450
460
470
480
4'J<)
AAAACAAAAAAAGGTACGGCTTGCAATCCTAG7XrCTGATAAAACAGACrrTAAACCAACAAAGATCAAA
EcolU
1
290
300
310
320
330
340
3S0
GCTCCAATTAJUUUlACACGGACrCGCAAATTGGATAAAAACTCAAUACCCATCAGTCTCCTGTATTCAGG
EcoPl
1
ISO
160
170
180
190
200
210
AAAATCATGCCAAATTCTAAAGACCATCGAGGCTAGGAAUAAACTGaTCAACTAACCAGCAAAATAACC
1
11
10
20
30
40
50
60
70
AACCnCATACUTGAAOOAGAAATAAAATACrTCACACACAACCAAATGCTGACAGATnTGTCACCACC
lUelll
1 11
Upall
Kill
Kpnl
UaiCI
1 1
llbvl
Fnu4III
Fuu4HI
Dpal
DpnI
HboII
II
Alul
llbotl
Upnl
1
licoRI*
920
930
940
950
960
970
980
GTUCAATCAAACTAGAACTCAGGGTTAAGAAACTCACTCAAAACCACTCAACTACATGCAAACTAAACAA
I
1
K«»I
TM
790
BOO
«10
820
830
U40
CCAGCCCAAATCAACAGAAAUTACAj\Ai\.VAri'CA(iCACCACACCACACCTATTCCAAAACri;ACCACATA
1
1
UioDll
64U
650
660
670
bilO
69U
700
CTCCCACACAATAATAATOGUAGACrrCAACACCCTACTCiTCAACAnAUAAAGATCAACAAGACAGAAA
1
Upol
II
Haolll
Hiol
500
51u
520
530
540
550
560
AGAGACAAAGAAGGCCATTACATAATGGTAAAAOGATCAAlTCAACAAliAAGAUCTAACTATCCTAAATA
XlioII
11 1
1
Ac y1
360
370
360
390
400
410
420
AMCCCATCrCATCTCCAGAGACACACATAGGCTCAAAATAAAAflGACGTCAAGATCTTCCAACAAATGO
1
1
Alul
220
230
240
250
260
270
280
AGCTAACATCATAATGATAGGATCAAACTCACACATAACAATATTAACTTTACATGTAAATGGACTAAAT
ll»cl
StuI
II
Alul
llgiAJ
SicI
11
BO
90
100
110
120
130
MO
AGGCCTGCCCTAAAAGAGCTCCTGAAGCAAGCACTAAACATGCAAAGUAACAACCGGTACCAGCCGCTGC
1
1
Dpnl
1
1
Alul
1
ninFI
F.cotll*
1
P.coPl
Alul
1
H,,I
JpnI
,I,,CI
1
Hnll
ninFI
IlcollI'
1
1
1
1
.'Inll
r.coR|.
1
1
I'bvl
Fnu41ll
Fxcil'15
Kcol'.I*
1
MboII
1
1
HinFI
MboII
ninPt
1
Xbil
EcoKI*
1 1 1
MboII
Mnll
Tiql
CCORI*
1 1 1
Hnll
1
1
F.coRI*
Moll
Rill
1 1
1
EcoRII
1
StnNI
F.colll"
1
EcoPl
1
It is
1760
1770
1780
1790
1800
1810
1B20
GCCAGCATCATCATCCTGATACCAAAGCCrGGCAGAnACACAACAAAAAAAGAGAATnTAOACCMTAT
11
IlinFI
BcoRI*
1 1 1
licoI'.II
Ilinl'l
Avail
Alul
1620
1630
1640
1650
1660
1670
1680
TACCAACCAAAAAGAGTCCAGGACCAGATGGAITCACAGCCAAATTCTACCAGAGGTACAAGGTDGAACT
Alul
IlinlHII
II
1
1480
1490
1500
1510
1S20
1530
1540
CTATGCAAATAAACTAGAAAATCTAfiAAGAAATGGATAAATTCCTCnACACATACACCCTCCCAAGACTA
Dpnl
1
1340
1350
1360
1370
1380
1390
1400
AAAATTGATAGATCACTAGCAAGACTAATAAAGAAr.AAAACAGAGAAGACTCTGATAOACTCAATAAAAA
Figure 2. Complete nucleotide sequence and restriction enzyme positions of the Hind III 1.9 kb insert.
shown as oriented in pBR322 (see Fig. 1 ) . Eco RI* = PPATQQ.Dpn I = Mbol.
Sf«NI
1
Alul
1 2 0 0
1210
1220
1230
1240
1230
1260
AAAAGCAAGAGCAAATACATTCAAAAGCTAGCAGAAGGCAAnAAATAATTAAAATCAGAGCAOAACrGAA
1
1
Unll EcoRI*
1
Alul
1060
1070
1080
1090
1100
1110
1120
GAOAACAAAOACACAACATACCAOTATCTCTOOOACATTCATAOCTGTGTGTACACOCAAATn'ATAeC*
Mnll
"«HI
»«1
Ml
1830
1840
1850
1860
1870
1880
1810
CCTTOATnAACATTnATCCAAAAATCCTrAATAAAATAnT.nCAAACiCAAATCCAfilWnCACATCAAAAA
1
Mull
11
1690
1700
1710
1720
1730
1740
1750
GGTACCATTCCrrCTGAAACTATTCCAATCAATAGAAAAAGAGGGAATCCTCCCTAACTCATTTTATClAG
HboII
1
1550
1560
1570
1580
1590
1600
1610
AACAAGGAAGAAOCTOAATCTCTGAACAGACCAJITAACAGGCTCTGAAATTATCGCAACAATCAATACCT
Moll
Tthlllll
HphI
Dpnl
II
II
1410
1420
1430
1440
14S0
1460
1470
ATGATAAAGOAGATATCACCACCGATCCCACAGAAATACAAACTACCATCAGAGAATACTACAAACACCT
HlnFI
Alul
[icoRII
EcoRI*
1
1270
1280
1290
1300
1310
1320
1330
COAAATACAGACACAAAAAACCCTTCAAAAAATTAATCAATCCACGAGCTGOII'IrnHAAACCATCAAC
till
1
1130
1140
1150
1160
1110
1180
1150
CTTAATGTCCACCAGTCAAAGCTOaAAAfiTTTCAAMTTCACACCCTAACATCACAATTCAAAnCACTAC
990
1000
1010
1020
1030
1040
10S0
CCTGCTCCTOAATGACTACTGGGTCCaTAACOAMTGAAGGCAGAAATAAAGATGTTCTTTGAAACCAGT
3
>
o
a.
<•>•
Nucleic Acids Research
cular weight human DNA using CsCl-Hoechst or Ag-CsSO4 gradients (12) possibly because the repeat is integrated with other sequences in the genome.
Analysis of the dinucleotide sequence distribution was generally in
keeping with the percent expected in a random sequence, with one striking
exception.
The pCG ratio (which is the same in both strands) was quite low
and showed only 9 pCG pairs or 0.5%, as compared to an expected percent of
3.6% in total DNA.
(position 127).
One of these pCG sites is at the Msp I (Hpa II) site
In preliminary studies using the restriction enzymes Msp I
and Hpa II to differentiate methylated from non-methylated cytosines, there
appears to be a small degree of methylation at this site in whole human DNA
(data not shown).
Only one other pCG site of the 9 in the total sequence
has a methyl C residue that potentially can be studied with a restriction
enzyme; at position 134 a simple inversion of CG could yield a Hha I site.
Generally, the degree of potential methylation sites in the 1.9 insert ,is
considerably lower than that seen in mouse satellite DNA (11, 13). Although
pCG methylation has been considered an important feature in repeated DNAs
and the inactivation of coding sequences (14, 15, 16), the low percent of
pCG residues here suggest C methylation may not be an important modifying
feature in the 1.9 kb sequence.
The low representation of pCG sites also
could coincide with special protein binding affinities of this sequence,
especially if it is adjacent to active gene coding regions.
Lack of Internal Repeats
Since the 1.9 kb Hind III sequence is repeated, one might suspect that
the sequence is built upon a simpler sequence, as for example that seen in
the 340 bp Eco RI human dimer, or the 232 bp mouse satellite sequences (9,
11, 13). We therefore searched the sequence for internal homologies that
were statistically significant
(see Methods).
The best homology was the
match between 175-212 and 1545-1580 in the 1.9 kb insert (30 of 40 bases
matched).
The expected value (E) for this match as compared to a random DNA
is 0.118, and showed a one base loop out (insertion-deletion) at 6 positions.
A statistically significant homology would have zero base loop outs,
and an expected value E of 0.05 or less.
We further reviewed
all the
matches of 75% homology and found no repeated sequence that was obvious in
these matches; there was also no consistent spacing pattern discernable in
the homologies obtained.
As a control of the computer program we compared
its expected value for the two repeats of the human Eco RI 340 bp dimer,
which using a different analysis were considered to be highly significant
statistically
3216
(9).
The computer also read this as a highly significant
Nucleic Acids Research
repeat with an expect value E = 0.000.
Regions of Symmetry and Dyad Symmetry in the 1.9 kb Fragment
Several symmetries were noted in the sequence, for example at position
430.
However, computer analysis of all symmetries showed there were none of
statistical significance.
Furthermore, there were no dyad symmetries where
the total free energy of base pairing was -10 K cal; -15 K cal or less has
been considered to be the limit for the stable formation of RNA hairpins
under physiological conditions (17).
Translation Analysis of the 1.9 kb Sequence
The translation products in all three reading frames were examined.
There were numerous stop codons.
The longest reading frame (in any frame,
without stop codons) was 251 amino acids long, starting at position 1139.
More stop codons were observed in a search of the inverse complement of the
sequence (the largest translation array was 82 amino acids in length).
More
detailed translation analyses (and homology and dyad symmetry analyses) are
available on request.
Comparison with other Sequences
The Hind III 1.9 kb sequence was compared with the human 340 bp Eco RI
dimer, in both orientations, using the sequence homology program. No statistically significant homologies were obtained. It is curious that 5 of the 12
matches (41.7%) with >^ 75% homology to the 1.9 kb insert oriented as shown
in Fig. 2, matched a region of the dimer sequence (ATAGAAACTAGACAGAATAAT) at
positions 683, 1192, 1338, 1477 and 1485 in the 1.9 kb sequence.
It is not
known if this sequence region may be of importance in cross-over events or
evolution of repeated DNAs.
ly related
No evidence that the 1.9 kb fragment was close-
to the Eco RI dimer family was obtained
from these sequence
analyses, and this is in accord with our hybridization studies showing no
homology between these two repeated DNAs (1).
These results contrast to
hybridization studies of related "alpha" satellites by other investigators
(3).
The best homology between these two different repeats within the same
species was 76.9% (20 of 26 bases) with an unimpressive E value of 0.995
(1.9 kb positions 683-708 with Eco RI dimer positions 327-351).
Analysis of
the inverse complement for homologies also showed no significant similarities between these two repeats.
Analysis of dyad symmetries between the
human Hind III 1.9 kb fragment and the Eco RI 340 bp dimer did however show
7 matches where the pairing yielded < -15.0 K cal, possibly suggesting that
under less stringent hybridization conditions, some positive hybridization
between the two different sequences might be obtained.
Alternatively, high
3217
Nucleic Acids Research
molecular weight "alphold" DNA, isolated as a satellite from cesium gradients, could also contain other repeated DNA sequences in addition to the 340
dimer family that contributed to the positive hybridization results obtained
by others (3).
In summary, the comparative data here does not lend any
positive evidence to support the notion that the shorter Eco RI 340 bp human
dimer sequence is related to the longer 1.9 kb human repeat.
We previously found that the 1.9 kb clone hybridized to an "interspersed" mouse repeated DNA band (1,18).
this
sequence
satellite.
bore
any
resemblance
Again no statistically
to
Thus it was of interest to find if
the
common
significant
repeated
mouse
DNA
sequence homologies were
noted between the human 1.9 kb repeat and mouse satellite, and this was in
accord with previous hybridization studies (1).
The best homology between
these two sequences matched positions 126-15 7 of the mouse sequence with
14 72-1504 of the 1.9 kb sequence (75%, 2 7 of 36 bp, E = 0.154) as shown:
* * * * *
* * * *
1.9 kb 1472: TATG CAA ATAAACTAGAAAATC TAGAAGAAATG 1504
Mouse satellite 126: TATGGCAAGAAAA CT GAAAATCATGGAA AA TG 157
This homology between two widely separated species was statistically better
than any homology between the human Eco RI 340 bp dimer and the human 1.9 kb
repeat.
"Alu family" sequences, including a polymerase III in vitro trans-
cription unit (19) were also compared since they are "interspersed" in the
human genome.
Alu sequences are notably more GC rich than the 1.9 kb se-
quence depicted here; Alu sequences showed less significant homologies with
the 1.9 kb repeat than mouse satellite DNA (sequences compared as shown in
Fig. 2 or as inverse complement).
Tandemly repeated human simple satellite
sequences (12, 20) did not show any obvious relation to the 1.9 kb repeat.
In summary, the 1.9 kb repeated sequence is here defined as a novel
human repeated DNA.
It lacks any internal repeated sequence subsets, and
thus cannot be classified as a tandem repeat based on its internal sequence.
In the following paper we demonstrate that larger fragments containing this
sequence do not show strict tandem repetition of the 1.9 kb sequence, i.e.
the end of the molecule is not followed directly by the beginning of the sequence or vice versa.
The "function" of this conserved repeat is unknown.
It is of interest that high resolution chromosome hybridization studies directly show at least a proportion of this sequence is localized on the
chromosome arms in a discrete or banded configuration (1 and in preparation).
It is possible that this sequence, or clusters of this sequence, may
have a role in the definition of discrete chromosomal and nuclear domains
that
3218
are
distinct
from
those
occupied
by
the
predominant
centromeric
Nucleic Acids Research
repeats.
ACKNOWLEDGEMENTS
J. Steitz generously supplied very high specific activity Y 32P ATP
used in these studies. This work was supported by NIH Grant CA15O44.
REFERENCES
1. M a n u e l i d i s , L. ( 1 9 8 2 ) . I n : Genome E v o l u t i o n and P h e n o t y p i o c
V a r i a t i o n G.A. Dover and R.B. F l a v e l l , e d s . p p . 263-285 Academic P r e s s ,
New York
2. M a n u e l i d i s , L. and B i r o , A. ( 1 9 8 2 ) . N u c l e i c Acids Res ( i n p r e s s )
3 . Maio, J . J . , Brown, F . L . , and Musich, P . R . ( 1 9 8 1 ) . Chromosoma
( B e r l . ) J33_, 103-125
4 . Singer, M.F. (1981). I n t e r n a t i o n a l Rev. Cytology ( i n press)
5. Kaufman, R.E., Kretschmer, P . J . , Adams, .W., Coon, H.C., Anderson,
W.F. and Nienhuis, A.W. (1980). Proc. N a t l . Acad. Sci. (U.S.A.) !]_,
4229-4233
6. Schmeckpeper, B . J . , Willard, H.F. and Smith, K.D. (1981). Nucleic
Acids Res. \ 1853-1872
7. Clewell, D.B. and Helinski, D.R. (1970). Biochem. 9_, 4428-4440
8. Manuelidis, L. (1976). Nucleic Acids Res. _3» 3063-3076
9. Wu J . C . and Manuelidis, L. (1980). J . Mol. Biol. _142_> 363-386
10. Maxam, A. and G i l b e r t , W. (1979). Methods Enzymol. ^ 5 , 504-561
11. Manuelidis, L. (1981). FEBS L e t t . \V^, 25-28
12. Manuelidis, L. (1978). Chromosoma ( B e r l . ) ^ 6 , 1-21
13. Horz, W. and Altenburger, W. (1981). Nucleic Acids Res. _9» 683-695
14.
Spencer, J . H . , Harbers, K., and Duhamel-Maestracci, N. (1981).
Advances in Enzyme Regulation J_9_, 453-470
15. Razin, A. and Riggs, A.D. (198^1. Science 2H), 604-610
16. Tantravahi, U. , Guntaka, R. , Erlanger, B. and M i l l e r , O.J. (1981).
Proc. Natl. Acad. Sci. U.S.A. _78» 489-493
17. Tinoco, I . , Borer, P.N., Dengler, B . , Levine, M.D., Uhlenbec, O.C.,
Crothers, D.M. and Gralla, J . (1973). Nature New Biol. 246_, 40-41
18. Manuelidis, L. (1980). Nucleic Acids Res. 8^, 3247-3258
19. Pan, J . , Elder, J . T . , Duncan, C.H., and Weissman, S.M. (1981).
Nucleic Acids Res. 9_, 1151-1170
20. Frommer, M., Prosser, J . , Tkachuk, D., Reisner, A.H. and Vincent,
P.C. (1982). Nucleic Acids Res. 10, 547-563
3219
Nucleic Acids Research