Efficient algorithms for molecular sequence analysis

Proc. Natl. Acad. Sci. USA
Vol. 85, pp. 841-845, February 1988
Evolution
Efficient algorithms for molecular sequence analysis
(repeats/dyad symmetries/alignment maps/algorithm)
SAMUEL KARLIN, MACDONALD MORRIS, GHASSAN GHANDOUR,
AND
MING-YING LEUNG
Department of Mathematics, Stanford University, Stanford, CA 94305
Contributed by Samuel Karlin, September 22, 1987
Efficient (linear time) algorithms are deABSTRACT
scribed for identifying global molecular sequence features
allowing for errors including repeats, matches between sequences, dyad symmetry pairings, and other sequence patterns. A multiple sequence alignment algorithm is also described. Specific applications are given to hepatitis B viruses
and the J5-C (J, joining; C, constant) region of the immunoglobulin K gene.
We illustrate the alignment algorithm on the rabbit, human, and mouse immunoglobulin K gene J5-C (J, joining; C,
constant) genomic region (see Table 1 and Fig. 1) and the
human, ground squirrel, and duck hepatitis B virus genomes
(see Table 3 and Fig. 2) with some interpretations.
With nucleic acid and protein sequence data accumulating at
an accelerated rate and with the advent of the human
genome sequencing project, the need for efficient molecular
sequence software is paramount. A number of molecular
sequence analysis packages is available including that of
the University of Wisconsin Genetics Computer Group
(UWGCG) and the programs on the Bionet Resource. These
variously use optimizations, dot plot and profile displays,
and empirical matching constructions (1-5).
This paper describes several algorithms with expected run
time approximately linear (proportional to sequence length).
One algorithm determines within and between multiple sequences all statistically significant long repeated words
(oligonucleotides, peptides) allowing for errors. Another
algorithm aligns multiple sequences. These algorithms can
identify global sequence features other than repeats, such as
dyad symmetries and masked patterns.
Central to these algorithms is a linked list array, which
connects each occurrence of a k-word (a sequence of k
consecutive letters) to the next occurrence of the same word
within the sequence. Repeats without errors (exact repeats)
can be located very quickly by using an algorithm that
constructs linked list arrays for longer words from the initial
linked list. Long repeats are printed subject to rules that
suppress redundant information. A more general algorithm,
which allows for errors, determines groupings of moderate
length exact repeats separated by short "error blocks" of
mismatched or inserted/deleted letters. Explicitly, a repeat
segment is an aggregate of exact repeats each of length OK
with successive exact repeats separated by at most e letters.
This characterization of errors avoids the optimization problems inherent in trying to identify single matches and the
exact location of insertions/deletions.
The multiple sequence alignment algorithm determines the
degree and extent of similarity between sequences in the
vicinity of long matching segments by identifying nearby
short matches that are closely aligned.
Printing criteria for these algorithms are based on statistical methods for distinguishing nonrandom sequence relationships from chance configurations. Significance criteria for a
variety of sequence statistics, as assessed by empirical
permutation procedures and comparisons with theoretical
random models, may be found in ref. 6.
The basic strategy of all these algorithms is to represent the
repetition of k-words within a sequence by a linked list
(called L). Consider a sequence of N letters from an m-letter
alphabet (e.g., m = 4 for nucleotides; m = 20 for amino
acids; for other useful molecular sequence alphabets, see
ref. 7). There is a k-word beginning at each of the first N k + 1 positions. The linked list for repetition of k-words is an
array whose ith element, L(i), is the position in the sequence
of the next occurrence of the k-word at i. If i is the last
occurrence of the word in the sequence then L(i) is 0.
The sequence of 13 nucleotides AAGCTTAGCTAGC
would have the 3-word linked list (0, 7, 8, 0, 0, 10, 11, 0, 0,
0, 0). The linked list allows us to find all subsequent repeats
of any k-word. The 3-word AGC occurs at position 2, at L(2)
= 7, and at L(7) = 11. Position 11 is the last occurrence
since L(11) = 0. Call the sequence of linked locations
begining with i (iL(i),L(L(i)),.. .) the k-word pointer chain
beginning with i.
Numerical Representation of k-Words. In an m-letter alphabet, there are mk possible k-words, each of which can be
assigned a different number between 1 and mik. Let a
represent a unique correspondence of the letters to the
numbers from 0 to m - 1. Then for a given k-word a,, a2,
ak the corresponding unique number is w = 1 +
oa(ad)mi-l. Given cl, {or}, and k one can easily recover al,
A. The Linked List
a29 . . ., ak.
Constructing the Linked List for k-Words. Three arrays are
required: the linked list array L, an array H (length N - k +
1), which stores the numerical representations of k-words in
the sequence, and a "bucket array," B, of length mik, whose
purpose is to hold at register co, 1 c W c Mk, the position of
the most recent occurrence of the word with the numerical
representation Cl. B and L should be initialized to zeros.
Reading the sequence from the file, perform the following
operations for each position j. (i) Calculate a', the numerical
representation of the k-word at j, and store co in register j of
H; (ii) determine the position (call it i) in the w register of B
and, if i #& 0, set L(i) = j; (iii) set B (cl) = j.
B. An Exact Repeat Algorithm
This approximately linear time algorithm finds all long
repeated words within or between sequences.
We begin by constructing L, the linked list for p-words, as
in section A. As a compromise between storage and effi-
The publication costs of this article were defrayed in part by page charge
payment. This article must therefore be hereby marked "advertisement"
in accordance with 18 U.S.C. §1734 solely to indicate this fact.
Abbreviations: J, joining; C, constant.
841
842
Evolution: Karlin et al.
Proc. Natl. Acad. Sci. USA 85 (1988)
ciency considerations, we generally choose Au such that ml
c N < mAL +1, where N is the composite sequence length.
Intermediate-Linked Lists. We can construct the linked list
for words of length v = , + A, which we will call L*,
directly from L provided that 1 c A-< ,u. It is convenient to
use two additional arrays, S and S* (lengths of N/4 and N/2
should suffice if ,u is chosen as above). The v-word at i is
repeated at some position h if and only if h is in the p-word
pointer chain beginning with i, and h + A is in the /i-word
pointer chain beginning with i + A. For each sequence
position i search the A-word pointer chains beginning with i
and i + A in L. When the first such h is found, set L*(V) =
h and store i in the next available register of S*. If no such
location is found, continue to the next position (i + 1).
Linked lists for even longer words can be constructed by
repeating the same procedure, switching the roles of L and
L* and of S and S*, and examining only those positions listed
in the array S*. Iterate this process as many times as
necessary to reach the length at which we wish to begin
printing (e.g., if A = 5, we could reach a desired printing
level of 17 with one intermediate list for 10-words).
Printing. Construct linked lists for each length v> t at
which we would like output (to determine the exact length of
each repeat longer than t, construct linked lists for v = t, t +
1, t +2. ... .). The decision of which v-word repeats to print
is made immediately following construction of the v+Aword linked list by comparing L (the linked list for v-words),
with L* (v+ A-words).
The locations at which a word is repeated are contained in
the pointer chain beginning with its first occurrence. If all
these occurrences are flanked to the right or left by a
common letter, then the repeat is longer than v, and we do
not print at this level. Otherwise v is the maximal length
(even though some subset of the occurrences may have a
longer common length), and we print the word and all
locations at which it occurs.
For each position i listed in the array S, do the following:
If L(i) is negative proceed to the next position; this repeat
has been considered previously. Count the length (call it ) of
the v-word pointer chain in L beginning with i, and change
the value of L to its negative at all locations in the chain
except for i. If ; is greater than some minimum dependent on
v (significance criterion) and the v+ A-word pointer chains in
L* beginning with i and i - A are both shorter than;
(maximal length criteria), print the repeated word (obtained
from its numerical representation in the array H) and its
locations (the v-word pointer chain in L beginning with i).
Multiple Sequences. The problem of finding matches between multiple sequences can be reduced to a problem of
J5 I
II
I
I1
III
I I
t7
1
c
II
IlI
r,
I
L--j
I
/
II
/
//
/
/ / /
/
-A/~~
unsequen<:ed
1/ /
/
/ I
/I/
- -
II
Mous e
For a given sequence this algorithm identifies all repeat
segments allowing for certain types of mismatch and
insertion/deletion errors. A repeat segment is an aggregate
of shared exact repeats each of length .K separated by error
blocks, each at most e letters. An error block may contain
matching nucleotides provided they are internal. For example, with K = 5 and e = 3 the oligonucleotides AGAGT(CAG)GTAGA(C)GGATA and AGAGT(TA7)GTAGAGGATA would qualify as a repeat segment, whereas ATCGG(TCT7)GAGGCT and ATCGG(AG)GAGGCT would not.
Generally we let e = 1 for proteins, e = 3 for DNA, and
choose K such that mK c f <inK+1, where fi is the average
length of the sequences considered (m is the alphabet size).
Thus, for a corresponding random sequence each K-word
would be expected to occur at most m times per sequence.
The number of repetitions of the repeat segments to be
found must be specified in advance and is called the multiplicity, r. Repeat segments are formed by aggregating r-fold
exact repeats. Within a single sequence we generally use r =
2 to find pairwise repeat segments, but higher-order repeats
with errors are also of interest.
The Algorithm. The array H (the numerical representations) and the K-word linked list, L, are constructed as
described in section A. Each r-fold exact repeat of length .K
is then tested to see whether there is an extending repeat
within e letters. Exact repeats that cannot be extended are
printed provided they are significantly long. Repeats that can
be extended are stored with their extensions in a "doubletlinked list" array of dimensions 2r + 2 by N/12. A subsequent pass through the doublet-linked list identifies and
prints significant repeat segments composed of various numbers of exact repeats. A detailed description follows for the
_-,
L:::]
I
J1
C. The Repeat Algorithm AlDowing for Errors
II
I
Rabbit
Human
finding repeats within a single concatenated sequence. The
boundary between two consecutive sequences is marked by
inserting a unique letter, not in the mr-alphabet, to ensure
that no words extend across sequence boundaries. The
criteria for significance can be changed from a threshold
number of repeats to a threshold number of sequences in
which there is at least one repeat.
Run Time. When implemented on an IBM 3090 mainframe, the execution time for determining all exact repeats of
length .6 base pairs (bp) in simian virus 40 (5243 bp) was
0.98 sec; for mouse mitochondrial genome (16,295 bp), 2.31
sec; for A phage (48,502 bp), 4.11 sec; and for repeats of
length .12 in Epstein-Barr virus (172,282 bp), 30.39 sec.
/
i
/,
//
/,
/
/
I aj
is
C
FIG. 1. Alignment map of the JIC genomic region of rabbit, human, and mouse immunoglobulin K gene. The components of each alignment
listed in Table 1. Solid blocks in alignment group IV detail the matching clusters in the group.
group are
Evolution: Karlin et al.
case r = 2 (each exact repeat is a pair of positions for a
repeated K-word). Changes required for r > 2 should be
clear.
I. Constructing the Doublet-Linked List. Consider each
exact repeat, (i, j) in order.
1. If H(i - 1) = H(j - 1), this repeat is embedded in a
previously considered repeat, and need not be considered.
2. Determine the maximal length (call it h) of the repeat by
comparing H registers (i + 1) with (j + 1), (i + 2) with (j +
2), and so forth until we find a mismatch (between registers
i + h - K + 1 andj + h - K + 1).
3. Verify whether there is an extending repeat of length K
or greater within e letters. This can be done by using the
linked list for K-words to check whether any of the addresses
j + h, ..., j + h + e are in the K-word pointer chains
beginingwithi + h, i + h + 1, . . .,ori + h + e.Case(i):
No extending repeat is found: print the single exact repeat if
its maximal length meets the printing criteria for no errors.
Case (ii): One or more extending repeats are found: for each
extending repeat store the following information in the next
available column of the doublet array: positions of the first
exact repeat (i, j) and its maximal length h; positions of the
extending repeat and its maximal length (determined as in
step 2).
II. Printing. Consider the columns of the doublet-linked
list in order.
1. If the first position of the first exact repeat is negative,
move on to the next column; this doublet has already been
considered.
2. Consider whether the doublet can be extended with
another error block by searching forward in the doubletlinked list for a doublet whose first exact repeat matches the
second exact repeat in the current column (treat any negative entries as though they were positive). If such a doublet
is found, replace the first position of the first exact repeat by
its negative (to indicate it has been considered) and search
for an extension of this new doublet. Continue this process
until no further extension can be found. Print the repeat
segment if it meets significance criteria. It is possible,
although highly unlikely, for a repeat to have several distinct
extending repeats following an error block. When this occurs
each possibility must be considered.
Multiple Sequences. Multiple sequences are treated by
concatenation as in the previous algorithm. To find segments
common to all the sequences (matching segments) we generally set r equal to the number of sequences and consider
only those exact repeats, called matches, that have exactly
one location on each sequence. When comparing more than
five sequences using this algorithm, it is suggested that they
be divided into groups of between three and five.
Word Relations. This and the preceding algorithm (see
section B), which have been described as they would apply
to the location of exact repeats and repeats with errors,
apply equally well to many other word relations. For example, dyad symmetry (inverted repeat) pairings in DNA or
RNA sequences (e.g., AGCCG ... CGGCT) can be located
by searching for repeats between the sequence and another
copy, which has been reversed and complemented. Each
dyad pairing will appear as two repeats. The same technique
can be applied to any word relation consisting of a letter
relabeling relation and/or sequence reversal.
D. The Alignment Algorithm
Two matches are said to be in close alignment if the
distances separating them on the different sequences do not
vary too widely. The difference between the greatest and the
least of these separating distances is called the alignment
error. The slant of a match is the difference between its
Proc. Natl. Acad. Sci. USA 85 (1988)
843
position in the second sequence and its position in the first
sequence. For two matches to be in close alignment (alignment error less than 8), the difference between their slants
must be less than 8.
Significant matching segments are obtained as described
in section C and those that are closely aligned are formed
into groups. The algorithm then identifies adjuncts to the
long segments: matches of some moderate minimum length
7q, which are closely aligned with and near the long segment.
The ensemble of closely aligned long segments and their
shorter proximal adjuncts, called an alignment group, indiTable 1. Alignment groups of the J5-C genomic region of human,
mouse, and rabbit immunoglobulin K gene
Mouse
Human
Rabbit
Length
Group 1
38 5' to J5
37 5' to J5
38 5' to J5
7(0)
63' inJ5
63' in J5
6 3' in J5
9*(0)
8 3' to J5
83' toJ5
63' toJ5
5(0)
I
1io2 J
;n J5
V; 1in
18 5' in J5
30*(2)
111 3' to J5
115 3' to J5
16t(0)
Group II
1118 3' to Ji
885 3' to J5
840 3' to J5
8(0)
1130 3' to Ji
897 3' to J5
852 3' to J5
6(0)
11% 3' to Ji
959 3' to J5
915 3' to J5
15t(l)
1252 3' to Ji
1012 3' to J5
969 3' to J5
5(0)
878 3' to J5
833 3' to J5
25t(0)
1091 3' to Ji
859 3' toi5
17t(1)
1187 3' to Ji
9063' toJ5
24t(2)
1001 3' to J5
958 3' to J5
20f(1)
Group III
685 5' to C
752 5' to C
722 5' to C
5(0)
675 5' to C
742 5' to C
712 5' to C
5(0)
661 5' to C
727 5' to C
697 5' to C
7(0)
639 5' to C
705 5'to C
675 5' to C
7(0)
624 5' to C
690 5' to C
660 5' to C
14t(1)
600 5' to C
666 5' to C
636 5' to C
11(1)
624 5' to C
660 5' to C
52t(5)
Group IV
17 5' to C
17 5' to C
17 5' to C
6(0)
9 5' to C
6 5' to C
9 5' to C
8(0)
9 5' in C
12 5' in C
9 5' in C
6(0)
24 5' in C
27 5' in C
24 5' in C
9*(0)
565' in C
595' inC
565' inC
5(0)
735' in C
76 5' in C
73 5' in C
5(0)
190 5' in C
190 5' in C
190 5' in C
7(0)
202 5' in C
202 5' in C
202 5' in C
8(0)
296 5' in C
290 5' in C
296 5' in C
lOt(O)
316 5' in C
310 5' in C
316 5' in C
8(0)
24 5' in C
24 5' in C
36t(4)
172 5' in C
172 5' in C
37f(2)
Group V
114 3' to C
86 3' to C
100 3' to C
6(0)
146 3' to C
120 3' to C
136 3' to C
llt(O)
186 3' to C
159 3' to C
177 3' to C
l9t(1)
118 3' to C
134 3' to C
15t(0)
135 3' to C
128 3' to C
190(1)
See text for interpretations. The notation 11(1) signifies a matching
length of 11 bp with 1 error block. Minimum component match length
for matching segments is 2K (K = S bp); error blocks are <e (e = 3
bp); maximum alignment error is -8 (8 = 6 bp). Alignment blocks are
25 bp. In locating matching segments, the following notation is used:
"x 5'(3') in gene" indicates that x nucleotides occur in the gene before
(after) the 5' nucleotide of the given word; "x 5'(3') to gene" indicates
that the 5' nucleotide of the given word occurs x nucleotides upstream
(downstream) from the given gene.
*Exceeding the expected length but not statistically significant.
tStatistically significant; P < 0.01.
tStatistically significant in two of the three sequences; P < 0.01.
Evolution: Karlin et al.
844
I
-UMAN
==q.w
CORE
| |
,
Proc. Natl. Acad. Sci. USA 85 (1988)
II
1*
TT
1.
-
-
-
-
I
|~~POLYbIERASE
____ __ _ ]_ _ _ 1SURFACE==-_____
I
I
i I
I
II I
I IPOLYMFRAcE
IFJL.IrIL;nnar.
-
i
II
I
I-""'
I
i
iss
0
i
i
0
0
i
i
m
I-
-
- -,"
-
I
i-
m
m
'I
'I
.-
--
m
-
-1L
-
-,
-
m
-
-
-|-c
w---~
T
_0
11
1i
I
a
VMCDAQE
Pnl
r rULPI.KL
--l
E
_
I
-l
I~~~
-'
-lr-r-
-10
0
i
DUCK
-
--
\
I-i ').- I--
I
SURFACE
-CORE
v
-
\ \\
\
-F---
I
i
-
_~~~~~~
\ \
\iX\
1
I1
CORE
-
l__
I'
I I .-i........ GROLUJND SflilRRFI
I'l*
r__- I
-
U
I
I
_
Il
-
SURFACE
-
-
I
aI
No
.
-
.
-
-i
FIG. 2. Alignment map of human, ground squirrel, and duck hepatitis B viruses. The components of alignment groups are listed in Tables
2 and 3. Asterisk and bold outline indicate an alignment group for all three viruses.
cates a region of substantial similarity among the sequences
free of major insertions and deletions.
Explicitly, an exact match of length k is an adjunct to a
long matching segment if they are separated by no more than
L intervening letters on any of the sequences, and the
alignment error is no more than 8 (we often use 8 = 3 or 6 for
DNA; = 1 or 2 for peptides). L is determined by the
5S} =
ln(0.9)/(1 - A)Ak, where s is
formula L{(8 + 1)5
the number of sequences and A is the probability of a local
letter match across all sequences based on the letter frequencies of the individual sequences. Using this formula, the
probability of finding an adjunct of length k within L letters
of a matching segment on a corresponding random sequence
is at most 0.1. Because L decreases with the adjunct length
k, shorter adjuncts can provide information only in the
immediate vicinity of the long segment. For DNA we usually
choose the smallest 7, such that L 20 for k = 7, with a
minimum of v1 = 4; for peptides we usually pick 71, such that
L - 5 with a minimum of q = 2.
The basic strategy of the algorithm is to consider each
and to test whether it is an
exact match of length k
adjunct to any of the long segments. A technique by which
each exact match is compared only with those long segments
having similar slants reduces this task and allows the vast
majority of matches to be discarded without explicit comparisons to any of the long segments.
1. Compile a list of matching segments (as in section C)
that are at least the expected size of the longest repeat with
the same number of errors in a corresponding random
sequence (6). Form groups of those segments that are closely
aligned with alignment error s28 (two segments with alignment error >28 can be in the same group provided that they
are both closely aligned with a third segment).
2. Divide the range of possible slants into numbered
intervals of width =48 (if space is a limitation, much wider
intervals can be used). The slant s is assigned to interval
number 1 + [s/48]. Use an array (or several arrays) to list for
each interval the long segments with nearby slants (either
within the interval or within 8 of its upper or lower bounds).
3. Construct the linked list for 27-words as described in
section A and use it to find all exact matches of length q7 or
greater. For each match, determine the slant and the appropriate interval and then check the array to see if there are
any long segments associated with this interval. Provided
that the match is not embedded in a longer exact match
considered previously (see step 1.1, section C), determine its
maximal length (see step I.2, section C), and then test
whether the match is an adjunct to any of the long segments
listed. For each segment to which it is an adjunct, store
identifying information (starting positions and length of the
adjunct, the alignment error, and the group of the long
segment) in an array for later printing.
Since there are likely to be many fewer long segments than
slant intervals, it is possible to discard most matches on the
basis of their slants after determining only the positions on
Table 2. Alignment groups of human, ground squirrel, and duck
hepatitis viruses
Duck
Human
Ground squirrel
Length
Group I
445 3' in S
616 3' in S
622 3' in S
8(0)
408 3' in S
579 3' in S
585 3' in S
21t(2)
312 3' in S
483 3' in S
489 3' in S
5(0)
265 3' in S
436 3' in S
442 3' in S
9*(O)
245 3' in S
416 3' in S
422 3' in S
7(0)
408 3' in S
579 3' in S
26t(2)
447 3' in S
624 3' in S
13t(0)
265 3' in S
442 3' in S
26t(1)
Group II
178 3' in S
193 3' in S
193 3' in S
5(0)
163 3' in S
178 3' in S
178 3' in S
11*(1)
100 3' in S
111 3' in S
111 3' in S
6(0)
86 3' in S
98 3' in S
98 3' in S
12t(0)
24 3' to S
12 3' to S
12 3' to S
5(0)
207 3' to S
192 3' to S
192 3' to S
8(0)
279 3' to S
264 3' to S
264 3' to S
5(0)
166 3' in S
181 3' in S
16t(1)
100 3' in S
111 3' in S
240(2)
See legend to Table 1 for notations. Oligonucleotides are located
relative to the genes core (C) and surface (S). The alignment map is
graphically presented in Fig. 2.
*Exceeding the expected length but not statistically significant.
tStatistically significant; P < 0.01.
*Statistically significant identity segments between duck and either
human or ground squirrel; the human and ground squirrel identities
are detailed in Table 3.
Proc. Natl. Acad. Sci. USA 85 (1988)
Evolution: Karlin et al.
Table 3. Alignment groups of human and ground squirrel
hepatitis viruses
Human (squirrel)
Length
Human (squirrel)
Length
Group IV
Group I
9(0)
299(299) 3' in S
7(7) 5' in C
16*(1)
6(0)
258(258) 3' in S
34(36) 5' in C
50t(2)
11(0)
246(246) 3' in S
103(106) 5' in C
14t(0)
228(228) 3' in S
151(154) 5' in C
5t(0)
21t(2)
196(196) 3' in S
49t(2)
157(160) 5' in C
6(0)
5(0)
135(135) 3' in S
5t(0)
193(196) 5' in C
33t(2)
122(122) 3' in S
24t(2)
223(226) 5' in C
9(0)
79(79) 3' in S
350(350) 5' in C
7(0)
8(0)
60(60) 3' in S
391(391) 5' in C
8(0)
6(0)
14(14) 3' in S
5t(0)
406(406) 5' in C
7(7) 3' in S
6(0)
412(412) 5' in C
9(0)
5t(0)
12(12) 3' to S
427(427) 5' in C
6(0)
6t(0)
24(24) 3' to S
441(441) 5' in C
6(0)
454(454) 5' in C
5t(0)
45(45) 3' to S
5t(0)
5t(0)
81(81) 3' to S
6(0)
460(460) 5' in C
10(0)
124(124) 3' to S
469(469) 5' in C
5(0)
479(479) 5' in C
5t(0)
154(154) 3' to S
18t(1)
19t(1)
192(192) 3' to S
15*(1)
506(506) 5' in C
7(0)
223(223) 3' to S
Group II
5(0)
255(255) 3' to S
64t(6)
90(90) 3' in C
6(0)
263(263) 3' to S
16(16) 3' to C
7(0)
44(44) 3' to C
9(0)
315(315) 3' to S
6t(0)
6t(0)
65(65) 3' to C
5(0)
327(327) 3' to S
36t(3)
340(340) 3' to S
77(77) 3' to C
6t(0)
92(92) 3' to C
5t(0)
429(429) 3' to S
5t(0)
441(441) 3' to S
6(0)
160(160) 3' to C
54(0)
5t(o)
7(0)
451(451) 3' to S
167(167) 3' to C
8(0)
463(463) 3' to S
206(206) 3' to C
5(0)
5t(0)
480(480) 3' to S
12*(0)
214(214) 3' to C
11(0)
498(498) 3' to S
5t(0)
254(254) 3' to C
5(0)
513(513) 3' to S
296(296) 3' to C
6(0)
5t(o)
34t(2)
534(543) 3' to S
327(327) 3' to C
18t(0)
23t(1)
576(576) 3' to S
370(370) 3' to C
Group III
6(0)
618(618) 3' to S
8(0)
630(630) 3' to S
6(0)
745(739) 3' in S
734(728) 3' in S
5t(0)
654(654) 3' to S
6(0)
14t(0)
700(694) 3' in S
668(668) 3' to S
5A(0)
676(670) 3' in S
7(0)
713(713) 3' to S
5(0)
16*(1)
737(737) 3' to S
665(659) 3' in S
5A(0)
6(0)
743(743) 3' to S
636(630) 3' in S
7(0)
66t(7)
622(616) 3' in S
9(0)
756(756) 3' to S
16*(1)
17t(1)
775(781) 3' to S
532(526) 3' in S
511(505) 3' in S
Group V
5(0)
19t(1)
495(489) 3' in S
9(0)
83(83) 5' to C
29t(1)
27t(1)
47(47) 5' to C
463(457) 3' in S
34t(3)
422(416) 3' in S
377(371) 3' in S
5(0)
5t(0)
360(354) 3' in S
337(331) 3' in S
6(0)
See legend to Table 1 for notations. Oligonucleotide locations are
given for human and (ground squirrel). For this alignment map, 8 =
3 bp. The alignment map is graphically presented in Fig. 2.
*Exceeding the expected length but not statistically significant.
tStatistically significant; P < 0.01.
tDoes not meet adjunct criterion strictly but is included because it
was in a region of perfect alignment.
the first two sequences. Alignment groups anchored by only
one long segment that is less than the length required for
significance are discarded if there are less than two adjuncts.
This algorithm can be used as the basis for sequence
homology scores, which in particular might be useful for
constructing phylogenies. It can also be used to extend
845
regions of dyad symmetry within DNA or RNA sequences
by applying it to a sequence along with a reversed and
complemented copy.
E. Examples
Fig. 1 and Table 1 present the alignment map and coordinates of the J5-C genomic region of the rabbit, human, and
mouse immunoglobulin K gene. The aligned regions extend
from the start of the J5 gene segment (13 codons) to -300 bp
3' to the C domain (106 codons in human and mouse; 104
codons in rabbit). The J5-C introns consist of 2515, 2750, and
2951 bp in rabbit, human, and mouse, respectively (including
570 unsequenced nucleotides in the human; see Fig. 1).
Alignment group I (-90 bp) emphasizes the identity of the
J5 gene segment and proximal flanks, including the consensus 5' nonamer and the splice junction extending -12 bp into
the intron. Alignment group II ('140 bp) contains more
repeats among the three sequences than any other region in
the J5-C intron and has been proposed as a major control
element (8). Alignment group III (4100 bp) embraces the
established enhancer element of the immunoglobulin K gene
(9). Alignment group IV encompasses the whole of the C
domain. Alignment group V (---95 bp) putatively contains a
transcription-termination sequence. The high-conservation
character of the three noncoding alignment groups and the J5
and C domains suggests that they are important functional
elements of the region. For further discussion on these
conserved sequence elements, see ref. 10.
Fig. 2 and Table 2 present the alignment of the human,
ground squirrel, and duck hepatitis B viruses (3182, 3311,
and 3021 bp, respectively). Alignment group I (""200 bp) is
centered in the surface and polymerase genes. Alignment
group II covers the 3' end of the surface gene and extends
downstream (this is also part of the polymerase gene). A
deletion of -z160 bp in the duck genome (relative to human
and ground squirrel) is evident.
Fig. 2 and Table 3 present the alignment between the
human and ground squirrel hepatitis B virus genomes.
Groups I and IV emphasize the strong correspondence
between the core gene and its 5' flank. These two groups
combine to a single group in the genome's natural circular
form. Alignment group II corresponds to the 5' end of the
polymerase gene. Groups I and II are offset by an
insertion/deletion of "15 bp. Groups III and IV enlarge on
the three-species alignment groups described in Table 2,
extending them to the end of the polymerase gene. It is
interesting that the core region, which is highly conserved
between human and ground squirrel, does not appear in the
three-species comparison where the highly conserved regions are all in the polymerase gene (cf. ref. 11).
This work was supported in part by National Institutes of Health
Grant NIH 2R01 GM10452-23 and National Science Foundation
Grant MCS 82-15131.
1. Needleman, S. B. & Wunsch, C. D. (1970) J. Mol. Biol. 48,
443-453.
2. Sellers, P. H. (1974) SIAM J. Appl. Math. 26, 787-793.
3. Wilbur, J. & Lipman, D. J. (1983) Proc. Natl. Acad. Sci. USA
80, 726-730.
4. Nucleic Acids Res. (1983-1987) January issues.
5. Kyte, J. & Doolittle, R. E. (1982) J. Mol. Biol. 157, 105-132.
6. Karlin, S. & Ost, F. (1987) Adv. Appl. Probab. 19, 293-351.
7. Karlin, S. & Ghandour, G. (1985) EMBO J. 4, 1217-1223.
8. Karlin, S. & Ghandour, G (1985) Mol. Biol. Evol. 2, 35-52.
9. Queen, C. & Baltimore, D. (1983) Cell 33, 741-748.
10. Karlin, S., Ghandour, G. & Foulser, D. E. (1985) Mol. Biol.
Evol. 1, 357-370.
11. Seeger, C., Ganem, D. & Varmus, H. E. (1984) J. Virol. 51,
367-375.

Download Report

Efficient algorithms for molecular sequence analysis

Paperzz.com

Your Paperzz