Multiple sequence alignment by consensus

Volume 14 Number 22 1986
Nucleic A c i d s Research
Multiple sequence alignment by consensus
Michael S.Waterman
I
Departments of Mathematics and Molecular Biology, University of Southern California, Los Angeles,
CA 90089-1113, USA
Received 16 September 1986; Revised and Accepted 20 October 1986
ABSTRACT
An algorithm for multiple sequenoe alignment is given that
matohes words of length and degree of mismatoh ohosen by the
user. The alignment maximizes an alignment sooring funotion.
The method is based on a novel extension of our oonsensus
sequenoe methods. The algorithm works for both DNA and protein
sequenoes, and from earlier work on oonsensus sequenoes, it is
possible to estimate statistioal signifloanoe.
INTRODUCTION
Sequenoe alignment is an important problem motivated by
moleoular biology and many oomputer algorithms have been devised
to aooomplish alignments. Host of the results have been for two
sequenoes. The analytioal work began with the paper of
Needleman and Wunsoh(l) who solved the problem of maximum
similarity alignment for two sequenoes. Later Sellers(2) gave a
related dynamlo programming algorithm for minimum dlstanoe
alignment of two sequenoes. These algorithms were extended to
oover multiple insertions and deletions by Waterman et al.(3).
In Waterman (1984)(4) the subjeot of sequenoe comparisons is
reviewed. It is olear that, while the two sequenoe oase has
been adequately solved for many oases, the situation for
alignment of more than two sequenoes is quite different. Next
we review the history of approaches to multiple sequenoe
alignment. Then we give a new algorithm for multiple sequenoe
alignment that is based on a novel extension of our oonsensus
sequenoe methods.
Sankoff(S) gave the first treatment of whioh we are aware
that considered multiple sequenoe alignment. Els method
requires a tree relating the sequenoes and employs dynamio
© IR L Pnss Limited, Oxford, England.
9095
Nucleic Acids Research
programming as well as parsimony. For oomparing R sequenoes of
length H, the method takes time proportional to z V
(O(2V))
and storage O(Ii ) . Only small sequenoes are praotioal even when
R-3. Sankoff et al.(6) apply the algorithm, modified somewhat,
to 6S sequenoes. In Waterman et al.(3) a similar algorithm is
presented whioh does not assume a tree. The time and spaoe
requirements are similar.
More reoently. Waterman and Perlwitz(7), apply the geometry
of geodesios to sequenoe alignment. The idea of sequenoe is
extended so that a "nuoleotide" is a mixture of A.C.G.T and $
(deletions). The basio algorithm is based on dynamlo
programming, and the sequenoes are aligned and merged two at a
time. When the relationships between the sequenoes are
understood, as when a oorreot evolutionary tree is available,
the method works very well. Still it is a palrwlse algorithm
and the final alignment is dependent on the order in whioh the
sequenoes are prooessed.
Computer soienoe has studied what they oall string matohlng
problems, beginning with Wagner and Pisoher (they prefer
"string" to sequenoe)(8). In Itoga(1981)(9) the string merging
problem is studied, and Hsu and Du(1984)(10) oonslder the
longest oommon subsequenoe of a set of strings. Hone of these
authors seem aware of the earlier results described above.
Sobel and Hartinez(ll) approaoh sequenoe matohlng as a
regions problem, where their algorithm is based on looating all
exaot repeats of patterns whioh ooour in the sequenoe set. The
method of finding repeats takes O(HlogH) operations(12). The
best set of regions making up the alignment is found by longest
path methods from oomputer soienoe. This is perhaps the only
praotioal method available for more than three sequenoes until
the present paper.
Some Interesting results have been obtained for protein
sequenoes. There it is thought that gaps should reoeive a
oonstant penalty, regardless of length. Fredman(13) Improved
the algorithms to O(H 3 ) for aligning 3 sequenoes of length n
with the oonstant gap penalty, whereas the dlreot extension of
Waterman et al. is O(H S ). Hurata et al.(14) give a similar
9096
Nucleic Acids Research
improvement. Johnson and Doolittle(lS) give a method, not
guaranteed to be globally optimal, whioh is based on the
progressive evaluation of seleoted segments from eaoh sequenoe.
There are two reoent methods designed to align multiple
sequences that require more disoussion. That of Sobel and
Martinez(11) is based on aligning segments common to the
sequences; it is frequently of interest to align patterns with
some degree of mismatoh, insertion, and/or deletion. There may
be no segment oommon to all the sequenoes. The method of
Johnson and Doolittle(lB) is based on aligning segments found
within a window of width W and for R sequenoes of length N has
running time proportional to O(R(N-W)W^ ) . For small R this is
a praotloal method, but is not praotioal for a larger number of
sequenoes. Next we present our method to overoome some of these
difficulties.
DESCRIPTION OF THE METHOD
We begin with a set of R sequenoes of length N
a
l , 1*1,2
a
l,H
a
a
a
2,N
2,l 2.2
These sequenoes oan be taken to be initially aligned on some
biologically determined feature. The alignment is, of oourse,
unknown exoept approximately. It is the purpose of this paper
to give an algorithm aligning the sequenoes by matohlng or
aligning on words of a given size. The usual methods of
sequence analysis align on single letters, that is words of
length 1.
A oonoept basio to our algorithm is that of oonsensus word.
The definition has been given in earlier work (Waterman et al.
(1984X16) and Galas et al. (1988)(17)) and will be briefly given
here. First, take a fixed word slze(length) k and a word w of
length k. There are 4 suoh words in DNA and 2CT in proteins.
Next, define the window width W. This parameter gives the width
of sequenoe in whioh a word oan be found and thus defines the
9097
Nucleic Acids Research
amount of shifting allowed In matohlng oonsensus words.
The
sequenoes starting at column J+l with window width W appear as
a
l,J+l a l.J+2
a
2,J+l a 2,J+2
a
R,j+i a R,j+2
a
a
l,J+W
2,j+W
^.J+w
First, we searoh the first sequenoe of the window for matohes
to our word w. An exaot matoh to w Is oalled a d-0 neighbor
while a 1-letter mlsmatoh from w Is oalled a d-1 neighbor, and
so on. It Is possible to lnolude Insertions and deletions In
this list of neighbors. We may deoide, e.g., to limit the
amount of mlsmatoh to d-0,1,2 anil not find w in a portion of
sequenoe unless it is within this neighborhood. Let q^ d equal
the number of lines that the best ooourrenoe of w Is as a d-th
neighbor. Eaoh of these ooourrenoes reoeives weight \^. The
soore of word w In this window is
s
(w) J+l,J+W
t X q.
d d w.d
*
A best word is word w
s
satisfying
(w )
J+l,j+W
max
W
s (w).
j+l,J+W
The idea of the algorithm is to align on oonsensus words,
attempting to maximize the sum of the soores of the words.
Before the praotloal algorithms are presented, a more general
oonoept of alignment on words is presented.
How we define a partial order on words. The words w^^ and w g
satisfy Wj < w 2 if the ooourrenoes of w. in sequenoe 1 are to
the left of the ooourrenoes of w_ in sequenoe 1 (and do not
lnterseot) for 1-1 to R. It is not neoessary for w^ or w 2 to
have ooourrenoes in all sequenoes. Implloit In the definition
is & window width W and neighborhood speoifloatlon. The goal of
an optimal alignment Is to find words w. whioh satisfy
(It Is frequently desirable to require s(Vj)2o for all 1, where
o is some crutoff value.) It is not possible to aooomplish this
9098
Nucleic Acids Research
goal in reasonable time, but it is possible to oome quite olose.
We now define two praotioal algorithms.
Next w. w. means that consensus words v, and w o oan be found
in non-overlapping windows, eaoh word satisfying as usual the
window width and neighborhood constraints. The modified
optimization problem is to satisfy
T - max{Zlils(wi) : wjwj...}.
There is a straightforward reoursion to find T. Let T ± be the
maximum sum for the sequenoes from base 1 to base i:
a
a
l,l ""' a l,l
2,l • • a 2,l
t,l " - • *R,i
Then T. satisfies
T
i " »«<TJ
+ s
j+l.l : *-™ * 1
a°d T_ w - T _ w + 1 - T Q - Tx - T k _ x - 0. Also S ^ - 0
if y-x+l<k. This algorithm runs in time approximately
proportional to NWTIB where B - neighborhood size. Here the
faotor of WRB aooounts for the oonsensus word algorithm with a
window width W. (This is an overestimate sinoe the aotual
windows vary from k to V In width.)
If muoh shifting is neoessary to matoh the sequenoes, T is an
underestimate and misses some of the relevant matohing. To
overoome this problem, the definition of T is modified to
S
l - ^ { S j + s J + l i : 1-W+1 * i S i-k}.
where B , + , , is the largest soorlng oonsensus word in the
window from J+l to 1 suoh that all ooourrenoes of the oonsensus
word are to the right of the oonsensus words for S.. This
algorithm is not guaranteed to be equal to the global maximum,
but it is muoh more useful than T.
KTCAMPT.K
The algorithm is now illustrated with an alignment of 34 5S
sequenoes from B. ooli and related organisms that were obtained
9099
tggc..ecatageg..gtgg..aaac.accc
ccat.tccg.aaca..cagaagttaagc
gegecgat....tagt.. .tggg.
.gaga..agga
Figure 1.
mismatoh.
An alignment of 34 5S rRNA sequenoes by algorithm S
with k-4, W-6, and up to 1
TCCC tggcGCCAGTagcgCGgtggTCCC
acctGACC ccat gecg aact cagaagtgaaacGCCGTAgcgccgatGG
;agtG
tgggGTCTCCCCATGC gagaGTagggAACTGCCAGGCATTGTC tggcGGcoatag.cgCAgtga.TCCC
acctGATC ccat gecg aact cagaagtgaaacGTTGTAgcgccgatGA
:ggtG
tgggGTCTCCCCATGT gagaGTagggAACTGCCAGGCAT
TGCT tggcGAccatagcgTTatggACCC
acctGAT ccctTgccg aact cagtagtgaaacGTAATAgcgccgatGG
:agtC
tgvgGTCTCCCCATGT gagaGTaggaCATCGCCAGGCAT
TGCT tggcGAcCatagcgATttggACCC
acctGACTTCCATTCCGAACTCAGAAGTGAAACGAATTAGCGCCGA
:ggtAGTGtgggGCTTCCCCAT
gtgaG agtaGCACATCGCCAGGCTT
TTCT tgacGAccatagagCAttgg aacc acctGATC ccat cccg aact cagtagtgaaacGATGCAtcgccgatGG
:agtG
tgggGTTTCCCCATGT gagaGTaggtCATCCTCAAGATT
TGCT tgacGAtcatagagC gttg gaacCacctGATC ccat cccg aact cagcagtgaaacGACGCAtcgccgatGG- agtG
tgggGTTTCCCCATGT aagaGTaggtCATCGTCAGGCGC
TGCT tgacGAtcatagagC gttg gaacCacctGAT ccct tcccGaact cagaagtgaaacGACGCAtcgccgatGG
agtG
tgggGTCTCCCCATGT gagaGTaggtCATCGTCAAGCTC
GTC
actGCGTCAAAAGACGTGG
gagaGTaggtCACCGCCAGACC
tggtGGccaaagcaCGAGCA aaac acccGATC ccat cccg aactCGGC cgttaagtGCCGTAgcgccaatGG
:ggtAC
tgcgTCTTAAGGCGTGGgagaGTaggtCGCCGCCAGGCCT
TGGCCTGCTCGTCATTGCGGGCTCGAA
acacCCGATCCCA tcccGAACTCGGCCGTGAAAGAGCCCTGCGCCAA
;ggtAC tgggACCGCAGGGTCCTGGGAG agtaGGTCGGTGCGGGGGAT
AATCCCCCGTGCCCTTAGCGGCGTGG
aaccACCCGTTCCCATTCCGAACACGGAAGTGAAACGCGCCAGCGCCGA
AATCCCCCGTGCCCATAGCGGCGTCG
:ggtAC tgggCGGGCGACCGCCTGGGAG agtaGGTCGGTGCGGGGGAT
aaccACCCGTTCCCATTCCGAACACGGAAGTGAAACGCGCCAGCGCCGA
TCC
tggtGTctatggcgGTatgg aacc actcTGACCccat cccg aact cagtTGTGaaacATACCTgcggCAACGA
LagcTCCCGGGTASCCGGTCGCTAAAATAGCTCGACGCCAGGTC
TTCC tggtGTCTCTagcgCTttgg aaccACTTCCATTCCA
gggaAAATAGCTCGATGCCAGGATT
tcccGAACrCGATTGTGAAACTTTCCTGCGGCTAAGATACTTGCTGGGTTGCTGGCT
TT
tggtGGcgatagcgAAgaggTCacacCCGTTC
ccat accg aaca cggaagttaagcTCTTCAgcg.ccgfitGG
•gtCG ggggTTTCCCCCTGT gagaGT«ggaCGCCGCCAAGC
TT
tggtGGcgatagcgAAgaggTCacacCCGTTC
ccat gecg aaca cggaagttaagcTCTTCAgcgccgatGG
agtTG ggggCTTCCCCCTGT gagaGTaggaCGCCGCCAAGC
CCTAGTGA
caatagcgGAgagg aaac acccGTTC ccat cccg aaca cggaagttaagcTCTCCAgcgccgatGG
agtTGGGGCCAGCCCCCCTGC
aagaGTaggtCGTTGCTAGGC
TC
tggtGActatagcgGAgggg caac acccGTAC ccat cccg aaca cggaCGTGaagaCCTCCAgcgccgagAA tactGGG agggCAGCCTCCTGG gaaaGTaggtCGTTGCCAGGC
TGTTGTGATGATCGCATT
gaggTCacacCTGTTC
ccat accg aaca cagaagttaagcTCAATAgcgccgaaAG tagtTGGAGGATCTCTTCCTGC
gaggATaggaCGTCGCAATGC
TG
tggtGGcgatagccTGAAGG atac acctGTTC ccat gecg aaca cagaagttaagcTTCAGCacgccgatAG tagtTG ggggATCGCCCCCTGC gaggATaggaCGTTGCCACGC
TG
tggtGGcgatagcgAGAAGG atac acctGTAA ccat gecg aaca cagaagttaagcTTCTTAgcgccgatTG tagtGA ggggGTTGCCCCTTGT gagaGTaggaCGTCGCCACGC
gagaGTaggaCGTCGCCACGC
B. br«vl« TC
tggtGATGATggcgGAgggg acac acccGTTC ccat accg aacaCGGC cgttaagcCCTCCAgcgccaatGG tactTGCTCCGCAGGGAGCCGG
ctatGACTTAgagg taac actcCTTC ccat tccg aacaGGcaggTT aagcTCTAATgtgctgatGG tactGC agggGAAGCCCTGTGG aagaGTaggtCGACGCTGGGT
C. paitau TCCAGTGT
tggtGCTTATGGCATAgtgg acac acccGTTC ccat cccg aaca cggaagttaagcACTATTacgccgacGA tagcCGCAAGGTGAAAATAGGGCAGTGCC
aggt
Splropla TC
Hycop. ca TTGG tggtATAGCATAGAGGTC
tgtgAGAAAATAGCAAGCTGCCAGTT
acacCTGTTCCCATGCCGAA
cacaGAAGTTAAGCTCTATTACGGTGA agatAT tactCA
Kycop. my TTGG tggtATAGCATAGAGCTC
tactTA
tgtgAGAAGATAGCAAGCTGCCAGTT
acacCTGTTCCCATGCCGAA
caCaGAAGTTAAGCTCTATTACGGTGAAAATAT
Klcrococ GTTA cggcGGctatagcgT ggggG aaac gcccGG
gggaG agtaGGTCGCCGCCGTGA
ccgtATATCGaacc cggaagctaagcCCCATAgcgccgatGGTTACTGTAACCGGGAGGTTGT
Str«ptom GTTTCGGTGG tcatagcgT gaggG aaac gcccGGTT acat tccg aacc cggaagctaagcCTTACAgcgccgatGG tactGC agggGGGACCCTGT
gggaG agtaGGACGCCGCCGAACT
whtat. ffltAAACCGGGCACTACGGTGAGACGTGAAAacacCCGATCCCATTOX»CCTCGATATATATGTGGAATCGTCTTtX^XCATATGTACTGAAATTGTTC
gggaGACATGGTCAAAGCCCGGAAA
Hbb. «ml2 TAGGTTTGGCCGTCATAGCG atggGGTATCACCTGGTCTCGTTTCGATCCCA<aAGTTAAGTCTTT
tcgcGTTTTGTTTGTGTACTATGGGTTCCGGTCTAT gggaATTTCATTTAGCTGCGAGCTTTTT
Hip. hung TCAA tagcGGecacagcaG gtgtGTCAC acccGTTC ccat tccg aaca cggaagttaagaCACCTCACGTggatGACGGTAC
tgagGTACGCGAGTCCTCCGGAAATCATCCTCGCTGCTATTGTT
Hb. aalln TTA
aggcGGceatagcgGTggggTTAC
tcccGTAC ccat cccg aaca cggaagataagcCCGCCTgcgtTCCGGT cagtAC
tggaGTGCGCGAGCCTCTGGGAAATCCGGTTCGCCGCCTACT
He. norrh TTA
aggcGGccacagcgGCggggCGAC
tcccGTAC ccat cccg aacaCGGC agataagcCCGCCAgcgtTCCAGC gagtAC
tggaGTGTGCGAACCTCTCGGAAAACTGGTTCGCCGCCTCCC
T. acldop GGCAACGG
tcatagcaGCAGGG aaac accaGATC ccat tccg aactCGAC ggttaagcCTGCTGCGTATTGCGTTGtactGTATGCCGCGAGGGTAC
gggaAGCGCAATATGCTGTTACCACT
8. acldoc GCCCACCCGGTCAC agtgA gcggG caac acccGGAC teat ttcg aacc cggaagttaagcCGCTCACGTTAGTGGGGCCG
tggaTACCGTGAGGATCCGCAGCCCCACTAAGCTGGGATGGGTTTT
comansui
E. Coli
P. Vulgar
Photobac
Banlckla
p flour
Axo. vino
P. a«rugi
Paracocc
R. rubrum
Th. aquat
Th. th«rm
Anacystl
Proohlor
B. »ubtil
B. lichen
B. ftttaro
B. acldoc
Lac. vlrl
Lac. br«v
Stropt. f
Nucleic Acids Research
from the oolleotion of Olsen and Paoe. The? oan be found in
GenBani or the review paper of Erdmann et al.(18).
An alignment of these sequenoes by algorithm S with k-4 and
W-6, allowing up to 1 mlsmatoh is given in Fig. 1. The
parameter X^ - (k-d)/k, where d equals the number of mismatohes.
Algorithm T whioh does not allow overlapping windows has soore
304.75, while algorithm 8 has soore 364.00. Some soores for
various oholoes of the parameters are given next:
W - window width
6
6
8
6
8
8
8
10
10
10
It — word size
4
4
4
4
6
6
6
6
6
6
• mismatohfifl
0
1
0
1
0
1
2
0
1
2
g
117.00
364.00
196.00
487.2B
23.00
124.00
269.17
82.00
213.67
329.00
DISCUSSION
The existing multiple sequenoe alignment programs based on
our algorithm are written for DKA sequenoes, but they oan easily
be extended to proteins. (We plan to do this.) The largest
protein word possible is k-3, sinoe 20-8,000. while words with
k-8 or 9 is possible with DNA. Whenever the amino aoid alphabet
is deoreased from the usual 20-letter alphabet, k oan be
inoreased.
Another feature we have not lnoluded in our program is a oost
for unmatohed letters (deletions/insertions). This oould easily
be done as Martinez does, for example, but we are not yet
persuaded of the neoessity here.
Statistloal signifioanoe is always an issue in sequenoe
alignment. Frequently the sequenoes are randomly permuted, and
the optimal alignment soores obtained from the random sequenoes.
Statistical signifioanoe is estimated from these soores. Here
we have a more dlreot approaoh. It is possible to oaloulate the
9101
Nucleic Acids Research
probability that & word ooours In the speolfled neighborhood and
In the window of one sequenoe. Then the binomial distribution
or the theory of large deviations (for large R)(16) oan be used
to oaloulate the probability of ever seeing a matched word in L
out of the R sequenoes. By requiring this probability to be
small (e.g., .01), we oan be assured that every matohed word in
an alignment is significant at that level of significance.
The implementation for DNA sequenoes is written in the C
language and is available from the author. Please inquire for
details.
ACKNOWI1KIX1MKNT.fi
The 8S sequenoes were reoelved from Gary Olsen and Norm Paoe.
Assistance from Mark Eggert and Felloltas Smith is greatly
appreciated.
This wort was supported by grants from the System Development
Foundation and the National Instutes of Health(GM 36230).
REFERENCES
1. Needleman, S.B. and Wunsoh, C D .
(1970). ,T. Mol. Biol. 48,
444.
2. Sellers, P. (1974). SIAM J. Appl. Math. 26, 787.
3. Waterman, M.S., Smith, T.F., and Beyer, W.A. (1976). Adv.
Maib^. 20, 367.
4. Waterman, M.S. (1984). Bull. Math. Biol. 46, 4, 473-500.
5. Sankoff, D. (1975). SIAM J. Appl. Math. 78, 38-42.
6. Sankoff, D., Cedergren, R.J., and Lapalme, G. (1978). iL.
Mol. Evol. 7, 133-149.
7. Waterman, M.S. and Perlwitz, M. (19S4). Bull. Math. Biol.
46, 4, 867-577.
8. Wagner and Flsoher. (1976). J. of the A B H O O . for Computing
MA.<VM T^f<7;y 23, 50—57.
9. Itoga, 8.Y. (1981). E H 21, 20-30.
10.Hsu, W.J. and Du, M.W. (1984). B H 24. 48-89.
ll.Sobel, E. and Martinez, H. (1986). Nuol. Aolds Res. 14, 1,
363-374.
12.Martinez. H. (1983). Nuol. Aolds Res. 11, 4629-4634.
13.Fredman. M.L. (1984). Bull. Math. Biol. 46, 4, 883-566.
14.Murata, M., Rlohardson, J. , and Sussman, L. (1988). ElQQ^.
Natl. Aoad. Sol. PSA 82, 3073-3077.
18.Johnson, M.S. and Doollttle, R. (in press). J. Mol. Evol.
16.Waterman, M.S.. Arratla, R., and Galas, D. (1984). Bull.
Math. Biol. 46, 818-627.
17.Galas, D.J., Eggert, M., and Waterman, M.S. (1988). J. Mol.
BloJ^. 186, 117-128.
18.Erdmann, V.A., Wolters, J., Huysmans, E. , A.TK1 R. de Waohter.
(1985). Nuol. Aolds Res.. 13. rlO5.
9102

Download Report

Multiple sequence alignment by consensus

Paperzz.com

Your Paperzz