A Hidden Markov Model for Progressive Multiple Alignment

A Hidden Markov Model for
Progressive Multiple
Alignment
Ari Löytynoja and Michel C.
Milinkovitch
Appeared in BioInformatics, Vol 19,
no.12 , 2003
Presented by Sowmya Venkateswaran
April 20,2006
Outline



Motivations
Drawbacks of existing methods
System and Methods







Substitution Model
Hidden Markov Model
Pairwise Alignment using Viterbi Algorithm
Posterior Probability
Multiple Alignment
Results
Discussion
Motivation
Progressive alignment techniques are
used for Multiple Sequence Alignment
 Used to deduce the phylogeny.
 Identify protein families.
 Probabilistic methods can be used to
estimate the reliability of global/local
alignments.

Drawbacks of existing Systems
Iterative application of global/local pairwise
sequence alignment algorithms does not
guarantee a globally optimum alignment.
 A best scoring alignment may not
correspond with true alignment. Hence
reliability of a score/alignment needs to be
inferred.

System and Methods

The idea is to provide a probabilistic framework for a
guide tree and define a vector of probabilities at each
character site.

Guide tree is constructed by using Neighbor Joining
Clustering after producing a distance matrix. It can also
be imported from CLUSTALW.

At each internal node, a probabilistic alignment is
performed. Pointers from parent to child sites are stored
and so also is a vector of probabilities of the different
character states( ‘A/C/T/G/-’ for nucleotides or the 20
amino acids with a gap)
Substitution Model
Consider 2 sequences x1…n and y1…m, whose
alignment we would like to find and their parent
in the guide tree is z1…l.
 Pa(xi) is the probability that site xi contains
character a.
Pa(xi) = 1, if a character a appears at terminal
node, else it is 0.
 At internal nodes, different characters have
different probabilities summing to 1.
 If the observed character is ambiguous,
probability is shared among different characters.

Emission Probabilities
Pxi,yj represents the probability that xi and yj are aligned.
pxi,yj=pzk(xi,yj)=∑pzk=a(xi,yj)
Pzk=a(xi,yj)=qa∑bsabpb(xi)∑bsabpb(yj)
qa is the character background probability
sab, probability of aligning characters a and b, is
calculated with the Jukes Cantor Model
sab=1/n + (n-1)/n * e –(n/n-1) v when a=b
sab=1/n - 1/n * e –(n/n-1) v when a≠b
n is the size of the alphabet ,
v is the NJ-estimated branch length
Probabilities


To find pxi,- , the probability that zk evolved to a
character on one of the child sites and a gap on
the other child is
pzk=a(xi,-)=qa∑bsabpb(xi)saThe same applies for pxi,-. sa- is computed just
like sab.
Any other model can be used for calculation of
sab, instead of the Jukes Cantor Model. Ex: PAM
(20 X 20) substitution matrix can be modified to
include gaps and transformed to a (21X21)
matrix, and the substitution probabilities can be
derived from that.
Hidden Markov Model
X
pxi,-
δ
1-2δ
ε
1-ε
M
pxi,yj
1-ε
δ
Y
p-,yj
ε
Hidden Markov Model

δ – probability of moving to an insert state (gap opening
penalty) ; lower the value, higher the penalty.

ε – probability of staying at an insert state (gap extension
penalty); again, lower the value, more the extension
penalty.

pxi,yj ,pxi,- , p-,yj –emission frequencies for match, insert X
and insert Y states.

For testing purposes, δ and ε were estimated from
pairwise alignments of terminal sequences such that
δ=1/2(lm+1) and ε=1-1/(lg+1); lm and lg are the mean
lengths of match and gap segments.
Pairwise Alignment

In this probabilistic model, the best alignment between 2
sequences corresponds to the Viterbi path through the
HMM.

Since there are 3 states in the model, and each state
needs 2-D space, we have 3 2-D tables : vM for match
states, vX and vY for the gap states.

A move within M, X or Y tables produces an additional
match or extends an existing gap. A move between M
table and either X or Y table closes or opens a gap.
Viterbi Recursion
Initialization:
v(0,0) = 1, v(i,-1) = v(-1,j)=0
Recursion:
vM(i,j) = pxi,yj max { (1-2δ) vM(i-1,j-1),
(1-ε) vX(i-1,j-1),
(1-ε) vY(i-1,j-1) }
vX(i,j) = pxi,- max { δ vM(i-1,j),
ε vX(i-1,j) }
vY(i,j) = p-,yj max { δ vM(i,j-1),
ε vY(i,j-1) }
Termination:
vE=max(vM(n,m),vX(n,m),vY(n,m))
Viterbi traceback

At each cell, the relative probabilities of entering the
different cells are stored. Ex:
pM-M= (1-2δ) vM(i-1,j-1)/N(i,j)
where N(i,j) is the normalizing constant, given by
N(i,j)=(1-2δ) vM(i-1,j-1)+(1-ε)*[vX(i-1,j-1)+ vY(i-1,j-1)]
The above equation is calculated for each of the 3 tables

Trace back algorithm used to find the best path; a match
step will create pointers from the parent site to the child
sites, and a gap step will create pointer to one and a gap
for the 2nd child site.
Posterior Probabilities-Forward
algorithm
Forward algorithm-sum of probabilities of all paths entering
a given cell from the start position.
Initialization:
f(0,0)=1;f(i,-1)=f(-1,j)=0;
Recursion:
i=0,…,n j=0,…,m, except (0,0)
fM(i,j) = pxi,yj [ (1-2δ) fM(i-1,j-1) + (1-ε) ( fX(i-1,j-1)+ fY(i-1,j-1))]
fX(i,j) = pxi,- [ δ fM(i-1,j) + ε fX(i-1,j)]
fY(i,j) = p-,yj [ δ fM(i,j-1) + ε fY(i,j-1)]
Termination:
fE=fM(n,m)+fX(n,m)+fY(n,m)
Backward algorithm
Sum of probabilities of all possible alignments between
subsequences xi…n and yj…m.
Initialization:
b(n,m)=1; b(i,m+1) = f(n+1,j) = 0;
Recursion:
i=n,…,1 j=m,…,1, except (n,m)
bM(i,j) = (1-2δ) px(i+1),y(j+1) bM(i+1,j+1) +
δ [ px(i+1),- bX(i+1,j) + p-,y(j+1) bY(i,j+1)]
bX(i,j) = (1-ε) px(i+1),y(j+1) bM(i+1,j+1) + ε px(i+1),- bX(i+1,j)
bY(i,j) = (1-ε) px(i+1),y(j+1) bM(i+1,j+1) + ε p-,y(j+1) bX(i+1,j)
Reliability Check

Assumption: Posterior probability of the sites on the
alignment path is a valid estimator of the local reliability
of the alignment since it gives the proportion of total
probability corresponding to all alignments passing
through the cell (i,j).

Posterior probability for a match is given by:
P(xi◊yj|x,y) = fM(i,j) bM(i,j) / fE
where fM and bM are the total probabilities of all possible
alignments between subsequences x1…i and y1…j and
xi…n and yj…m respectively
Similar probabilities are calculated for Insert X and Insert
Y states too.
Multiple alignment

Each parent node site has a vector of probabilities
corresponding to each possible character state
(including the gap). For a match,
pa(zk)=pzk=a(xi,yj)/∑bpzk=b(xi,yj)

Pairwise alignment builds the tree progressively, from
the terminal nodes towards an arbitrary root.

Once root node is defined, trace-back is done to find
multiple alignment of the nodes below since each node
stores pointers to the matching child sites.

If a gap occurs in one of the internal nodes, a gap
character state is introduced in all of the sequences of
that sub-tree, and recursive call will not proceed further
in that branch.
Testing
Algorithms tested on
(i) simulated nucleotide sequences
50 random data sets generated using the program Rose. A
root random sequence (length 500) was evolved on a
random tree to yield sequences of “low” (no. of
substitutions per site 0.5) and “high” (1.0) divergences.
Also, the insertion/deletion length distribution was set to
‘short’ or ‘long’.

(ii) Amino acid data sets from Ref1 of the BAliBASE database.
Ref1 contains alignments of less than 6 equi-distant
sequences, i.e., the percent-identity between 2 sequences
is within a specified range with no large insertion or
deletion. Datasets were divided into 3 groups based on
lengths, and further into 3 based on similarities.
Results of Simulation on Nucleotide
Sequences
Type1 and Type 2 errors vs.
minimum posterior probability
Performance and Future Work



ProAlign performs better than ClustalW for the
nucleotide sequences, but not for amino acid
sequences with sequence identity less than
25%.
Possible reasons may be that the model does
not take into account, the protein secondary
structure. So, the HMM can be extended to
modeling protein secondary structure too.
Minimum posterior probability correlates well
with correctness ; can be used to detect/remove
unreliably aligned regions

Download Report

A Hidden Markov Model for Progressive Multiple Alignment

Paperzz.com

Your Paperzz