GTR model PROSITE – protein families database

GTR model
Criticism: observed in DNA stationary probabilities π1, π2, π3, π4 of
letters are not equal – πi’s should be model parameters.
The most General Time Reversible model.
Universal assumption – time reversibility:
⇒
πi Rj,i = πj Ri,j
πi P(j|i,t) = πj P(i|j,t)
πi’s and 6 more independent parameters:
 * π 1β π 1α π 1 χ 
π β
* π 2δ π 2η 
2


R=
π 3α π 3δ
* π 3ε 


* 
π 4 χ π 4η π 4ε
The more general model, the more adequate, but … more parameters
must be set by the user.
PROSITE – protein
families database
PROSITE collects biologically significant
sequential patterns obtained from amino acid
sequences multialignments.
4functional amino acid patterns,
4protein domains,
4protein families characterized by conservative motifs.
Ability of detecting patterns in given amino acid sequences.
Two kinds of records describing patterns:
4profiles,
4regular expressions of special format.
PROSITE – protein
families
database
PROSITE pattern notation:
4– – separator between the pattern’s elements,
4V – any letter, one letter amino acid code,
4x – any amino acid,
4[…] – one amino acid from bracket,
4{…} – one amino acid, but not from bracket,
4e(i) – for element e and number i: repetition of e exactly i times,
4e(i,j) – repetition of e exactly k times, where k≥i and k≤j.
Example. Pattern of some RNA-binding proteins’ family:
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]
Fragment of multialignment:
4SRSLKMRGQAFVIFKEVSSAT
4KLTGRPRGVAFVRYNKREEAQ
4VGCSVHKGFAFVQYVNERNAR
PROSITE – protein
families database
Example. PROSITE
pattern description.
Maximum likelihood
Idea: use a model of sequence evolution P(i|j,t). Find a tree T(V,E)
with weights on edges (time lengths) w:E→R≥0 for which an appearance
probability P(wv:v∈L|T,w) of observed leaves’ sequences wv (v∈L, |wv|
=l) is maximal possible.
Phylogenetic analysis of species set L:
1. Find a gene/protein whose homologues are present in all species of
L,
2. Make multialignment, delete columns containing spaces,
3. Calculation of a likelihood P(wv:v∈L|T,w) for a given tree T and
weights w is efficient. But heuristic search among both: different trees
and weigths of their edges (2|L|-3 continuous parameters) is necessary.
4the phylogenetic model as well as final result are the most reliable,
4heuristic optimization performed among discrete and continuous
variables. Very large search space. Time consuming numerical
calculations.
Maximum likelihood
4each position in words may be treated separately, therefore the
likelihood is a product over all multialignment columns,
4if an evolution model is time-reversible (i.e. πi P(j|i,t) = πj P(i|j,t))
then adding a root in any place doesn’t change the likelihood –
heuristic search may be performed on unrooted trees.
Problem. How to find a likelihood for one multialignment column?
Given a binary rooted tree T(V,E), weights w:E→R≥0 and letters au∈Σ
for leaves u∈L. Knowing a letter av=a in vertex v∈V find the
probability of appearance of correct letters in leaves as a result of
evolution in a subtree Tv rooted at v.
v
… dynamic programming,
P(au:u∈L(Tv)|Tv,av=a,w|E(
bottom-up proceeding order …
|E(T ))=?
wx
wy
=Σb∈Σ P(b|a,wx) P(au:u∈L(Tx)|Tx,ax=b,w|E(
|E(T ))·
x
y
· Σc∈Σ P(c|a,wy) P(au:u∈L(Ty)|Ty,ay=c,w|E(
|E(T ))
Likelihood = Σa∈Σ πa·P(au:u∈L|T,aroot=a,w)
v
x
y
Bayesian approach
D – input data: sequences w (v∈L, |w |=l)
4 D – input data: sequences wv (v∈L, |wv |=l)
4(T,w) – results: binary unrooted tree
4But weighted trees (regardless of data) are seen as not equiprobable:
P(T,w) – prior probability (density) of weighted trees
4P(D|T,w) – likelihood. But we want the most probable weighted tree
for a given data D, not a tree (T,w) for which data D is the most
probable!
likelihood
prior
new weighted
new weighted
4Posterior tree’s probability:
tree’s quality
P(T,w|D)=P(T,w,D)/P(D)=P(D|T,w)P(T,w)/P(D)
function
~ P(D|T,w)P(T,w)
hard to estimate, but … unnecesary factor
Bayesian approach
Metropolis–Hastings algorithm (Markov chain Monte Carlo).
Input: connected graph G(V,E), function f:V→R+
Output: switching states probabilities auv for all {u,v}∈E that discrete
time Markov chain with states from V has stationary distribution ~f.
1. Create probability distributions pv(u)>0 (for u∈NG(v)).
2. Let v∈V;
3. repeat
choose random u∈NG(v) with probability distribution pv(u);
a:=pu(v)f(u)/pv(u)f(v);
if a≥random([0;1]) then v:=u;
print v;
until false;
In phylogenetics: states – weighted trees, f – posterior prob. Do not
optimize! Just run a chain for a long time, take a sample of probable
trees.