Combined thermodynamic and evolutionary model for RNA

Outline
Motivation
Existing implementations
Combination of two models
Application
Combined thermodynamic and
evolutionary model for
RNA secondary structure prediction
Rolf Backofen1 , Jan Gorodkin2 and Stefan Seemann1,2
1 Bioinformatics
2 Division
- Inst. of Computer Science, Albert-Ludwigs-University Freiburg
of Genetics and Bioinformatics, IBHV, Copenhagen University, Denmark
Bled, Feb 2007
Discussion
Outline
Motivation
Existing implementations
Combination of two models
1
Motivation
2
Existing implementations
Pfold
Vienna RNA Package - RNAfold
3
Combination of two models
4
Application
Model performance
Alignment dependencies
5
Discussion
Application
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Outline
1
Motivation
2
Existing implementations
Pfold
Vienna RNA Package - RNAfold
3
Combination of two models
4
Application
Model performance
Alignment dependencies
5
Discussion
Application
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Motivation
non-coding RNA genes provide their functionality
through their space conformation
functional structures are conserved in the evolution
several independent models to judge consensus
secondary structures:
1
2
3
evolutionary model of RNA sequences
probabilistic model for secondary structure
thermodynamic model for folding energy
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Outline
1
Motivation
2
Existing implementations
Pfold
Vienna RNA Package - RNAfold
3
Combination of two models
4
Application
Model performance
Alignment dependencies
5
Discussion
Application
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Pfold
Probabilistic evolutionary model1 , which consists of
stochastic evolutionary model (T)
1
~ i ~Ai+j−1|T ]
Prpaired [A
~ i |T ]
Prsingle [A
SCFG-based probabilistic model (M) for secondary
structure
2
production rules:
S → LS | L
F → dFd | LS
L → s | dFd
1
(produces loops)
(produces stems)
(single base or new stem)
Knudsen B, Hein J (2003) Pfold: RNA secondary structure
prediction using stochastic context-free grammars. Nucleic Acids
Res. 31(13):3423-8.
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Pfold
Most probable consensus structure σ can be determined
by maximize Pr[σ|A, T , M]:
σ MAP = argmax Pr[A|σ, T , M] Pr[σ|T , M]
σ
This can be solved using the CYK-algorithm.
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Vienna RNA Package - RNAfold2
The partition function measures the probability of a
secondary structure σ in thermodynamic equilibrium:
∆Gσ
Zσ
e− RT
Pσ =
=P
∆GS
Z
e− RT
S∈Ω
It can be calculated the density probability of
base pairs Pr [(Aiu , Aui+j−1 ) | su ]
unpaired bases Prunpaired [Aui+j−1 |su ]
2
I.L. Hofacker, W. Fontana, P.F. Stadler, S. Bonhoeffer, M. Tacker,
P. Schuster (1994) Fast Folding and Comparison of RNA Secondary
Structures. Monatshefte f. Chemie 125: 167-188
Outline
Motivation
Existing implementations
Combination of two models
Outline
1
Motivation
2
Existing implementations
Pfold
Vienna RNA Package - RNAfold
3
Combination of two models
4
Application
Model performance
Alignment dependencies
5
Discussion
Application
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Extended model
Combination of probabilistic evolutionary information with
thermodynamic parameters of the standard energy
minimization model:
sh
T, M and σ
s1
Pr[ σ1|
σ =A σ
1
sn
s1]
Pr[σn|
σn =A σ
sn]
Outline
Motivation
Existing implementations
Combination of two models
Application
Probabilistic evolutionary model
Def. of probability of structure σ:
Pr[A|σ, T , M] Pr[σ|T , M] = probM,τM (σ) (r , A)
Recursively definition for the probabilistic evolutionary model:
probM,τM (σ) (n, A) =
Pr[rule(n)|M]
Q
× kℓ=1 probM,τM (σ) (nℓ , A)

~ i ~Ai+j−1|T ] if rule(n) = F → dFd

Prpaired [A




or rule(n) = L → dFd
×
i
~
Prsingle [A |T ]
if rule(n) = L → s



1
else
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
PE thermodynamic model
probM,τ (σ) (n, A) =
M
Pr[rule(n)|M]
×
Qk
ℓ=1 probM,τM (σ) (nℓ , A)
8
Prpaired [~
Ai ~
Ai+j−1 |T ]
>
>
(
>
i+j−1
>
>
Q
Pr
[(Aiu , Au
) | su ]
>
>
× nu=1
>
>
i+j−1
>
|su ]
Prunpaired [Aiu |su ] × Prunpaired [Au
>
>
>
>
>
>
>
>
i
i+j−1
>
> Prpaired [~
A~
A
|T ]
<
(
i+j−1
i+j
×
Q
Pr [(Aiu , Au
) | (Ai−1
u , Au ), su ]
>× n
>
i+j−1
u=1
>
>
Prunpaired [Aiu |su ] × Prunpaired [Au
|su ]
>
>
>
>
>
>
>
Qn
>
i
i
>
~
>
Pr
[A |T ] × u=1 Prunpaired [Au |su ]
>
> single
>
>
>
>
:
1
if rule(n) = L → dFd
i+j−1
if bp(sui , su
)
i+j−1
if ¬bp(sui , su
)
if rule(n) = F → dFd
i+j−1
if bp(sui , su
)
i+j−1
if ¬bp(sui , su
)
if rule(n) = L → s
else
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Gaps
Treating gaps is a general problem in biological sequence
analysis:
alignment columns with ≥ 25% gaps are removed
(like in Pfold)
sequence depended probabilities are calculated
without gaps
gap probabilities are estimated as geometric mean of
probabilities in the appropriate column
Outline
Motivation
Existing implementations
Combination of two models
Outline
1
Motivation
2
Existing implementations
Pfold
Vienna RNA Package - RNAfold
3
Combination of two models
4
Application
Model performance
Alignment dependencies
5
Discussion
Application
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Model performance in U1
Comparison of: PETfold, RNAalifold, Pfold
Test data: Rfam seed alignment of U1 spliceosomal RNA
(av.id= 40.6%; max.id = 50%; #seq = 5)
set
rules
0
0
0
0
1
1
1
1
set set log2 prob
tree seq
0
0
0
0
1
-1089
1
0
-12
1
1
-1381
0
0
-37
0
1
-1276
1
0
-106
1
1
-1540
RNAalifold
Pfold
sensitivity
[%]
0
75
95
72.5
0
72.5
90
70
62.5
95
specificity
[%]
0
64
90
81
0
74
90
82
86
90
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Optimal consensus structure
Rfam
Pfold
RNAalifold
PETfold
Rfam
Pfold
RNAalifold
PETfold
Rfam
Pfold
RNAalifold
PETfold
Rfam
Pfold
RNAalifold
PETfold
mir-9/mir-79 microRNA precursor family
...<<<<<<<<<<<<<<<..<.<<<................>.>>>..>>>>>>>>>>>>>>>...
...(((((((((((((((..(.(((................).)))..)))))))))))))))...
...(((((((((((((((..............................)))))))))))))))...
...(((((((..(.(((((.(-((((.((-......--))))-))).))))).)..)))))))...
sensitivity = 100%|79%|84%; specificity = 100%|100%|80%
U98 small nucleolar RNA
.......<<<<<.......<<<<<<<........>>>>>>>...........>>>>>......
....................(((((..........))))).......................
.....(((..(((...((.((((((((......)))))))).))(((((..............
..--((.((.......(((((((((((......)))))))...))))......)).)).....
.......<<<<<<<...........<<<...........>>>................>>>>>>>.......
........................................................................
(((((.(((......))).)))))......(((......)))))))).....)))...)))...........
......((((((((.((((.....))))..(((......)))..(((.--..)))...)))))))..)....
sensitivity = 23%|55%|73%; specificity = 100%|37.5%|52%
Small nucleolar RNA Z159/U59
..<<<<.............∼........<<<<<<<<<..∼..>>>>>>>>>..............∼..>>>>..
((((((.............∼........(((((((((..∼..)))))))))......((...)).∼..))))))
..((((.............∼........(((((((((..∼..)))))))))......(....)..∼..))))..
(.(((((.....((.((..∼..))..))(..(..(((..∼..)))..)..)...)-.(....)..∼..)))).)
sensitivity = 100%|100%|69%; specificity = 76%|93%|53%
Outline
Motivation
Existing implementations
Combination of two models
Application
RNAfold vs PETfold
−20
50 suboptimal structures of the U1 spliceosomal RNA
AE003745
1
1
1
1
1
1
1
1
1
1
−60
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
−100
−80
1
1
−50
−40
1
1
1
−120
qfold Log2 Prob of substructure
−40
1
1 11
1
1 1
1
1
1
−30
RNAfold Log2 Prob of substructure
−20
Discussion
G
C
C
G G
C C
C G
G G
U U
A
C A
G
C
C G
AG G
C
UA
A
CU
C
G
UA UU A A
U
AU U
G
C
U
C
G U C GGU
G
G
C
AAC
C
G
G
U U
GUG
A UG C G A U A A U C G U
G
A
U
C
A
U
A
U GU
AAG
C GG U C
CACU
G CGG
C G U
C
U
G
U
C
U AC
G U G AG A A
A
G
U C C U C CG A
U A
G
G U
C GU
UUA
G
G
G U
C GC
A CG
U
C
U
U
G
C A CC
C
G U
C U
G G
C C
C G
G G
U U
A
C A
G
C
C G
AG G
C
UA
A
C
C
U
UA G
AU
CG
U CG
U
U
G
U
U
G
U
C UAA
G G
U C
C
A G
G
U
AG U G G
C C
GG U C
G UC
UU CC
A A
A GC A
U A
U G C A AU G A
A G CUAG
U U
C
U
A A
G CGGUU
GAG
U A
C
U GU
CU
U
C G U
C
C
C G C
A
U A
GG G
C
A
U
G
U CC
A GC A
G
G
U
G CAU
U CG C
U G
U
U
Observation: PETfold predicts more probable
basepairs as single bases in multiloops
RNAfold vs PETfold
3 most probable structures of the U1 spliceosomal RNA
AE003745 predicted by RNAfold
Discussion
Application
Combination of two models
Existing implementations
Motivation
Outline
C
G U
C U
G G
C C
C G
G G
U U
A
C A
C G
C G
AG G
C
UA
A
CU
C
G
UA
UAA
A U UU
U
CG U
G
G U C GU
C
G
G
AC
C
GC
U A UC U G
G
A UG C G A
U U
G
A A
U A
G AU C U
G
G
U
U
C G U
UU CC
A GC A
C
AA GA
G
A
A G C U A UG C C U U
U A
G CGGU
GAG
U CUC
U
C
C GC
GG G
C
A
U
G
U CC
A GC A
G
G
G CAUU
U CG C
U G
Outline
Motivation
Existing implementations
Combination of two models
Application
RNAfold vs PETfold
50 suboptimal structures of the mir-9/mir-79 microRNA
Z81467
1 1
1
−10
1
1 1
1 1 1
1 1
1
1
−20
1 1
1
−30
1
−50
−40
1
−60
qfold Log2 Prob of substructure
1 1
1
1
−25
−20
−15
RNAfold Log2 Prob of substructure
−10
−5
1 1
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Alignment dependencies in U1
Influence of average pairwise identity of an alignment on
the optimal structure probability of our model:
Test data: Rfam seed alignment of U1 spliceosomal RNA
−4000
−6000
−8000
Log2 Probability
−2000
0
the number of sequences in the alignment is fixed
2 seq
5 seq
10 seq
20 seq
30
40
50
60
70
average pairwise identity [%]
80
90
100
Outline
Motivation
Existing implementations
Combination of two models
Application
Alignment dependencies in U1
Influence of sequence number in an alignment on the
optimal structure probability of our model:
Test data: Rfam seed alignment of U1 spliceosomal RNA
0
the average pairwise identity of the alignment is fixed
−2000
−3000
−4000
−5000
Log2 Probability
−1000
average
pairwise
identity
40 %
60 %
80 %
2
4
6
8
number of sequences
10
12
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Outline
1
Motivation
2
Existing implementations
Pfold
Vienna RNA Package - RNAfold
3
Combination of two models
4
Application
Model performance
Alignment dependencies
5
Discussion
Application
Discussion
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Problems and further work
prior probability of structures without extra
information content (uniform distribution)
low number of basepairs in large alignments (also
Pfold has problems with large input)
basepairs with higher probability as single bases in
multiloops
dangling ends are not considered until now (usage of
RNAfold -p2)
Outline
Motivation
Existing implementations
Combination of two models
Application
Discussion
Extended grammar considering stacking
probabilities
Modified grammar: single bases are considered in their
structural context by changed F rule
S → LS | L
F → dFd | dFdS | sS
L → s | dFd
Sequence stacking probabilities are estimated by
RNAfold constraints.
Outline
Motivation
Existing implementations
Thank you!!!
:-)
Combination of two models
Application
Discussion