Turbo-Decoding of RNA Secondary Structure

Background
Turbo-Decoding RNA Secondary Structure
References
Turbo-Decoding of RNA Secondary Structure
Gaurav Sharma
Department of Electrical and Computer Engineering
Department of Biostatistics and Computational Biology
June 18, 2013
1
Outline
◮
Turbo-decoding in Communications: A Quick Review
Outline
◮
◮
Turbo-decoding in Communications: A Quick Review
RNA Structure Analysis: Motivation and Background
◮
◮
RNA, noncoding RNA, RNA structure and its significance
RNA structure prediction
◮
Single/Multiple sequence methods
Outline
◮
◮
Turbo-decoding in Communications: A Quick Review
RNA Structure Analysis: Motivation and Background
◮
◮
RNA, noncoding RNA, RNA structure and its significance
RNA structure prediction
◮
◮
Turbo-decoding RNA secondary structure
◮
◮
Single/Multiple sequence methods
Iterative probabilistic decoding of structures of multiple
homologs: TurboFold
Turbo-decoding: RNA vs communications
Outline
◮
◮
Turbo-decoding in Communications: A Quick Review
RNA Structure Analysis: Motivation and Background
◮
◮
RNA, noncoding RNA, RNA structure and its significance
RNA structure prediction
◮
◮
Single/Multiple sequence methods
Turbo-decoding RNA secondary structure
◮
Iterative probabilistic decoding of structures of multiple
homologs: TurboFold
◮
Turbo-decoding: RNA vs communications
◮
Conclusions
Outline
◮
◮
Turbo-decoding in Communications: A Quick Review
RNA Structure Analysis: Motivation and Background
◮
◮
RNA, noncoding RNA, RNA structure and its significance
RNA structure prediction
◮
◮
Single/Multiple sequence methods
Turbo-decoding RNA secondary structure
◮
Iterative probabilistic decoding of structures of multiple
homologs: TurboFold
◮
Turbo-decoding: RNA vs communications
◮
Conclusions
◮
Ongoing related work
Convolutional Codes
◮
D
D
D
c1 (D)
m1 (D)
c2 (D)
m2 (D)
c3 (D)
m3 (D)
c4 (D)
Codeword
symbol streams
Encoder
Messageword
symbol streams
◮
Finite state machine
◮
Output and next state are functions of current state and inputs
ML Decoding: Convolutional Code
◮
Convolutional code structure constrains possibilities to a trellis
Rx Bits
11
11
01
10
01
10
0
1
1
0
00
10
01
11
1
◮
0
Deoded
Message Bits
ML Decoding: Most likely path through the trellis given the
received information
Turbo Decoding in Communications
◮
An encoder construction
P1 (D)
m(l)
.
.
.
Parity
Subunit
Messageword
P(D)
m(l)
p(l)
.
.
.
cn−k
(l)
m1
cn−k+1
(l)
mk−1
m(l)
Figure : Two encoders.
(l)
m0
m(l)
(l)
.
.
.
(l)
p2
(l)
cn−k−1
Systemati Subunit
(l)
m(l)
(l)
c0
(l)
c1
Computation
m(l)
P2 (D)
Reursive Systemati Convolutional Enoder
(l)
p1
.
.
.
m(l)
m(l)
(l)
cn
(l)
P1 (D)
p1
P2 (D)
p2
Interleaver
Π
Figure : Systematic convolutional encoder
(recursive)
Figure : Parallel
concatentation.
(l)
c(l)
Turbo Decoding in Communications
m̃
P1 (D)
m̃
r̃1
p̃1
r̃1
Interleaver
Π
Channel
p̃2
P2 (D)
r̃1
Figure : Encoder + channel
t=T
r̃0
Forward-Bakward Algorithm
Approximate MAP
for Extrinsi Information
r̃1
Deoding
Update for Code 1
n
r̃
Π
(t)
Ul (m)
oL−1
n
l=0
Π−1
(t)
Tl (m)
oL−1
Final Deoding Step
l=0
Π
Forward-Bakward Algorithm
for Extrinsi Information
r̃2
Update for Code 2
Iterative Phase
Figure : Iterative Decoder
ˆ
{m̂(l) }L−1
l=0 ≡ m̃
Symbol-wise MAP Decoding: Convolutional Code
◮
Convolutional code structure constrains possibilities to a trellis
Rx Bits
11
11
01
10
01
10
00
10
01
11
◮
Most probable value of a bit for all possible paths through the
trellis, given the received information
◮
Localized probabilistic information
Turbo Decoding in Communications: Observations
◮
◮
Multiple encodings of same message information
Joint (optimal) decoding desirable
◮
◮
Exact joint decoding ≈ exponential complexity
Computationally Efficient Decoding: Iterative approximation
(belief propagation)
◮
◮
◮
◮
Localized MAP probabilitistic formulation
Decomposition into loosely coupled individual decodings +
information exchange at each iteration
Linear complexity in length of data
Pseudo-prior interpretation
Turbo Decoding in Communications: Observations
◮
◮
Multiple encodings of same message information
Joint (optimal) decoding desirable
◮
◮
Exact joint decoding ≈ exponential complexity
Computationally Efficient Decoding: Iterative approximation
(belief propagation)
◮
◮
◮
◮
Localized MAP probabilitistic formulation
Decomposition into loosely coupled individual decodings +
information exchange at each iteration
Linear complexity in length of data
Pseudo-prior interpretation
RNA?
What does this have to do with RNA?
RNA: Ribonucleic Acid
G
C
Adjacent nucleotides linked
together by strong (covalent)
phosphodiester bonds
between sugar and phosphate
C
N
C
C
O
HN
Information encoded with 4
different types of nucleotides
differentiated by base
content: Adenine, Guanine,
Cytosine, Uracil
http://www.genome.gov
C
Cytosine
NH2
C
H
O
H
HN
G
O
C
N
C
H
N
H
A
C
N
H
C
N
N
C
N
H
C
H
N
H
H
Nitrogenous
Bases
A
Adenine
NH2
N
N
C
N
C
H
N
Sugar
phosphate
backbone
H
N
T
O
C
C
N
C
C
H
N
C
H
N
H
Thymine
O
H
H
H
O
H
replaces Thymine in RNA
Nitrogeneous
Bases
N
C
H
Uracil O
H
C
C
H C
H
U
H
Adenine
Base pair
C
C
N
H
NH2
N
H
C H
C
C H
C
C
H
Guanine
O
N
H
H
Guanine
C
C
C
H
N
C
N
H C
◮
C
AT
NH2
G
◮
's
Cytosine
G
's
C
C
Nucleic Acid of long chain of
units named nucleotides:
Nitrogenous Base, Ribose
sugar, Phosphate
AU
◮
C
C
N
C
C
H
N
H3C
H
H
RNA
DNA
Ribonucleic acid
Deoxyribonucleic acid
Nitrogeneous
Bases
The Central Dogma
DNA
Nuclear membrane
mRNA Transcription
Mature mRNA
Amino acid
Transport to cytoplasm
tRNA
Amino acid
chain (protein)
Translation
Anti-codon
Codon
mRNA
Ribosome
◮
Genetic information flows unidirectionally:
◮
◮
DNA → RNA → Protein
RNA plays a passive role
◮
Transient copy created for protein synthesis
http://www.genome.gov
RNA an Active Player: ncRNAs
◮
DNA Template
◮
Transcription
Self-splicing
Intron
pre mRNA
AUG
UAA Codons
Increasing numbers (being)
discovered
◮
1989 Nobel Prize in Chemistry:
Ribozymes
UAG Anti-Codon
micro RNA
Ribosomal
RNA
tRNA
◮
Protein
ncRNAs
w/o translation to protein =⇒
“noncoding”
◮
mRNA
+/-
ncRNAs: play direct functional roles
in cellular processes
◮
Thomas Cech and Sidney
Altman
2006 Nobel Prize in
Physiology/Medicine: siRNA
◮
Andrew Fire and Craig Mello
Background
Turbo-Decoding RNA Secondary Structure
References
Noncoding RNAs (ncRNAs): Examples
◮
Commonly known ncRNAs
◮
◮
◮
Protein synthesis: tRNA, rRNA
RNA modification: snoRNAs,
Up/Down regulation of gene expression
◮
◮
◮
Regulation of transcription
siRNA/miRNA post transcription regulation silencing of genes
piRNAs regulation of retroransposons
◮
RNA Splicing (autocatalysis)
◮
Many more: ...
◮
RNA Genomes (Many viruses including HIV and SIV)
ncRNAs and diseases
◮
◮
◮
◮
Abnormal expression for ncRNAs observed in cancerous cells
Prader-Willi Syndrome (over-eating and learning disabilities)
Autism, Alzheimer’s, ...
13
Background
Turbo-Decoding RNA Secondary Structure
References
Noncoding RNAs (ncRNAs)
◮
RNA molecules that directly play functional roles in cellular
processes
◮
◮
◮
Do not code for protein synthesis =⇒ “noncoding”.
Structure determines function in noncoding roles
Determination of structure is of significant interest
◮
◮
◮
Further understanding of ncRNA function
Enhances understanding of cellular processes and interactions
Provides targets for drug design
14
Computational Prediction of RNA Structure
◮
Structure determines function in noncoding roles
Computational Prediction of RNA Structure
◮
◮
Structure determines function in noncoding roles
Experimental determination of structure is challenging
◮
◮
X-ray Crystallography
Crystallization difficult and expensive
Computational Prediction of RNA Structure
◮
◮
Structure determines function in noncoding roles
Experimental determination of structure is challenging
◮
◮
◮
X-ray Crystallography
Crystallization difficult and expensive
Computational estimation of structure is of significant interest
◮
◮
◮
Understanding ncRNA function in cellular processes and
interactions
Genome understanding: structure based ncRNA gene search
Therapeutics: targets for drug design
Computational Prediction of RNA Structure
◮
◮
Structure determines function in noncoding roles
Experimental determination of structure is challenging
◮
◮
◮
X-ray Crystallography
Crystallization difficult and expensive
Computational estimation of structure is of significant interest
◮
◮
◮
Understanding ncRNA function in cellular processes and
interactions
Genome understanding: structure based ncRNA gene search
Therapeutics: targets for drug design
RNA Structure Hierarchy [Tinoco and Bustamante, 1999]
5’
AAUUGCGGGAAAGGGGUCAA
CAGCCGUUCAGUACCAAGUC
UCAGGGGAAACUUUGAGAUG
GCCUUGCAAAGGGUAUGGUA
AUAAGCUGACGGACAUGGUC
CUAACCACGCAGCCAAGUCC
UAAGUCAACAGAUCUUCUGU
UGAUAUGGAUGCAGUUCA
3’
P 5c
A
C
A
G
G
P 5a CA
G
U
C
G
AA
U
A
AU
180 G
G
GUA U
AAAGG
C
G UC C
G G
U
160
GAC A
C
U
C
200 G
GU
G
U
P 5 UC
C
A
C
G
U
U
A
A
U
A
G
A
AA
A
G
U
C
U
C
A
G
G
G
G
A
C
C
AC
P 4 GC
A
140
P 6 GC
C
A
A
220 G
U
C
C
U
A
A
G
U
C
A
A
C
A
G
A
UC
120
G
A
A
C
C
A
P 5b GU
U
U
C
A
A
C
U
G
G
G
A
G
G
G
C
G
U
260
C A
U
U U
G AA
A
C
G
U
A
G P 6a
G
U
A
U
A
G
U
U P 6b
G
U
C 240
U
U
Figure : Hierarchy of RNA structure formation [Waring and Davies, 1984,
Doudna and Cech, 2002, Doudna and Cate, 1997]
RNA Secondary Structure
◮
◮
Folding of RNA linear molecular chain onto itself with base
pairing rules
Formation of hydrogen bonds between nucleotides
◮
Canonical base pairs
◮
◮
◮
◮
A can pair with U
G can pair with C and U
G-U pair called non Watson-Crick pair
Greater variety of structures than the DNA double helix
RNA Secondary Structure Elements
U UG
C
C
130
G
A
U
G
C
C
G
G
C
GA
120
C
A
G
A
GU
A
U
140
C
G
60
110
G
G
U
G
G C
G
C
G A G
U
C
C
G A
C
U
G
C
U
C
A
190
C
C
G
C
A
U
G
U
G
G
A
A
G
G
U
A
G
U
C
C
AU
G
G
U
G
A
C
A
C A 200
C
A
C
G
100
U
A
90
G
G
C
C
A
A
U
G
U
G
A A C AG
A 150
G
G
180
50 C
G 70
A C
U
U
A
G
C
G
G
C
U
A
A
A
U
A
U
GA
G
40
80 C
G
G
CA A C
G
C
A
A
G
GG
210
CG G C U G G
A
G
GC
G
U
A
A
U GC
UA
G
AU U G A C C
C
CC C
G
C
A
G
C
G
A
U A 160
A
170 G
30 U
A
A G U GG
C
A
A
U 270
C
C 220
G
C
G
A
G
C
A
U
GC G
GC
260 G C
GC
C
G
A
U
G
G
G
C
CU
C
C
AC
G
A
G
G AC
G
AU C
G
G 280
C 240U
G
A
C
10
U
A
U
U
20 C
G
UC
G
G
U
U
A G AA
C
G
5’ G A G G A A A G C C G G G
G
A
C
290
C
A
G
230
3’ A U U C G G C C C
GA
A A U AA G
A
A
CC
U
A
A
410
G 250 300
G
UU
A
GG
390
G GG
C
AA
400 C
GC AU
A
AA
U
CC
A GGGGGG
U
U
UU
U
GGG
A
A
CG
340 G 330
350
C
G
GCCCC
A
320 U
G
380
A
C
GCCUGUCGG
C
A
U
G
G
G
C
G
U 310
C
CGGACGGC A
G
G
C
C
C
G
U 360
C CG
U
A
U
G
A
A
GU
G
A AG
Bulge Loop
Internal Loop
Hairpin Loop
Multibranch Loop
370
Pseudoknot
Figure : Structural Elements of LGW17 sequence from RNAse P
Database [Brown, 1999]
RNA Structure: Thermodynamics
◮
Equilibrium: Boltzmann Distribution of structures
Unpaired
State
5’
3’
3’
5’
S2
exp(−∆Go(S2)/RT )
···
···
5’
exp(−∆Go(S1)/RT )
···
3’
S1
5’
Sk
exp(−∆Go(Sk )/RT )
···
3’
···
◮
Lower ∆Go (Sk ), higher the probability of Sk
◮
Most likely structure → Minimization of free energy
Modeling RNA Thermodynamics: Nearest neighbor model
◮
Nearest neighbor model [Xia et al., 1998, Mathews et al.,
1999]
◮
◮
Computational model for free energy change of RNA structure
Experimentally determined free energy terms for each nearest
neighbor interaction in secondary structure
◮
Loop decomposition
U
U
G
G
C
C
G
U
A
C
A
C
A
U
A
Terminal Mismatch (GC/AA): −1.1 kcal/mol
A
Stacked loop (GC/GC): −2.9 kcal/mol
A
1 Nucleotide Bulge Loop: +3.3 kcal/mol
A
4 nucleotide hairpin loop: +5.9 kcal/mol
Stacking of base pairs (GC/GC): −2.9 kcal/mol
Stacked loop (UA/GC): −1.8 kcal/mol
Stacked loop (AU/UA): −0.9 kcal/mol
Stacked loop (CG/AU): −1.8 kcal/mol
G
U
Stacked loop (AU/CG): −2.1 kcal/mol
3’
A
5’
5’ dangling nucleotide: −0.3 kcal/mol
Figure : Total free energy change is summation of all nearest neighbor
energies [Durbin et al., 1999]
RNA ML Decoding of Structure: Single Sequence
◮
Most likely or minimum free energy structure, given sequence
···
◮
···
···
Dynamic Programming MFold [Zuker, 1989] O(N 3 )
complexity
RNA MAP Decoding of Structure
◮
Posterior probability of base pairing, given sequence
···
···
···
◮
Dynamic Programming [McCaskill, 1990], MFold, RNAfold
(O(N 3 ) in time, O(N 2 ) in space)
◮
Localized probabilistic information
RNA Structure Prediction (Single Sequence)
RD0260 Sequence
Partition Function GCGACCGGGGCUGGCUU
GGUAAUGGUACUCCCCU
Computation
GUCACGGGAGAGAAUGU
Free Energy
Minimization
GGGUUCAAAUCCCAUCG
GUCGCGCCA
U C A
G
C
U
C G
C G
C G
60
50
40
30
20
10
0
10
20
30
40
50
60
70
RD0260 nucleotide indices
70
80
40
70
A
60
G
30 C A
U
C G A
A
A
A
20 U
G
0
80
A U
G U
50
40
U
GGU
UCG
G
GU
CG
GG
U
G C
C G
C G
70
A U
G C
C G
G C
G
C
50
30
U U C
UGGG
A
ACCC
U A A
20
10
60
0
10
20
30
40
50
60
70
RD0260 nucleotide indices
80
0
80
RD0260 nucleotide indices
-28.6 kcals/mol
C
A
RD0260 nucleotide indices
◮
Free energy minimization: “Hard” Prediction
◮
◮
Single prediction structure
Base pairing probabilities: “Soft” Prediction
◮
◮
Thresholding may yield pseudo-knotted structures
Maximum Expected Accuracy Structure Prediction, [Do et al.,
2006, Lu et al., 2009]
Structure Prediction for Multiple Sequences: Homologous
ncRNAs
◮
Homologous ncRNAs
◮
◮
◮
Share evolutionary ancestor
Serve same function
Structural similarity in terms of topology of structures
Bacteriophage T5 (Asp)
U C
RD0260
Synechocystis (Glu)
Haloferax Volcanii (Asp)
G
U
C
U C
A
A
C
40
C
C G
40
C G
C GG
U
C G
30 A C
G
30 CU A
C G
C G A
A U G
G
20
A
A
A
C A
50
A
20
G C
C
U
C
G C
G
CAU
U U C
U G U
C
G C G
U
GCGGG
U A A
G
U
C
U
U
C
G
70
G
G
G
C G
U
G
GUG
G
GUGGG
A U
G
C
A
G
U
70 C G C C C
A
G C
G
G
GUCG
U A
C
G
U
A
C
C
C
U G A
10
C
U
G
U A A
C G
G C
U C
10
C
60
G
G
60
G
C
C
C
C
A
A
G
U
C
RD0500
U
U
C
20
U C
C
C
30 U
C
C
A
A
C
G
G
A
G
G
40
C
G
A U
U C G
GGAC C
C
C G
C G
G
UCUG
C G
G
A
C G
A G
10
G U
A
C U A
C
50
AGGGA
70
UCCCU
A
C
C
A
RE2140
U U C
G
A
U A
60
Structure Prediction for Multiple Sequences: Homologous
ncRNAs
◮
Homologous ncRNAs
◮
◮
◮
Share evolutionary ancestor
Serve same function
Structural similarity in terms of topology of structures
Bacteriophage T5 (Asp)
U C
RD0260
Synechocystis (Glu)
Haloferax Volcanii (Asp)
G
U
C
U C
A
A
C
40
C
C G
40
C G
C GG
U
C G
30 A C
G
30 CU A
C G
C G A
A U G
G
20
A
A
A
C A
50
A
20
G C
C
U
C
G C
G
CAU
U U C
U G U
C
G C G
U
GCGGG
U A A
G
U
C
U
U
C
G
70
G
G
G
C G
U
G
GUG
G
GUGGG
A U
G
C
A
G
U
70 C G C C C
A
G C
G
G
GUCG
U A
C
G
U
A
C
C
C
U G A
10
C
U
G
U A A
C G
G C
U C
10
C
60
G
G
60
G
C
C
C
C
A
A
G
U
C
RD0500
U
U
C
20
U C
C
C
30 U
C
C
A
A
C
G
G
A
G
G
40
C
G
A U
U C G
GGAC C
C
C G
C G
G
UCUG
C G
G
A
C G
A G
10
G U
A
C U A
C
50
AGGGA
70
UCCCU
A
C
C
A
RE2140
U U C
G
A
U A
60
Structure Prediction for Multiple Sequences: Homologous
ncRNAs
RD0260
G
U
C
RE2140
RD0500
U C
A
C
40
C GG
C G
C
30 U A
C G A
G
A
A
A
20
G C
G
G C G
U
U A A
C
U U C
70
UGGU G
G
GUGGG
A U
C
A
G
G
C
G
U
G
UACCC
C G
U
G
U A A
G C
U C
10
G
60
C
U C
G
A
U
C
40
C
C G
G
C
U
A
30
G C
C G
A U G
20
A
C A
50
C
U
C
G C
U
A
C
U U C
U G U
C
GCGGG
G
G C
G
G
GUG
C G
G
U
70 C G C C C
A
G
U A
C
U G A
10
C G
C
60
G
G
U
U
C
U C
A
C
40
C G
G
C
A
30 U G
C
C G C
G
A
20
A
A U
50
C
C U A
U C G
C
U U C
GGA
C
C G
C
AGGGA
C G
G
G
70
UCUG
G
C
UCCCU
G
A
A
G
C
U A
A G
10
G U
60
A
C
C
C
C
C
A
5’
RD0260
RD0500
RE2140
A
A
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA
GCGACCGGGGCUGGCUUGGUA-AUGGUACUCCCCUGUCACGGGAGAGAA
UGUGGGUUCAAAUCCCAUCGGUCGCGCCA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGACGCGGGUUCGAAUCCCGCCUCGGGCGCCA
GCCCGGGUGGUGUAGU-GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUUCGAAUCCCGCCUCGGGCGCCA
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGACAGGGAUUCGAAUUCCCUUGGGGGUACCA
GCCCCCAUCGUCUAGA-GGCCUAGGACACCUCCCUUUCACGGAGGCGA-CAGGGAUUCGAAUUCCCUUGGGGGUACCA
◮
“Common” structures and conforming sequence alignment
◮
Joint estimation can harness comparative structure and sequence
information across homologs
3’
Multiple Sequence RNA Structure Prediction
Input Sequences
N1
S1
x1
xK
1
...
1
...
x2
N2
NK
Multiple
Structure
Prediction
S2
...
1
SK
A (Optional)
Multiple Sequence RNA Structure Prediction
Input Sequences
N1
S1
x1
xK
◮
1
...
1
...
x2
N2
NK
Multiple
Structure
Prediction
S2
...
1
SK
A (Optional)
Sankoff’s dynamic programming algorithm [Sankoff, 1985]
◮
◮
◮
Simultaneous folding (pseudo-knot free) and alignment of K
sequences
Time (Memory) complexity: O(N 3K ) (O(N 2K ))
Computationally infeasible even for short sequences and K = 2
w/o cutting corners
Turbo-Decoding RNA Secondary Structure
◮
Iteratively update each using information from other
TurboFold [Harmanci et al., 2007, 2011].
Base pairing
probabilities
N1
Input Sequences
1
1
N1
N1
N2
x1
1
x2
xK
1
N2
NK
TurboFold
1
1
N2
...
◮
Base pairing probabilities, posterior alignment probabilities
...
◮
...
◮
Goal: Performance similar (“better”) than joint estimation,
complexity similar to single sequence computation.
Probabilistic formulation of folding and alignment
...
◮
NK
1
1
1
NK
Structural Alignment: Joint Representation of Structures
and Alignment
Two sequence case
C
U
C
A
U
G
C
U
A
U
U
U
U C
AGGC AAUAUUC UUC UUC C GUGGUAGCUU
Structure 1
U
G
G
U
A
C
G
G
A
5’
C
U
U
A
G
3’
AGGC AAUAUUC UUC UUC C GUGGUAGCUU
Sequence 1
5’ Sequence 1
3’
AGGCAAUA UUCUU CUUCCGUGGUAGCUU
AA C GUACUUAC UUGAUCGCAUU G UU
Sequence 2
Sequence 2
Co−incidence path
for alignment
AAC GUAC UUAC UUGAUC GC AUUGUU
AAC GUACUUACUUGAUCGCAUUGUU
◮
C
Structure 2
U
U
A
C
A
U
G
A
C UU G
U
G
C
A
U
A
U
5’
3’
G
U
A
U
C
Decoupled Probabilistic Representation for RD0260,
RD0500 Structural Alignment
Formulate in probabilistic framework and separate the
folding/alignment representations
80
70
70
60
60
50
40
30
0
10
20
30
40
50
RD0500 indices
Base pairing
probabilities for
RD0500
(via partition
function)
60
70
RD0500 indices
80
40
30
20
10
10
0
80
0
0
10
20
0
10
20
80
Co-incidence
probabilities
(via HMM)
Base pairing
probabilities for
RD0260
50
20
70
60
RD0260 indices
◮
50
40
30
20
10
0
30
40
50
RD0260 indices
30
40
50
60
70
80
60
70
80
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
◮ Extrinsic information for a sequence
◮
◮
The information about folding of a sequence which is
computed using base pairing probabilities of other sequences
Thermodynamic model + Alignment model
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
◮ Extrinsic information for a sequence
◮
◮
◮
The information about folding of a sequence which is
computed using base pairing probabilities of other sequences
Thermodynamic model + Alignment model
Base pairing probabilities of a sequence (Intrinsic Information)
◮
◮
From sequence itself
Thermodynamic model
TurboFold
Given K homologous RNA sequences, each sequence contains
some information about folding of every other sequence.
◮ Extrinsic information for a sequence
◮
◮
◮
Base pairing probabilities of a sequence (Intrinsic Information)
◮
◮
◮
The information about folding of a sequence which is
computed using base pairing probabilities of other sequences
Thermodynamic model + Alignment model
From sequence itself
Thermodynamic model
Iterative updates:
◮
◮
◮
◮
Compute extrinsic information using base pairing probabilities
and alignment co-incidence probabilities
Update base pairing probabilities using updated extrinsic
information
Update extrinsic information using updated base pairing
probabilities
...
Extrinsic Information for Base Pairing for RD0260
80
70
70
60
60
50
40
30
10
20
30
40
50
RD0500 indices
60
70
50
40
30
20
20
10
10
0
80
0
0
10
20
0
10
20
80
Extrinsic (RD0260)
0
p
RD0260
60
T
)
RD0500, RD0260
0
c
(Π
0 RD0500
p
RD0500, RD0260
Π
0
c
70
˜
Π
RD0260 indices
0
RD0500 indices
80
50
40
30
20
10
Π
0
(1 − ψ
)
RD0500, RD0260
30
40
50
RD0260 indices
30
40
50
60
70
80
60
70
80
3 Sequences
RD0500
70
70
60
60
RD0500 indices
80
50
40
30
20
0
0
50
40
30
20
10
10
10
20
30
40
50
RD0260 indices
60
70
0
0
80
10
20
30
40
50
RD0500 indices
60
70
80
RE2140
80
70
60
RE2140 indices
RD0260 indices
RD0260
80
50
40
30
20
10
0
0
10
20
30
40
50
RE2140 indices
60
70
80
Base Pairing Proclivity Matrix for RD0260 Induced by RE2140
T
RE2140
80
80
50
40
30
20
RD0260 indices
RE2140 indices
60
70
60
50
40
30
20
70
60
50
40
30
20
10
10
10
0
0
0
10
20
30
40
50
60
70
80
0
0
10
20
RE2140 indices
30
40
50
60
70
0
10
20
30
40
50
60
70
80
80
RE2140 indices
RE2140 indices
80
RD0260 indices
RD0260 indices
80
70
70
60
Base pairing proclivity matrix
for RD0260 induced by RE2140
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
RD0260 indices
◮
Information in RE2140 about folding of RD0260
t (s→m)
p Π̃
= c Π(m,s)
t−1 s
p Π
(c Π(m,s) )T
(1)
3 Sequences: Extrinsic information Computation
Proclivity Matrices for RD0260
80
Proclivity matrix
induced by RD0500
RD0260 indices
70
Extrinsic Information
Matrix for RD0260
60
50
Weight by
(1 - similarity)
40
30
20
80
10
0
10
20
30
40
50
60
70
80
RD0260 indices
Normalize by
maximum
80
Proclivity matrix
induced by RE2140
RD0260 indices
70
60
50
40
30
20
10
0
60
Weight by
(1 - similarity)
50
40
30
20
10
0
RD0260 indices
70
0
0
10
20
30
40
50
60
RD0260 indices
70
80
0
10
20
30
40
50
60
RD0260 indices
70
80
K Sequences: Extrinsic Information Computation for xm
x1
xm−1
···
xm+1
Nm−1
N1
···
Nm+1
xK
NK
Base pairing
probabilities
···
···
1
1
N1
1
N1
1
Nm−1
Nm−1
1
1
Nm+1
Nm
1
1
Nm
1
1
Nm
···
1
Nm
(1 − ψ1,m)
1
Nm
···
1
···
1
Nm
1
(1 − ψm−1,m)
1
Nm
(1 − ψm+1,m)
{tcΠ(s,m)}s∈N \m
Nm
Nm
Nm
Nm
1
Co−incidence
probabilities
···
1
1
{pt−1Πs}s∈N \m
1
NK
NK
Nm+1
···
1
1
1
1
Nm
···
(1 − ψk,m)
Induced base pairing
proclivity matrices
for xm
Normalize by maximum
Nm
1
1
Nm
Extrinsic information for
base pairing for xm
t Π̃m
p
{tpΠ̃(s→m)}s∈N \m
Base Pairing Probability Computation Using Extrinsic
Information
◮
Modified Boltzmann distribution of secondary structures:
!
∆G̃(S)
P (S) ∝ exp −
RT
Base Pairing Probability Computation Using Extrinsic
Information
◮
Modified Boltzmann distribution of secondary structures:
!
∆G̃(S)
P (S) ∝ exp −
RT
where
∆G̃(S) = ∆Go (S) − γ
X
log(π̃(i, j))
(i,j)∈S
is the modified free energy change for structure S.
◮
π̃(i, j): Extrinsic information for pairing of nucleotides at
indices i and j
◮
γ: Weight of extrinsic information on modified free energy
relative to ∆Go (S)
Extrinsic information introduced via a pseudo free energy for each
base pair
Base Pairing Probability Computation Using Extrinsic
Information
◮
Modified Boltzmann distribution of secondary structures:
!
∆G̃(S)
(2)
P (S) ∝ exp −
RT
where
∆G̃(S) = ∆Go (S) − γ
X
(i,j)∈S
log(π̃(i, j))
(3)
Base Pairing Probability Computation Using Extrinsic
Information
◮
Modified Boltzmann distribution of secondary structures:
!
∆G̃(S)
(2)
P (S) ∝ exp −
RT
where
∆G̃(S) = ∆Go (S) − γ
X
log(π̃(i, j))
(3)
(i,j)∈S
Replace (3) in (2):
∆Go (S)


Y
(π̃(i, j))γ/RT 
) 
}
(i,j)∈S
{z
}
Boltzmann distribution |
P (S) ∝ exp(−
|
{zRT
proportionality term
Extrinsic
information
Base Pairing Probability Computation Using Extrinsic
Information
◮
Modified Boltzmann distribution of secondary structures:
!
∆G̃(S)
(2)
P (S) ∝ exp −
RT
where
∆G̃(S) = ∆Go (S) − γ
X
log(π̃(i, j))
(3)
(i,j)∈S
Replace (3) in (2):
∆Go (S)


Y
(π̃(i, j))γ/RT 
) 
}
(i,j)∈S
{z
}
Boltzmann distribution |
P (S) ∝ exp(−
|
{zRT
proportionality term
◮
Extrinsic
information
Base pair (i, j) has a pseudo prior probability of (π̃(i, j))γ/RT
due to extrinsic information.
··· ···
···
Extrinsic information
computation for 1st
sequence
Π̃1
··· ···
···
Extrinsic information
computation for mth
sequence
···
Π̃m
··· ···
··· ···
··· ···
TurboFold: Iterative Updates
Extrinsic information
computation for K th
sequence
Π̃K
···
··· ···
Unit "time"
delay
Base pairing probability
Base pairing probability
Base pairing probability
th
th
computation for1st
· · · computation for m · · · computation forK
sequence
sequence
sequence
··· ···
···
···
···
Alignment
probabilities
ΠK
···
Πm
Π1
η Iterations
◮
For low K, per iteration complexity is comparable to single
sequence structure prediction
◮
Benefits from comparative analysis
TurboFold Structure Prediction Overview
◮
Obtain base pairing probabilities after η iterations, then
predict structures
◮
◮
Significant base pairs
Maximum expected accuracy (MEA) structures
Input Sequences
{xm}m∈N
N1
x1
TurboFold
1
xK
1
...
1
...
x2
N2
NK
{tpΠ̃m}
{tpΠm}
Thresholding
{ηp Πm}m∈N
Maximum Expected
η Iterations
Accuracy Prediction
TurboFold−MEA
Structure Prediction
◮
Structure for xm composed of base pairs with probabilities
greater than Pthresh :
S∗m = {(i, j) ∋ ηp π m (i, j) > Pthresh }
(4)
TurboFold: Computation Complexity
◮
Initialization
◮
◮
◮
Iterations
◮
◮
◮
Computation of co-incidence matrices: O(K 2 N 2 )
Computation of sequence similarities: O(K 2 N 2 )
Extrinsic information computation: O(ηK 2 d2 N 2 )
Base pairing probability computation: O(ηKU N 3 )
Structure prediction
◮
◮
Thresholding: O(KN 2 )
MEA prediction: O(KN 3 )
Compare to Sankoff’s algorithm: O(N 3 (U 2 d)K )
Evaluating Accuracy of Estimates
◮
Sensitivity: Ratio of number of correctly predicted base pairs
to the total number of base pairs in the known structure
True Positive
True Positive + False Negative
◮
◮
Recall
Positive Predictive Value(PPV): Ratio of number of correctly
predicted base pairs to the total number of base pairs in the
predicted structure
True Positive
True Positive + False Positive
◮
Precision
Parameter Selection: Number of iterations, η
Sensitivity vs. PPV over 5S rRNA dataset
with changing η
0.905
5
0.9
0.895
2
34
0.89
PPV
0.885
1
0.88
0.875
0.645
0.64
0.635
0.66
◮
0
0.67
η = 3 is used in TurboFold
0.87
0.88
Sensitivity
0.89
0.9
Parameter Selection: Weight of Extrinsic Information, γ
Sensitivity vs. PPV over 5S rRNA dataset
with changing γ/RT
1
1.2 1.0
0.8
0.95
0.5
0.3
0.9
0.2
PPV
0.85
0.1
0.8
0.05
0.75
0.7
0.65
0.001
0.6
0.65
◮
0.7
0.75
0.8
Sensitivity
γ = 0.3RT is used in TurboFold
0.85
0.9
Benchmarking Experiments: Datasets
◮
Randomly choose 200 RNase P, 400 5S rRNA, 400 SRP, and
400 tRNA sequences and divide into K combinations
◮
◮
Choose and divide for K = 2, . . . , 10
Yields 36 datasets
The datasets have significant diversity:
◮
RNase Ps: 336 nucleotides, 50% average pairwise identity
◮
tmRNA: 366 nucleotides, 45% average pairwise identity
◮
telomerase RNA: 445 nucleotides, 54% average pairwise
identity
◮
SRPs: 187 nucleotides, 42% average pairwise identity
◮
tRNAs: 77 nucleotides, 47% average pairwise identity
◮
5S rRNAs: 119 nucleotides, 63% average pairwise identity
Benchmarking Experiments: Datasets
◮
Randomly choose 200 RNase P, 400 5S rRNA, 400 SRP, and
400 tRNA sequences and divide into K combinations
◮
◮
Choose and divide for K = 2, . . . , 10
Yields 36 datasets
The datasets have significant diversity:
◮
RNase Ps: 336 nucleotides, 50% average pairwise identity
◮
tmRNA: 366 nucleotides, 45% average pairwise identity
◮
telomerase RNA: 445 nucleotides, 54% average pairwise
identity
◮
SRPs: 187 nucleotides, 42% average pairwise identity
◮
tRNAs: 77 nucleotides, 47% average pairwise identity
◮
5S rRNAs: 119 nucleotides, 63% average pairwise identity
Benchmarking Experiments
◮
TurboFold is benchmarked against methods that estimate
base pairing probabilities:
◮
◮
◮
LocARNA [Will et al., 2007]
RNAalifold [Bernhart et al., 2008]
Single sequence partition function [Mathews, 2004]
◮
The set of base pairs with estimated probabilities higher than
Pthresh are scored
◮
Plotted sensitivity versus PPV while varying Pthresh between 0
and 1 with step size of 0.04
Benchmarking Experiments
Sensitivity vs PPV ROC curves for TurboFold vs three alternative
methods
1
0.9
0.8
PPV
0.7
0.6
0.5
0.4
0.3
0.2
0
TurboFOLD (K = 3)
TurboFOLD (K = 10)
LocARNA (K = 3)
LocARNA (K = 10)
RNAalifold (K = 3)
RNAalifold (K = 10)
Single (K = 3)
Single (K = 10)
0.1
0.2
0.3
0.4
0.5
Sensitivity
0.6
0.7
0.8
0.9
Run Time Requirements
◮
Run time requirements over 50 RNase P sequence datasets
TurboFold
LocARNA
RNAalifold
Runtime (seconds) for
K = 3 K = 5 K = 10
136.75 277.9
517.0
746.44 2815.9 11395.8
0.2
0.3
0.6
Table : Time requirements (in seconds) for the methods.
◮
TurboFold scales slower with increase in K
Conclusions
◮
TurboFold: A multiple sequence structure prediction method
◮
Lowers Complexity with iterative combination of intrinsic and
extrinsic information for folding
◮
◮
Intrinsic information: From sequence via thermodynamic
folding model (nearest neighbor model)
Extrinsic information: From other sequences
◮
TurboFold accuracy: close to or higher than the simultaneous
folding and alignment methods
◮
Details: BMC Bioinformatics article Harmanci et al. [2011].
◮
Connections to coding theory in digital communications
Turbo Decoding: RNA vs Communications
◮
Multiple encodings of same information
◮
◮
Nature/Man
Joint (optimal) decoding desirable
◮
◮
Exact joint decoding ≈ exponential complexity
Iterative approximation (belief propagation)
◮
◮
◮
◮
Localized MAP probabilitistic formulation (base
pairing/symbol probs.)
Decomposition into loosely coupled individual decodings +
information exchange at each iteration
Linear complexity in length of data
Pseudo-prior interpretation
Ongoing Related Work
◮
Moving beyond TurboFold
◮
◮
◮
◮
Alignment probability updates based on structures
Better handling of dependencies
Domain insertions/deletions
Connecting with experiments
◮
◮
Incorporating experimental information (e.g. SHAPE) in
structural alignments
Postulating mechanisms and experimental validation (HIV)
Acknowledgments
◮
Collaborators at UR:
◮
◮
◮
Arif O. Harmanci, UR ECE, currently PostDoc at Yale
David H. Mathews, Department of Biochemistry and
Biophysics
Research support:
◮
◮
National Institutes of Health (NIH) (Award # GM097334-01)
Center for Research Computing, University of Rochester
Thank you
Questions?
Background
Turbo-Decoding RNA Secondary Structure
References
S. H. Bernhart, I. L. Hofacker, S. Will, A. R. Gruber, and P. F.
Stadler. RNAalifold: improved consensus structure prediction for
RNA alignments. BMC Bioinformatics, 9:474, 2008.
J. W. Brown. The Ribonuclease P database. Nucleic Acids Res.,
27(1):314, Jan. 1999.
Chuong B. Do, Daniel A. Woods, and Serafim Batzoglou.
CONTRAfold: RNA secondary structure prediction without
physics-based models. Bioinformatics, 22(14):90–98, 2006.
J. A. Doudna and T. R. Cech. The chemical repertoire of natural
ribozymes. Nature, 418(6894):222–228, 2002.
J.A. Doudna and J.H. Cate. RNA structure: crystal clear? Current
Opinions in Structural Biology, 7:310–316, 1997.
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological
Sequence Analysis : Probabilistic Models of Proteins and
Nucleic Acids. Cambridge University Press, Cambridge, UK,
1999. ISBN 0521629713.
54
Background
Turbo-Decoding RNA Secondary Structure
References
A. Ozgun Harmanci, Gaurav Sharma, and David H. Mathews.
Toward turbo decoding of RNA secondary structure. In Proc.
IEEE Intl. Conf. Acoustics Speech and Sig. Proc., volume I,
pages 365–368, Apr. 2007.
A. Ozgun Harmanci, Gaurav Sharma, and David H. Mathews.
TurboFold: Iterative Probabilistic Estimation of Secondary
Structures for Multiple RNA Sequences. BMC Bioinformatics,
12:108, 2011. early access available online, April 20, 2011.
Zhi John Lu, Jason W. Gloor, and David H. Mathews. Improved
RNA secondary structure prediction by maximizing expected pair
accuracy. RNA, 15(10):1805–1813, 2009.
D. H. Mathews. Using an RNA secondary structure partition
function to determine confidence in base pairs predicted by free
energy minimization. RNA, 10(8):1178–1190, 2004.
D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded
sequence dependence of thermodynamic parameters provides
54
Background
Turbo-Decoding RNA Secondary Structure
References
improved prediction of RNA secondary structure. J. Mol. Biol.,
288(5):911–940, 1999.
J. S. McCaskill. The equilibrium partition function and base pair
binding probabilities for RNA secondary structure. Biopolymers,
29(6-7):1105–1119, Nov. 1990.
D. Sankoff. Simultaneous solution of RNA folding, alignment and
protosequence problems. SIAM J. App. Math., 45(5):810–825,
Oct. 1985.
I. Tinoco, Jr. and C. Bustamante. How RNA folds. J Mol Biol,
293(2):271–281, 1999.
Richard B. Waring and R. Wayne Davies. Assessment of a model
for intron rna secondary structure relevant to RNA self-splicing –
a review. Gene, 28(3):277–291, 1984.
Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler,
and Rolf Backofen. Inferring noncoding RNA families and
classes by means of genome-scale structure-based clustering.
PLoS Comput. Biol., 3(4):680–691, Apr. 2007.
54
Background
Turbo-Decoding RNA Secondary Structure
References
T. Xia, J. SantaLucia, Jr., R Kierzek, S. J. Schroeder, X Jiao,
C Cox, and Douglas Henry Turner. Thermodynamic parameters
for an expanded nearest-neighbor model for formation of RNA
duplexes with Watson-Crick pairs. Biochemistry, 37(42):
14719–14735, 1998.
M. Zuker. Computer prediction of RNA structure. Methods
Enzymol., 180:262–288, 1989.
54