RNAstrand : Reading direction of structured RNAs in multiple

RNAstrand : Reading direction of structured
RNAs in multiple sequence alignments
Kristin Reiche, Peter F. Stadler
Bioinformatics Group, Department of Computer Science
University of Leipzig
Bled 2007
RNAstrand
Why strand predictor of structured RNAs?
3300
Oikopleura dioica
Cephalochordates
Larvaceans
3600
Hemichordates
Caenorhabditis remanei
Caenorhabditis briggsae Echinoderms
Caenorhabditis elegans
Ciona savignyi
Ciona intestinalis
30 000
Vertebrates
Ascidians
Urochordates
Nematodes
Chordates
Protostomes
Deuterostomes
Bilateria
Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold )
Annotation is next challenge:
RNAs with homologous secondary structures (LocARNA )
Detect specific RNA families (RNAmicro , snoReport )
intronic, intergenic or non-translated exon (RNAstrand )
RNAstrand
Why strand predictor of structured RNAs?
3300
Oikopleura dioica
Cephalochordates
Larvaceans
3600
Hemichordates
Caenorhabditis remanei
Caenorhabditis briggsae Echinoderms
Caenorhabditis elegans
Ciona savignyi
Ciona intestinalis
30 000
Vertebrates
Ascidians
Urochordates
Nematodes
Chordates
Protostomes
Deuterostomes
Bilateria
Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold )
Annotation is next challenge:
RNAs with homologous secondary structures (LocARNA )
Detect specific RNA families (RNAmicro , snoReport )
intronic, intergenic or non-translated exon (RNAstrand )
RNAstrand
Why strand predictor of structured RNAs?
3300
Oikopleura dioica
Cephalochordates
Larvaceans
3600
Hemichordates
Caenorhabditis remanei
Caenorhabditis briggsae Echinoderms
Caenorhabditis elegans
Ciona savignyi
Ciona intestinalis
30 000
Vertebrates
Ascidians
Urochordates
Nematodes
Chordates
Protostomes
Deuterostomes
Bilateria
Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold )
Annotation is next challenge:
RNAs with homologous secondary structures (LocARNA )
Detect specific RNA families (RNAmicro , snoReport )
intronic, intergenic or non-translated exon (RNAstrand )
RNAstrand
Why strand predictor of structured RNAs?
3300
Oikopleura dioica
Cephalochordates
Larvaceans
3600
Hemichordates
Caenorhabditis remanei
Caenorhabditis briggsae Echinoderms
Caenorhabditis elegans
Ciona savignyi
Ciona intestinalis
30 000
Vertebrates
Ascidians
Urochordates
Nematodes
Chordates
Protostomes
Deuterostomes
Bilateria
Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold )
Annotation is next challenge:
RNAs with homologous secondary structures (LocARNA )
Detect specific RNA families (RNAmicro , snoReport )
intronic, intergenic or non-translated exon (RNAstrand )
RNAstrand
Why strand predictor of structured RNAs?
3300
Oikopleura dioica
Cephalochordates
Larvaceans
3600
Hemichordates
Caenorhabditis remanei
Caenorhabditis briggsae Echinoderms
Caenorhabditis elegans
Ciona savignyi
Ciona intestinalis
30 000
Vertebrates
Ascidians
Urochordates
Nematodes
Chordates
Protostomes
Deuterostomes
Bilateria
Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold )
Annotation is next challenge:
RNAs with homologous secondary structures (LocARNA )
Detect specific RNA families (RNAmicro , snoReport )
intronic, intergenic or non-translated exon (RNAstrand )
RNAstrand
Why strand predictor of structured RNAs?
Naïve approach: Take strand which has higher RNAz probability or EvoFold
score
Problem: RNAz and EvoFold are not trained for strand prediction of RNA
Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84
Can we do that better?
RNAstrand
Why strand predictor of structured RNAs?
Naïve approach: Take strand which has higher RNAz probability or EvoFold
score
Problem: RNAz and EvoFold are not trained for strand prediction of RNA
Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84
Can we do that better?
RNAstrand
Why strand predictor of structured RNAs?
Naïve approach: Take strand which has higher RNAz probability or EvoFold
score
Problem: RNAz and EvoFold are not trained for strand prediction of RNA
Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84
Can we do that better?
RNAstrand
Why strand predictor of structured RNAs?
Naïve approach: Take strand which has higher RNAz probability or EvoFold
score
Problem: RNAz and EvoFold are not trained for strand prediction of RNA
Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84
Can we do that better?
RNAstrand
Task
Input alignment containing structured ncRNA
AL645782.2/32556-32631
AY303232.1/800-874
AF321227.1/48039-48109
AF321227.1/86918-86845
(((((.....(((.((((((((((((((((.................)))))))))))))))).)))....))))).
CUAUAUAUACCCUGUAGAUCCGGAUUUGUGUA-AACAGACGCACAGUCACAAAUUCGUAUCUAGGGGAGUAUGUAGU
CUAUACAUACCCUGUAGAACCAAAUGUGUGUGCAGCUGACUUG--AUCACAGAUUGGGUUCUAGGGGAGUCUAUGGG
CUACAUCUACCCUGUAGAUCCGAAUUUGUUUGAAAUCAG------GCGACAAAUUCGGUUCUAGAGAGGUUUGUGUG
CGUGAUCUACCCUGUAGAUCCGGGCUUUUGUAGAAUUUUUA---GAUCAGAAGCUCGUCUCUACAGGUAUCUUACGG
Structured ncRNA on reading direction of input alignment?
G
CG
UG
AU
UG
UA UC
A
U
U
G
A
CC G A
AL645782.2/32556-32631
AY303232.1/800-874
AF321227.1/48039-48109
AF321227.1/86918-86845
CG
UA
G G
UA
AU
GC
AU
UU
CG
CG
GC
AU
AU
UA
UA
UA
GC
UA
G
U CU
A
_
A
A
C
U
GAC
(((((.....(((.((((((((((((((((.................)))))))))))))))).)))....))))).
CUAUAUAUACCCUGUAGAUCCGGAUUUGUGUA-AACAGACGCACAGUCACAAAUUCGUAUCUAGGGGAGUAUGUAGU
CUAUACAUACCCUGUAGAACCAAAUGUGUGUGCAGCUGACUUG--AUCACAGAUUGGGUUCUAGGGGAGUCUAUGGG
CUACAUCUACCCUGUAGAUCCGAAUUUGUUUGAAAUCAG------GCGACAAAUUCGGUUCUAGAGAGGUUUGUGUG
CGUGAUCUACCCUGUAGAUCCGGGCUUUUGUAGAAUUUUUA---GAUCAGAAGCUCGUCUCUACAGGUAUCUUACGG
A
_
_
_
_
_
Structured ncRNA on reverse complement of input alignment?
C GA
C
U
C
A
A
U
C
A
A
G
GA UA
C
U G
CG
CG
CA
C C
UA
AU
GC
AU
A A
AG
CG
GC
AC
AU
UA
UA
UA
GC
UA
AGCA
C
C
_
_
_
U
_
U
_
A
_
A
GUC
AF321227.1/86918-86845
AY303232.1/800-874
AL645782.2/32556-32631
AF321227.1/48039-48109
.......((.(((.((((((((((((((((.................)))))))))))))))).)))))........
CCGUAAGAUACCUGUAGAGACGAGCUUCUGAUC---UAAAAAUUCUACAAAAGCCCGGAUCUACAGGGUAGAUCACG
CCCAUAGACUCCCCUAGAACCCAAUCUGUGAU--CAAGUCAGCUGCACACACAUUUGGUUCUACAGGGUAUGUAUAG
ACUACAUACUCCCCUAGAUACGAAUUUGUGACUGUGCGUCUGUU-UACACAAAUCCGGAUCUACAGGGUAUAUAUAG
CACACAAACCUCUCUAGAACCGAAUUUGUCGC------CUGAUUUCAAACAAAUUCGGAUCUACAGGGUAGAUGUAG
RNAstrand
Approach
RNAstrand
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
RNAstrand
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
meanz =
P
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
single sequence z-score
N
RNAstrand
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
meanz =
P
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
single sequence z-score
N
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
meanz
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
P
single sequence z-score
N
meanmfe =
single sequence MFE
N
meanz =
P
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
meanz
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz
meanmfe =
P
single sequence MFE
N
meanmfe
meanz =
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz
meanmfe =
P
single sequence MFE
N
meanmfe
meanz =
sci = consmfe
meanmfe
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz
meanmfe =
P
single sequence MFE
N
meanmfe
meanz =
sci = consmfe
meanmfe
sci
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz
meanmfe =
P
single sequence MFE
N
meanmfe
meanz =
sci = consmfe
meanmfe
sci
consmfe = MFE of consensus sequence
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz
meanmfe =
P
single sequence MFE
N
meanmfe
meanz =
sci = consmfe
meanmfe
sci
consmfe = MFE of consensus sequence
consmfe
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz → 4meanz
meanmfe =
P
single sequence MFE
N
meanmfe → 4meanmfe
meanz =
sci = consmfe
meanmfe
sci → 4sci
consmfe = MFE of consensus sequence
consmfe →4consmfe
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Approach
Reading direction of input alignment
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
Realigned reverse complement
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
P
single sequence z-score
N
meanz → 4meanz
meanmfe =
P
single sequence MFE
N
meanmfe → 4meanmfe
meanz =
sci = consmfe
meanmfe
sci → 4sci
consmfe = MFE of consensus sequence
consmfe →4consmfe
Classification via support vector machine (SVM)
RNAstrand
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
Which descriptors are the best?
RNAstrand
Which descriptors are the best?
Receiver Operating Characteristics (5-fold cross validation):
RNAstrand
Which descriptors are the best?
Receiver Operating Characteristics (5-fold cross validation):
true positive rate
1
0.95
0.9
∆meanz [0.9312]
∆sci [0.9487]
∆meanmfe [0.9798]
∆consmfe [0.9855]
0.85
0.8
0
0.05
0.1
0.15
false positive rate
0.2
RNAstrand
Combination of two
true positive rate
1
0.95
∆sci, ∆meanz [0.9731]
∆sci, ∆consmfe [0.9850]
∆sci, ∆meanmfe [0.9869]
∆meanz, ∆consmfe [0.9871]
∆meanmfe, ∆consmfe [0.9874]
∆meanz, ∆meanmfe [0.9893]
0.9
0.85
0.8
0
0.05
0.1
0.15
false positive rate
0.2
RNAstrand
Combination of at least three
true positive rate
1
0.95
0.9
∆meanz, ∆meanmfe, ∆consmfe [0.9878]
∆sci, ∆meanmfe, ∆consmfe [0.9887]
∆sci, ∆meanz, ∆consmfe [0.9911]
∆sci, ∆meanz, ∆meanmfe [0.9928]
∆sci, ∆meanz, ∆meanmfe, ∆consmfe [0.9939]
0.85
0
0.05
0.1
0.15
false positive rate
0.2
RNAstrand
Combination of at least three
true positive rate
1
0.95
0.9
∆meanz, ∆meanmfe, ∆consmfe [0.9878]
∆sci, ∆meanmfe, ∆consmfe [0.9887]
∆sci, ∆meanz, ∆consmfe [0.9911]
∆sci, ∆meanz, ∆meanmfe [0.9928]
∆sci, ∆meanz, ∆meanmfe, ∆consmfe [0.9939]
0.85
0
0.05
0.1
0.15
false positive rate
0.2
Maximal area under the curve (99.39%) if all four descriptors are taken
RNAstrand
Additional descriptors
4meanmfe , 4consmfe , 4meanz , 4sci depend on fraction of GU base pairs:
30
20
0.5
20
0
-10
relative frequency
∆meanmfe
∆consmfe
10
10
0
-10
-20
-0.4
-0.2
0
0.2
0.4 ∆sci
-3
30
-2
-1
0
1
2
3 ∆z-score
0
0
10
20
30
40
0.07
20
50 (GU(+)/all(+)) + (GU(-)/all(-))
U70 snoRNA (RF00156)
0.06
20
∆meanmfe
10
0
-10
relative frequency
10
∆consmfe
0.3
0.2
0.1
-20
-30
tRNA (RF00005)
0.4
0
-10
-20
0.05
0.04
0.03
0.02
0.01
-30
-20
-0.4
-0.2
0
0.2
0.4 ∆sci
-3
-2
-1
0
1
2
3 ∆z-score
0
0
RNAstrand
10
20
30
40
50 (GU(+)/all(+)) + (GU(-)/all(-))
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
Relevance of the strand differences are interpreted by:
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
Relevance of the strand differences are interpreted by:
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
Final set of descriptors
Strand differences are captured by:
Differences in stability
4meanz
4meanmfe
Differences in structure conservation
4sci
4consmfe
Relevance of the strand differences are interpreted by:
GU pairs in consensus
GU pairs in consensus
+
all pairs in consensus
all pairs in consensus
Average mean pairwise identity of alignment in both reading directions
Number of sequences in alignment
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
SVM specifications
SVM library libsvm
Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 )
Attributes are scaled to −1 and 1
Optimal parameters: penalty of error term C = 128, γ = 0.5
Probability estimates P that alignment contains ncRNA in same reading direction
RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1]
Different cutoffs c of score provide different prediction reliabilities:
D > +c: ncRNA in reading direction of input alignment
D < −c: ncRNA is reverse complement of input alignment
−c ≤ D ≤ +c: No decision
RNAstrand
Training of RNAstrand
5886 training alignments including representatives of rRNAs, snRNAs, snoRNAs,
tRNAs, miRNAs, nuclear RNase P and SRP RNA
RNAstrand
Training of RNAstrand
absolute frequency
5886 training alignments including representatives of rRNAs, snRNAs, snoRNAs,
tRNAs, miRNAs, nuclear RNase P and SRP RNA
700
600
500
400
300
200
100
0
300
250
200
150
100
50
0
100
200
300
alignment length
400
0
40 50 60 70 80 90 100
mean pairwise sequence identity
absolute frequency
700
300
600
500
200
400
300
100
200
100
0
0
10 20 30 40 50 60
(GU(+)/all(+)) + (GU(-)/all(-))
0
2
3
4
5
number sequences
RNAstrand
6
Validation of RNAstrand
Validate RNAstrand with 35766 automatically created ClustalW alignments of 313
non-coding RNA families found in RFAM 7.0
ncRNA type
5S rRNA
5.8S rRNA
tRNA
miRNA
snoRNA (C/D)
snoRNA (H/ACA)
spliceos. RNA
euk. SRP RNA
nucl. RNaseP
RNase MRP
SECIS
7SK
N
860
146
294
2496
204
1340
2878
1000
260
140
76
184
O
12.6%
0.9%
4.5%
9.9%
1.2%
35.0%
8.1%
41.9%
-
c =0
A+
A−
0.98
0.98
0.93
0.93
0.94
0.94
0.98
0.97
0.59
0.57
0.98
0.98
0.92
0.92
0.99
0.99
0.93
0.93
0.98
1.00
0.65
0.64
0.04
0.03
A
0.98
0.89
0.88
0.96
0.48
0.97
0.88
0.99
0.92
0.98
0.51
0.02
c = 0.5
1-A-u
0.00
0.05
0.01
0.00
0.32
0.01
0.05
0.00
0.04
0.00
0.25
0.91
RNAstrand
u
0.01
0.05
0.09
0.02
0.18
0.01
0.06
0.00
0.03
0.01
0.22
0.05
A
0.95
0.73
0.62
0.89
0.29
0.94
0.77
0.97
0.85
0.96
0.32
0.01
c = 0.9
1-A-u
0.00
0.02
0.00
0.00
0.18
0.00
0.02
0.00
0.01
0.00
0.19
0.80
u
0.03
0.24
0.36
0.10
0.52
0.05
0.19
0.02
0.12
0.03
0.48
0.18
Best cutoff c
including undecided alignments
excluding undecided alignments
0.91
c=0
c=0.15
0.9
true positive rate
0.8
c=0.95
0.89
0.88
0.75
0.87
0.86
c=0.15
0.7
0.85
c=0.95
0.65
0
0.05
0.1
0.15
false positive rate
c=0
0.84
0.2
0.1
0.12
0.14
false positive rate
0.16
alignments classified to contain ncRNA in same reading direction
alignments classified to contain ncRNA on reverse complement
Maximal Youden index of 0.75 with cutoff c = 0.15
RNAstrand
Distribution of RNAstrand scores
relative frequencies
Accuracy at least 0.5 (279 families)
100
10
1
0.1
0.01
-1
-0.5
0
RNAstrand score
0.5
1
relative frequencies
Accuracy less than 0.5 (34 families)
100
10
1
0.1
0.01
-1
-0.5
0
0.5
RNAstrand score
Alignments in reading direction of ncRNA
Reverse complementary alignments
RNAstrand
1
Distribution of SVM decision values
relative frequencies
test alignments
0.09
0.06
0.03
0
-20
-10
0
10
20
10
20
SVM decision values
shuffled test alignments
relative frequencies
0.12
0.09
0.06
0.03
0
-20
-10
0
SVM decision values
RNAstrand
Are we better than naïve approach? (RNAz )
ncRNA type
5S rRNA
5.8S rRNA
tRNA
miRNA
snoRNA (C/D)
snoRNA (H/ACA)
spliceos. RNA
euk. SRP RNA
nucl. RNaseP
RNase MRP
SECIS
7SK
N
860
146
294
2496
204
1340
2878
1000
260
140
76
184
Accuracy (RNAstrand: c = 0)
A(RNAstrand)
A(RNAz)
0.98
0.97
0.93
0.90
0.94
0.53
0.97
0.14
0.58
0.46
0.98
0.94
0.92
0.82
0.99
0.84
0.93
0.82
0.99
0.50
0.65
0.48
0.04
0.03
RNAstrand
Are we better than naïve approach? (RNAz )
RNAstrand
correct
incorrect
correct
21536 21425
1618
1729
incorrect
8711 8639
3901 3973
2-fold reduction of misclassification rate
RNAstrand
Are we better than naïve approach? (EvoFold )
RNAstrand(fwd)
correct
incorrect
RNAstrand(rev)
correct
incorrect
correct
104 8
16 0
correct
102 8
18 0
incorrect
15
0
7
2
incorrect
11
0
11
2
Strand prediction of EvoFold comparable to RNAstrand .
RNAstrand
RNAstrand
Backup Slide - Classification of 7SK RNAs
30
20
7SK (RF00100)
20
∆meanmfe
∆consmfe
10
10
0
-10
0
-10
-20
-30
-20
-0.4
-0.2
0
0.2
0.4
∆sci
-3
30
-2
-1
0
1
2
20
3 ∆z-score
U70 snoRNA (RF00156)
20
∆meanmfe
∆consmfe
10
10
0
-10
0
-10
-20
-30
-20
-0.4
-0.2
0
0.2
0.4 ∆sci
-3
-2
-1
0
RNAstrand
1
2
3 ∆z-score
absolute frequency
Backup Slide - Test alignments
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
0
100
200
300
alignment length
400
0
40 50 60 70 80 90 100
mean pairwise sequence identity
absolute frequency
4000
8000
3000
6000
2000
4000
1000
0
2000
0
10 20 30 40 50 60
(GU(+)/all(+)) + (GU(-)/all(-))
0
2
3
4
5
number of sequences
RNAstrand
6
Backup Slide - Standard deviations
ncRNA class
rRNA
miRNA
snoRNA (C/D)
snoRNA (H/ACA)
spliceos. RNA
N
2
36
23
26
6
c=0
A+
A−
0.02
0.02
0.15
0.13
0.38
0.40
0.17
0.17
0.24
0.25
A
0.04
0.18
0.38
0.22
0.29
c = 0.5
1-A-u
0.02
0.05
0.36
0.11
0.20
RNAstrand
u
0.01
0.13
0.15
0.15
0.08
A
0.11
0.31
0.37
0.32
0.29
c = 0.9
1-A-u
0.01
0.00
0.28
0.05
0.12
u
0.10
0.31
0.35
0.30
0.17
relative frequency
Backup Slide - GU base pair fraction
10
5
0
0
0.2
0.4
0.6
0.8
1
fraction of GU base pairs in consensus structure
original test alignments in reading direction of ncRNA
corresponding shuffled alignments
RNAstrand