RNAstrand : Reading direction of structured RNAs in multiple sequence alignments Kristin Reiche, Peter F. Stadler Bioinformatics Group, Department of Computer Science University of Leipzig Bled 2007 RNAstrand Why strand predictor of structured RNAs? 3300 Oikopleura dioica Cephalochordates Larvaceans 3600 Hemichordates Caenorhabditis remanei Caenorhabditis briggsae Echinoderms Caenorhabditis elegans Ciona savignyi Ciona intestinalis 30 000 Vertebrates Ascidians Urochordates Nematodes Chordates Protostomes Deuterostomes Bilateria Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold ) Annotation is next challenge: RNAs with homologous secondary structures (LocARNA ) Detect specific RNA families (RNAmicro , snoReport ) intronic, intergenic or non-translated exon (RNAstrand ) RNAstrand Why strand predictor of structured RNAs? 3300 Oikopleura dioica Cephalochordates Larvaceans 3600 Hemichordates Caenorhabditis remanei Caenorhabditis briggsae Echinoderms Caenorhabditis elegans Ciona savignyi Ciona intestinalis 30 000 Vertebrates Ascidians Urochordates Nematodes Chordates Protostomes Deuterostomes Bilateria Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold ) Annotation is next challenge: RNAs with homologous secondary structures (LocARNA ) Detect specific RNA families (RNAmicro , snoReport ) intronic, intergenic or non-translated exon (RNAstrand ) RNAstrand Why strand predictor of structured RNAs? 3300 Oikopleura dioica Cephalochordates Larvaceans 3600 Hemichordates Caenorhabditis remanei Caenorhabditis briggsae Echinoderms Caenorhabditis elegans Ciona savignyi Ciona intestinalis 30 000 Vertebrates Ascidians Urochordates Nematodes Chordates Protostomes Deuterostomes Bilateria Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold ) Annotation is next challenge: RNAs with homologous secondary structures (LocARNA ) Detect specific RNA families (RNAmicro , snoReport ) intronic, intergenic or non-translated exon (RNAstrand ) RNAstrand Why strand predictor of structured RNAs? 3300 Oikopleura dioica Cephalochordates Larvaceans 3600 Hemichordates Caenorhabditis remanei Caenorhabditis briggsae Echinoderms Caenorhabditis elegans Ciona savignyi Ciona intestinalis 30 000 Vertebrates Ascidians Urochordates Nematodes Chordates Protostomes Deuterostomes Bilateria Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold ) Annotation is next challenge: RNAs with homologous secondary structures (LocARNA ) Detect specific RNA families (RNAmicro , snoReport ) intronic, intergenic or non-translated exon (RNAstrand ) RNAstrand Why strand predictor of structured RNAs? 3300 Oikopleura dioica Cephalochordates Larvaceans 3600 Hemichordates Caenorhabditis remanei Caenorhabditis briggsae Echinoderms Caenorhabditis elegans Ciona savignyi Ciona intestinalis 30 000 Vertebrates Ascidians Urochordates Nematodes Chordates Protostomes Deuterostomes Bilateria Genome wide prediction of structured non-coding RNAs (RNAz , EvoFold ) Annotation is next challenge: RNAs with homologous secondary structures (LocARNA ) Detect specific RNA families (RNAmicro , snoReport ) intronic, intergenic or non-translated exon (RNAstrand ) RNAstrand Why strand predictor of structured RNAs? Naïve approach: Take strand which has higher RNAz probability or EvoFold score Problem: RNAz and EvoFold are not trained for strand prediction of RNA Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84 Can we do that better? RNAstrand Why strand predictor of structured RNAs? Naïve approach: Take strand which has higher RNAz probability or EvoFold score Problem: RNAz and EvoFold are not trained for strand prediction of RNA Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84 Can we do that better? RNAstrand Why strand predictor of structured RNAs? Naïve approach: Take strand which has higher RNAz probability or EvoFold score Problem: RNAz and EvoFold are not trained for strand prediction of RNA Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84 Can we do that better? RNAstrand Why strand predictor of structured RNAs? Naïve approach: Take strand which has higher RNAz probability or EvoFold score Problem: RNAz and EvoFold are not trained for strand prediction of RNA Accuracy: RNAz on miRNAs: 0.14, EvoFold on miRNAs: 0.84 Can we do that better? RNAstrand Task Input alignment containing structured ncRNA AL645782.2/32556-32631 AY303232.1/800-874 AF321227.1/48039-48109 AF321227.1/86918-86845 (((((.....(((.((((((((((((((((.................)))))))))))))))).)))....))))). CUAUAUAUACCCUGUAGAUCCGGAUUUGUGUA-AACAGACGCACAGUCACAAAUUCGUAUCUAGGGGAGUAUGUAGU CUAUACAUACCCUGUAGAACCAAAUGUGUGUGCAGCUGACUUG--AUCACAGAUUGGGUUCUAGGGGAGUCUAUGGG CUACAUCUACCCUGUAGAUCCGAAUUUGUUUGAAAUCAG------GCGACAAAUUCGGUUCUAGAGAGGUUUGUGUG CGUGAUCUACCCUGUAGAUCCGGGCUUUUGUAGAAUUUUUA---GAUCAGAAGCUCGUCUCUACAGGUAUCUUACGG Structured ncRNA on reading direction of input alignment? G CG UG AU UG UA UC A U U G A CC G A AL645782.2/32556-32631 AY303232.1/800-874 AF321227.1/48039-48109 AF321227.1/86918-86845 CG UA G G UA AU GC AU UU CG CG GC AU AU UA UA UA GC UA G U CU A _ A A C U GAC (((((.....(((.((((((((((((((((.................)))))))))))))))).)))....))))). CUAUAUAUACCCUGUAGAUCCGGAUUUGUGUA-AACAGACGCACAGUCACAAAUUCGUAUCUAGGGGAGUAUGUAGU CUAUACAUACCCUGUAGAACCAAAUGUGUGUGCAGCUGACUUG--AUCACAGAUUGGGUUCUAGGGGAGUCUAUGGG CUACAUCUACCCUGUAGAUCCGAAUUUGUUUGAAAUCAG------GCGACAAAUUCGGUUCUAGAGAGGUUUGUGUG CGUGAUCUACCCUGUAGAUCCGGGCUUUUGUAGAAUUUUUA---GAUCAGAAGCUCGUCUCUACAGGUAUCUUACGG A _ _ _ _ _ Structured ncRNA on reverse complement of input alignment? C GA C U C A A U C A A G GA UA C U G CG CG CA C C UA AU GC AU A A AG CG GC AC AU UA UA UA GC UA AGCA C C _ _ _ U _ U _ A _ A GUC AF321227.1/86918-86845 AY303232.1/800-874 AL645782.2/32556-32631 AF321227.1/48039-48109 .......((.(((.((((((((((((((((.................)))))))))))))))).)))))........ CCGUAAGAUACCUGUAGAGACGAGCUUCUGAUC---UAAAAAUUCUACAAAAGCCCGGAUCUACAGGGUAGAUCACG CCCAUAGACUCCCCUAGAACCCAAUCUGUGAU--CAAGUCAGCUGCACACACAUUUGGUUCUACAGGGUAUGUAUAG ACUACAUACUCCCCUAGAUACGAAUUUGUGACUGUGCGUCUGUU-UACACAAAUCCGGAUCUACAGGGUAUAUAUAG CACACAAACCUCUCUAGAACCGAAUUUGUCGC------CUGAUUUCAAACAAAUUCGGAUCUACAGGGUAGAUGUAG RNAstrand Approach RNAstrand Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE RNAstrand Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 meanz = P RNAfold: single sequence MFEs RNAalifold: consensus MFE single sequence z-score N RNAstrand Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 meanz = P RNAfold: single sequence MFEs RNAalifold: consensus MFE single sequence z-score N (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 meanz RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE P single sequence z-score N meanmfe = single sequence MFE N meanz = P (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 meanz RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz meanmfe = P single sequence MFE N meanmfe meanz = RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz meanmfe = P single sequence MFE N meanmfe meanz = sci = consmfe meanmfe RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz meanmfe = P single sequence MFE N meanmfe meanz = sci = consmfe meanmfe sci RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz meanmfe = P single sequence MFE N meanmfe meanz = sci = consmfe meanmfe sci consmfe = MFE of consensus sequence RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz meanmfe = P single sequence MFE N meanmfe meanz = sci = consmfe meanmfe sci consmfe = MFE of consensus sequence consmfe RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz → 4meanz meanmfe = P single sequence MFE N meanmfe → 4meanmfe meanz = sci = consmfe meanmfe sci → 4sci consmfe = MFE of consensus sequence consmfe →4consmfe RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Approach Reading direction of input alignment CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG Realigned reverse complement CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 P single sequence z-score N meanz → 4meanz meanmfe = P single sequence MFE N meanmfe → 4meanmfe meanz = sci = consmfe meanmfe sci → 4sci consmfe = MFE of consensus sequence consmfe →4consmfe Classification via support vector machine (SVM) RNAstrand RNAfold: single sequence MFEs RNAalifold: consensus MFE Which descriptors are the best? RNAstrand Which descriptors are the best? Receiver Operating Characteristics (5-fold cross validation): RNAstrand Which descriptors are the best? Receiver Operating Characteristics (5-fold cross validation): true positive rate 1 0.95 0.9 ∆meanz [0.9312] ∆sci [0.9487] ∆meanmfe [0.9798] ∆consmfe [0.9855] 0.85 0.8 0 0.05 0.1 0.15 false positive rate 0.2 RNAstrand Combination of two true positive rate 1 0.95 ∆sci, ∆meanz [0.9731] ∆sci, ∆consmfe [0.9850] ∆sci, ∆meanmfe [0.9869] ∆meanz, ∆consmfe [0.9871] ∆meanmfe, ∆consmfe [0.9874] ∆meanz, ∆meanmfe [0.9893] 0.9 0.85 0.8 0 0.05 0.1 0.15 false positive rate 0.2 RNAstrand Combination of at least three true positive rate 1 0.95 0.9 ∆meanz, ∆meanmfe, ∆consmfe [0.9878] ∆sci, ∆meanmfe, ∆consmfe [0.9887] ∆sci, ∆meanz, ∆consmfe [0.9911] ∆sci, ∆meanz, ∆meanmfe [0.9928] ∆sci, ∆meanz, ∆meanmfe, ∆consmfe [0.9939] 0.85 0 0.05 0.1 0.15 false positive rate 0.2 RNAstrand Combination of at least three true positive rate 1 0.95 0.9 ∆meanz, ∆meanmfe, ∆consmfe [0.9878] ∆sci, ∆meanmfe, ∆consmfe [0.9887] ∆sci, ∆meanz, ∆consmfe [0.9911] ∆sci, ∆meanz, ∆meanmfe [0.9928] ∆sci, ∆meanz, ∆meanmfe, ∆consmfe [0.9939] 0.85 0 0.05 0.1 0.15 false positive rate 0.2 Maximal area under the curve (99.39%) if all four descriptors are taken RNAstrand Additional descriptors 4meanmfe , 4consmfe , 4meanz , 4sci depend on fraction of GU base pairs: 30 20 0.5 20 0 -10 relative frequency ∆meanmfe ∆consmfe 10 10 0 -10 -20 -0.4 -0.2 0 0.2 0.4 ∆sci -3 30 -2 -1 0 1 2 3 ∆z-score 0 0 10 20 30 40 0.07 20 50 (GU(+)/all(+)) + (GU(-)/all(-)) U70 snoRNA (RF00156) 0.06 20 ∆meanmfe 10 0 -10 relative frequency 10 ∆consmfe 0.3 0.2 0.1 -20 -30 tRNA (RF00005) 0.4 0 -10 -20 0.05 0.04 0.03 0.02 0.01 -30 -20 -0.4 -0.2 0 0.2 0.4 ∆sci -3 -2 -1 0 1 2 3 ∆z-score 0 0 RNAstrand 10 20 30 40 50 (GU(+)/all(+)) + (GU(-)/all(-)) Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe Relevance of the strand differences are interpreted by: GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe Relevance of the strand differences are interpreted by: GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand Final set of descriptors Strand differences are captured by: Differences in stability 4meanz 4meanmfe Differences in structure conservation 4sci 4consmfe Relevance of the strand differences are interpreted by: GU pairs in consensus GU pairs in consensus + all pairs in consensus all pairs in consensus Average mean pairwise identity of alignment in both reading directions Number of sequences in alignment RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand SVM specifications SVM library libsvm Radial basis function kernel: K (xi , xj ) = exp(−γ||xi − xj ||2 ) Attributes are scaled to −1 and 1 Optimal parameters: penalty of error term C = 128, γ = 0.5 Probability estimates P that alignment contains ncRNA in same reading direction RNAstrand score: D = 2 ∗ P − 1, D ∈ [−1, 1] Different cutoffs c of score provide different prediction reliabilities: D > +c: ncRNA in reading direction of input alignment D < −c: ncRNA is reverse complement of input alignment −c ≤ D ≤ +c: No decision RNAstrand Training of RNAstrand 5886 training alignments including representatives of rRNAs, snRNAs, snoRNAs, tRNAs, miRNAs, nuclear RNase P and SRP RNA RNAstrand Training of RNAstrand absolute frequency 5886 training alignments including representatives of rRNAs, snRNAs, snoRNAs, tRNAs, miRNAs, nuclear RNase P and SRP RNA 700 600 500 400 300 200 100 0 300 250 200 150 100 50 0 100 200 300 alignment length 400 0 40 50 60 70 80 90 100 mean pairwise sequence identity absolute frequency 700 300 600 500 200 400 300 100 200 100 0 0 10 20 30 40 50 60 (GU(+)/all(+)) + (GU(-)/all(-)) 0 2 3 4 5 number sequences RNAstrand 6 Validation of RNAstrand Validate RNAstrand with 35766 automatically created ClustalW alignments of 313 non-coding RNA families found in RFAM 7.0 ncRNA type 5S rRNA 5.8S rRNA tRNA miRNA snoRNA (C/D) snoRNA (H/ACA) spliceos. RNA euk. SRP RNA nucl. RNaseP RNase MRP SECIS 7SK N 860 146 294 2496 204 1340 2878 1000 260 140 76 184 O 12.6% 0.9% 4.5% 9.9% 1.2% 35.0% 8.1% 41.9% - c =0 A+ A− 0.98 0.98 0.93 0.93 0.94 0.94 0.98 0.97 0.59 0.57 0.98 0.98 0.92 0.92 0.99 0.99 0.93 0.93 0.98 1.00 0.65 0.64 0.04 0.03 A 0.98 0.89 0.88 0.96 0.48 0.97 0.88 0.99 0.92 0.98 0.51 0.02 c = 0.5 1-A-u 0.00 0.05 0.01 0.00 0.32 0.01 0.05 0.00 0.04 0.00 0.25 0.91 RNAstrand u 0.01 0.05 0.09 0.02 0.18 0.01 0.06 0.00 0.03 0.01 0.22 0.05 A 0.95 0.73 0.62 0.89 0.29 0.94 0.77 0.97 0.85 0.96 0.32 0.01 c = 0.9 1-A-u 0.00 0.02 0.00 0.00 0.18 0.00 0.02 0.00 0.01 0.00 0.19 0.80 u 0.03 0.24 0.36 0.10 0.52 0.05 0.19 0.02 0.12 0.03 0.48 0.18 Best cutoff c including undecided alignments excluding undecided alignments 0.91 c=0 c=0.15 0.9 true positive rate 0.8 c=0.95 0.89 0.88 0.75 0.87 0.86 c=0.15 0.7 0.85 c=0.95 0.65 0 0.05 0.1 0.15 false positive rate c=0 0.84 0.2 0.1 0.12 0.14 false positive rate 0.16 alignments classified to contain ncRNA in same reading direction alignments classified to contain ncRNA on reverse complement Maximal Youden index of 0.75 with cutoff c = 0.15 RNAstrand Distribution of RNAstrand scores relative frequencies Accuracy at least 0.5 (279 families) 100 10 1 0.1 0.01 -1 -0.5 0 RNAstrand score 0.5 1 relative frequencies Accuracy less than 0.5 (34 families) 100 10 1 0.1 0.01 -1 -0.5 0 0.5 RNAstrand score Alignments in reading direction of ncRNA Reverse complementary alignments RNAstrand 1 Distribution of SVM decision values relative frequencies test alignments 0.09 0.06 0.03 0 -20 -10 0 10 20 10 20 SVM decision values shuffled test alignments relative frequencies 0.12 0.09 0.06 0.03 0 -20 -10 0 SVM decision values RNAstrand Are we better than naïve approach? (RNAz ) ncRNA type 5S rRNA 5.8S rRNA tRNA miRNA snoRNA (C/D) snoRNA (H/ACA) spliceos. RNA euk. SRP RNA nucl. RNaseP RNase MRP SECIS 7SK N 860 146 294 2496 204 1340 2878 1000 260 140 76 184 Accuracy (RNAstrand: c = 0) A(RNAstrand) A(RNAz) 0.98 0.97 0.93 0.90 0.94 0.53 0.97 0.14 0.58 0.46 0.98 0.94 0.92 0.82 0.99 0.84 0.93 0.82 0.99 0.50 0.65 0.48 0.04 0.03 RNAstrand Are we better than naïve approach? (RNAz ) RNAstrand correct incorrect correct 21536 21425 1618 1729 incorrect 8711 8639 3901 3973 2-fold reduction of misclassification rate RNAstrand Are we better than naïve approach? (EvoFold ) RNAstrand(fwd) correct incorrect RNAstrand(rev) correct incorrect correct 104 8 16 0 correct 102 8 18 0 incorrect 15 0 7 2 incorrect 11 0 11 2 Strand prediction of EvoFold comparable to RNAstrand . RNAstrand RNAstrand Backup Slide - Classification of 7SK RNAs 30 20 7SK (RF00100) 20 ∆meanmfe ∆consmfe 10 10 0 -10 0 -10 -20 -30 -20 -0.4 -0.2 0 0.2 0.4 ∆sci -3 30 -2 -1 0 1 2 20 3 ∆z-score U70 snoRNA (RF00156) 20 ∆meanmfe ∆consmfe 10 10 0 -10 0 -10 -20 -30 -20 -0.4 -0.2 0 0.2 0.4 ∆sci -3 -2 -1 0 RNAstrand 1 2 3 ∆z-score absolute frequency Backup Slide - Test alignments 5000 4000 4000 3000 3000 2000 2000 1000 1000 0 0 100 200 300 alignment length 400 0 40 50 60 70 80 90 100 mean pairwise sequence identity absolute frequency 4000 8000 3000 6000 2000 4000 1000 0 2000 0 10 20 30 40 50 60 (GU(+)/all(+)) + (GU(-)/all(-)) 0 2 3 4 5 number of sequences RNAstrand 6 Backup Slide - Standard deviations ncRNA class rRNA miRNA snoRNA (C/D) snoRNA (H/ACA) spliceos. RNA N 2 36 23 26 6 c=0 A+ A− 0.02 0.02 0.15 0.13 0.38 0.40 0.17 0.17 0.24 0.25 A 0.04 0.18 0.38 0.22 0.29 c = 0.5 1-A-u 0.02 0.05 0.36 0.11 0.20 RNAstrand u 0.01 0.13 0.15 0.15 0.08 A 0.11 0.31 0.37 0.32 0.29 c = 0.9 1-A-u 0.01 0.00 0.28 0.05 0.12 u 0.10 0.31 0.35 0.30 0.17 relative frequency Backup Slide - GU base pair fraction 10 5 0 0 0.2 0.4 0.6 0.8 1 fraction of GU base pairs in consensus structure original test alignments in reading direction of ncRNA corresponding shuffled alignments RNAstrand
© Copyright 2026 Paperzz