Introduction and Motivation Results Computational RN(Az)omics of Drosophilids Dominic Rose Bioinformatics Group, Institute for Computer Science, University of Leipzig Bled, Feb 2007 Conclusion Introduction and Motivation Results Outline Introduction and Motivation Results Conclusion Conclusion Introduction and Motivation Results Conclusion A short introduction • Multiple regulatory layers of gene regulation are emphasized for eukaryotes • Many involve non-protein-coding RNAs (ncRNAs) • Vastly different structures, functions and evolutionary patterns • Gene silencing, RNA processing/modification, . . . Introduction and Motivation Results Conclusion Why Fly? ”Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.” A. Siepel, G. Bejerano, J. S. Pedersen et al.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15(8):1034-1050, Aug 2005 Introduction and Motivation Results Conclusion RNAz surveys of the past 3300 Oikopleura dioica Cephalochordates Larvaceans 3600 Hemichordates Caenorhabditis remanei Caenorhabditis briggsae Echinoderms Caenorhabditis elegans Ciona savignyi Ciona intestinalis Ascidians Urochordates Nematodes Chordates Protostomes Deuterostomes Bilateria 30 000 Vertebrates Introduction and Motivation Results Conclusion The RNAz approach CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT (((((((........)))))))..((((((........)))))). −19.2 ((((((..........))))))...(((((.........))))).. −21.3 .((((((........))))))....((((((........)))))). −11.6 (((((((........)))))))..((((((........)))))). −18.6 RNAfold: single sequence MFEs RNAalifold: consensus MFE • SVM classification: consensus MFE • SCI = (structure conservation index) mean single sequence MFE P single sequence z-score • z-score = (thermodynamic stability) N • Significance: RNA classification probability p Introduction and Motivation Results Screen design • 12 drosophilid genomes (CAF1-”Freeze”) • D. melanogaster serves as reference • Use existing Pecan alignments to apply RNAz pipeline • Alignment pre-processing: • Filter for valid aligned regions • Filter for 3- to 6-way alignments • Annotate predictions Conclusion Introduction and Motivation Results Outline Introduction and Motivation Results Conclusion Conclusion Introduction and Motivation Results Conclusion Overall statistics of the RNAz screen alignments aligned DNA [Mb] screened by RNAz [Mb] percentage RNAz p > 0.5 [Kb] RNAz p > 0.9 [Kb] FDR p > 0.5 hits sequence FDR p > 0.9 sequence 4077 117 57.4 49 42 482 5 079 16 377 2 167 56.5 52.8 45.3 40.2 Introduction and Motivation Results Conclusion Sensitivity on known ncRNAs Class tRNA 5S rRNA SRP RNA RNAse P snRNA snoRNA miRNA RNAz 171 0 0 1 18 96 75 input 250 0 0 1 22 202 78 annotated 297 99 2 1 22 250 85 sensitivity (%) 69 — — 100 81 48 96 not in input not in input U6 not det. Introduction and Motivation Results Conclusion miRNAs Bithorax 3R, 12681500-12682500 (ubx, abd-A, abd-B) here: mir-iab–4-3p, mir-iab-4-5p chr3R: <--- 12681600 12681700 12681800 12681900 12682000 12682100 12682200 RNAz p>0.5 RNAz p>0.5 Percentage GC in 20,000-Base Windows GC Percent iab-4 iab-4 FlyBase Protein-Coding Genes RefSeq Genes FlyBase Noncoding Genes mir-iab-4-3p mir-iab-4-5p Human(hg17) proteins mapped by chained tBLASTn D. melanogaster mRNAs from GenBank D. melanogaster mRNAs D. melanogaster ESTs That Have Been Spliced Spliced ESTs D. melanogaster ESTs Including Unspliced D. melanogaster ESTs 12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores Conservation d_simulans d_sechellia d_yakuba d_erecta d_ananassae d_pseudoobscura d_persimilis d_willistoni d_virilis d_mojavensis d_grimshawi a_gambiae a_mellifera t_castaneum Repeating Elements by RepeatMasker RepeatMasker 12682300 12682400 Introduction and Motivation Results Evolutionary patterns of conservation Blast comparison with our prior RNAz surveys (Mammals, Nematodes. Urochordates) reveals: • 167 tRNAs • 11 snRNAs • 5 miRNAs • New: U6atac Conclusion Introduction and Motivation Results Conclusion False discovery rates Numbers of positive scored RNAz windows of the control screen: shuffling method chromosomes all 2L 2R 3L 3R 4 X conservative 29 938 5 220 631 6 402 7 254 160 6 271 complete 662 123 99 132 155 1 152 (68 562 windows overall) Pessimistic estimates: 50 % (p > 0.5), 40 % (p > 0.9) (preserving gap pattern and sequence conservation) → intersection of true and control screen: 7.6 % Optimistic estimates : ∼1-2 % (without preservation, ”complete” shuffling) Introduction and Motivation Results Conclusion Genomic distribution p>0.5 p>0.9 13712 9532 741 832 31377 intron exon 32500 5’ 30000 -UTR 3’ 27500 -UTR other 25000 5602 2845 359 467 12706 100.00% 2 2.9 22.4 17.4 90.00% 3.6 4.8 80.00% 39.9 70.00% 22500 60.00% 20000 p>0.5 17500 p>0.9 15000 5’ -UTR 22.5 24.5 3’ -UTR Exon 8.4 50.00% 7.8 40.00% 17.5 Intronic proximal Intronic distal 33.6 other 12500 30.00% 10000 7500 20.00% 5000 10.00% 2500 41.6 46.6 0.00% 0 intron exon 5’ -UTR 3’ -UTR other Distance boundaries: ”Distal”: > 5 kb away from next gene ”Proximal”: <= 5 kb dmel p>0.5 dmel p>0.9 Encode Introduction and Motivation Results Conclusion Genomic distribution Comparison of the distribution of distances to nearest CDS of RNAz hits and intergenic regions in D. mel. and human. Length distribution intergenic regions (human) intergenic regions (dmel) intergenic RNAz hits (human) intergenic RNAz hits (dmel) fitted intergenic region (human) fitted intergenic RNAz hits (human) fitted intergenic regions (dmel) fitted intergenic RNAz hits (dmel) 1e+05 frequency 10000 1000 100 10 100 1000 10000 1e+05 distance 1e+06 1e+07 Introduction and Motivation Results Conclusion Further Annotation Isogai et al. (2006): • Identification of TRF1/BRF binding sites using high-resolution tiling arrays • Evidence that TRF1/BRF complex is responsible for initiation of known classes of Pol III transcription • 3x enrichment of RNAz hits (p > 0.9) in these regions • Mostly tRNAs, 7SL RNAs, and snoRNAs • 197 unannotated RNAz hits → prime candidates for novel Pol-III transcripts Moreover: • 365 plausible miRNA predictions (RNAmicro) • 1700 RNAz hits have direct evidence for expression through ESTs (not related to protein coding genes) Introduction and Motivation Results Conclusion cluster130 N=5 MPI=21.22 SCI=0.70 3R_locus5478 X_locus3103 2L_locus695 2L_locus3024 X_locus5296 X_locus5277 3R_locus1959 2R_locus1885 2R_locus5373 3R_locus5119 A A A A CC G A _ A A _ G _ A U A G C A _ A C C U _ G AU _ GAC U U G U U C G A A _ GC G A A A X_locus380 U A_ _ C G UG UU UG A G_ CG AG UA _ UA GA CG GU GC G C C G _ G UU _ G AU _ A UA GC UA U G U U A G X_locus381 G C _ 3L_locus4686 CG A A GU CG A C _ _ 3R_locus2676 _ _ AA 2R_locus1902 N=3 MPI=23.67 SCI=0.88 N=4 MPI=32.17 SCI=0.57 cluster125 cluster119 cluster115 U _ A UC _ _ CG CA CA AA CG AG AU AA AU AA AA A A C __ A G G AC 2R_locus260 2R_locus1081 2L_locus3002 2R_locus1080 3R_locus1080 C A __ CG CA _ _C G AC CG A A GA CU AA 2R_locus5124 UUA A A A AUA AU GU AU AA _ G A AU AA AU AU G A U GU A AA UA U CG _ A UA A _ A A _ A A C _ G G U U A U U C U _ _ C _U A UUAU CG AU CG AU UG AA A _ _ _G U AA _ A U _C AA GCU A A U A _C C AU UA GC A G A A_C CA A _ A U__ GC AU AA _C G AG AA GA AA GGA U A AAC _ AA A _ A C C _ C _ CC A A GA 3R_locus1077 N=4 MPI=28.91 SCI=0.74 cluster111 cluster108 N=2 MPI=27.08 SCI=1.11 N=3 MPI=22.11 SCI=0.47 C G C A A GC CG AG AC CG CA CC CG A UA A G A A _ G G C A_U_ CC GC CG AC AA CG C _ G 3L_locus209 GC CG UA C_ _ _ U C G UC AU G AA _ _ GC AA C AC A A CU A A GC _C G _ GA A A CG G G _ GC GU U G CA CC A A A G _ A G _ C AC GC AC _ _A U AG U U _ _ 3L_locus4650 X_locus5351 X_locus2076 X_locus5352 3L_locus4646 _ _ _ _ 2R_locus492 cluster106 N=4 MPI=25.47 SCI=0.77 cluster102 N=2 MPI=30.43 SCI=1.07 N=4 MPI=26.67 SCI=0.56 X_locus5297 3L_locus8416 3R_locus1079 3R_locus3165 UG_ _ A A G C CG _ _ AU UA GC _ G U _ GC GU G A A A _ CG _ A GG GU AU C U GG A _ A _ G C U G U G U _ A _ __ _ C G _ AC AG GC CG AG CG AU UG U G GU AU _ U G_ CG AG A _ G X_locus7015 U U GA CG AG A UA_ A G A A G G _ A A AU A A AC GA CC CG GC CA UA A A GA A A ACA A G C A C C AC _ _ _ G_ _ _ _ _ _ _ _ _ _ _ _ _ _ A _ G G_ _ G C _G U C U A A _UA U A G U A G A A C C U A G A U A A G CU _ _ G _ C A G__ G U U C U A G A U G _ AU A C C A A A C AA U C _ A U U CAC AC C CG _ AA AG AU UA G AU U _ AC U A_ CC CA A UA UG G C_ AG GC GU __ U G__ UA _U A UA GU U U UG UA _ G GC _ C G G C CG_ AU U U A C C_ UA CC UC CG G G U C U U G GA_ UA UA AG UA UA UG CG U A_ CG C A A A U A A U A cluster101 Structure-Based Clustering Introduction and Motivation Results Conclusion Structure-Based Clustering GG AUA A GA A _ U _ AUUC G C G A U U U_G A A _A CAGA GUUCCAAC_A_ A AU CA UA A A AC AG AG A A GA AU GA CC ACG _ CA AA A ACAGA AUA C A GAUA A A AC G G C A A A G G G A G C G C G _ _ C G G A C A A A _ U U U G A C AU AGUUAU_ C AA A G CA U A A C C C A C X_locus8475 U C X_locus5356 C _ G U _ _ _ A AGGAU AUUCC _ A A A U A C A AA 2R_locus1572 G A _ X_locus2074 _ G X_locus1045 GA A G A _ A UA_ UA CG UA _ A AU U CA _ A X_locus5295 G A C X_locus1957 3L_locus4704 3R_locus2978 3L_locus2820 3R_locus5931 2L_locus689 2R_locus264 C AG AU AU GU AC A C UC A _ UA _ A _ U C A A A A A U U A _ AU U GA UA A U_ U C A AG _ U G AU U A A U U C U A AU A C C A G U UUG _ A U A A U U A A UA U U U _ A U U U U U U A U A _ _ _ _ _ _ _ _ G C_ CA CA GA _ A AU U U _ C U GCU UA CA GC U _ AUUG C _ ACG A _ U A A A A C_ C _ U A A _ GU U ACA A U _ _ G A AG U UGU UUA C U G CA A A C C U C _ G C A _ C A A UU _ A CA U G UG _ U UA A U _ U UC N=3 MPI=24.86 SCI=1.05 cluster19 cluster20 N=2 MPI=30.70 SCI=0.90 N=4 MPI=24.15 SCI=0.46 cluster25 N=3 MPI=21.05 SCI=0.61 cluster28 MPI=25.73 SCI=0.42 cluster21 _ _ _ A AU CA CA GA A UU _ U A _ A U G AU UA CA GC _ G C U G U A _ U A _ _ A C A A C G__ GU A GU UU C _UUA GU _ GGG ACG U G G G A C U _ UU _ G A C AA A U _ A U U _ A A _ A U C U U U A U U A GA _ U G U UA A U _ UGC MPI=27.15 SCI=0.39 cluster22 A AU CA CA AU A UU _ _ A U A _ U G AU AU CA G C _C_ G A_ AG U U G A U U A _ C U _ U A A C C A _ _ G _ C _ A A AU A GAC U U _ G G U A U U _ A_ U G U A _A U G AA U _U A U CGAUU A CU _ U AU G CAUGUUC A A U _ UGC Introduction and Motivation Results Conclusion Phylogenetic Distribution DroGri_CAF1 05: 6,566 15.3 % DroVir_CAF1 09: 1,916 11.7 % 6 05: 3,444 14.3 % 09: DroMoj_CAF1 0.5: 1,215 2.9 % 0.9: DroWil_CAF1 370 2.3 % 742 10 % 5 0.5: 589 2.5 % 0.9: 126 1.7 % DroPer_CAF1 0.5: 4,873 11.5 % DroPse_CAF1 0.9: 1,415 8.7 % 4 0.5: 2,594 10.8 % 0.9: 0.5: 6,462 15.2 % 0.9: 2,188 13.4 % 3 546 7.4 % 0.083 DroAna_CAF1 0.5: 3,373 14.1 % 0.9: 884 11.9 % 0.119 0.5: 20,828 49.1 % 0.9: 2 0.9: 4,715 63.5 % Innovations/branch length: 1.39 1.05 1.27 1.12 1.12 1.33 DroYak_CAF1 0.5: 2,467 5.8 % 0.9: 0.5: 12,459 51.9 % 0.437 p>0.5 p>0.9 DroEre_CAF1 9,574 58.5 % DroSec_CAF1 905 5.3 % 1 0.5: 0.9: 1,545 6.5 % 414 5.6 % 0.067 0.87 0.79 DroSim_CAF1 DroMel_CAF1 Introduction and Motivation Results Outline Introduction and Motivation Results Conclusion Conclusion Introduction and Motivation Results Conclusion Conclusion • Submitted :-) • High-quality ncRNA prediction for drosophilid species • Insights into drosophilid genome organisation • Reliable FDR? • Very strong signals for true novel ncRNAs (Pol-III transcripts) Introduction and Motivation Results Thank you!!! ;-) Jörg Stefan Kristin Jana Sven Peter Sonja Conclusion
© Copyright 2026 Paperzz