Computational RN(Az)omics of Drosophilids

Introduction and Motivation
Results
Computational RN(Az)omics of
Drosophilids
Dominic Rose
Bioinformatics Group, Institute for Computer Science, University of Leipzig
Bled, Feb 2007
Conclusion
Introduction and Motivation
Results
Outline
Introduction and Motivation
Results
Conclusion
Conclusion
Introduction and Motivation
Results
Conclusion
A short introduction
• Multiple regulatory layers of gene regulation are
emphasized for eukaryotes
• Many involve non-protein-coding RNAs (ncRNAs)
• Vastly different structures, functions and evolutionary
patterns
• Gene silencing, RNA processing/modification, . . .
Introduction and Motivation
Results
Conclusion
Why Fly?
”Noncoding HCEs also show strong statistical evidence of
an enrichment for RNA secondary structure.”
A. Siepel, G. Bejerano, J. S. Pedersen et al.:
Evolutionarily conserved elements in vertebrate, insect, worm,
and yeast genomes.
Genome Res., 15(8):1034-1050, Aug 2005
Introduction and Motivation
Results
Conclusion
RNAz surveys of the past
3300
Oikopleura dioica
Cephalochordates
Larvaceans
3600
Hemichordates
Caenorhabditis remanei
Caenorhabditis briggsae Echinoderms
Caenorhabditis elegans
Ciona savignyi
Ciona intestinalis
Ascidians
Urochordates
Nematodes
Chordates
Protostomes
Deuterostomes
Bilateria
30 000
Vertebrates
Introduction and Motivation
Results
Conclusion
The RNAz approach
CGCACGGCTCTAACCGTGTGGTCGTGGG−TCGACCCCACGG
CAATCGGCT−−TAACCGATTGGTCGCAGGTTCGAATCCTGCC
CAGAGGACTGCAAATCCTTTA−TCCCCAGTTCAAATCTGGGT
(((((((........)))))))..((((((........)))))).
−19.2
((((((..........))))))...(((((.........)))))..
−21.3
.((((((........))))))....((((((........)))))).
−11.6
(((((((........)))))))..((((((........)))))).
−18.6
RNAfold: single sequence MFEs
RNAalifold: consensus MFE
• SVM classification:
consensus MFE
• SCI =
(structure conservation index)
mean single sequence MFE
P
single sequence z-score
• z-score =
(thermodynamic stability)
N
• Significance: RNA classification probability p
Introduction and Motivation
Results
Screen design
• 12 drosophilid genomes (CAF1-”Freeze”)
• D. melanogaster serves as reference
• Use existing Pecan alignments to apply RNAz
pipeline
• Alignment pre-processing:
• Filter for valid aligned regions
• Filter for 3- to 6-way alignments
• Annotate predictions
Conclusion
Introduction and Motivation
Results
Outline
Introduction and Motivation
Results
Conclusion
Conclusion
Introduction and Motivation
Results
Conclusion
Overall statistics of the RNAz screen
alignments
aligned DNA [Mb]
screened by RNAz [Mb]
percentage
RNAz p > 0.5
[Kb]
RNAz p > 0.9
[Kb]
FDR p > 0.5 hits
sequence
FDR p > 0.9
sequence
4077
117
57.4
49
42 482
5 079
16 377
2 167
56.5
52.8
45.3
40.2
Introduction and Motivation
Results
Conclusion
Sensitivity on known ncRNAs
Class
tRNA
5S rRNA
SRP RNA
RNAse P
snRNA
snoRNA
miRNA
RNAz
171
0
0
1
18
96
75
input
250
0
0
1
22
202
78
annotated
297
99
2
1
22
250
85
sensitivity (%)
69
—
—
100
81
48
96
not in input
not in input
U6 not det.
Introduction and Motivation
Results
Conclusion
miRNAs
Bithorax 3R, 12681500-12682500 (ubx, abd-A, abd-B)
here: mir-iab–4-3p, mir-iab-4-5p
chr3R:
<---
12681600
12681700
12681800
12681900
12682000
12682100
12682200
RNAz p>0.5
RNAz p>0.5
Percentage GC in 20,000-Base Windows
GC Percent
iab-4
iab-4
FlyBase Protein-Coding Genes
RefSeq Genes
FlyBase Noncoding Genes
mir-iab-4-3p
mir-iab-4-5p
Human(hg17) proteins mapped by chained tBLASTn
D. melanogaster mRNAs from GenBank
D. melanogaster mRNAs
D. melanogaster ESTs That Have Been Spliced
Spliced ESTs
D. melanogaster ESTs Including Unspliced
D. melanogaster ESTs
12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores
Conservation
d_simulans
d_sechellia
d_yakuba
d_erecta
d_ananassae
d_pseudoobscura
d_persimilis
d_willistoni
d_virilis
d_mojavensis
d_grimshawi
a_gambiae
a_mellifera
t_castaneum
Repeating Elements by RepeatMasker
RepeatMasker
12682300
12682400
Introduction and Motivation
Results
Evolutionary patterns of conservation
Blast comparison with our prior RNAz surveys
(Mammals, Nematodes. Urochordates) reveals:
• 167 tRNAs
• 11 snRNAs
• 5 miRNAs
• New: U6atac
Conclusion
Introduction and Motivation
Results
Conclusion
False discovery rates
Numbers of positive scored RNAz windows of the control screen:
shuffling method
chromosomes
all
2L
2R
3L
3R
4
X
conservative
29 938 5 220 631 6 402 7 254 160 6 271
complete
662
123
99
132
155
1
152
(68 562 windows overall)
Pessimistic estimates: 50 % (p > 0.5), 40 % (p > 0.9)
(preserving gap pattern and sequence conservation)
→ intersection of true and control screen: 7.6 %
Optimistic estimates : ∼1-2 %
(without preservation, ”complete” shuffling)
Introduction and Motivation
Results
Conclusion
Genomic distribution
p>0.5
p>0.9
13712
9532
741
832
31377
intron
exon
32500
5’ 30000
-UTR
3’ 27500
-UTR
other
25000
5602
2845
359
467
12706
100.00%
2
2.9
22.4
17.4
90.00%
3.6
4.8
80.00%
39.9
70.00%
22500
60.00%
20000
p>0.5
17500
p>0.9
15000
5’ -UTR
22.5
24.5
3’ -UTR
Exon
8.4
50.00%
7.8
40.00%
17.5
Intronic
proximal
Intronic
distal
33.6
other
12500
30.00%
10000
7500
20.00%
5000
10.00%
2500
41.6
46.6
0.00%
0
intron
exon
5’ -UTR
3’ -UTR
other
Distance boundaries:
”Distal”: > 5 kb away from next gene
”Proximal”: <= 5 kb
dmel
p>0.5
dmel
p>0.9
Encode
Introduction and Motivation
Results
Conclusion
Genomic distribution
Comparison of the distribution of distances to nearest CDS of RNAz
hits and intergenic regions in D. mel. and human.
Length distribution
intergenic regions (human)
intergenic regions (dmel)
intergenic RNAz hits (human)
intergenic RNAz hits (dmel)
fitted intergenic region (human)
fitted intergenic RNAz hits (human)
fitted intergenic regions (dmel)
fitted intergenic RNAz hits (dmel)
1e+05
frequency
10000
1000
100
10
100
1000
10000
1e+05
distance
1e+06
1e+07
Introduction and Motivation
Results
Conclusion
Further Annotation
Isogai et al. (2006):
• Identification of TRF1/BRF binding sites using
high-resolution tiling arrays
• Evidence that TRF1/BRF complex is responsible for
initiation of known classes of Pol III transcription
• 3x enrichment of RNAz hits (p > 0.9) in these regions
• Mostly tRNAs, 7SL RNAs, and snoRNAs
• 197 unannotated RNAz hits → prime candidates for
novel Pol-III transcripts
Moreover:
• 365 plausible miRNA predictions (RNAmicro)
• 1700 RNAz hits have direct evidence for expression
through ESTs (not related to protein coding genes)
Introduction and Motivation
Results
Conclusion
cluster130
N=5 MPI=21.22 SCI=0.70
3R_locus5478
X_locus3103
2L_locus695
2L_locus3024
X_locus5296
X_locus5277
3R_locus1959
2R_locus1885
2R_locus5373
3R_locus5119
A
A
A
A CC
G A
_ A
A _
G _ A U
A
G
C
A
_
A
C
C
U
_
G
AU
_ GAC
U U
G U
U C
G A
A
_ GC
G
A
A A
X_locus380
U A_
_
C
G UG
UU
UG
A G_
CG
AG
UA
_
UA
GA
CG
GU GC
G
C
C
G
_
G
UU _
G
AU
_ A
UA
GC
UA
U
G
U
U
A G
X_locus381
G
C
_
3L_locus4686
CG
A A
GU
CG
A
C
_
_
3R_locus2676
_ _ AA
2R_locus1902
N=3 MPI=23.67 SCI=0.88
N=4 MPI=32.17 SCI=0.57
cluster125
cluster119
cluster115
U
_
A
UC _ _
CG
CA
CA
AA
CG
AG
AU
AA
AU
AA
AA
A
A
C __
A G
G AC
2R_locus260
2R_locus1081
2L_locus3002
2R_locus1080
3R_locus1080
C A __
CG
CA
_
_C G
AC
CG
A A
GA
CU
AA
2R_locus5124
UUA
A
A
A
AUA
AU
GU
AU
AA
_ G
A
AU
AA
AU
AU
G A
U
GU
A AA
UA U
CG _ A
UA
A
_
A
A
_
A
A
C
_
G
G
U
U
A
U
U
C
U
_
_
C
_U
A
UUAU
CG
AU
CG
AU
UG
AA
A _
_
_G U
AA _
A
U
_C
AA
GCU
A
A
U A
_C C
AU
UA
GC
A
G
A A_C
CA
A
_
A U__
GC
AU
AA
_C G
AG
AA
GA
AA
GGA
U
A
AAC
_
AA A _
A
C
C
_
C
_
CC
A A GA
3R_locus1077
N=4 MPI=28.91 SCI=0.74
cluster111
cluster108
N=2 MPI=27.08 SCI=1.11
N=3 MPI=22.11 SCI=0.47
C G
C
A A
GC
CG
AG
AC
CG
CA
CC
CG
A UA
A
G
A
A
_
G
G
C A_U_
CC
GC
CG
AC
AA
CG
C
_
G
3L_locus209
GC
CG
UA
C_ _ _
U C G UC
AU
G AA
_ _
GC
AA C
AC
A A
CU
A
A
GC
_C G
_
GA
A A
CG
G
G
_
GC
GU
U
G CA
CC
A A
A
G
_
A
G
_
C AC
GC
AC
_
_A U
AG
U
U
_
_
3L_locus4650
X_locus5351
X_locus2076
X_locus5352
3L_locus4646
_
_
_
_
2R_locus492
cluster106
N=4 MPI=25.47 SCI=0.77
cluster102
N=2 MPI=30.43 SCI=1.07
N=4 MPI=26.67 SCI=0.56
X_locus5297
3L_locus8416
3R_locus1079
3R_locus3165
UG_
_
A A
G
C
CG
_
_
AU
UA
GC
_
G
U
_
GC
GU G
A
A
A
_
CG _
A
GG
GU
AU
C U GG
A
_
A
_
G
C
U
G
U
G
U
_
A
_
__
_
C G _ AC
AG
GC
CG
AG
CG
AU
UG
U
G GU
AU
_
U G_
CG
AG
A
_ G
X_locus7015
U
U
GA
CG
AG
A UA_
A
G
A
A
G
G
_ A A AU
A A
AC
GA
CC
CG
GC
CA
UA
A A
GA
A A
ACA
A
G
C
A
C
C AC
_ _ _ G_
_
_
_
_
_
_
_
_
_
_
_
_
_
A
_
G G_ _ G
C
_G
U C
U A
A
_UA
U
A
G
U A
G A
A C
C U
A G
A U
A A
G CU
_
_
G
_
C
A G__
G U
U C
U A
G A
U G
_ AU A
C
C
A
A
A
C AA
U
C
_
A
U
U
CAC
AC
C CG
_
AA
AG
AU
UA
G AU
U _
AC
U A_
CC
CA A
UA
UG
G C_
AG
GC
GU
__
U G__
UA
_U A
UA
GU
U U
UG
UA
_ G GC
_
C
G
G
C
CG_
AU
U
U
A
C C_
UA
CC
UC
CG
G
G U
C
U
U
G
GA_
UA
UA
AG
UA
UA
UG
CG
U A_
CG
C
A
A A
U
A
A
U
A
cluster101
Structure-Based Clustering
Introduction and Motivation
Results
Conclusion
Structure-Based Clustering
GG
AUA
A GA A _ U _ AUUC
G
C
G A U U U_G
A
A
_A
CAGA
GUUCCAAC_A_
A
AU
CA
UA
A A
AC
AG
AG
A A
GA
AU
GA
CC
ACG _ CA
AA
A
ACAGA AUA
C
A
GAUA A A AC
G
G C A
A A
G G
G A
G C
G C G _ _
C
G
G
A
C
A
A
A
_
U
U
U
G
A C
AU
AGUUAU_ C AA A
G CA U
A
A
C
C
C
A C
X_locus8475
U C
X_locus5356
C _
G
U
_
_
_ A AGGAU
AUUCC _
A
A
A
U
A
C
A
AA
2R_locus1572
G
A
_
X_locus2074
_
G
X_locus1045
GA
A
G
A
_
A
UA_
UA
CG
UA
_ A
AU
U CA _
A
X_locus5295
G
A
C
X_locus1957
3L_locus4704
3R_locus2978
3L_locus2820
3R_locus5931
2L_locus689
2R_locus264
C
AG
AU
AU
GU
AC
A C UC
A _
UA
_
A
_
U
C
A
A
A
A
A
U
U
A
_
AU
U
GA UA
A
U_
U
C
A AG _
U G AU U A A U
U
C
U
A AU A C
C
A
G
U
UUG
_
A
U A A U U
A A UA U
U
U
_ A
U U
U
U
U U
A U
A
_ _
_ _
_ _
_ _
G C_
CA
CA
GA
_ A AU
U
U
_
C
U
GCU
UA
CA
GC
U _
AUUG C
_
ACG A
_
U
A A
A A C_
C _ U A A _ GU
U
ACA A U
_
_
G
A AG
U
UGU
UUA
C U
G
CA A A C
C
U
C
_
G
C
A
_
C
A A UU _ A
CA U
G
UG
_
U UA
A
U
_
U UC
N=3 MPI=24.86 SCI=1.05
cluster19
cluster20
N=2 MPI=30.70 SCI=0.90
N=4 MPI=24.15 SCI=0.46
cluster25
N=3 MPI=21.05 SCI=0.61
cluster28
MPI=25.73 SCI=0.42
cluster21
_ _
_
A
AU
CA
CA
GA
A UU
_
U
A
_
A
U G AU
UA
CA
GC _
G
C
U
G
U
A
_
U A _ _ A
C
A
A
C G__
GU
A GU
UU
C _UUA
GU _
GGG ACG U G G G A C U _
UU
_
G
A
C
AA
A U
_
A
U
U _
A
A
_
A
U
C
U U
U A U U A GA _
U
G U UA
A
U
_
UGC
MPI=27.15 SCI=0.39
cluster22
A
AU
CA
CA
AU
A UU
_
_
A
U
A
_
U
G AU
AU
CA
G C _C_
G
A_
AG
U
U
G
A
U
U
A
_
C
U
_
U A
A
C
C
A _ _ G
_
C _ A A AU
A
GAC U
U
_
G
G
U
A U U _ A_
U
G
U
A _A
U
G
AA
U
_U A U
CGAUU
A CU _ U
AU
G
CAUGUUC
A
A
U
_
UGC
Introduction and Motivation
Results
Conclusion
Phylogenetic Distribution
DroGri_CAF1
05: 6,566
15.3 %
DroVir_CAF1
09: 1,916
11.7 %
6
05: 3,444
14.3 %
09:
DroMoj_CAF1
0.5: 1,215
2.9 %
0.9:
DroWil_CAF1
370
2.3 %
742
10 %
5
0.5:
589
2.5 %
0.9:
126
1.7 %
DroPer_CAF1
0.5: 4,873
11.5 %
DroPse_CAF1
0.9: 1,415
8.7 %
4
0.5: 2,594
10.8 %
0.9:
0.5: 6,462
15.2 %
0.9: 2,188
13.4 %
3
546
7.4 %
0.083
DroAna_CAF1
0.5: 3,373
14.1 %
0.9:
884
11.9 %
0.119
0.5: 20,828
49.1 %
0.9:
2
0.9:
4,715
63.5 %
Innovations/branch length:
1.39
1.05
1.27
1.12
1.12
1.33
DroYak_CAF1
0.5: 2,467
5.8 %
0.9:
0.5: 12,459
51.9 %
0.437
p>0.5
p>0.9
DroEre_CAF1
9,574
58.5 %
DroSec_CAF1
905
5.3 %
1
0.5:
0.9:
1,545
6.5 %
414
5.6 %
0.067
0.87
0.79
DroSim_CAF1
DroMel_CAF1
Introduction and Motivation
Results
Outline
Introduction and Motivation
Results
Conclusion
Conclusion
Introduction and Motivation
Results
Conclusion
Conclusion
• Submitted :-)
• High-quality ncRNA prediction for drosophilid species
• Insights into drosophilid genome organisation
• Reliable FDR?
• Very strong signals for true novel ncRNAs (Pol-III
transcripts)
Introduction and Motivation
Results
Thank you!!!
;-)
Jörg
Stefan
Kristin
Jana
Sven
Peter
Sonja
Conclusion