Document

Identification of long non-coding RNAs
(lncRNAs) using RNASeq in dogs
IGDR - UMR6290 - CNRS - Université de Rennes1
Canine Genetics Group - Catherine André
!
Thomas DERRIEN
December 10th
Non-coding Genome
•
80% of the variants associated with disease (by GWAS) are localized outside of proteincoding genes (Manolio et al., Hindorrf et al.)
!
•
>60% of the human genome is covered by processed_transcripts (~75% by primary
transcripts) with only 2% corresponding to proteins... (ENCODE Consortium; Djebali
Nature; 2012)
Non-coding Genome
•
80% of the variants associated with disease (by GWAS) are localized outside of proteincoding genes (Manolio et al., Hindorrf et al.)
!
•
>60% of the human genome is covered by processed_transcripts (~75% by primary
transcripts) with only 2% corresponding to proteins... (ENCODE Consortium; Djebali
Nature; 2012) •
Back to the future: The cell as an RNA machinery
!
(from Amaral P, et al., 2008)
Type
functions
miRNAs
Regulation of gene expression
siRNAs
RNA interference pathway
snoRNAs
Chemical modification of rRNA,
tRNAs and small RNAs
piRNAs
transposon defense - regulate
euchromatin formation
snRNA
splicing, regulation of TFs,
telomere stability...
...
...
long ncRNAs
Various
What is known about lncRNAs
•
•
Definition : Transcripts without coding potential , >200 nt, spliced, polyA+/- (Derrien et al., 2012)
Annotation in human : e.g GENCODE reference annotation (Harrow et al., 2012) (1000 genomes project)
25000
Protein-coding_Genes
LncRNAs_Genes
Number of genes
20000
15000
10000
5000
12
12
st
/2
0
gu
Au
Ju
ch
ar
M
ne
/2
0
/2
0
12
01
1
D
ec
em
be
er
r/2
/2
0
11
11
O
ct
ob
ly
Ju
/2
0
ay
M
/2
0
11
11
/2
0
ch
M
ar
09
/2
0
er
O
ct
ob
Ju
ly
/2
0
09
0
•
•
"Famous" lncRNAs: XIST, H19, HOTAIR... (Guttman et al., Duret et al., Navarro et al., Ponting et al.,) Known functions: regulation of mRNAs expression, X chromosome inactivation, imprinting...
LncRNAs Functions
LncRNAs Functions (broad overview)
•
Can enhance or repress
transcription of targeted mRNA(s) •
•
Can act in cis or in trans
•
Examples:
Serve as "flexible scaffolds"
•
XIST : binds PRC2 (DNMT3A) => DNA
hypermethylation => silencing X chromosome
•
HOTTIP : binds MLL1 => H3K4me3 => activation
of HOXA genes
(from Mattick JS, et al., 2010)
➡ RNASeq in dogs
➡ FEELnc : Annotation of lncRNAs
➡ Characterization of canine lncRNAs set
Dog and non-coding genome
•
•
•
•
➡
Unique evolutionary history
High heterogeneity bw breeds vs. High homogeneity within a breed
One breed = One genetic isolate
Facilitates the identification of Genotype/Phentoype relationships
Annotate ncRNAs to exploit the strength of the dog model to identify Genotype/Phenotype
relationship
How to annotate lncRNAs: RNASeq
•
RNASeq: High throughput sequencing of all RNA molecules of cell line or a tissue at a
specific time point.
!
•
RNASeq experiment for bioinformaticians (skipping all the different steps/protocols...) : !
How to annotate lncRNAs: RNASeq
•
RNASeq: High throughput sequencing of all RNA molecules of cell line or a tissue at a
specific time point.
!
•
RNASeq experiment for bioinformaticians (skipping all the different steps/protocols...) : !
Fragments of RNA
(cDNA) sequences
Library
construction
@BRAIN_1_R.1 1 length=76!
TATACATAAGCAGGTACCCACAAGGCAAGGTAGGACAGTTACTGTAGCTAATGAAAGAAAAAAGTCAGGGTGAGGA!
+BRAIN_1_R.1 1 length=76!
CCCFFFFFHHHHHJEHJJIJJJJJJIJJJJFGIIJJJJIJIJJJIJJJJJJJJJJJJJJJJJJJGHHHHH?BBFEC!
@BRAIN_1_R.2 2 length=76!
TGCATATATCACTTTTATTGGTAAATCCGCATTTCTTAGCTTAGAGACATATTCTGTAGATTATTCCCCTCCCCCT!
+BRAIN_1_R.2 2 length=76!
CCCFFFFFHHHHHJJJJJJJJIGIIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJGHIJHIJJJJE!
@BRAIN_1_R.3 3 length=76!
TGCATATATCACTTTTATTGGTAAATCCGCATTTCTTAGCTTAGAGACATATTCTGTAGATTATTCCCCTCCCCCT!
+BRAIN_1_R.3 3 length=76!
CCCFFFFFHHHHHJJJGIGIICHHHGJJJIJJJJJJJJHIIIIIJEIIIICHGIJIJIIBGHEIJIGHGEEHIIIB!
@BRAIN_1_R.4 4 length=76!
GAAGTGTAATCACATTTAGTTTCAAAAGTTCAAATGCCTGTTCCTGTTATACATAAGCAGGTACCCACAAGGCAAG!
+BRAIN_1_R.4 4 length=76!
BCCFDFFEHHHHHJGJJIJIJJJJIJJJHIJJJIIJJJJJIJJJJIIJJJJJJJJJIIJIIDFGIIIJJJJJGGHC!
@BRAIN_1_R.5 5 length=76!
AAGGTTTGCCCTCTTTTCTCTGAAACTTCTAGGTATTTTTAAGTTCCAGCTGGTTCTCTGCTCTGCCATAAACGAG!
+BRAIN_1_R.5 5 length=76!
@CCFDFFFGHHFHGIIIGIEHGHGGGGHGIJIJ:EHIIJIJIIJGHIGHIJJIHIJIHGHIIJJJIJJIIG>HIIG!
@BRAIN_1_R.6 6 length=76!
GAAGTAACCGCCTTTCCTGGAGGAGTGGGTGGTCTCCGCTACAATCTCATCTGCCTCCTCTCCTGAAACAGGACTG!
+BRAIN_1_R.6 6 length=76!
BBCFDDEDHHHHHJJJJJJJJJJIJFGIJHIIIJJJJJJJJJJIJJJJJJJJHGHHHHFFFFFEEECEDDDDDDDB!
@BRAIN_1_R.7 7 length=76!
GGAAATATCAGAAGTAAAAGAGTAAATGGGAAGAGGCCAAGGATGTATTCGTCCAACGGATATTAAAATGTCCTTT!
+BRAIN_1_R.7 7 length=76!
CCCFFFFFHHHHHJHIJJJJIJCHHIJJJJJJJJJJIJIJJJJJJHIIJJJIJJJIJJJIJJJJJJJHHHHHHHFB!
@BRAIN_1_R.8 8 length=76!
TGGCGCCCTGCCTGGCTCCATTAAAACAATTACCACCCTTTTGGGATCATCTACACTTCTGCTATGTCCTCTCCCT!
+BRAIN_1_R.8 8 length=76!
CCCFFFFFHHHHHJJJJIJIJJJIJJGJJJJJJJJJJJJJJJJIJIJJJIJJJIJJJJJJJJHEHHHHGFFFFFFC!
@BRAIN_1_R.9 9 length=76!
TGGCGCCCTGCCTGGCTCCATTAAAACAATTACCACCCTTTTGGGATCATCTACACTTCTGCTATGTCCTCTCCCT!
+BRAIN_1_R.9 9 length=76!
CCCFFFFFHHHHHJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIHHHHHHHFFFFFFA!
@BRAIN_1_R.10 10 length=76!
I.
RNASeq
samples available in dog
CaniDNA BioBank :
➡
34 samples ➡
24 samples
➡
28 from dogs at GIGA (Liège)
➡
18 from dogs
➡
6 from dogs at CNG (Evry)
➡
6 from wolves
➡
Unstranded
➡
Stranded and Not stranded
58 RNAseq
33 Dogs
10 Breeds
17 Tissues
~
~
~
paired-end : 2x75bp /2x100 bp!
30-60 millions reads/RNA-seq !
3 billion reads !
300 billion nucleotides
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
TISSUE
BLOOD
CELL_DC
CELL_LINE
HEART
LIVER
MUSCLE
OVARY
TESTIS
KIDNEY
MUCOSA
SPLEEN
LYMPHATIC_NODE
LUNG
SKIN
BRAIN
BRAIN_CORTEX
MUCOSA_ORAL
X!
1X!
1X!
1X!
1X!
1X!
1X!
1X!
1X!
2X!
2X!
3X!
4X!
5X!
5X!
7X!
11X!
11X
Pipeline for dog RNASeq analysis
(Christophe Hitte)
Dog Reference annotation: Ensembl (v70)
RNASeq_file (.fastq)
stats
fastqc + sickle...
Cleaning
Cleaned sequences (.fastq)
stats
tophat2
bowtie2
Mapping
Mapped files (.bam)
stats
Cufflink2
Known and novel transcripts(.gtf)
Expression Levels
(.fpkm)
stats
Genomic positions
(bp)
Transcriptome
reconstruction
Example of Brain (cortex) RNASeq
Current dog annotation
One RNASeq Experiment
BRAIN RNASeq
-#Genes:
29,878
-#tcpts:
44,831
ZNF3-201
Scale
chr6:
CUFF.25557.4
CUFF.25557.3
CUFF.25557.2
CUFF.25557.1
ENSCAFT00000023568
ENSCAFT00000023568
SINE
LINE
LTR
DNA
Simple
Low Complexity
Satellite
RNA
Other
Unknown
Gap
9,525,500
9,526,000
9,526,500
9,527,000
9,527,500
9,528,000
2 kb
9,528,500
canFam3
9,529,000
9,529,500 9,530,000 9,530,500 9,531,000
BROAD2_BRAIN.transcripts_gt0_ENSv70.gtf
9,531,500
9,532,000
9,532,500
9,533,000
LncRNAs_merged58_v70
RefSeq Genes
Ensembl Gene Predictions - archive Ensembl 70 - jan2013
Repeating Elements by RepeatMasker
Gap Locations
=> RNASeq allows to annotate new isoforms w.r.t to current reference annotations
9,533,500
Example of Brain (cortex) RNASeq
Current dog annotation
One RNASeq Experiment
BRAIN RNASeq
-#Genes:
29,878
-#tcpts:
44,831
New transcript
Scale
chr9:
CUFF.30318.1
CUFF.30324.1
AFT00000043699
GGTA1
SINE
LINE
LTR
DNA
Simple
Low Complexity
Satellite
RNA
Other
Unknown
Gap
60,950,000
60,960,000
50 kb
60,970,000
60,980,000
60,990,000
61,000,000
61,010,000
BROAD2_BRAIN.transcripts_gt0_ENSv70.gtf
61,020,000
canFam3
61,030,000
61,040,000
RefSeq Genes
Ensembl Gene Predictions - archive Ensembl 70 - jan2013
ENSCAFT00000043699
Repeating Elements by RepeatMasker
=> RNASeq allows to annotate new (expressed) transcripts
Gap Locations
=> Are these lncRNAs?
61,0
➡ RNASeq in dogs
➡ FEELnc : Annotation of lncRNAs
➡ Characterization of canine lncRNAs set
FEELnc : Fast and Effective Extraction of LncRNAs
RNASeq
Experiment(s)
I- FEELnc_Filter
II- FEELnc_CodingPot
III- FEELnc_Classifier
LncRNAs
FEELnc : Filters
Merged 58 dog RNASeq samples
Known and novel
transcripts
#tcpts: 300,735!
#genes: 140,007
I- FEELnc_Filters
-
transcripts overlapping annotated mRNAs exon
-
size > N bp [default N=200]
-
monoexonic transcripts
-
Options
-
transcripts overlapping mRNAs locus ( get lincRNAs)
-
...
Candidate lncRNAs
153,910 overlap mRNAs (in sense)
111,447 mono-exonic transcripts!
!
!
(from unstranded RNASeq)
3,940 length lower than 200bp
-#tcpts: 31,157
FEELnc : Coding potential
Candidate lncRNAs
II- FEELnc_CodingPot.
-
-
Combination of 1 to 4 dedicated programs
-
CPC
- TxCdsPredict
-
CPAT
- Geneid
-#tcpts: 31,157
- CPC : Coding Potential Calculator!
blast on protein database !
- CPAT : Coding-Potential Assessment Tool!
hexamer_Frequency + ORF length analysis
+ Codon usage bias!
- GeneId - TxCdsPredict : HMM trained on
mRNAs...
Get Intersection/Union and construct Venn diagram
Venn diagram
Dog stringent set of lncRNAs
-#tcpts: 18,051!
-#genes: 9,810
FEELnc : Classifier
•
Classifying lncRNAs genomic context wrt to mRNAs could help predict functionality
-#tcpts: 18,051!
-#genes: 9,810
Dog set of lncRNAs
III- FEELnc_Classifier
-
Classify bona fide lncRNAs
-
Intergenic
Schematic overlapping
scenario
LncRNA ex.
Cod ex.
- Genic
Bidirectional
promoter
LncRNA ex.
Intergenic (lincRNA) !
=14,726
Divergent
5,497
Genic (mRNA overlap)!
=3,325
Exon (AS)
1920
Cod ex.
LncRNA ex.
Cod ex.
Convergent
2,777
Intron (S/AS)
57/1,018
LncRNA ex.
Same Orient.
6,452
Encomp (S/AS)
129/201
Cod ex.
FEELnc : Classifier
•
Classifying lncRNAs genomic context wrt to mRNAs could help predict functionality
-#tcpts: 18,051!
-#genes: 9,810
Dog set of lncRNAs
Schematic overlapping
scenario
III- FEELnc_Classifier
-
LncRNA ex.
Classify bona fide lncRNAs
-
Intergenic (3 classes)
Cod ex.
- Genic (5 classes)
LncRNA ex.
Intergenic (lincRNA) !
=14,726
Divergent
5,497
Genic (mRNA overlap)!
=3,325
Exon (AS)
1920
Cod ex.
LncRNA ex.
Cod ex.
Convergent
2,777
Intron (S/AS)
57/1,018
LncRNA ex.
Same Orient.
6,452
Encomp (S/AS)
129/201
Cod ex.
FEELnc : Classifier
•
Classifying lncRNAs genomic context wrt to mRNAs could help predict functionality
-#tcpts: 18,051!
-#genes: 9,810
Dog set of lncRNAs
Schematic overlapping
scenario
III- FEELnc_Classifier
-
LncRNA ex.
Classify bona fide lncRNAs
-
Intergenic (3 classes)
Cod ex.
- Genic (5 classes)
LncRNA ex.
Intergenic (lincRNA) !
=14,726
Divergent
5,497
Genic (mRNA overlap)!
=3,325
Exon (AS)
1920
Cod ex.
LncRNA ex.
Cod ex.
Convergent
2,777
Intron (S/AS)
57/1,018
LncRNA ex.
Same Orient.
6,452
Encomp (S/AS)
129/201
Cod ex.
FEELnc : Classifier
•
Classifying lncRNAs genomic context wrt to mRNAs could help predict functionality
-#tcpts: 18,051!
-#genes: 9,810
Dog set of lncRNAs
Schematic overlapping
scenario
III- FEELnc_Classifier
-
Exonic
AS
Classify bona fide lncRNAs
-
Intergenic (3 classes)
Intergenic (lincRNA) !
=14,726
- Genic (5 classes)
Genic (mRNA overlap)!
=3,325
Divergent
5,497
Exon (AS)
1920
Convergent
2,777
Intron (S/AS)
57/1,018
Same Orient.
6,452
Contain (S/AS)
129/201
LncRNA ex.
Cod ex.
LncRNA ex.
Intronic
Contain
Cod ex.
LncRNA ex.
Cod ex.
Dog lncRNAs: a few examples
•
Dog XIST: not annotated by Ensembl reference annotation (partially by comparative genomics)
Scale
chrX:
10 kb
57,320,000
57,325,000
57,330,000
57,335,000
TCONS_00297829
TCONS_00297821
TCONS_00297827
TCONS_00297826
TCONS_00297831
TCONS_00297830
TCONS_00297828
TCONS_00297825
TCONS_00297824
TCONS_00297823
TCONS_00297822
ENSCAFT00000045197
canFam3
57,345,000
57,350,000
RefSeq Genes
Ensembl Gene Predictions - archive Ensembl 70 - jan2013
57,355,000
ENSCAFT00000048497
Non-Dog RefSeq Genes
Bos XIST
Homo XIST
Gap Locations
Gap
•
57,340,000
User Supplied Track
CDKN2B-AS (ANRIL) : associated by GWAS study with many diseases (coronary disease,
aneurysm, type 2 diabetes)
Scale
chr11:
TCONS_00028371
TCONS_00028379
TCONS_00028378
TCONS_00028377
TCONS_00028376
TCONS_00028375
TCONS_00028374
TCONS_00028373
TCONS_00028372
TCONS_00028381
TCONS_00028382
TCONS_00028385
TCONS_00028384
TCONS_00028383
TCONS_00028386
TCONS_00028387
TCONS_00028388
TCONS_00028391
TCONS_00028394
TCONS_00028400
TCONS_00028401
TCONS_00028402
TCONS_00028403
CDKN2B
ENSCAFT00000002632
41,270,000
41,280,000
41,290,000
41,300,000
50 kb
41,310,000
41,320,000
41,330,000
41,340,000
41,350,000
LncRNAs_merged58_v70
41,360,000
RefSeq Genes
Ensembl Gene Predictions - archive Ensembl 70 - jan2013
ENSCAFT00000045034
canFam3
41,370,000
41,380,000
41,390,000
41,400,000
41,410,000
➡ RNASeq in dogs
➡ FEELnc : Annotation of lncRNAs
➡ Characterization of canine lncRNAs set
Dog lncRNAs Characterization (i)
GERP (Genomic Evolutionary Rate Profiling) identifies constrained elements in multiple alignments by quantifying substitution deficits
=> LncRNAs exons do not seem to be evolutionary conserved wrt to mRNAs
=> LncRNAs promoters as conserved as mRNAs
Dog lncRNAs Characterization (ii)
Proportion of lncRNAs/mRNAs transcripts found in 58 RNASeq exp.
by increased RPKM thresholds
100
Proportion of elements (%)
75
RPKM thresholds
0
0.5
1
50
5
10
100
25
0
LncRNAs
mRNAs
Transcript type
Lower level of expression compared to mRNAs
Dog lncRNAs Characterization (iii)
Dog LncRNAs Txs are more tissue-specific than mRNAs
Dog bidirectional lncRNAs and disease (J.Plassais)
•
Sensory Neuropathy : Insensitivity to pain
•
GWAS on 50 cases/control dogs identifies a
single locus
•
Capture locus and NGS resequencing
•
One mutation located in a lncRNA sharing
a bi-directional promoter with a (very)
interesting candidate gene
•
Functional validation : qPCR + enhancer
assay (Rory Jonhson, CRG Barcelona)
Conclusions
•
•
Annotation of a catalogue of ~18,000 lncRNAs in dogs
•
As in humans, canine LncRNAs:
Development of a bioinformatic method (FEELnc) to automatize lncRNA identification
using RNASeq
1. are modestly conserved through evolution
2. are less expressed compared mRNAs
3. tend to exhibit a tissue specific expression
!
•
Integration of lncRNAs catalogue with ongoing research project in the team allows the
study (and experimentally validation) of dog lncRNAs and highlights the importance of
lncRNA in genotype/phenotype relationship (disease)
Perspectives : Finding functionally important lncRNAs
➡
by comparative genomics : AutoGRAPH
(http://autograph.genouest.org/)
Using species-specific lncRNA and gene order conservation:
- to increase dog lncRNAs repertoire
- to find evolutionary conserved LncRNAs
➡
by structure assessment : Find lncRNAs that are more likely to be folded
Collaboration with G. Rizk and D. Lavenier (Genscale, IRISA)
- GPU acceleration because of long ncRNAs versus miRNAs
Integrating lncRNAs catalogue with cancer projects in the lab
Affected dogs
Healthy dogs
- Hist.Sarcoma
- Melanoma
- Lymphoma
- Mendelian diseases
RNASeq
|
LncRNA catalogue
• Variants in lncRNA sequences
• eQTL affecting lncRNAs expression
• LncRNAs/mRNAs differentially expressed
• Fusion genes
(Mathieu Bahin)
•…
ACKNOWLEDGEMENTS
- IGDR. CNRS-UMR6290, Rennes
Christophe Hitte
Laetitia Lagoutte
Mathieu Bahin
Anne-Sophie Guillory
Benoit Hédan
Clotilde de brito
Amaury Vaysse
Melanie Rault
Jocelyn Plassais
Ronan Ulvé
Edouard Cadieu
Morgane Bunel
Catherine ANDRÉ
- Unit of Animal Genomics, GIGA-R & Faculty of
Veterinary Medicine. University Liège
Benoit HENNUY
Wouter COPPIETERS
- BROAD Institute/Uppsala University
Jennifer MEADOW
Kerstin LINDBLAD-TOH
- Center for Genomic Regulation -BarcelonaRory JOHNSON
Giovanni BUSSOTTI
Cédric NOTREDAME
Roderic GUIGÓ
LUPA
- Genomic Plate-form -RennesBirama N’DIAYE
Marie DE TAYRAC
Marc AUBRY
- Biogenouest - Symbiose - Genscale Team
Fabrice Legeai, Claire Lemaître, Pierre Peterlongo,
Guillaume Rizk, D. Lavenier, Olivier Collin et al
- Centre National Genotypage -ParisDiana ZELENIKA
Anne BOLAND
Parameters:
RNASeq_file (.fastq)
- reference genome (e.g canFam3) - reference annotation (e.g ensembl v.70)
stats
Cleaning
fastqc or sickle or ...
Cleaned sequences (.fastq)
stats
tophat-bowtie2 or gsnap
Mapping
Mapped files (.sam)
stats
Samtools
Compressed mapped
Cufflink or Trinity
Transcriptome
reconstruction
Mapped files (.bam)
stats
Annotated and novel transcripts(.gtf)
CPC-CPAT-TxCdsPredict
CPC-CPAT-TxCdsPredict
mRNAs
ncRNAs
Coding Potential
Filtering
Filter (length 200bp)
Characterization
Characterization:
- length
- Nb exons
- comparison with mRNAs
!
Classification
- intergenic
- intragenic
- exon overlap
- intron overlap
- Sense versus Antisense
LncRNAs
mRNAs
Correlation
expression in N
tissues/Dvpt
Interaction
prediction via
structure
Correlation of expression
Structure prediction
Searching for functional LncRNAs
!
➡ Determine 2ndary structure for lncRNAs (computationally challenging task)
!
➡ Interactions lncRNAs:mRNAs ★
Correlation of expression lncRNAs:mRNAs
Predict lncRNAs-mRNAs co-expression profiles
Genotype to phenotype relationship
Not only protein-coding genes...
ENVIRONMENT
Epigenetic
Population
(sub-)structure
Non-coding
transcriptome
CNV
...
Why dogs?
•
•
Unique structure population due to unique history
•
One breed = one genetic isolate
High heterogeneity between breeds whereas homogeneity intra
breeds
• Cancers homologues aux humains • Cancers spécifiques de races (fréquence élevée dans une race ≈ 20%)
• Cancers spontanés (et non induits comme modèle murin) • Même environnement que homme ! • Accès facilité aux prélèvements
➡ Most of the traits are governed by a few variants with high phenotypical effects