Comparison between Human and Mouse genomes

Alternative Splicing from ESTs
Eduardo Eyras
Bioinformatics UPF – February 2004
Intro
ESTs
Prediction of
Alternative Splicing from ESTs
5’
3’
3’
5’
exons
Transcription
introns
pre-mRNA
Splicing
Mature mRNA
5’ CAP
AAAAAAA
Translation
Peptide
5’
3’
3’
5’
exons
Transcription
introns
pre-mRNA
Different Splicing
Mature mRNA
5’ CAP
AAAAAAA
Translation
Different Peptide
Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted  protein diversity
Can introduce early stop codons, resulting in truncated proteins or
unstable mRNAs
It can modify the activity of the transcription factors, affecting the
expression of genes
It is observed nearly in all metazoans
Estimated to occur in 30%-60% of human
Forms of alternative splicing
Exon skipping / inclusion
Alternative 3’ splice site
Alternative 5’ splice site
Mutually exclusive exons
Intron retention
Constitutive exon
Alternatively spliced exons
How to study alternative splicing?
ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of cDNA
Typically 200-500 nucleotides long
It may contain coding and/or non-coding region
ESTs
Cells from a specific
organ, tissue or
developmental stage
5’
mRNA extraction
AAAAAA 3’
Add oligo-dT primer
5’
Reverse transcriptase
AAAAAA 3’
3’ TTTTTT 5’
RNA
5’
AAAAAA 3’
DNA
3’
TTTTTT 5’
Ribonuclease H
3’
Double stranded cDNA
TTTTTT
5’
DNA polimerase
Ribonuclease H
5’
AAAAAA 3’
3’
TTTTTT 5’
ESTs
5’
AAAAAA 3’
3’
TTTTTT 5’
5’ EST
Single-pass sequence reads
3’ EST
Clone cDNA into a vector
Multiple cDNA clones
Sampling the Transcriptome with ESTs
Genomic
Primary transcript
Splicing
Splice variants
oligo-dT primer
Reverse transcriptase
cDNA clones
(double stranded)
EST sequences
(Single-pass sequence reads)
5’
3’
5’
3’
Large scale EST-sequencing coupled to Genome
sequencing
EST sequencing
Is fast and cheap
Gives direct information about the gene sequence
Partial information
Resulting ESTs
(DB searches)
Known gene
Similar to known gene
Contaminant
Novel gene
dbEST release 20 February 2004
Number of public entries:
20,039,613
Summary by organism
Homo sapiens (human)
Mus musculus + domesticus (mouse)
Rattus sp. (rat)
Triticum aestivum (wheat)
Ciona intestinalis
Gallus gallus (chicken)
Danio rerio (zebrafish)
Zea mays (maize)
Xenopus laevis (African clawed frog)
…
5,472,005
4,056,481
583,841
549,926
492,511
460,385
450,652
391,417
359,901
EST lengths
~ 450 bp
Human EST length distribution
(dbEST Sep. 2003 )
ESTs provide expression data
eVOC Ontologies
Anatomical
System
Cell Type
http://www.sanbi.ac.za/evoc/
The tissue, organ or anatomical system from which the sample was prepared.
Examples are digestive, lung and retina.
The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte.
Pathology
The pathological state of the sample from which the sample was prepared.
Examples are: normal, lymphoma, and congenital.
Developmental
Stage
The stage during the organism's development at which the sample was prepared.
Examples are: embryo, fetus, and adult.
Pooling
Indicates whether the tissue used to prepare the library was derived from single or
multiple samples. Examples are pooled, pooled donor and pooled tissue.
J Kelso et al. Genome Research 2002
ESTs provide expression data
eVOC Ontologies
Cell Type
Pathology
http://www.sanbi.ac.za/evoc/
Developmental
Pooling
Stage
Anatomical
System
…
nervous
brain
Library 1
ESTs
cerebellum
Library 2
ESTs
…
…
Linking the expression vocabulary to gene
annotations
ESTs
Genes
V Curwen et al. Genome Research (2004)
Gene expression vocabulary
Normalized vs. non-normalized libraries
The down side of the ESTs
Cannot detect lowly/rarely expressed genes or nonexpressed sequences (regulatory)
Random sampling: the more ESTs we sequence the
less new useful sequences we will get
Using ESTs to study Alternative Splicing
ESTs aligned to the genome
EST
Stop
GT
Paralog
AG
True match
best in genome
*
PolyA
Processed
pseudogene
It defines the location of exons and introns
We can verify the splice sites of introns  check the correct strand of spliced ESTs
It helps preventing chimeras
It can avoid putting together ESTs from paralogous genes
We can prevent including pseudogenes in our analysis
Must Clip poly A tails before aligning
Alternative Exons/ 3´ PolyA sites from
ESTs
ESTs can also provide information about potential
alternative splicing when aligned to the genome (and
when aligned to mRNA data)
Aligning ESTs to the Genome
Many ESTs  Fast programs, Fast computers
Nearly exact matches
Splice sites:
Coverage
>= 97%
Percent_id >= 97%
GT—AG, AT—AC, GC—AG
Genomics as a Technology
Development of special software:
fast versus accurate alignment
Development of special technology:
efficient use of computer farms (~2000 CPUs)
Recovering full transcripts from ESTs
Recover the mRNA from the ESTs
The Problem
ESTs
Genome
What are the transcripts represented in this
set of mapped ESTs?
Predict Transcripts from ESTs
ESTs
Transcript predictions
Merge ESTs according to splicing structure compatibility
Redundant ESTs
Consider 2 ESTs in a Genomic Cluster with more ESTS
x
z
x+z
z gives redundant splicing information, we could keep only x
x
z
w
x+z
z+w
However, the relation with other ESTs in the cluster is important: a
third EST, w, is compatible with z but not with x.
--> keep all relations
Extension of the exon structure
Consider 2 ESTs in a Genomic Cluster with more ESTS
x
y
x+y
y extends x, we can assume that they are from the same mRNA
x
z
w
Our success will depend on the coverage of the exons.
However, ESTs are 3’and 5’ biased
(ESTs like z not so frequent), hence we will have fragmentation.
Representation
For every 2 ESTs in a Genomic Cluster, we decide if they
represent equivalent splicing structures
The compatibility relation is a graph:
Extension
Inclusion
x
x
y
x
z
E Eyras et al. Genome Research (2004)
y
x
z
Criteria of “merging”
Allow edge-exon mismatches
mismatches
Allow internal mismatches
Allow intron mismatches
Is this intron real?
Transitivity
x
x
y
Extension
z
y
w
Inclusion
x
z
w
x
z
This reduces the number of comparisons needed
w
ClusterMerge graph
Each node defines an inclusion sub-tree
y
x
z
y
x
Extensions form acyclic graphs
x
y
z
w
E Eyras et al. Genome Research (2004)
z
x
y
z
w
Mergeable sets
Example
1
2
3
4
5
6
7
Mergeable sets
Example
1
2
3
4
5
6
7
6
1
3
2
5
4
7
Mergeable sets
Example
1
2
3
4
5
6
7
Root 1
3
2
5
6
4
Leaves
7
Mergeable sets
Example
1
2
3
4
5
6
7
Lists produced:
Root 1
3
2
5
6
4
Leaves
(1,2,3,5,6,7)
( 1,2,3,4,5,7)
7
Deriving the transcripts from the lists
Internal Splice Sites:
external coordinates of the 5’ and
3’ exons are not allowed to contribute
Deriving the transcripts from the lists
Splice Sites:
are set to the most common coordinate
5’ and 3’ coordinates:
are set to the exon coordinate that
extends the potential UTR the most
Single exon transcripts
Reject resulting single exon transcripts when using ESTs
Alternative splicing
and comparative genomics
Conservation of Alternative Splicing
Degree of conservation: 30-60%
Methods:
1.- compare single events
2.- Cross-alignment of full transcripts
Exon Skipping Events
Introns flanking alternatively spliced (skipped) exons have high sequence conservation.
Higher on average than constitutive inrons.
R Sorek & G Ast. Genome Research 13:1631-1637, 2003
Sequences regulating the (Alternative) splicing
Conserved
Alternative
Exon
Flanking
Introns
Overrepresented hexamer (downstream)
Overrepresented sequences in conserved introns (between human and mouse) may be
Involved in the regulation of alternative splicing.
Overrepresented: found in these introns more often than expected at random AND not found
in intronic sequences flanking constitutive exons (and upstream of skipped ones)
R Sorek & G Ast. Genome Research (2003) 13:1631-1637
Sequences regulating the (Alternative)
splicing
Flanking
Introns
Conserved
Alternative
Exon
Overrepresented hexamer
Not all types of events are equally conserved.
Introns flanking alternative 5´and 3´exons, and retained introns, have higher sequence
conservation.
Sugnet CW, Kent WJ, Ares M Jr, Haussler D. Pac Symp Biocomput. 2004;:66-77
Frame preservation
Frame preserving
Constitutive exons
Alternative exons
All exons
39.7% (Human)
39.5% (Mouse)
41.6% (Human)
44.7% (Mouse)
Conserved
Exon
40.9% (Human)
38% (Mouse)
51.8% (Human)
51.9% (Mouse)
A Resch et al. Nucleic Acids Research 2004, 32 (4) 1261-1269
Predicting alternative exons
Features Differentiating Between Alternatively splice
and Constitutively spliced exons
Alternative
exons
Constitutive
exons
Average size
87
128
length = mutliple of 3
73%
37%
Average human-mouse exon conservation
94%
89%
(A) Exons with upstream intron conserved in
mouse
92%
45%
(B) Exons with downstream intron conserved in
mouse
82%
35%
(A) + (B)
77%
17%
(A), (B) : conservation is considered if at least there 12 consecutive matches over 100bp of the intron
R Sorek et al. Genome Research (2004) 14:1617-1623
Build a classifier to make predictions
• Rule: Set of conditions over the parameters:
e.g. “at least 99% conservation with mouse AND divisible by 3, etc…”
• Try all the possible combinations of parameters
• Select the rule that would correctly identify a maximum number of true
alternative exons minimizing the number of false positives
This rule achieved 31% sensitivity and no false positives in a set of known exons:
At least 95% identity with mouse orthologous exon
Exon size is a multiple of 3
An upstream intronic alignment of at least 15bp with at least 85% identity
A downstream intronic exact alignment of at least 12bp
R Sorek et al. Genome Research (2004) 14:1617-1623
Summary
Alternative splicing is a mechanism to generate function diversity
We can study alternative splicing using ESTs (Expressed Sequence Tags)
EST data is fragmented and full of noise: need to be processed
Some alternative splicing is conserved across species (Human-Mouse)
Prediction of alternative (conserved) exons is possible (a classifier) but no ab initio
Evolution of alternative splicing?
THE END