Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004 Intro ESTs Prediction of Alternative Splicing from ESTs 5’ 3’ 3’ 5’ exons Transcription introns pre-mRNA Splicing Mature mRNA 5’ CAP AAAAAAA Translation Peptide 5’ 3’ 3’ 5’ exons Transcription introns pre-mRNA Different Splicing Mature mRNA 5’ CAP AAAAAAA Translation Different Peptide Alt splicing as a mechanism of gene regulation Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-60% of human Forms of alternative splicing Exon skipping / inclusion Alternative 3’ splice site Alternative 5’ splice site Mutually exclusive exons Intron retention Constitutive exon Alternatively spliced exons How to study alternative splicing? ESTs (Expressed Sequence Tags) Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region ESTs Cells from a specific organ, tissue or developmental stage 5’ mRNA extraction AAAAAA 3’ Add oligo-dT primer 5’ Reverse transcriptase AAAAAA 3’ 3’ TTTTTT 5’ RNA 5’ AAAAAA 3’ DNA 3’ TTTTTT 5’ Ribonuclease H 3’ Double stranded cDNA TTTTTT 5’ DNA polimerase Ribonuclease H 5’ AAAAAA 3’ 3’ TTTTTT 5’ ESTs 5’ AAAAAA 3’ 3’ TTTTTT 5’ 5’ EST Single-pass sequence reads 3’ EST Clone cDNA into a vector Multiple cDNA clones Sampling the Transcriptome with ESTs Genomic Primary transcript Splicing Splice variants oligo-dT primer Reverse transcriptase cDNA clones (double stranded) EST sequences (Single-pass sequence reads) 5’ 3’ 5’ 3’ Large scale EST-sequencing coupled to Genome sequencing EST sequencing Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTs (DB searches) Known gene Similar to known gene Contaminant Novel gene dbEST release 20 February 2004 Number of public entries: 20,039,613 Summary by organism Homo sapiens (human) Mus musculus + domesticus (mouse) Rattus sp. (rat) Triticum aestivum (wheat) Ciona intestinalis Gallus gallus (chicken) Danio rerio (zebrafish) Zea mays (maize) Xenopus laevis (African clawed frog) … 5,472,005 4,056,481 583,841 549,926 492,511 460,385 450,652 391,417 359,901 EST lengths ~ 450 bp Human EST length distribution (dbEST Sep. 2003 ) ESTs provide expression data eVOC Ontologies Anatomical System Cell Type http://www.sanbi.ac.za/evoc/ The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte. Pathology The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. Developmental Stage The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Pooling Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue. J Kelso et al. Genome Research 2002 ESTs provide expression data eVOC Ontologies Cell Type Pathology http://www.sanbi.ac.za/evoc/ Developmental Pooling Stage Anatomical System … nervous brain Library 1 ESTs cerebellum Library 2 ESTs … … Linking the expression vocabulary to gene annotations ESTs Genes V Curwen et al. Genome Research (2004) Gene expression vocabulary Normalized vs. non-normalized libraries The down side of the ESTs Cannot detect lowly/rarely expressed genes or nonexpressed sequences (regulatory) Random sampling: the more ESTs we sequence the less new useful sequences we will get Using ESTs to study Alternative Splicing ESTs aligned to the genome EST Stop GT Paralog AG True match best in genome * PolyA Processed pseudogene It defines the location of exons and introns We can verify the splice sites of introns check the correct strand of spliced ESTs It helps preventing chimeras It can avoid putting together ESTs from paralogous genes We can prevent including pseudogenes in our analysis Must Clip poly A tails before aligning Alternative Exons/ 3´ PolyA sites from ESTs ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data) Aligning ESTs to the Genome Many ESTs Fast programs, Fast computers Nearly exact matches Splice sites: Coverage >= 97% Percent_id >= 97% GT—AG, AT—AC, GC—AG Genomics as a Technology Development of special software: fast versus accurate alignment Development of special technology: efficient use of computer farms (~2000 CPUs) Recovering full transcripts from ESTs Recover the mRNA from the ESTs The Problem ESTs Genome What are the transcripts represented in this set of mapped ESTs? Predict Transcripts from ESTs ESTs Transcript predictions Merge ESTs according to splicing structure compatibility Redundant ESTs Consider 2 ESTs in a Genomic Cluster with more ESTS x z x+z z gives redundant splicing information, we could keep only x x z w x+z z+w However, the relation with other ESTs in the cluster is important: a third EST, w, is compatible with z but not with x. --> keep all relations Extension of the exon structure Consider 2 ESTs in a Genomic Cluster with more ESTS x y x+y y extends x, we can assume that they are from the same mRNA x z w Our success will depend on the coverage of the exons. However, ESTs are 3’and 5’ biased (ESTs like z not so frequent), hence we will have fragmentation. Representation For every 2 ESTs in a Genomic Cluster, we decide if they represent equivalent splicing structures The compatibility relation is a graph: Extension Inclusion x x y x z E Eyras et al. Genome Research (2004) y x z Criteria of “merging” Allow edge-exon mismatches mismatches Allow internal mismatches Allow intron mismatches Is this intron real? Transitivity x x y Extension z y w Inclusion x z w x z This reduces the number of comparisons needed w ClusterMerge graph Each node defines an inclusion sub-tree y x z y x Extensions form acyclic graphs x y z w E Eyras et al. Genome Research (2004) z x y z w Mergeable sets Example 1 2 3 4 5 6 7 Mergeable sets Example 1 2 3 4 5 6 7 6 1 3 2 5 4 7 Mergeable sets Example 1 2 3 4 5 6 7 Root 1 3 2 5 6 4 Leaves 7 Mergeable sets Example 1 2 3 4 5 6 7 Lists produced: Root 1 3 2 5 6 4 Leaves (1,2,3,5,6,7) ( 1,2,3,4,5,7) 7 Deriving the transcripts from the lists Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute Deriving the transcripts from the lists Splice Sites: are set to the most common coordinate 5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most Single exon transcripts Reject resulting single exon transcripts when using ESTs Alternative splicing and comparative genomics Conservation of Alternative Splicing Degree of conservation: 30-60% Methods: 1.- compare single events 2.- Cross-alignment of full transcripts Exon Skipping Events Introns flanking alternatively spliced (skipped) exons have high sequence conservation. Higher on average than constitutive inrons. R Sorek & G Ast. Genome Research 13:1631-1637, 2003 Sequences regulating the (Alternative) splicing Conserved Alternative Exon Flanking Introns Overrepresented hexamer (downstream) Overrepresented sequences in conserved introns (between human and mouse) may be Involved in the regulation of alternative splicing. Overrepresented: found in these introns more often than expected at random AND not found in intronic sequences flanking constitutive exons (and upstream of skipped ones) R Sorek & G Ast. Genome Research (2003) 13:1631-1637 Sequences regulating the (Alternative) splicing Flanking Introns Conserved Alternative Exon Overrepresented hexamer Not all types of events are equally conserved. Introns flanking alternative 5´and 3´exons, and retained introns, have higher sequence conservation. Sugnet CW, Kent WJ, Ares M Jr, Haussler D. Pac Symp Biocomput. 2004;:66-77 Frame preservation Frame preserving Constitutive exons Alternative exons All exons 39.7% (Human) 39.5% (Mouse) 41.6% (Human) 44.7% (Mouse) Conserved Exon 40.9% (Human) 38% (Mouse) 51.8% (Human) 51.9% (Mouse) A Resch et al. Nucleic Acids Research 2004, 32 (4) 1261-1269 Predicting alternative exons Features Differentiating Between Alternatively splice and Constitutively spliced exons Alternative exons Constitutive exons Average size 87 128 length = mutliple of 3 73% 37% Average human-mouse exon conservation 94% 89% (A) Exons with upstream intron conserved in mouse 92% 45% (B) Exons with downstream intron conserved in mouse 82% 35% (A) + (B) 77% 17% (A), (B) : conservation is considered if at least there 12 consecutive matches over 100bp of the intron R Sorek et al. Genome Research (2004) 14:1617-1623 Build a classifier to make predictions • Rule: Set of conditions over the parameters: e.g. “at least 99% conservation with mouse AND divisible by 3, etc…” • Try all the possible combinations of parameters • Select the rule that would correctly identify a maximum number of true alternative exons minimizing the number of false positives This rule achieved 31% sensitivity and no false positives in a set of known exons: At least 95% identity with mouse orthologous exon Exon size is a multiple of 3 An upstream intronic alignment of at least 15bp with at least 85% identity A downstream intronic exact alignment of at least 12bp R Sorek et al. Genome Research (2004) 14:1617-1623 Summary Alternative splicing is a mechanism to generate function diversity We can study alternative splicing using ESTs (Expressed Sequence Tags) EST data is fragmented and full of noise: need to be processed Some alternative splicing is conserved across species (Human-Mouse) Prediction of alternative (conserved) exons is possible (a classifier) but no ab initio Evolution of alternative splicing? THE END
© Copyright 2025 Paperzz