Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Exploring Alternative Splicing Features using
Support Vector Machines
Jing Xia1, Doina Caragea1, Susan J. Brown2
1 Computing and Information Sciences Kansas State University, USA
2 Bioinformatics Center Kansas State University, USA
Jan 16 2008
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Outline
1
Background & Motivation
2
Problem & Feature Construction
Problem Definition
Data Set
Feature Construction
3
Experiments Design & Results
Experimental Design
Experimental Results
4
Conclusions and Future Work
Conclusion
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Alternative Splicing
Alternative Splicing
exon
intron
exon
intron
exon
DNA
Splicing: important step
during gene expression
Variable splicing process
(Alternative splicing) one
gene -> many proteins
5’UTR
Trasncription
GT
AG
GT
AG
3’UTR
TSS ATG
exon
intron
exon
exon
intron
pre−mNRA cap
5’UTR
Splicing
GU
AG
GT
AG
3’UTR
AUG
mRNA
Translation
protein
Genes expression: genes to proteins
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Alternative Splicing
Alternative Splicing
Splicing: important step
during gene expression
Variable splicing process
(Alternative splicing) one
gene -> many proteins
Gene
pre−mRNA
Alternative Splicing
transcript isoforms
Proteins
One genes to many proteins
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Patterns of Alternative Splicing
Patterns of Alternative Splicing
Exon skipping (most
frequent)
Constitutively Spliced Exon (CSE)
Alternatively Spliced Exon (ASE)
CSE
exon1
CSE
exon2
ASE
exon3
exon4
Alternative 5’ splice sites
Alternative 3’ splice sites
Intron retention
Mutually exclusive
Here, focus on predicting alternatively spliced exons (ASE)
and constitutively spliced exons (CSE) based on SVM
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Alternative splicing
Wet lab experiments
finding AS is time
Traditionally, align EST to
genome alignments
(limited to amount of EST
available to the genome)
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Identifying Alternative Splicing in genome
Alternative splicing
Wet lab experiments
finding AS is time
consuming
Transcripts
genomic DNA
Traditionally, align EST to
genome alignments
(limited to amount of EST
available to the genome)
Alternative 3’ Exon Exon Skipping
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Identifying Alternative Splicing in genome
Alternative splicing
Wet lab experiments
finding AS is time
consuming
Traditionally, align EST to
genome alignments
(limited to amount of EST
available to the genome)
Use machine learning algorithms that to predict AS at the
genome level
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Problem Definition
Problem Definition: given an exon, can we predict it as
alternatively spliced exons (ASE) or constitutively
spliced exons (CSE)?
Constitutively Spliced Exon (CSE)
Alternatively Spliced Exon (ASE)
CSE
CSE
ASE
exon1
exon2
exon3
CSE
exon4
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons
(ASE) vs constitutively spliced exons (CSE) Use Support
Vector Machine (SVM)
Task:Two-class (ASE and CSE) classification problem
Need:Training data set containing labeled examples
(ASE & CSE)
Learning: Train classifier with training data
Application: Predict unknown ASE
Need features to represent ASEs & CSEs
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE)
vs constitutively spliced exons (CSE) Use Support Vector
Machine (SVM)
Task:Two-class (ASE and CSE) classification problem
Need:Training data set containing labeled examples
(ASE & CSE)
Learning: Train classifier with training data
Application: Predict unknown ASE
Need features to represent ASEs & CSEs
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE)
vs constitutively spliced exons (CSE) Use Support Vector
Machine (SVM)
Task:Two-class (ASE and CSE) classification problem
Need:Training data set containing labeled examples
(ASE & CSE)
Learning: Train classifier with training data
Application: Predict unknown ASE
Need features to represent ASEs & CSEs
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE)
vs constitutively spliced exons (CSE) Use Support Vector
Machine (SVM)
Task:Two-class (ASE and CSE) classification problem
Need:Training data set containing labeled examples
(ASE & CSE)
Learning: Train classifier with training data
Application: Predict unknown ASE
Need features to represent ASEs & CSEs
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE)
vs constitutively spliced exons (CSE) Use Support Vector
Machine (SVM)
Task:Two-class (ASE and CSE) classification problem
Need:Training data set containing labeled examples
(ASE & CSE)
Learning: Train classifier with training data
Application: Predict unknown ASE
Need features to represent ASEs & CSEs
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Data Set
Published data set from the model organism, C. elegans
(worm)
Includes alternatively spliced exons (ASE) and
constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs
100-base local sequences around splice sites
Example of data set
ASE
ASE
CSE
GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT
ATACTATAGCGTCTTG....ACCGATCGTACACGCT
GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG
exon
GT
AG
−100
0
+100
−100
0
+100
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Data Set
Published data set from the model organism, C. elegans
(worm)
Includes alternatively spliced exons (ASE) and
constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs
100-base local sequences around splice sites
Example of data set
ASE
ASE
CSE
GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT
ATACTATAGCGTCTTG....ACCGATCGTACACGCT
GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG
exon
GT
AG
−100
0
+100
−100
0
+100
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Data Set
Published data set from the model organism, C. elegans
(worm)
Includes alternatively spliced exons (ASE) and
constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs
100-base local sequences around splice sites
Example of data set
ASE
ASE
CSE
GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT
ATACTATAGCGTCTTG....ACCGATCGTACACGCT
GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG
exon
GT
AG
−100
0
+100
−100
0
+100
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Data Set
Published data set from the model organism, C. elegans
(worm)
Includes alternatively spliced exons (ASE) and
constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs 100base local sequences around splice sites
Previous work:
Motifs captured and identified by kernel G. Ratch et al.,
Length of exons and flanking introns Sorek et al.
Our work:
Exploit more biologically significant features
Use several additional approaches to derive features
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Data Set
Published data set from the model organism, C. elegans
(worm)
Includes alternatively spliced exons (ASE) and
constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs 100base local sequences around splice sites
Previous work:
Motifs captured and identified by kernel G. Ratch et al.,
Length of exons and flanking introns Sorek et al.
Our work:
Exploit more biologically significant features
Use several additional approaches to derive features
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Feature List
Several features known to be biologically important
Strength of splice sites (SSS)
Motif features
Intronic splicing regulator (ISR)
Motifs derived from local sequences (MAST)
Exonic splicing enhancer (ESE)
Reduced set of motif features based on locations of motifs
on secondary structure (MAST-R)
Optimal folding energy (OPE)
Basic sequence features (BSF)
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
SSS: Strength of Splice Site
CGAG
exon
AGGTAAGT
We consider all splice sites
CGAG
exon
AGGTAAGT
logF(Xi)
GGAG
exon
AGGTAGGT
F(X) ,
CGAG
exon
AGGTTAGT
score = ∑
i
where X ∈ {A,U,G,C}. i ∈ {−3,+7}
for 3’ splice sites (3’ss) and
i ∈ {−26,+2} for 5’ splice sites (5’ss).
exon
CCAG
−3
+7
3’ ss
AGGTAAGT
−26
+2
5’ ss
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Motif: sequence pattern that occurs repeatedly in group of
sequences
Intronic Splicing Regulator: identified in Kabat et al.
MAST: derived by MEME using [-100,+100] sequence
Exon Splicing Enhancers: based on two assumption
ISR
exon
Illustration of ISR dispersed among sequences
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Motif: sequence pattern that occurs repeatedly in group of
sequences
Intronic Splicing Regulator: identified in Kabat et al.
MAST: derived by MEME using [-100,+100] sequence
Exon Splicing Enhancers: based on two assumption
Example: a 20-base motif derived from sequences around splice sites
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Motif: sequence pattern that occurs repeatedly in group of
sequences
Intronic Splicing Regulator: identified in Kabat et al.
MAST: derived by MEME using [-100,+100] sequence
Exon Splicing Enhancers: based on two assumption
more frequent in exons than in introns
more frequent in exons with weak splice sites than in exons
with strong splice sites
ISR
MAST
ESE
Motifs - dispersed among exons and introns
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Pre-mRNA secondary structures
influence exon recognition
motif
AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA
Secondary structure:
Located at different structure
derived from Mfold
filter motifs using secondary
structure
Loop
Optimal Folding Energy: stability
of RNA secondary structure
Stem
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
Pre-mRNA secondary structures
influence exon recognition
motif
AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA
Secondary structure:
Located at different structure
derived from Mfold
filter motifs using secondary
structure
Loop
Optimal Folding Energy: stability
of RNA secondary structure
Stem
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Problem Definition
Data Set
Feature Construction
GC content (G & C ratio),= A+U+G+C , characteristics of
sequence
Sequence length
Length of exons and length of exons’ flanking introns
frames of stop codons
Summary of features
Motif features
Secondary structure
Strength of splice sites
Sequence features
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental Design
Experimental Design
List of previous defined features
as SVM input
Combination of different features
to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train
(kernel linear, RBF.., Cost C)
20%
Choose parameters with best
cross-validation (CV) accuracy
Test trained SVM on testing
ASEs & CSEs
80%
split3
split4
split5
5−fold cross validation
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental Design
Experimental Design
List of previous defined features
as SVM input
Combination of different features
to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train
(kernel linear, RBF.., Cost C)
20%
Choose parameters with best
cross-validation (CV) accuracy
Test trained SVM on testing
ASEs & CSEs
80%
split3
split4
split5
5−fold cross validation
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental Design
Experimental Design
List of previous defined features
as SVM input
Combination of different features
to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train
(kernel linear, RBF.., Cost C)
20%
Choose parameters with best
cross-validation (CV) accuracy
Test trained SVM on testing
ASEs & CSEs
80%
split3
split4
split5
5−fold cross validation
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental Design
Experimental Design
List of previous defined features
as SVM input
Combination of different features
to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train
(kernel linear, RBF.., Cost C)
20%
Choose parameters with best
cross-validation (CV) accuracy
Test trained SVM on testing
ASEs & CSEs
80%
split3
split4
split5
5−fold cross validation
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental Design
Experimental Design
List of previous defined features
as SVM input
Combination of different features
to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train
(kernel linear, RBF.., Cost C)
20%
Choose parameters with best
cross-validation (CV) accuracy
Test trained SVM on testing
ASEs & CSEs
80%
split3
split4
split5
5−fold cross validation
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental results
Results of alternatively spliced exon classification.
All features, including ISR motifs, are used.
C
Split1
Split2
Split3
Split4
Split5
0.05
0.05
0.1
0.01
0.1
Cross Validation Score
fp 1%
AUC %
32.45
86.55
39.33
88.32
37.56
87.76
40.86
89.02
36.48
87.50
Test score
fp 1% AUC%
56.48 90.05
52.04 89.04
38.71 87.97
37.63 84.42
35.79 85.69
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental results
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Mixed-Feas (85.55%)
Base-Feas(78.78%)
0
0
0.2
0.4
0.6
0.8
1
False Positive Rate
Comparison of ROC curves obtained using basic features only and basic features plus other mixed features (except
conserved ISR motifs). Models trained using 5-fold CV with C = 1.
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Experimental results
AUC score comparison between data sets with secondary
struc- tural features and data sets without secondary structural
fea- tures
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Motif Evaluation
Intersection between motifs derived from sequences & intronic
splicing regulators
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Motif Evaluation
Conserved ESE in metazoans (animals), Human and Mouse
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Experimental Design
Experimental Results
Motif Evaluation
Comparison with A. thaliana
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found
using transcripts
Machine learning effectively used for prediction of AS
events
Identified features informative in predicting AS Explored
comparatively comprehensive feature sets from
biological point of view
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found
using transcripts
Machine learning effectively used for prediction of AS
events
Identified features informative in predicting AS Explored
comparatively comprehensive feature sets from
biological point of view
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found
using transcripts
Machine learning effectively used for prediction of AS
events
Identified features informative in predicting AS
Explored comparatively comprehensive feature sets
from biological point of view
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found
using transcripts
Machine learning effectively used for prediction of AS
events
Identified features informative in predicting AS
Explored comparatively comprehensive feature sets
from biological point of view
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Conclusion
Future Work
Apply this approach to specific organism
Identify motifs more accurately
Refine relationships between features (2nd Structure:w
and motifs)
Learn other types of AS events (not only skipped exons)
adapted from "Detection of Alternative Splicing Events Using Machine Learning"
Background & Motivation
Problem & Feature Construction
Experiments Design & Results
Conclusions and Future Work
Conclusion
Thank you for your attention!
Questions?
Related work
RASE http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE
Acknowledgement
data set from Dr. Ratsch’s FML group
http://www.fml.tuebingen.mpg.de/raetsch/
projects/RASE/altsplicedexonsplits.tar.gz
Dr. Caragea’s MLB group
http://people.cis.ksu.edu/~dcaragea/mlb
Dr. Brown’s Bininformatics Center at KSU
http://bioinformatics.ksu.edu
© Copyright 2026 Paperzz