Protein graph kernel

Data Mining in Bioinformatics
Day 6: Classification in Bioinformatics
Karsten Borgwardt
February 25 to March 10
Bioinformatics Group
MPIs Tübingen
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Karsten M. Borgwardt
Protein function prediction
via graph kernels
ISMB 2005
Joint work with Cheng Soon Ong and S.V.N. Vishwanathan,
Stefan Schönauer, Hans-Peter Kriegel and Alex Smola
Ludwig-Maximilians-Universität Munich, Germany
and National ICT Australia, Canberra
Content
Introduction
• The problem: protein function prediction
• The method: Support Vector Machines (SVM)
Our approach to function prediction
• Protein graph model
• Protein graph kernel
• Experimental evaluation
Technique to analyze our graph model
• Hyperkernels
Discussion
Karsten Borgwardt et al. - Protein function prediction via graph kernels
2
Current approaches to protein
function prediction
similar structures
similar phylogenetic profiles
similar sequences
similar motifs
similar function
similar chemical properties
similar interaction
partners
similar surface clefts
Karsten Borgwardt et al. - Protein function prediction via graph kernels
3
Current approaches to protein
function prediction
similar structures
similar phylogenetic profiles
similar sequences
similar motifs
similar function
similar interaction
partners
similar chemical properties
similar surface clefts
Karsten Borgwardt et al. - Protein function prediction via graph kernels
4
Support Vector Machines
Are new data points (x) red or black?
The blue decision boundary allows to predict class membership
of new data points.
Karsten Borgwardt et al. - Protein function prediction via graph kernels
5
Kernel trick
input space
feature space
mapping Ф
kernel
function
The kernel trick allows to introduce a separating hyperplane in
feature space.
Karsten Borgwardt et al. - Protein function prediction via graph kernels
6
Feature vectors for function
prediction
protein structure
and/or
protein sequence
e.g. Cai et al. (2004) ,
Dobson and Doig (2003)
• hydrophobicity
• polarity
• polarizability
• van der Waals volume
•fraction of amino acid types
•fraction of surface area
•disulphide bonds
•size of largest surface pocket
7
Our approach
Sequence + Structure + Chemical properties
SVMs + Graph models
Graph model
Protein function
Karsten Borgwardt et al. - Protein function prediction via graph kernels
8
Protein graph model
protein
secondary
structure
sequence
Karsten Borgwardt et al. - Protein function prediction via graph kernels
structure
9
Protein graph model
Node attributes
• hydrophobicity
• polarity
• polarizability
• van der Waals volume
• length
• helix, sheet, loop
Edge attributes
• type (sequence, structure)
• length
Karsten Borgwardt et al. - Protein function prediction via graph kernels
10
Protein graph kernel
(Kashima et al. (2003) and Gärtner et al. (2003))
compares walks of identical length l
l −1
k walk v 1, ... ,v l  , w 1, ... , w l =∑ k step  vi ,v i1 ,w i , w i1
i =1
Walks are similar, if along both walks
•types of secondary structure elements (SSEs) are the same
•distances between SSEs are similar
•chemical properties of SSEs are similar
11
Example: Protein kernel
S
SS
Protein A
S
S
Protein B
S
Similar
(H,10,S,1,S,3,H) (H,9,S,1,S,3,H)
12
Example: Protein kernel
S
Protein A
SS
S
S
Protein B
S
Dissimilar
(H,10,S,1,S)
(S,3,H,5,S)
13
Evaluation: enzymes vs. non-enzymes
10-fold cross-validation on 1128 proteins from dataset by
Dobson and Doig (2003); 59 % are enzymes.
Kernel type
Vector kernel
Optimized vector kernel
Graph kernel
Graph kernel without structure
Graph kernel with global info
DALI classifier
accuracy
76.86
80.17
77.30
72.33
84.04
75.07
Karsten Borgwardt et al. - Protein function prediction via graph kernels
SD
1.23
1.24
1.20
5.32
3.33
4.58
14
Attribute selection
Which structural or chemical attribute is most important
for correct classification?
For this purpose, we employ hyperkernels (Ong et. al,
2003).
Hyperkernels find an optimal linear combination of input
kernel matrices :
m
∑ βi K i
i=1
minimizing training error and
fulfilling regularization constraints
Karsten Borgwardt et al. - Protein function prediction via graph kernels
15
Attribute selection
Our approach:
•Calculate kernel matrix for 600 proteins on
graph model with only ONE single attribute!
•Repeat this for all attributes
•Normalize these kernel matrices
•Determine hyperkernel combination
•Weights then reflect contribution of individual
attributes to correct classification
16
Attribute selection
Attribute
EC 1 EC 2 EC 3
EC 4
EC 5
EC 6
Amino acid length
1.00
0.31
1.00
1.00
0.73
0.00
3-bin van der Waals
0.00
0.00
0.00
0.00
0.00
0.00
3-bin Hydrophobicity
0.00
0.00
0.00
0.00
0.00
0.00
3-bin Polarity
0.00
0.01
0.00
0.00
0.00
1.00
3-bin Polarizability
0.00
0.00
0.00
0.00
0.12
0.00
3d length
0.00
0.40
0.00
0.00
0.00
0.00
Total van der Waals
0.00
0.00
0.00
0.00
0.00
0.00
Total Hydrophobicity
0.00
0.13
0.00
0.00
0.01
0.00
Total Polarity
0.00
0.14
0.00
0.00
0.01
0.00
Total Polarizability
0.00
0.01
0.00
0.00
0.13
0.00
Karsten Borgwardt et al. - Protein function prediction via graph kernels
17
Discussion
• Novel combined approach to protein function prediction
integrating sequence, structure and chemical information
• Reaches state-of-the-art classification accuracy on less
information; higher accuracy levels on same amount of
information
• Hyperkernels for finding most interesting protein
characteristics
Karsten Borgwardt et al. - Protein function prediction via graph kernels
18
Discussion
• More detailed graph models (amino acids, atoms) might be
more interesting, yet raise computational difficulties (graphs
too large!)
Two directions of future research:
• Efficient, yet expressive graph kernels for structure
• Integrating more proteomic information, e.g. surface
pockets, into our graph model
Karsten Borgwardt et al. - Protein function prediction via graph kernels
19
The End
Thank you!
Questions?
Karsten Borgwardt et al. - Protein function prediction via graph kernels
20
ARTS: Accurate Recognition of Transcription
Starts in human
Sören Sonnenburg,†
†
Alexander Zien,∗,♮
Gunnar Rätsch♮
Fraunhofer FIRST.IDA, Kekuléstr. 7, 12489 Berlin, Germany
♮
Friedrich Miescher Laboratory of the Max Planck Society,
∗
Max Planck Institute for Biological Cybernetics,
Spemannstr. 37-39, 72076 Tübingen, Germany
[email protected],
{Alexander.Zien,Gunnar.Raetsch}@tuebingen.mpg.de
Promoter Detection
Overview:
• Transcription Start Site (TSS)
• Features to describe the TSS
• Our approach
• Evaluation with current methods
• Example - Protocadherin-α
• Summary
Sonnenburg, Zien, Rätsch
1
Promoter Detection
Transcription Start Site - Properties
• POL II binds to a rather vague region of ≈ [−20, +20] bp
• Upstream of TSS: promoter containing transcription factor binding sites
• Downstream of TSS: 5’ UTR, and further downstream coding regions
and introns (different statistics)
• 3D structure of the promoter must allow the transcription factors to bind
⇒ Promoter Prediction is non-trivial
Sonnenburg, Zien, Rätsch
2
Promoter Detection
Features to describe the TSS
• TFBS in Promoter region
• condition: DNA should not be too twisted
• CpG islands (often over TSS/first exon; in most, but not all promoters)
• TSS with TATA box (≈ −30 bp upstream)
• Exon content in UTR 5” region
• Distance to first donor splice site
Idea: Combine weak features to build strong promoter
predictor
Sonnenburg, Zien, Rätsch
3
Promoter Detection
The ARTS Approach
use SVM classifier
•
f (x) = sign
Ns
X
yi αik(x, xi) + b
i=1
!
• key ingredient is kernel k(x, x′) — similarity of two sequences
• use 5 sub-kernels suited to model the aforementioned features
′
′
′
′
′
′
k(x, x ) = kT SS (x, x )+kCpG (x, x )+kcoding (x, x )+kenergy (x, x )+ktwist(x, x )
Sonnenburg, Zien, Rätsch
4
Promoter Detection
The 5 sub-kernels
1. TSS signal (including parts of core promoter with TATA box)
– use Weighted Degree Shift kernel
2. CpG Islands, distant enhancers and TFBS upstream of TSS
– use Spectrum kernel (large window upstream of TSS)
3. Model coding sequence TFBS downstream of TSS
– use another Spectrum kernel (small window downstream of TSS)
4. Stacking energy of DNA
– use btwist energy of dinucleotides with Linear kernel
5. Twistedness of DNA
– use btwist angle of dinucleotides with Linear kernel
Sonnenburg, Zien, Rätsch
5
Promoter Detection
Weighted Degree Shift Kernel
k(x1,x2) = w6,3
+
w6,-3 + w3,4
x1
x2
• Count matching substrings of length 1 . . . d
• Weight according to length of the match β1 . . . βd
• Position dependent but tolerates “shifts” of up to S
k(x, x′) =
d
X
βk
k=1
S
L−k+1
X X
l=1
δs (I(x[k : l + s] = x′[k : l])+I(x[k : l] = x′[k : l + s]))
s=0
s+l≤L
x[k : l] := subsequence of x of length k starting at position l
Sonnenburg, Zien, Rätsch
6
Promoter Detection
Training – Data Generation
True TSS:
• From dbTSSv4 (based on hg16) extract putative TSS windows of size
[−1000, +1000]
Decoy TSS:
• Annotate dbTSSv4 with transcription-stop (via BLAT alignment of
mRNAs)
• From the interior of the gene (+100bp to gene end) sample negatives for
training (10 per positive), again windows [−1000, +1000]
Processing:
• 8508 positive, 85042 negative examples
• Split into disjoint training and validation set (50% : 50%)
Sonnenburg, Zien, Rätsch
7
Promoter Detection
Training – Model Selection
16 kernel parameters + SVM regularization to be tuned!
• Full grid search infeasible
• Local axis-parallel searches instead
SVM training/evaluation on > 10, 000 examples computationally too
demanding
Speedup trick:
f (x) =
Ns
X
i=1
αik(xi, x) + b =
Ns
X
αiΦ(xi) ·Φ(x) + b = w · Φ(x) + b
|i=1 {z
w
}
f (x) before: O(NsdLS) now: = O(dL) ⇒ speedup factor up to Ns · S
⇒ Large Scale Training and Evaluation possible
Sonnenburg, Zien, Rätsch
8
Promoter Detection
Comparison
Current state-of-the-art methods:
• FirstEF [Davuluri, Grosse, Zhang; 2001, Nat Genet]
QDF: for promoter, donor, first exon, WM
Range: [−1500, +500]
• McPromoter [Ohler, Liao, Niemann, Rubin; 2002, Genome Biol]
GHMM with IMC for 6 regions (e.g. upstream, TATA) NN
Range: [−250, +50]
• Eponine [Down, Hubbard; 2002 Genome Res]
RVM: WM with positional distribution for 4 regions (e.g. TATA, CpG)
Range: [−200, +200]
⇒ Do a genome wide evaluation!
⇒ How to do a fair comparison?
Sonnenburg, Zien, Rätsch
9
Promoter Detection
Evaluation
Idea: Only consider “new” TSS from dbTSSv5-dbTSSv4, with max 30%
overlap
1. Compute genome wide outputs for each TSF
2. Decrease resolution: divide genome into non-overlapping fixed size chunks
(e.g. 50 or 500)
3. Annotate dbTSSv5 TSS with gene end
4. Label chunk positive if intersects with [T SS − 20bp, T SS + 20bp]
5. Label chunk negative [T SS + 21bp, GeneEnd]
Sonnenburg, Zien, Rätsch
10
Promoter Detection
Results
Receiver Operator Characteristic Curve and Precision Recall Curve
⇒ 35% true positives at a false positive rate of 1/1000
(best other method find about a half (18%))
Sonnenburg, Zien, Rätsch
11
Promoter Detection
What does ARTS do better ?
Entropy and Relative Entropy
entropy auROC: 86.5% auPRC: 49.8%
relative entropy auROC: 86.5% auPRC: 49.8%
entropy auROC: 86.5% auPRC: 49.8%
5. 5
5. 5
5. 4
5. 4
5. 4
5. 3
5. 3
5. 3
5. 2
5. 2
5. 2
5. 1
5. 1
5. 1
5
5
5
4. 9
0
500
1000
1500
2000
2500
4. 9
0
500
1000
1500
2000
5. 5
2500
4. 9
0
500
1000
1500
2000
Di-nucleotide Frequency
⇒ strong discriminative signal around TSS
Sonnenburg, Zien, Rätsch
12
2500
Promoter Detection
Which kernel captures most information ?
using or removing single kernels
area under ROC Curve (in %)
96
94
92
90
88
86
84
82
80
TSS
WD shift
Promotor
Spectrum
1st Exon
Spectrum
Angles
Linear
⇒ Most important Weighted Degree Shift kernel
modelling the TSS signal
Sonnenburg, Zien, Rätsch
13
Promoter Detection
Alternative TSS - Protocadherin-α
Sonnenburg, Zien, Rätsch
14
Promoter Detection
Conclusion
• Developed a new TSF finder, “ARTS”
• In genome-wide evaluation achieves state-of-the-art results: ARTS about
35% true positives at a false positive rate of 1/1000 (best other method
about a half, 18%)
• Reason: intensively modelling the TSS region, large scale svm
training/evaluation with string kernels
• Future work: Drosophila, C.elegans, Zebrafish,. . .
Poster:
H56
Datasets, Genomebrowser custom track, a lot more details:
http://www.fml.tuebingen.mpg.de/raetsch/projects/arts
Source code of SHOGUN toolbox used to train ARTS freely available:
http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun
Sonnenburg, Zien, Rätsch
15
The end
See you tomorrow! Next topic: Clustering in Bioinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2