Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Karsten M. Borgwardt Protein function prediction via graph kernels ISMB 2005 Joint work with Cheng Soon Ong and S.V.N. Vishwanathan, Stefan Schönauer, Hans-Peter Kriegel and Alex Smola Ludwig-Maximilians-Universität Munich, Germany and National ICT Australia, Canberra Content Introduction • The problem: protein function prediction • The method: Support Vector Machines (SVM) Our approach to function prediction • Protein graph model • Protein graph kernel • Experimental evaluation Technique to analyze our graph model • Hyperkernels Discussion Karsten Borgwardt et al. - Protein function prediction via graph kernels 2 Current approaches to protein function prediction similar structures similar phylogenetic profiles similar sequences similar motifs similar function similar chemical properties similar interaction partners similar surface clefts Karsten Borgwardt et al. - Protein function prediction via graph kernels 3 Current approaches to protein function prediction similar structures similar phylogenetic profiles similar sequences similar motifs similar function similar interaction partners similar chemical properties similar surface clefts Karsten Borgwardt et al. - Protein function prediction via graph kernels 4 Support Vector Machines Are new data points (x) red or black? The blue decision boundary allows to predict class membership of new data points. Karsten Borgwardt et al. - Protein function prediction via graph kernels 5 Kernel trick input space feature space mapping Ф kernel function The kernel trick allows to introduce a separating hyperplane in feature space. Karsten Borgwardt et al. - Protein function prediction via graph kernels 6 Feature vectors for function prediction protein structure and/or protein sequence e.g. Cai et al. (2004) , Dobson and Doig (2003) • hydrophobicity • polarity • polarizability • van der Waals volume •fraction of amino acid types •fraction of surface area •disulphide bonds •size of largest surface pocket 7 Our approach Sequence + Structure + Chemical properties SVMs + Graph models Graph model Protein function Karsten Borgwardt et al. - Protein function prediction via graph kernels 8 Protein graph model protein secondary structure sequence Karsten Borgwardt et al. - Protein function prediction via graph kernels structure 9 Protein graph model Node attributes • hydrophobicity • polarity • polarizability • van der Waals volume • length • helix, sheet, loop Edge attributes • type (sequence, structure) • length Karsten Borgwardt et al. - Protein function prediction via graph kernels 10 Protein graph kernel (Kashima et al. (2003) and Gärtner et al. (2003)) compares walks of identical length l l −1 k walk v 1, ... ,v l , w 1, ... , w l =∑ k step vi ,v i1 ,w i , w i1 i =1 Walks are similar, if along both walks •types of secondary structure elements (SSEs) are the same •distances between SSEs are similar •chemical properties of SSEs are similar 11 Example: Protein kernel S SS Protein A S S Protein B S Similar (H,10,S,1,S,3,H) (H,9,S,1,S,3,H) 12 Example: Protein kernel S Protein A SS S S Protein B S Dissimilar (H,10,S,1,S) (S,3,H,5,S) 13 Evaluation: enzymes vs. non-enzymes 10-fold cross-validation on 1128 proteins from dataset by Dobson and Doig (2003); 59 % are enzymes. Kernel type Vector kernel Optimized vector kernel Graph kernel Graph kernel without structure Graph kernel with global info DALI classifier accuracy 76.86 80.17 77.30 72.33 84.04 75.07 Karsten Borgwardt et al. - Protein function prediction via graph kernels SD 1.23 1.24 1.20 5.32 3.33 4.58 14 Attribute selection Which structural or chemical attribute is most important for correct classification? For this purpose, we employ hyperkernels (Ong et. al, 2003). Hyperkernels find an optimal linear combination of input kernel matrices : m ∑ βi K i i=1 minimizing training error and fulfilling regularization constraints Karsten Borgwardt et al. - Protein function prediction via graph kernels 15 Attribute selection Our approach: •Calculate kernel matrix for 600 proteins on graph model with only ONE single attribute! •Repeat this for all attributes •Normalize these kernel matrices •Determine hyperkernel combination •Weights then reflect contribution of individual attributes to correct classification 16 Attribute selection Attribute EC 1 EC 2 EC 3 EC 4 EC 5 EC 6 Amino acid length 1.00 0.31 1.00 1.00 0.73 0.00 3-bin van der Waals 0.00 0.00 0.00 0.00 0.00 0.00 3-bin Hydrophobicity 0.00 0.00 0.00 0.00 0.00 0.00 3-bin Polarity 0.00 0.01 0.00 0.00 0.00 1.00 3-bin Polarizability 0.00 0.00 0.00 0.00 0.12 0.00 3d length 0.00 0.40 0.00 0.00 0.00 0.00 Total van der Waals 0.00 0.00 0.00 0.00 0.00 0.00 Total Hydrophobicity 0.00 0.13 0.00 0.00 0.01 0.00 Total Polarity 0.00 0.14 0.00 0.00 0.01 0.00 Total Polarizability 0.00 0.01 0.00 0.00 0.13 0.00 Karsten Borgwardt et al. - Protein function prediction via graph kernels 17 Discussion • Novel combined approach to protein function prediction integrating sequence, structure and chemical information • Reaches state-of-the-art classification accuracy on less information; higher accuracy levels on same amount of information • Hyperkernels for finding most interesting protein characteristics Karsten Borgwardt et al. - Protein function prediction via graph kernels 18 Discussion • More detailed graph models (amino acids, atoms) might be more interesting, yet raise computational difficulties (graphs too large!) Two directions of future research: • Efficient, yet expressive graph kernels for structure • Integrating more proteomic information, e.g. surface pockets, into our graph model Karsten Borgwardt et al. - Protein function prediction via graph kernels 19 The End Thank you! Questions? Karsten Borgwardt et al. - Protein function prediction via graph kernels 20 ARTS: Accurate Recognition of Transcription Starts in human Sören Sonnenburg,† † Alexander Zien,∗,♮ Gunnar Rätsch♮ Fraunhofer FIRST.IDA, Kekuléstr. 7, 12489 Berlin, Germany ♮ Friedrich Miescher Laboratory of the Max Planck Society, ∗ Max Planck Institute for Biological Cybernetics, Spemannstr. 37-39, 72076 Tübingen, Germany [email protected], {Alexander.Zien,Gunnar.Raetsch}@tuebingen.mpg.de Promoter Detection Overview: • Transcription Start Site (TSS) • Features to describe the TSS • Our approach • Evaluation with current methods • Example - Protocadherin-α • Summary Sonnenburg, Zien, Rätsch 1 Promoter Detection Transcription Start Site - Properties • POL II binds to a rather vague region of ≈ [−20, +20] bp • Upstream of TSS: promoter containing transcription factor binding sites • Downstream of TSS: 5’ UTR, and further downstream coding regions and introns (different statistics) • 3D structure of the promoter must allow the transcription factors to bind ⇒ Promoter Prediction is non-trivial Sonnenburg, Zien, Rätsch 2 Promoter Detection Features to describe the TSS • TFBS in Promoter region • condition: DNA should not be too twisted • CpG islands (often over TSS/first exon; in most, but not all promoters) • TSS with TATA box (≈ −30 bp upstream) • Exon content in UTR 5” region • Distance to first donor splice site Idea: Combine weak features to build strong promoter predictor Sonnenburg, Zien, Rätsch 3 Promoter Detection The ARTS Approach use SVM classifier • f (x) = sign Ns X yi αik(x, xi) + b i=1 ! • key ingredient is kernel k(x, x′) — similarity of two sequences • use 5 sub-kernels suited to model the aforementioned features ′ ′ ′ ′ ′ ′ k(x, x ) = kT SS (x, x )+kCpG (x, x )+kcoding (x, x )+kenergy (x, x )+ktwist(x, x ) Sonnenburg, Zien, Rätsch 4 Promoter Detection The 5 sub-kernels 1. TSS signal (including parts of core promoter with TATA box) – use Weighted Degree Shift kernel 2. CpG Islands, distant enhancers and TFBS upstream of TSS – use Spectrum kernel (large window upstream of TSS) 3. Model coding sequence TFBS downstream of TSS – use another Spectrum kernel (small window downstream of TSS) 4. Stacking energy of DNA – use btwist energy of dinucleotides with Linear kernel 5. Twistedness of DNA – use btwist angle of dinucleotides with Linear kernel Sonnenburg, Zien, Rätsch 5 Promoter Detection Weighted Degree Shift Kernel k(x1,x2) = w6,3 + w6,-3 + w3,4 x1 x2 • Count matching substrings of length 1 . . . d • Weight according to length of the match β1 . . . βd • Position dependent but tolerates “shifts” of up to S k(x, x′) = d X βk k=1 S L−k+1 X X l=1 δs (I(x[k : l + s] = x′[k : l])+I(x[k : l] = x′[k : l + s])) s=0 s+l≤L x[k : l] := subsequence of x of length k starting at position l Sonnenburg, Zien, Rätsch 6 Promoter Detection Training – Data Generation True TSS: • From dbTSSv4 (based on hg16) extract putative TSS windows of size [−1000, +1000] Decoy TSS: • Annotate dbTSSv4 with transcription-stop (via BLAT alignment of mRNAs) • From the interior of the gene (+100bp to gene end) sample negatives for training (10 per positive), again windows [−1000, +1000] Processing: • 8508 positive, 85042 negative examples • Split into disjoint training and validation set (50% : 50%) Sonnenburg, Zien, Rätsch 7 Promoter Detection Training – Model Selection 16 kernel parameters + SVM regularization to be tuned! • Full grid search infeasible • Local axis-parallel searches instead SVM training/evaluation on > 10, 000 examples computationally too demanding Speedup trick: f (x) = Ns X i=1 αik(xi, x) + b = Ns X αiΦ(xi) ·Φ(x) + b = w · Φ(x) + b |i=1 {z w } f (x) before: O(NsdLS) now: = O(dL) ⇒ speedup factor up to Ns · S ⇒ Large Scale Training and Evaluation possible Sonnenburg, Zien, Rätsch 8 Promoter Detection Comparison Current state-of-the-art methods: • FirstEF [Davuluri, Grosse, Zhang; 2001, Nat Genet] QDF: for promoter, donor, first exon, WM Range: [−1500, +500] • McPromoter [Ohler, Liao, Niemann, Rubin; 2002, Genome Biol] GHMM with IMC for 6 regions (e.g. upstream, TATA) NN Range: [−250, +50] • Eponine [Down, Hubbard; 2002 Genome Res] RVM: WM with positional distribution for 4 regions (e.g. TATA, CpG) Range: [−200, +200] ⇒ Do a genome wide evaluation! ⇒ How to do a fair comparison? Sonnenburg, Zien, Rätsch 9 Promoter Detection Evaluation Idea: Only consider “new” TSS from dbTSSv5-dbTSSv4, with max 30% overlap 1. Compute genome wide outputs for each TSF 2. Decrease resolution: divide genome into non-overlapping fixed size chunks (e.g. 50 or 500) 3. Annotate dbTSSv5 TSS with gene end 4. Label chunk positive if intersects with [T SS − 20bp, T SS + 20bp] 5. Label chunk negative [T SS + 21bp, GeneEnd] Sonnenburg, Zien, Rätsch 10 Promoter Detection Results Receiver Operator Characteristic Curve and Precision Recall Curve ⇒ 35% true positives at a false positive rate of 1/1000 (best other method find about a half (18%)) Sonnenburg, Zien, Rätsch 11 Promoter Detection What does ARTS do better ? Entropy and Relative Entropy entropy auROC: 86.5% auPRC: 49.8% relative entropy auROC: 86.5% auPRC: 49.8% entropy auROC: 86.5% auPRC: 49.8% 5. 5 5. 5 5. 4 5. 4 5. 4 5. 3 5. 3 5. 3 5. 2 5. 2 5. 2 5. 1 5. 1 5. 1 5 5 5 4. 9 0 500 1000 1500 2000 2500 4. 9 0 500 1000 1500 2000 5. 5 2500 4. 9 0 500 1000 1500 2000 Di-nucleotide Frequency ⇒ strong discriminative signal around TSS Sonnenburg, Zien, Rätsch 12 2500 Promoter Detection Which kernel captures most information ? using or removing single kernels area under ROC Curve (in %) 96 94 92 90 88 86 84 82 80 TSS WD shift Promotor Spectrum 1st Exon Spectrum Angles Linear ⇒ Most important Weighted Degree Shift kernel modelling the TSS signal Sonnenburg, Zien, Rätsch 13 Promoter Detection Alternative TSS - Protocadherin-α Sonnenburg, Zien, Rätsch 14 Promoter Detection Conclusion • Developed a new TSF finder, “ARTS” • In genome-wide evaluation achieves state-of-the-art results: ARTS about 35% true positives at a false positive rate of 1/1000 (best other method about a half, 18%) • Reason: intensively modelling the TSS region, large scale svm training/evaluation with string kernels • Future work: Drosophila, C.elegans, Zebrafish,. . . Poster: H56 Datasets, Genomebrowser custom track, a lot more details: http://www.fml.tuebingen.mpg.de/raetsch/projects/arts Source code of SHOGUN toolbox used to train ARTS freely available: http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun Sonnenburg, Zien, Rätsch 15 The end See you tomorrow! Next topic: Clustering in Bioinformatics Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
© Copyright 2025 Paperzz