ECML/PKDD 2010 Large Scale Support Vector Learning with Structural Kernels Aliaksei Severyn and Alessandro Moschi; University of Trento, Italy September 21, 2010 1 Large scale training: Exis@ng solu@ons Linear SVMs for binary classifica@on -‐ O(n) scaling • SVM-‐perf [Joachims, 2006] • OCAS [Fran and Sonneburg, 2008] • Pegasos [Schwartz, 2007] Non-‐linear (e.g. Kernels) -‐ scaling • Sparse Kernel SVMs [Joachims, 2009] • Basis pursuit approach • SVM-‐struct using Sampled Cuts [Yu and Joachims, 2008] • Uses sampling to approximate the sub-gradient 2 Adult Data Set • Predict whether income exceeds $50K/yr • Number of AUributes: 14 • Number of Instances: 48,842 • AUributes: age, workclass, educa4on, marital-‐ status, occupa4on, rela4onship, race, sex, etc. 3 Covertype Data Set • Predic@ng forest cover type • Number of AUributes: 54 • Number of Instances: 581,012 • AUributes: eleva4on, slope, aspect, soil type, horizontal & ver4cal distance to hydrology, hillshade, etc 4 Experiments in work of Yu and Joachims • Number of features is very small: < 50 • Only Gaussian kernel is used • The task has limited complexity in terms of seman@cs • There is no clear dependency between the features 5 Our Study • Structural Features which embed natural language syntac@c/seman@c informa@on • Huge number of features • Syntac@c/seman@c tree fragments • Features are very sparse • The a-‐priori weights are very skewed • There is high redundancy and inter-‐ dependency between the features 6 Example: our dataset Seman@c Role Labeling: • ExponenDal number of aUributes • 1M instances • AUribute: 7 Main ideas promoted in the paper • Test if CPAs can be successfully applied to complex structural spaces • Extremely fast model and kernel parameter selec@on via CPAs with small samples • Applica@on of structural kernels on large scale NLP tasks 8 Contents • Cuang plane algorithm • Sampling cuts • Experiments • • • • Ques@on Classifica@on & SRL Accuracy & Speed vs. Sampling Size Fast parameteriza@on Kernel tes@ng and selec@on • Conclusions 9 Cuang plane (primal) 1-‐slack [Thorsten, 2006] 10 CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. 15 Expensive double-‐sum of kernel evalua@ons Replace by a smaller sample : 16 Experimental setup • Datasets: • Seman@c Role Labeling • Ques@on Classifica@on 17 Seman@c Role Labeling (SRL) dataset " Task: identification of argument boundaries Example of SRL annotation: Paul gives a talk in Rome [ Arg0 Paul] [ target gives ] [ Arg1 a talk] [ ArgM in Rome] Seman@c Role Labeling (SRL) dataset " Task: identification of argument boundaries Example of SRL annotation: Paul gives a talk in Rome [ BD Paul] [ target gives ] [ BD a talk] [ BD in Rome] Seman@c Role Labeling (SRL) dataset " Task: identification of argument boundaries Example of SRL annotation: Paul gives a talk in Rome [ BD Paul] [ target gives ] [ BD a talk] [ BD in Rome] " Consists of PropBank, PennTree bank and Charniak parse trees as provided by CoNLL 2005 " Two Training sets: 100,000 and 1 million " Two Test sets: " Sections 23 and 24 (234,416 and 149,140 instances) Features Syntac@c Tree Kernel [Collins and Duffy, 2002] 21 Models • Exact SVM based on SVM-‐Light-‐TK [Thorsten, 1999] [Moschitti, 2008]: • SVM • CPA with sampling [Yu and Joachims, 2008] • uSVM (uniform sampling) • iSVM (importance sampling)* * not in this presenta@on 22 Accuracy vs. Sampling size (SRL 1 million) Training time (min) 1 4 14 45 222 576 814 1121 84 82 F1 80 78 76 uSVM, sec23 uSVM, sec24 SVM, sec23, training time = 10,705 SVM, sec24, training time = 10,705 74 72 0 1000 2000 3000 4000 Sampling size 5000 6000 23 Sampling size vs Number of SVs (SRL 1mil) 84 36374 46849 26337 82 14598 80 F1 41714 7713 78 6649 76 2759 74 uSVM, sec23 SVM, sec23, number of SV = 61881 72 0 1000 2000 3000 4000 Sampling size 5000 6000 24 Fast selec@on of kernel parameters 25 Kernel Selec@on from • • • • • ST [Vishwanathan and Smola, 2002] SST [Collins and Duffy, 2002] SST-‐bow [Zhang and Lee, 2003] PT [Moschia, ECML 2006] uPT (new!) 26 Results (SRL 1 mil) 27 What does this buy us? (contd.) Finding the best model is very costly: • training a convenDonal SVM solver with tree kernels on 1 million examples requires more than seven days Using a very small sample size (takes a couple of minutes on 1 million examples) we can correctly es@mate: • the best kernel and its hyper-‐parameters • the trade-‐off parameter C 28 Second Exp: Ques@on Classifica@on " Sub-task of question answering [Lin and Roth, 2005] " 6 Coarse-grained Categories (training: 5500 & test: 500) AbbreviaDon: What does HTML stand for? DescripDon: What's the final line in the Edgar Allan Poe poem "The Raven”? Entity: What foods can cause allergic reaction in people? Human: Who won the Nobel Peace Prize in 1992? Location: Where is the Statue of Liberty? Numeric: When was Martin Luther King Jr. born? Accuracy vs Margin parameter C NUM 90 F1 80 70 60 SVM uSVM, sampling size = 100 50 40 0.1 1 10 C 100 30 Accuracy vs Margin parameter C DESC 100 F1 80 60 40 SVM uSVM, sampling size = 100 20 0 0.1 1 10 C 100 31 Conclusions • Integra@on of cu;ng plane training using sampling with Tree Kernels* • As accurate as exact SVM while 10 Dmes as fast • Provides a flexible method to train very fast models • Efficient model and kernel parameter selec@on • We were able to test and compare a new kernel: uPT This opens the door for the efficient applica@on of structural kernels on very large corpora * will be freely available soon at http://projects.disi.unitn.it/iKernels 32 Ideas for future work • Handling unbalanced datasets • CPA allows for efficient parallel implementa@on 33 Ques@ons… 34 What does this buy us? CPA is at least 10 Dmes as fast as the exact version while giving the same classifica@on accuracy. Decreasing the sampling size leads to even faster training @me. Learning a model that is only 1.0 percentage point apart from the exact model reduces the training @me by a factor of 50 35
© Copyright 2026 Paperzz