Large Scale Support Vector Learning with

ECML/PKDD 2010 Large Scale Support Vector Learning with Structural Kernels Aliaksei Severyn and Alessandro Moschi; University of Trento, Italy September 21, 2010
1 Large scale training: Exis@ng solu@ons Linear SVMs for binary classifica@on -­‐ O(n) scaling •  SVM-­‐perf [Joachims, 2006]
•  OCAS [Fran and Sonneburg, 2008]
•  Pegasos [Schwartz, 2007]
Non-­‐linear (e.g. Kernels) -­‐ scaling •  Sparse Kernel SVMs [Joachims, 2009]
•  Basis pursuit approach •  SVM-­‐struct using Sampled Cuts [Yu and Joachims, 2008]
•  Uses sampling to approximate the sub-gradient
2 Adult Data Set •  Predict whether income exceeds $50K/yr •  Number of AUributes: 14 •  Number of Instances: 48,842 •  AUributes: age, workclass, educa4on, marital-­‐
status, occupa4on, rela4onship, race, sex, etc. 3 Covertype Data Set •  Predic@ng forest cover type •  Number of AUributes: 54 •  Number of Instances: 581,012 •  AUributes: eleva4on, slope, aspect, soil type, horizontal & ver4cal distance to hydrology, hillshade, etc 4 Experiments in work of Yu and Joachims •  Number of features is very small: < 50 •  Only Gaussian kernel is used •  The task has limited complexity in terms of seman@cs •  There is no clear dependency between the features 5 Our Study •  Structural Features which embed natural language syntac@c/seman@c informa@on •  Huge number of features •  Syntac@c/seman@c tree fragments •  Features are very sparse •  The a-­‐priori weights are very skewed •  There is high redundancy and inter-­‐
dependency between the features 6 Example: our dataset Seman@c Role Labeling: •  ExponenDal number of aUributes •  1M instances •  AUribute: 7 Main ideas promoted in the paper •  Test if CPAs can be successfully applied to complex structural spaces •  Extremely fast model and kernel parameter selec@on via CPAs with small samples •  Applica@on of structural kernels on large scale NLP tasks 8 Contents •  Cuang plane algorithm •  Sampling cuts •  Experiments • 
• 
• 
• 
Ques@on Classifica@on & SRL Accuracy & Speed vs. Sampling Size Fast parameteriza@on Kernel tes@ng and selec@on •  Conclusions 9 Cuang plane (primal) 1-­‐slack [Thorsten, 2006] 10 CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. CPA in a nutshell Original SVM Problem CPA SVM Approach ExponenDal constraints Most are dominated by a small set of “important” constraints Repeatedly finds the next most violated constraint… …unDl set of constraints is a good approximaDon. 15 Expensive double-­‐sum of kernel evalua@ons Replace by a smaller sample : 16 Experimental setup •  Datasets: •  Seman@c Role Labeling •  Ques@on Classifica@on 17 Seman@c Role Labeling (SRL) dataset " 
Task: identification of argument boundaries
Example of SRL annotation:
Paul gives a talk in Rome
[ Arg0 Paul] [ target gives ] [ Arg1 a talk] [ ArgM in Rome]
Seman@c Role Labeling (SRL) dataset " 
Task: identification of argument boundaries
Example of SRL annotation:
Paul gives a talk in Rome
[ BD Paul] [ target gives ] [ BD a talk] [ BD in Rome]
Seman@c Role Labeling (SRL) dataset " 
Task: identification of argument boundaries
Example of SRL annotation:
Paul gives a talk in Rome
[ BD Paul] [ target gives ] [ BD a talk] [ BD in Rome]
" 
Consists of PropBank, PennTree bank and Charniak
parse trees as provided by CoNLL 2005
" 
Two Training sets: 100,000 and 1 million
" 
Two Test sets:
" 
Sections 23 and 24 (234,416 and 149,140 instances)
Features  Syntac@c Tree Kernel [Collins and Duffy, 2002] 21 Models •  Exact SVM based on SVM-­‐Light-­‐TK [Thorsten,
1999] [Moschitti, 2008]: •  SVM •  CPA with sampling [Yu and Joachims, 2008] •  uSVM (uniform sampling) •  iSVM (importance sampling)* * not in this presenta@on 22 Accuracy vs. Sampling size (SRL 1 million) Training time (min)
1 4 14
45
222
576
814
1121
84
82
F1
80
78
76
uSVM, sec23
uSVM, sec24
SVM, sec23, training time = 10,705
SVM, sec24, training time = 10,705
74
72
0
1000
2000 3000 4000
Sampling size
5000
6000
23 Sampling size vs Number of SVs (SRL 1mil) 84
36374
46849
26337
82
14598
80
F1
41714
7713
78
6649
76
2759
74
uSVM, sec23
SVM, sec23, number of SV = 61881
72
0
1000
2000 3000 4000
Sampling size
5000
6000
24 Fast selec@on of kernel parameters 25 Kernel Selec@on from • 
• 
• 
• 
• 
ST [Vishwanathan and Smola, 2002] SST [Collins and Duffy, 2002] SST-­‐bow [Zhang and Lee, 2003] PT [Moschia, ECML 2006] uPT (new!) 26 Results (SRL 1 mil) 27 What does this buy us? (contd.) Finding the best model is very costly: •  training a convenDonal SVM solver with tree kernels on 1 million examples requires more than seven days Using a very small sample size (takes a couple of minutes on 1 million examples) we can correctly es@mate: •  the best kernel and its hyper-­‐parameters •  the trade-­‐off parameter C 28 Second Exp: Ques@on Classifica@on " 
Sub-task of question answering [Lin and Roth, 2005]
" 
6 Coarse-grained Categories (training: 5500 & test: 500)
AbbreviaDon: What does HTML stand for? DescripDon: What's the final line in the Edgar Allan Poe poem "The Raven”?
Entity: What foods can cause allergic reaction in people?
Human: Who won the Nobel Peace Prize in 1992?
Location: Where is the Statue of Liberty?
Numeric: When was Martin Luther King Jr. born?
Accuracy vs Margin parameter C NUM
90
F1
80
70
60
SVM
uSVM, sampling size = 100
50
40
0.1
1
10
C
100
30 Accuracy vs Margin parameter C DESC
100
F1
80
60
40
SVM
uSVM, sampling size = 100
20
0
0.1
1
10
C
100
31 Conclusions •  Integra@on of cu;ng plane training using sampling with Tree Kernels* •  As accurate as exact SVM while 10 Dmes as fast •  Provides a flexible method to train very fast models •  Efficient model and kernel parameter selec@on •  We were able to test and compare a new kernel: uPT This opens the door for the efficient applica@on of structural kernels on very large corpora * will be freely available soon at http://projects.disi.unitn.it/iKernels
32 Ideas for future work •  Handling unbalanced datasets •  CPA allows for efficient parallel implementa@on 33 Ques@ons… 34 What does this buy us? CPA is at least 10 Dmes as fast as the exact version while giving the same classifica@on accuracy. Decreasing the sampling size leads to even faster training @me. Learning a model that is only 1.0 percentage point apart from the exact model reduces the training @me by a factor of 50 35