1.5cmExtreme Classification Tighter Bounds

Extreme Classification
Tighter Bounds, Distributed Training, and new Algorithms
Marius Kloft (HU Berlin)
Krakow, Feb 16, 2017
Joint work with Urun Dogan (Microsoft Research), Yunwen Lei (CU Hong Kong), Maximilian Albers (Berlin Big Data
Center), Julian Zimmert (HU Berlin), Moustapha Cisse (Facebook AI Lab), Alexander Binder (Singapore), and
Rohit Babbar (MPI Tübingen)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
1
Introduction
2
Distributed Algorithms
3
Theory
4
Learning Algorithms
5
Conclusion
Conclusion
2 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
What is Multi-class Classification?
Multiclass classification is, given a data point x, decide on the
class with which the data point is annotated.
3 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
What is Extreme Classification?
Extreme classification is multi-class classification using an
extremely large amount of classes.
4 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Example 1
We are continuously monitoring the internet for new webpages,
which we would like to categorize.
5 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Example 2
We have data from an online biomedical bibliographic database
that we want to index for quick access to clinicians.
6 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Example 3
We are collecting data from an online feed of photographs that
we would like to classify into image categories.
7 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Example 4
We add new articles to an online encyclopedia and intend to
predict the categories of the articles.
8 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Need
Need for theory and algorithms for extreme classification.
9 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
How do algorithms and bounds scale
in #classes?
10 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
How do algorithms and bounds scale
in #classes?
10 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
1
Introduction
2
Distributed Algorithms
3
Theory
4
Learning Algorithms
5
Conclusion
Conclusion
11 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
12 / 66
Support Vector Machine (SVM) is a Popular Method
for Binary Classification (Cortes and Vapnik, ’95)
Core idea:
I
Which hyperplane to take?
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
13 / 66
Support Vector Machine (SVM) is a Popular Method
for Binary Classification
I
Which hyperplane to take?
I
The one that separates the data with the largest margin
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Popular Generalization to Multiple Classes:
One-vs.-Rest SVM
Put C := #classes.
One-vs.-rest SVM
1 For c = 1..C
2
class1 := c, class2 := union(allOtherClasses)
3
wc := solutionOfSVM(class1,class2)
4 end
5 Given a test point x, predict cpredicted := arg maxc w>
c x
14 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Runtime of One-vs.-Rest
... assuming sufficient computational resources (#classes many
computers)
15 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Problem With One-vs.-Rest
:) training can be parallelized in the number of classes
(extreme classification!)
:( Is just a hack. One-vs.-Rest SVM is not built for multiple
classes (coupling of classes not exploited)!
16 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
17 / 66
There are “True” Multi-class SVMs,
So-called All-in-one Multi-class SVMs
binary:
MC:
SVM
Lin, Lee, and
Wahba (’04)
Watkins and
Weston (’99)
Crammer and
Singer (’02)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
17 / 66
There are “True” Multi-class SVMs,
So-called All-in-one Multi-class SVMs
binary:
MC:
SVM
Lin, Lee, and
Wahba (’04)
Watkins and
Weston (’99)
Crammer and
Singer (’02)
Problem: State of the art solvers require a training time
complexity of O(dn · C), where d =dim, n=#examples, and
C:=#classes.
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Aim: Develop algorithms where O(C) machines in parallel and
in O(dn) runtime train all-in-one MC-SVMs.
⇒ same time complexity as one-vs.-rest,
yet more sophisticated algorithm
18 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
All-in-one SVMs
All of them have in common that they minimize a trade-off of a
regularizer and a loss term:
1X
kwc k2 + C ∗ L(w, data)
w=(w1 ,...,C) 2
c
min
19 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
All-in-one SVMs
All of them have in common that they minimize a trade-off of a
regularizer and a loss term:
1X
kwc k2 + C ∗ L(w, data)
w=(w1 ,...,C) 2
c
min
20 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
All Three MC-SVMs have:
minw=(w1 ,...,wC )
1
2
2
c kwc k
P
+ C∗ ...
21 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
21 / 66
All Three MC-SVMs have:
minw=(w1 ,...,wC )
1
2
2
c kwc k
P
But they differ in the loss:
CS:
+ C∗ ...
note: l(x) := max(0, 1 − x)
n X
T
...
max l((wyi − wc ) xi )
i=1
c6=yi
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
21 / 66
All Three MC-SVMs have:
minw=(w1 ,...,wC )
1
2
2
c kwc k
P
But they differ in the loss:
CS:
+ C∗ ...
note: l(x) := max(0, 1 − x)
n X
T
...
max l((wyi − wc ) xi )
i=1
c6=yi
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
22 / 66
All Three MC-SVMs have:
minw=(w1 ,...,wC )
1
2
2
c kwc k
P
+ C∗ ...
But they differ in the loss:
CS:
...
n X
i=1
note: l(x) := max(0, 1 − x)
max l((wyi − wc ) xi )
T
c6=yi

WW:

n
X
X

...
l((wyi − wc )T xi )
i=1
c6=yi
Sources: Lee, Lin, and Wahba (2004), Weston and Watkins (1999), Crammer and Singer (2002)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
22 / 66
All Three MC-SVMs have:
minw=(w1 ,...,wC )
1
2
2
c kwc k
P
+ C∗ ...
But they differ in the loss:
CS:
...
n X
i=1
note: l(x) := max(0, 1 − x)
max l((wyi − wc ) xi )
T
c6=yi

WW:

n
X
X

...
l((wyi − wc )T xi )
i=1
c6=yi

LLW:

n
X
X
X

...
wc = 0
l(−wTc xi ) , s.t.
i=1
c6=yi
c
Sources: Lee, Lin, and Wahba (2004), Weston and Watkins (1999), Crammer and Singer (2002)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Can we solve these all-in-one MC-SVMs in parallel?
23 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Can we solve these all-in-one MC-SVMs in parallel?
Let’s look at Lee, Lin, and Wahba (LLW) first.
23 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
This is the LLW Dual Problem
max
α
−
C
X
1X
1X
||Xαc −
Xαc̃ ||2 +
αi
2
C
c,i:yi =c
c̃
c=1
| {z }
s.t. αi,yi = 0
0 ≤ αi,c ≤ C
24 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
This is the LLW Dual Problem
max max −
α
w̄
C
X
1X
1X
||Xαc −
Xαc̃ ||2 +
αi
2
C
c,i:yi =c
c̃
c=1
| {z }
=w̄
s.t. αi,yi = 0
0 ≤ αi,c ≤ C
24 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
25 / 66
This is the LLW Dual Problem
Dc (αc ,w̄)
z
max
α,w̄
X
c
}|
− 1 ||Xαc − w̄||2 +
2
s.t. αi,yi = 0
0 ≤ αi,c ≤ C
{
X
i:yi =c
αi 
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
LLW: Proposed Algorithm
Algorithm Simple wrapper algorithm
1: function S IMPLE S OLVE -LLW(C, X, Y)
2:
while not converged do
3:
for c = 1..C do in parallel
4:
αc ← arg maxα̃c Dc (α̃c , w̄)
5:
end for
6:
w̄ ← arg maxw D(α, w)
7:
end while
8: end function
Alber, Zimmert, Dogan, and Kloft (2016):
NIPS submitted
26 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
LLW: Proposed Algorithm
Algorithm Simple wrapper algorithm
1: function S IMPLE S OLVE -LLW(C, X, Y)
2:
while not converged do
3:
for c = 1..C do in parallel
4:
αc ← arg maxα̃c Dc (α̃c , w̄)
5:
end for
6:
w̄ ← arg maxw D(α, w)
7:
end while
8: end function
Alber, Zimmert, Dogan, and Kloft (2016):
NIPS submitted rejected ;)
27 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
LLW: Proposed Algorithm
Algorithm Simple wrapper algorithm
1: function S IMPLE S OLVE -LLW(C, X, Y)
2:
while not converged do
3:
for c = 1..C do in parallel
4:
αc ← arg maxα̃c Dc (α̃c , w̄)
5:
end for
6:
w̄ ← arg maxw D(α, w)
7:
end while
8: end function
Alber, Zimmert, Dogan, and Kloft (2016):
NIPS submitted rejected ;)
PLoS submitted, arXiv:1611.08480
27 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Ok, fine so far with the LLW SVM.
Now, let’s look at the Weston and Watkins (WW) SVM.
28 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
WW: This is How the Dual Problem Looks Like
=:D(α)
max
α∈Rn×C
s.t.
}|
z 
{
C
X
X
− 1 || − Xαc ||2 +
αi,c 
2
c=1
i:yi 6=c
X
∀i : αi,yi = −
αi,c ,
c:c6=yi
∀c 6= yi : 0 ≤ αi,c ≤ C
29 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
WW: This is How the Dual Problem Looks Like
=:D(α)
max
α∈Rn×C
s.t.
}|
z 
{
C
X
X
− 1 || − Xαc ||2 +
αi,c 
2
c=1
i:yi 6=c
X
∀i : αi,yi = −
αi,c ,
c:c6=yi
∀c 6= yi : 0 ≤ αi,c ≤ C
A common strategy to optimize such a dual problem, is to
optimize one coordinate after another (“dual coordinate
ascent”):
1 for i = 1, ..., n
2
for c = 1, . . . , C
3
4
αi,c = maxαi,c D(α)
end
5 end
29 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
This is Now the Story...
We optimize αi,c into gradient direction:
∂
: 1 − (wyi − wc )T xi
∂αi,c
Derivative depends only on two weight vectors (not all C
many!).
30 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
This is Now the Story...
We optimize αi,c into gradient direction:
∂
: 1 − (wyi − wc )T xi
∂αi,c
Derivative depends only on two weight vectors (not all C
many!).
Can we exploit this?
30 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Analogy: Soccer League Schedule
We are given a football league (e.g., Bundesliga) with C many
teams.
Before the season, we have to decide on a schedule such that
each team plays any other team exactly once.
Furthermore, all teams shall play on every matchday so that in
total we need only C − 1 matchdays.
Example
Bundesliga has C = 18 teams.
⇒ C − 1 = 17 matchdays (or twice that many if counting home
and away matches)
31 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Analogy: Soccer League Schedule
We are given a football league (e.g., Bundesliga) with C many
teams.
Before the season, we have to decide on a schedule such that
each team plays any other team exactly once.
Furthermore, all teams shall play on every matchday so that in
total we need only C − 1 matchdays.
Example
Bundesliga has C = 18 teams.
⇒ C − 1 = 17 matchdays (or twice that many if counting home
and away matches)
How can we come up with a schedule?
31 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
This is a Classical Computer Science Problem...
This is the 1-factorization of a graph problem.
32 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
This is a Classical Computer Science Problem...
This is the 1-factorization of a graph problem. The solution is
known:
Here: C = 8 many teams, 7 matchdays
32 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
WW: Proposed Algorithm
Algorithm Simplistic DBCA
wrapper algorithm
1: function S IMPLE S OLVE -WW(C, X, Y)
2:
while not converged do
3:
for r = 1...C − 1 do # iterate over “matchdays”
4:
for c = 1..C/2 do in parallel # iterate over
“matches”
5:
6:
7:
8:
9:
10:
(ci , cj ) ← the two classes (“opposing teams”)
αIci ,cj , αIcj ,ci ← arg maxα1 ,α2 Dc (α1 , α2 )
end for
end for
end while
end function
Alber, Zimmert, Dogan, and Kloft (2016): arXiv:1611.08480
33 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Accuracies
Dataset
# Training
ALOI
98,200
LSHTCsmall
4,463
DMOZ2010 128,710
# Test # Classes # Features
10,800
1000
128
1,858
1,139
51,033
34,880 12,294
381,581
Dataset
OVR
CS
WW
LLW
ALOI
0.1824 0.0974 0.0930 0.6560
LSHTCsmall 0.549 0.5919 0.5505 0.9263
DMOZ2010 0.5721
0.5432 0.9586
Table: Datasets used in our paper, their properties and best test error
over a grid of C values.
34 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Results: Speedup
25
LLW:dmoz_2010
LLW:aloi
8
WW:dmoz_2010
WW:aloi
Speedup
20
6
15
10
4
5
2
0
12 4
8
16
LLW:Number of Nodes
32
1 2 4
8
16
WW:Number of active Nodes
35 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Open questions
I
higher efficiencies via GPUs?
I
Why does LLW accuracy break?
I
parallelization for CS?
36 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
1
Introduction
2
Distributed Algorithms
3
Theory
4
Learning Algorithms
5
Conclusion
Conclusion
37 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Theory and Algorithms in Extreme Classification
I
Just saw: Algorithms that better handle large number of
classes
38 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Theory and Algorithms in Extreme Classification
I
Theory not prepared for extreme classification
I
Data-dependent bounds scale at least linearly with the
number of classes
(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al.,
2014)
39 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Theory of Extreme Classification
Questions
I
Can we get bounds with mild dependence on #classes?
⇒ Novel algorithms?
40 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
41 / 66
Multi-class Classification
Given:
I
i.i.d.
Training data z1 = (x1 , y1 ), . . . , zn = (xn , yn ) ∼ P
{z
}
|
∈X ×Y
I
I
Y := {1, 2, . . . , C}
C = number of classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Formal Problem Setting
Aim:
I
Define a hypothesis class H of functions h = (h1 , . . . , hc )
I
Find an h ∈ H that “predicts well” via
ŷ := arg max
y∈Y
hy (x)
Multi-class SVMs:
I
hy (x) = hwy , φ(x)i
I
Introduce notion of the (multi-class) margin
ρh (x, y) := hy (x) − max
hy0 (x)
0 0
y :y 6=y
I
the larger the margin, the better
Want: large expected margin Eρh (X, Y).
42 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Types of Generalization bounds for Multi-class
Classification
Data-independent bounds
I
based on covering numbers
(Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007)
- conservative
I
unable to adapt to data
Data-dependent bounds
I
based on Rademacher complexity
(Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013;
Kuznetsov et al., 2014)
+ tighter
I
I
able to capture the real data
computable from the data
43 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Def.: Rademacher and Gaussian Complexity
I
Let σ1 , . . . , σn be independent Rademacher variables (taking
only values ±1, with equal probability).
I
The Rademacher complexity (RC) is defined as
n
1X
R(H) := Eσ sup
σi h(zi )
n
h∈H
i=1
I
Let g1 , . . . , gn ∼ N(0, 1).
I
The Gaussian complexity (GC) is defined as
n
1X
gi h(zi )
G(H) = Eg sup
h∈H n
i=1
Interpretation: RC and GC reflect the ability of the hypothesis
class to correlate with random noise.
44 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Def.: Rademacher and Gaussian Complexity
I
Let σ1 , . . . , σn be independent Rademacher variables (taking
only values ±1, with equal probability).
I
The Rademacher complexity (RC) is defined as
n
1X
R(H) := Eσ sup
σi h(zi )
n
h∈H
i=1
I
Let g1 , . . . , gn ∼ N(0, 1).
I
The Gaussian complexity (GC) is defined as
n
1X
gi h(zi )
G(H) = Eg sup
h∈H n
i=1
Interpretation: RC and GC reflect the ability of the hypothesis
class to correlate with random noise.
Theorem ((Ledoux and Talagrand, 1991))
r
R(H) ≤
π
G(H) ≤ 3
2
r
πp
log nR(H).
2
44 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
45 / 66
Existing Data-Dependent Analysis
The key step is estimating R({ρh : h ∈ H}) induced from the
margin operator ρh and class H.
Existing bounds build on the structural result:
R(max{h1 , . . . , hC } : hj ∈ Hc , c = 1, . . . , C) ≤
C
X
c=1
R(Hc )
(1)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
45 / 66
Existing Data-Dependent Analysis
The key step is estimating R({ρh : h ∈ H}) induced from the
margin operator ρh and class H.
Existing bounds build on the structural result:
R(max{h1 , . . . , hC } : hj ∈ Hc , c = 1, . . . , C) ≤
C
X
R(Hc )
(1)
c=1
Best known dependence on the number of classes:
I quadratic dependence Koltchinskii and Panchenko (2002); Mohri et al.
(2012); Cortes et al. (2013)
I
linear dependence
Kuznetsov et al. (2014)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
45 / 66
Existing Data-Dependent Analysis
The key step is estimating R({ρh : h ∈ H}) induced from the
margin operator ρh and class H.
Existing bounds build on the structural result:
R(max{h1 , . . . , hC } : hj ∈ Hc , c = 1, . . . , C) ≤
C
X
R(Hc )
(1)
c=1
Best known dependence on the number of classes:
I quadratic dependence Koltchinskii and Panchenko (2002); Mohri et al.
(2012); Cortes et al. (2013)
I
linear dependence
Can we do better?
Kuznetsov et al. (2014)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
45 / 66
Existing Data-Dependent Analysis
The key step is estimating R({ρh : h ∈ H}) induced from the
margin operator ρh and class H.
Existing bounds build on the structural result:
R(max{h1 , . . . , hC } : hj ∈ Hc , c = 1, . . . , C) ≤
C
X
R(Hc )
(1)
c=1
Best known dependence on the number of classes:
I quadratic dependence Koltchinskii and Panchenko (2002); Mohri et al.
(2012); Cortes et al. (2013)
I
linear dependence
Kuznetsov et al. (2014)
Can we do better?
The correlation among class-wise components is ignored.
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
46 / 66
A New Structural Lemma on Gaussian Complexities
We consider Gaussian complexity.
I
We show:
G {max{h1 , . . . , hC } : h = (h1 , . . . , hC ) ∈ H} ≤
n
C
XX
1
Eg
sup
gic hc (xi ) . (2)
n h=(h1 ,...,hC )∈H
i=1 c=1
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
46 / 66
A New Structural Lemma on Gaussian Complexities
We consider Gaussian complexity.
I
We show:
G {max{h1 , . . . , hC } : h = (h1 , . . . , hC ) ∈ H} ≤
C
n
XX
1
Eg
sup
gic hc (xi ) . (2)
n h=(h1 ,...,hC )∈H
i=1 c=1
Core idea: Comparison inequality on GPs:
Xh :=
n
X
i=1
gi max{h1 (xi ), . . . , hC (xi )}, Yh :=
(Slepian, 1962)
n X
C
X
gic hc (xi ), ∀h ∈ H.
i=1 c=1
E[(Xθ − Xθ̄ )2 ] ≤ E[(Yθ − Yθ̄ )2 ] =⇒ E[sup Xθ ] ≤ E[sup Yθ ].
θ∈Θ
θ∈Θ
Eq. (2) preserves the coupling among class-wise components!
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
47 / 66
Example on Comparison of the Structural Lemma
I
Consider
H := {(x1 , x2 ) → (h1 , h2 )(x1 , x2 ) = (w1 x1 , w2 x2 ) : k(w1 , w2 )k2 ≤ 1}
For the function class {max{h1 , h2 } : h = (h1 , h2 ) ∈ H},
Pn
sup
n
i=1 σi h1 (xi ) +
X
(h1 ,h2 )∈H
Pn
sup
[gi1 h1 (xi ) + gi2 h2 (xi )]
sup
(h1 ,h2 )∈H
i=1 σi h2 (xi )
I
(h1 ,h2 )∈H
i=1
Preserving the coupling means supremum in a smaller space!
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Estimating Multi-class Gaussian Complexity
I
Consider a vector-valued function class defined by
H := {hw = (hw1 , φ(x)i, . . . , hwc , φ(x)i) : f (w) ≤ Λ},
where f is β-strongly convex w.r.t. k · k
I
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) − β2 α(1 − α)kx − yk2 .
Theorem
v
u
n X
C
n C 2
X
u 2πΛ X
1
1
w
gic hc (xi ) ≤ t
Eg sup
Eg
gic φ(xi )
, (3)
n hw ∈H
n
β
c=1 ∗
i=1 c=1
where k · k∗ is the dual norm of k · k.
i=1
48 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Features of the complexity bound
I
Applies to a general function class defined through a
strongly-convex regularizer f
Class-wise components h1 , . . . , hC are correlated through
C 2
the term gic φ(xi )
I
Consider class Hp,Λ := {hw : kwk2,p ≤ Λ}, ( p1 + p1∗ = 1); then:
I
c=1 ∗
v
u n
C
n X
X
uX
1
Λ
w
Eg sup
gic hc (xi ) ≤ t
k(xi , xi )×
n hw ∈Hp,Λ
n
i=1
i=1 c=1
√
1
1+
 e(4 log C) 2 log C , if p∗ ≥ 2 log C,
 2p∗ 1+ p1∗ C p1∗ ,
otherwise.
The dependence is sublinear for 1 ≤ p ≤ 2, and even
logarithmic when p approaches to 1!
49 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
`p -norm MC-SVM
I
Consider class Hp,Λ := {hw : kwk2,p ≤ Λ}, ( p1 + p1∗ = 1); then:
1
Eg sup
n hw ∈Hp,Λ
v
u n
X
Λu
w
gic hc (xi ) ≤ t
k(xi , xi )×
n
c=1
i=1
√
1
 e(4 log C)1+ 2 log C , if p∗ ≥ 2 log C,
 2p∗ 1+ p1∗ C p1∗ ,
otherwise.
C
n X
X
i=1
The dependence is sublinear for 1 ≤ p ≤ 2, and even
logarithmic when p approaches to 1!
50 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Future Directions
Theory: A data-dependent bound independent of the class
size?
51 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Future Directions
Theory: A data-dependent bound independent of the class
size?
⇒ Need more powerful structural result on Gaussian
complexity for functions induced by maximum operator.
I
Might be worth to look into `∞ -norm covering numbers.
Reference: Lei, Dogan, Binder, and Kloft (NIPS 2015);
Journal submission forthcoming
51 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
1
Introduction
2
Distributed Algorithms
3
Theory
4
Learning Algorithms
5
Conclusion
Conclusion
52 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
53 / 66
`p -norm Multi-class SVM
Motivated by the mild dependence on C as p → 1, we consider
(`p -norm) Multi-class SVM, 1 ≤ p ≤ 2
C
n
c=1
i=1
i2
X
1h X
p
min
kwc kp2 + C
(1 − ti )+ ,
w 2
s.t. ti = hwyi , φ(xi )i − max hwy , φ(xi )i,
y:y6=yi
(P)
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
54 / 66
Empirical Results
Empirical Results:
Method / Dataset
`p -norm MC-SVM
Crammer & Singer
Sector
94.2±0.3
93.9±0.3
News 20
86.2±0.1
85.1±0.3
Rcv1
85.7±0.7
85.2±0.3
Birds 50
27.9±0.2
26.3±0.3
Caltech 256
56.0±1.2
55.0±1.1
Proposed `p -norm MC-SVM consistently better on benchmark
datasets.
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Wait... I performed this Experiment:
55 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Wait... I performed this Experiment:
I
So I took the DMOZ2010 dataset
(Aim: categorize new webpages)
55 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Wait... I performed this Experiment:
I
OVR-SVM, Train=128,710, Test=34,880; Result:
27% of classes never used in prediction
56 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
57 / 66
New Learning Algorithm
Schatten-SVM
1X
min
W=(w1 ,...,wC ) 2
c
kWk2Sp
| {z }
Schatten norm
Schatten-p norm
kWkSp :=
qP
p
√
p
i σi (
W > W)
+C
n X
i=1
max l((wyi − wc ) xi )
T
c6=yi
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Geometry of Schatten Norm
Conclusion
58 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Schatten-norm Parameter p Controls coverage
59 / 66
Introduction
Distributed Algorithms
Results
Theory
Learning Algorithms
Conclusion
60 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Future Directions
Algorithms: New models & efficient solvers
I
Novel models motivated by theory
I
top-k MC-SVM (Lapin et al., 2015)
I
Analyze p > 2 regime
I
Extensions to multi-label learning
61 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
1
Introduction
2
Distributed Algorithms
3
Theory
4
Learning Algorithms
5
Conclusion
Conclusion
62 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Extreme Classification
Conclusion
63 / 66
Introduction
Distributed Algorithms
Conclusion
Theory
Learning Algorithms
Conclusion
64 / 66
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
65 / 66
Refs I
M. Alber, J. Zimmert, U. Dogan, and M. Kloft. Distributed Optimization of Multi-Class SVMs. in submission, 2016.
C. Cortes, M. Mohri, and A. Rostamizadeh. Multi-class classification with maximum margin multiple kernel. In
ICML-13, pages 46–54, 2013.
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal
of Machine Learning Research, 2:265–292, 2002.
Y. Guermeur. Combining discriminant models with new multi-class svms. Pattern Analysis & Applications, 5(2):
168–179, 2002.
S. I. Hill and A. Doucet. A framework for kernel-based multi-category classification. Journal of Artificial
Intelligence Research, 30(1):525–564, 2007.
V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined
classifiers. Annals of Statistics, pages 1–50, 2002.
V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Advances in Neural Information Processing
Systems, pages 2501–2509, 2014.
M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM. CoRR, abs/1511.06683, 2015. URL
http://arxiv.org/abs/1511.06683.
M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer,
Berlin, 1991.
Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classification of
microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–82,
2004.
Y. Lei, Ü. Dogan, A. Binder, and M. Kloft. Multi-class svms: From tighter data-dependent generalization bounds to
novel algorithms. In Advances in Neural Information Processing Systems 28: Annual Conference on
Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages
2035–2043, 2015. URL http://papers.nips.cc/paper/
6012-multi-class-svms-from-tighter-data-dependent-generalization-bounds-to-novel-algorith
M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2012.
Introduction
Distributed Algorithms
Theory
Learning Algorithms
Conclusion
Refs II
D. Slepian. The one-sided barrier problem for gaussian noise. Bell System Technical Journal, 41(2):463–501,
1962.
J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In M. Verleysen, editor,
Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN), pages
219–224. Evere, Belgium: d-side publications, 1999.
T. Zhang. Class-size independent generalization analsysis of some discriminative multi-category classification. In
Advances in Neural Information Processing Systems, pages 1625–1632, 2004a.
T. Zhang. Statistical analysis of some multi-category large margin classification methods. The Journal of Machine
Learning Research, 5:1225–1251, 2004b.
66 / 66