Large-Scale SVM Optimization: Taking a Machine Learning

Large-Scale SVM Optimization:
Taking a Machine Learning Perspective
Shai Shalev-Shwartz
Toyota Technological Institute at Chicago
Joint work with Nati Srebro
Talk at NEC Labs, Princeton, August, 2008
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
1 / 25
Motivation
10k training examples
Shai Shalev-Shwartz (TTI-C)
1 hour
SVM from ML Perspective
2.3% error
Aug’08
2 / 25
Motivation
10k training examples
1 hour
2.3% error
1M training examples
1 week
2.29% error
SVM from ML Perspective
Aug’08
Shai Shalev-Shwartz (TTI-C)
2 / 25
Motivation
10k training examples
1 hour
2.3% error
1M training examples
1 week
2.29% error
Can always sub-sample and get error of 2.3% using 1 hour
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
2 / 25
Motivation
10k training examples
1 hour
2.3% error
1M training examples
1 week
2.29% error
Can always sub-sample and get error of 2.3% using 1 hour
Can we leverage excess data to reduce runtime ?
Say, achieve error of 2.3% using 10 minutes ?
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
2 / 25
Outline
Background: Machine Learning, Support Vector Machine (SVM)
SVM as an optimization problem
A Machine Learning Perspective on SVM Optimization
Approximated optimization
Re-define quality of optimization using generalization error
Error decomposition
Data-Laden Analysis
Stochastic Methods
Why Stochastic ?
PEGASOS (Stochastic Gradient Descent)
Stochastic Coordinate Dual Ascent
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
3 / 25
Background: Machine Learning and SVM
Learning Algorithm
Training Set
{(xi , yi )}m
i=1
Output h :
X → Y
Hypothesis set H
Loss function
Learning rule
Support Vector Machine
Linear hypotheses: hw (x) = hw, xi
Prefer hypotheses with large margin, i.e., low Euclidean norm
Resulting learning rule:
m
argmin
w
λ
1 X
kwk2 +
max{0, 1 − yi hw, xi i}
{z
}
2
m i=1 |
Hinge−loss
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
4 / 25
Support Vector Machines and Optimization
SVM learning rule:
m
λ
1 X
argmin kwk2 +
max{0, 1 − yi hw, xi i}
2
m
w
i=1
SVM optimization problem can be written as a Quadratic
Programming problem
m
λ
1 X
ξi
argmin kwk2 +
2
m
w,ξ
i=1
s.t. ∀i, 1 − yi hw, xi i ≤ ξi ∧ ξi ≥ 0
Standard solvers exist. End of story ?
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
5 / 25
Approximated Optimization
If we don’t have infinite computation power, we can only
approximately solve the SVM optimization problem
Traditional analysis
SVM objective:
m
P (w) =
1 X
λ
kwk2 +
`(hw, xi i , yi )
2
m i=1
w̃ is ρ-accurate solution if
P (w̃) ≤ min P (w) + ρ
w
Main focus: How optimization runtime depends on ρ ? E.g. IP
methods converge in time O(m3.5 log(log( ρ1 )))
Large-scale problems: How optimization runtime depends on m ?
E.g. SMO converges in time O(m2 log( ρ1 ))
SVM-Perf runtime is O( λmρ )
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
6 / 25
Machine Learning Perspective on Optimization
Our real goal is not to solve the SVM problem P (w)
Our goal is to find w with low generalization error:
L(w) = E(x,y)∼P [`(hw, xi , y)]
Redefine approximated accuracy:
w̃ is -accurate solution w.r.t. margin parameter B if
L(w̃) ≤
min
L(w) + w:kwk≤B
Study runtime as a function of and B
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
7 / 25
Error Decomposition
Theorem (S, Srebro ’08)
If w̃ satisfies
P (w̃) ≤ min P (w) + ρ
w
then, w.p. at least 1 − δ over choice of training set, w̃ satisfies
L(w̃) ≤
min
L(w) + w:kwk≤B
with
=
λ B2
c log(1/δ)
+
+ 2ρ
2
λm
(Following:
Bottou and Bousquet, “The Tradeoffs of Large Scale Learning”, NIPS ’08)
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
8 / 25
More Data ⇒ Less Work ?
L(w̃)
optimization
estimation
approximation
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
9 / 25
More Data ⇒ Less Work ?
L(w̃)
optimization
estimation
approximation
m
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
9 / 25
More Data ⇒ Less Work ?
L(w̃)
optimization
estimation
approximation
m
When data set size increases:
Can increase ρ ⇒ can optimize less accurately ⇒ runtime decreases
But handling more data may be expensive ⇒ runtime increases
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
9 / 25
Machine Learning Analysis of Optimization Algorithms
Given solver with opt. accuracy ρ(T, m, λ)
To ensure excess generalization error ≤ we need that
min
λ
λ B 2 c log(1/δ)
+
+ 2 ρ(T, m, λ) ≤ 2
λm
From the above we get runtime T as a function of m, B, Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
10 / 25
Machine Learning Analysis of Optimization Algorithms
Given solver with opt. accuracy ρ(T, m, λ)
To ensure excess generalization error ≤ we need that
min
λ
λ B 2 c log(1/δ)
+
+ 2 ρ(T, m, λ) ≤ 2
λm
From the above we get runtime T as a function of m, B, Examples (ignoring logarithmic terms and constants, and assuming
linear kernels):
SMO (Platt ’98)
SVM-Perf (Joachims ’06)
SGD (S, Srbero, Singer ’07)
Shai Shalev-Shwartz (TTI-C)
ρ(T, m, λ)
exp(−T /m2 )
m
λT
1
λT
SVM from ML Perspective
T (m, B, )
B 4
B 4
B 2
Aug’08
10 / 25
Stochastic Gradient Descent (Pegasos)
Initialize w1 = 0
For t = 1, 2, . . . , T
Choose i ∈ [m] uniformly at random
Define
∇t = λ wt − I[yt hwt ,xt i>0] yt xt
Note: E[∇t ] is a sub-gradient of P (w) at wt
Set ηt = λ1t
Update:
wt+1 = wt − ηt ∇t = (1 − 1t )wt +
1
I[y hw ,x i>0] yt xt
λt t t t
Theorem (Pegasos Convergence)
E[ρ] ≤ O
Shai Shalev-Shwartz (TTI-C)
log(T )
λT
SVM from ML Perspective
Aug’08
11 / 25
Dependence on Data Set Size
Corollary (Pegasos generalization analysis)



T (m; , B) = Õ  B
−
√1
m

2 
Empirical (CCAT)
Million Iterations (! runtime)
sample complexity
Runtime
Theoretical
1
15
12.5
data-laden
10
7.5
5
2.5
0
300,000
Training Set Size
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
500,000
Training Set Size
700,000
Aug’08
12 / 25
Intermediate Summary
Analyze runtime (T ) as a function of
excess generalization error ()
size of competing class (B)
Up to constants and logarithmic terms, stochastic gradient descent
(Pegasos)
is optimal – its runtime is order of sample complexity
2
Ω B
For Pegasos, running time decreases as training set size increases
Coming next
Limitations of Pegasos
Dual Coordinate Ascent methods
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
13 / 25
Limitations of Pegasos
Pegasos is simple and efficient optimization method. However, it has some
limitations:
log(sample complexity) factor in convergence rate
No clear stopping criterion
Tricky to obtain a good single solution with high confidence
Too aggressive at the beginning (especially when λ very small)
When working with kernels, too much support vectors
Hsieh et al recently argued that empirically dual coordinate ascent
outperforms Pegasos
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
14 / 25
Dual Methods
The dual SVM problem:
m
min D(α)
α∈[0,1]m
where
X
1 X
1
D(α) =
αi −
k
αi yi xi k2
m
2λ m2
i=1
i
Decomposition Methods
Dual problem has a different variable for each example
⇒ can optimize over subset of variables at each iteration
Extreme case
Dual Coordinate Ascent (DCA) – optimize D w.r.t. a single variable at
each iteration
SMO – optimize over 2 variables (necessary when having a bias term)
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
15 / 25
Linear convergence for decomposition methods
General convergence theory of (Luo and Tseng ’92) implies linear
convergence
But, dependence on m is quadratic. Therefore
T = O(m2 log(1/ρ))
This implies the Machine Learning analysis
T = O(B 4 /4 )
Why SGD is much better than decomposition methods ?
Primal vs. dual ?
Stochastic ?
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
16 / 25
Stochastic Dual Coordinate Ascent
The stochastic DCA algorithm
Initialize α = (0, . . . , 0) and w = 0
For t = 1, 2, . . . , T
Choose i ∈ [m] uniformly at random
Update: αi n= αi + τi where
n
oo
i hw,xi i)
τi = max −αi , min 1 − αi , λ m (1−y
kxi k2
i
Update: w = w + λτm
yi xi
Hsieh et al showed encouraging empirical results
No satisfactory theoretical guarantee
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
17 / 25
Analysis of stochastic DCA
Theorem (S ’08)
With probability at least 1 − δ, the accuracy of stochastic DCA satisfies
8 ln(1/δ) 1
ρ ≤
+m
T
λ
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
18 / 25
Analysis of stochastic DCA
Theorem (S ’08)
With probability at least 1 − δ, the accuracy of stochastic DCA satisfies
8 ln(1/δ) 1
ρ ≤
+m
T
λ
Proof idea:
Let α? be optimal dual solution
Upper bound dual sub-optimality at round t by the double potential
1
Ei kαt − α? k2 − kαt+1 − α? k2 + Ei D(αt+1 ) − D(αt )
2λm
Sum over t, use telescoping, and bound the result using weak-duality
Use approximated duality theory (Scovel, Hush, Steinwart ’08)
Finally, use measure concentration techniques
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
18 / 25
Comparing SGD and DCA
1 log(T )
T λ
1 1
+m
ρ(m, T, λ) ≤
T λ
ρ(m, T, λ) ≤
SGD :
DCA :
?
Conclusion: Relative performance depends on λ m < log(T )
CCAT
2
cov1
2
10
10
SGD
DCA
SGD
DCA
0
0
10
εacc
εacc
10
−2
10
−4
−4
10
10
−6
10 −8
10
−2
10
−6
−6
10
Shai Shalev-Shwartz (TTI-C)
−4
λ
10
−2
10
10 −8
10
SVM from ML Perspective
−6
10
−4
λ
−2
10
10
Aug’08
19 / 25
Combining SGD and DCA ?
The above graphs raise the natural question: Can we somehow
combine SGD and DCA ?
Seemingly, this is impossible as SGD is a primal algorithm while DCA
is a dual algorithm
Interestingly, SGD can be viewed also as a dual algorithm, but with a
dual function that changes along the optimization process
This is an ongoing direction ...
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
20 / 25
Machine Learning analysis of DCA
So far, we compared SGD and DCA using the old way (ρ)
But, what about runtime as a function of and B ?
Similarly to previous derivation (and ignoring log terms)
SGD :
DCA :
B2
2
B2
T ≤ 3
T ≤
Is this really the case ?
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
21 / 25
SGD vs. DCA – Machine Learning Perspective
CCAT
0.148
SGD
DCA
0.146
Hinge−loss
0.144
0.142
0.14
0.138
0.136
0.134
0.132
0.13 0
10
1
10
2
10
runtime (epochs)
cov1
0.548
SGD
DCA
0.546
Hinge−loss
0.544
0.542
0.54
0.538
0.536
0.534 0
10
1
10
2
10
runtime (epochs)
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
22 / 25
SGD vs. DCA – Machine Learning Perspective
CCAT
CCAT
0.148
0.055
SGD
DCA
0.146
0.144
0.142
0−1 loss
Hinge−loss
SGD
DCA
0.054
0.14
0.138
0.053
0.052
0.136
0.134
0.051
0.132
0.13 0
10
1
10
0.05 0
10
2
10
runtime (epochs)
cov1
2
10
cov1
0.548
0.232
SGD
DCA
0.546
SGD
DCA
0.231
0.23
0−1 loss
0.544
Hinge−loss
1
10
runtime (epochs)
0.542
0.54
0.538
0.229
0.228
0.227
0.536
0.226
0.534 0
10
1
10
2
10
0.225 0
10
runtime (epochs)
Shai Shalev-Shwartz (TTI-C)
1
2
10
10
runtime (epochs)
SVM from ML Perspective
Aug’08
22 / 25
Analysis of DCA revisited
DCA analysis T ≤
1
λρ
+
m
ρ
First term is like in SGD while second term involves training set size.
This is necessary since each dual variable has only 1/m effect on w.
However, a more delicate analysis is possible:
Theorem (DCA refined analysis)
If T ≥ m then with high probability at least one of the following holds
true:
After a single epoch DCA satisfies L(w̃) ≤ min L(w)
w:kwk≤B
√
c
1
DCA converges in time ρ ≤
+ λ m B2 + B m
T −m λ
The above theorem implies T ≤ O(B 2 /2 ).
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
23 / 25
Discussion
Bottou and Bousquet initiated a study of approximated optimization
from the perspective of generalization error
We further develop this idea
Regularized loss (like SVM)
Comparing algorithms based on runtime for achieving certain
generalization error
Comparing algorithms in the data-laden regime
More data ⇒ less work
Two stochastic approaches are close to optimal
Best methods are extremely simple :-)
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
24 / 25
Limitations and Open Problems
Analysis is based on upper bounds of estimation and optimization
error
The online-to-batch analysis gives the same bounds for one epoch
over the data (No theoretical explanation when we need more than
one pass)
We assume constant runtime for each inner product evaluation (holds
for linear kernels). How to deal with non-linear kernels ?
Sampling ?
Smart selection (online learning on a budget ? Clustering ?)
We assume λ is optimally chosen. Incorporating the runtime of tuning
λ in the analysis ?
?
Assumptions on distribution (e.g. Noise conditions) ⇒ Better analysis
A more general theory of optimization from a machine learning
perspective
Shai Shalev-Shwartz (TTI-C)
SVM from ML Perspective
Aug’08
25 / 25