Support Vector Machines (SVMs) and its solution by Stochastic
Gradient Descent
Yossi Keshet
March 18, 2014
We consider an input instance x ∈ X = Rd and a target label y ∈ Y = {−1, +1}. They
are assumed to be drawn from a fixed but unknown distribution ρ. We don’t have access to
this distribution, but instead we have a training set of examples S = {(x1 , y1 ), . . . , (xm , ym )},
which serves as a restricted window to the unknown distribution ρ.
We also consider a linear classifier parameterized by a weight vector w ∈ Rd , such that
the parameter setting w determines a mapping of any input x ∈ X to an output ŷw (x) ∈ Y,
defined as follows
ŷw (x) = sign(w · x).
(1)
The goal of the learning problem is to find a setting of the weight vector w of the classifier
so as to minimize the error
w∗ = arg min P [ŷw (x) 6= y] .
(2)
w
This probability is with respect to the distribution ρ. We can replace probability by expectation
over indicator function (Kronecker delta), hence
w∗ = arg min E(x,y)∼ρ [1[ŷw (x) 6= y]] ,
w
(3)
where 1[π] is the indicator function which equals to 1 when the predicate π holds and 0 otherwise.
We replace the expectation with the mean over the training set and add regularization to
avoid overfitting:
m
1 X
λ
w∗ = arg min
1
[ŷw (xi ) 6= yi ] + kwk2 .
w
m
2
i=1
Since the term 1[ŷw (xi ) 6= yi ] is a combinatorial quantity and is thus difficult to minimize
directly. Instead we use a convex upper bound to it:
1[ŷw (x) 6= y] = 1[yi w · xi ≤ 0]
≤ max{0, 1 − yi w · xi }
= `(w; xi , yi ).
(4)
(5)
(6)
The function `(w; xi , yi ) is called loss. See Figure 1 for graphical presentation of the loss as an
upper bound to the 0-1 error.
1
1
0
yw·x
Figure 1: The hinge loss, `(w; xi , yi ) as a convex upper bound (red) to the 0-1 binary error
(blue).
We replace 1[ŷw (x) 6= y] with the loss:
m
w∗ = arg min
w
1 X
λ
max{0, 1 − yi w · xi } + kwk2 .
m
2
(7)
i=1
Note that the loss is a convex function of w since for any two vectors w1 and w2 and a scalar
α the following holds:
`(αw1 + (1 − α)w2 ; x, y) ≤ α`(w1 ; x, y) + (1 − α)`(w2 ; x, y).
The regularization λ2 kwk2 is also a convex function. This means that the optimization problem
in (7) is convex and there exits a single optimum. How do we find this optimum?
Stochastic Gradient Descent
We are interested in finding w which minimizes the following objective
min
w
m
X
fi (w)
i=1
where fi are functions from vectors w to scalars. The idea of stochastic gradient descent is to
iterate over the functions fi , and update the vector w after each iteration. The update at each
iteration is
w ← w − ηt ∇w fi (w),
(8)
where ∇w fi (w) is the gradient of function fi (w).
Let’s apply that to the SVM optimization problem defined in (7). Here, the functions fi
are of the form fi (w) = max{0, 1 − yi w · xi }. The gradient of those functions is not defined,
but instead the sub-gradient of the functions is defined as follows:
h
i −yx if 1 − y w · x ≥ 0
∇w max{0, 1 − y w · x} =
0
otherwise
2
Hence the (sub-)gradient of (7) for example i is:
−yx + λw if 1 − y w · x ≥ 0
0 + λw
otherwise
Plugging this gradient into the update rule in (8), then the stochastic sub-gradient descent
algorithm to (7) is as follows:
Input: training set S = {(x1 , y1 ), . . . , (xm , ym )}, parameter λ, η0
Initialize: w0 = 0
For: t = 1, . . . , T
• choose example: (xi , yi ) uniformly at random
√
• set ηt = η0 / t
• if 1 − yi w · xi ≥ 0
wt = (1 − ηt λ) wt−1 + ηt yi xi
else
wt = (1 − ηt λ) wt−1
P
Output: w = Tt=1 wt
Figure 2: Stochastic sub-gradient descent algorithm for SVM
Compare this to the Perceptron algorithm:
Input: training set S = {(x1 , y1 ), . . . , (xm , ym )}
Initialize: w0 = 0
For: t = 1, . . . , T
• choose example: (xi , yi ) uniformly at random
• if yi w · xi ≤ 0
wt = wt−1 + yi xi
else
wt = wt−1
P
Output: w = Tt=1 wt
Figure 3: Perceptron algorithm
The differences are:
1. In the Perceptron algorithm we have no margin, hence the if check for errors (yi w·xi ≤ 0)
and not for margin errors (1 − yi w · xi ≥ 0)
2. The Perceptron has no regularization and the reduction of w by (1 − ηt λ) is not there. It
is as if λ = 0.
3. The learning rate of the Perceptron algorithm is fixed to 1, ηt = 1.
3
© Copyright 2026 Paperzz