Generalization Performance of the Random Fourier Features Method

Generalization Performance of the Random
Fourier Features Method
Anna C. Gilbert
University of Michigan
joint work with Yitong Sun (Univ. of Michigan)
SVM Preliminaries: two label classification
• Data with labels: Let X ⊂ Rd and data (x, y ) ∈ X × {−1, 1},
generated by the distribution D.
• Classification task: Find/approximate the Bayes classifier
f (x) = sgn
P
{Y = 1 | X = x} −
(X ,Y )∼D
• Use kernelized SVM:
sgn(hw , φ(x)iH + b),
where φ : X → H satisfies
hφ (x) , φ (y )iH = k (x, y ) .
1
2
.
SVM Preliminaries: solution of SVM
• Primal:
m
ŵ = argmin
w ∈H
C X
1
max (0, 1 − yj (hw , φ (xj )iH + b))+ kw k2H .
m
2
j=1
(1)
• Dual:
α̂ =
argmax C
∀jP
0≤αj ≤1/m
j αj yj =0
m
X
αj −
j=1
C2
(α ◦ y )| Km (α ◦ y ) ,
2
where α ◦ y = (α1 y1 , . . . , αm ym )| and Km = [k (xi , xj )].
• Then,
ŵ = C
m
X
j=1
α̂j yj φ (xj ) .
SVM Preliminaries: equivalent formulation
• Primal:
m
1 X
max (0, 1 − yj (hw , φ (xj )iH + b)) .
kw kH ≤Λ m
ŵ = argmin
(2)
j=1
• Dual:
α̂ = argmax
m
X
q
αj − Λ (α ◦ y )| Km (α ◦ y )
0≤α
P j ≤1/m j=1
j αj yj =0
• When
Λ=C
q
(α̂ ◦ y )| Km (α̂ ◦ y ),
the solution is the same with (1).
• Force b = 0 for technical simplicity and then corresponding
dual problem
P is of the same form except for dropping the
constraint j αj yj = 0.
(3)
RFF: Radial Basis Function Kernels
• Radial Basis Function kernels (RBF):
k (x, y ) = e −
kx−y k2
2
2σ 2
.
• Have the following feature map φ : X → L2 Rd , g , γ
|
φ(x) = e ig x .
where γ is the standard norm distribution on Rd .
• Easy to verify that
|
kx−y k2
ig x
ig | y
2
−
σ
σ
E e
e
= e 2σ2 .
g ∼γ
RFF: Approximate features
• To speed up computation, reduce the dimension of
(span {φ(xj )}) to N m.
• [Rahimi and Recht, ’08] proposed to approximate φ by
φ̃ : X → R2N with
1
φ̃(x) = √
N
|
cos
g1 x
σ
|
!
, sin
g1 x
σ
|
!
, . . . , cos
gN x
σ
|
!
, sin
gN x
!!|
σ
where gk ’s are i.i.d. from N (0, Id ).
• Then as N → ∞,
|
k̃ (x, y ) := φ̃ (x)| φ̃ (y ) −→ E [e ig x e ig | y ] = k (x, y )
g ∼γ
SVM Using Random Features
• Denote
m
1 X
max 0, 1 − yj w | φ̃ (xi )
≤Λ̃ m
w̃ = argmin
kw k2
α̃ = argmax
(4)
j=1
m
X
q
αj − Λ̃ (α ◦ y )| K̃m (α ◦ y ) ,
(5)
0≤αj ≤1/m j=1
q
h
i
where K̃m = k̃ (xi , xj ) and C̃ = Λ̃ (α̃ ◦ y )| K̃m (α̃ ◦ y )
• Question: how large should N be so that we do not lose too
much accuracy? In what quantity?
Related work: How close are the sep’ors? kŵ − w̃ k
• Alas, ŵ and w̃ are in different spaces!
• [Cortes, et al., 2010] mapped them to the same ambient space
and showed
1/2
C
kŵ − w̃ k22 ≤ √ kKm k1/2 K̃m − Km , ,
m
where C is the common regularization parameter.
• By a matrix Bernstein inequality [Tropp, 2015],
4m kK k
2m 1/2
m
log
.
K̃m − Km ≤
N
δ
• Unfortunately, kKm k = O (m) in most cases.
Related work: What about the true risk? R (w̃ ) − R (ŵ )
• A better indicator of the performance of a model is the true
(expected) risk
R (h) =
E
` (y , h (x)) .
(x,y )∼D
where, for SVM classification, ` is usually hinge loss.
• Hypothesis class for kSVM is h(x) = hw , φ(x)iH + b
• Regularized (expected) risk
R(hw , φ(x)iH ) = R(hw , φ(x)iH ) +
1
kw k2H
2C
Related work: What about the true risk? R (w̃ ) − R (ŵ )
• [Rahimi and Recht, 2009] showed that
R (w̃ ) − R (ŵ ) ≤ O
C
C
√ +√
m
N
with high probability.
• Result is, however, for a special regularlization setup (instead
of 2-norm regularization): w̃ and ŵ


 X
1
C
αj φ̃(xj ) kαk∞ ≤
and

m
j
are optimal solutions on


 X
1
C
αj φ(xj ) kαk∞ ≤
.

m
j
• Does not match the 2-norm regularlized ERM formulation!
Related work: others
• [Bach, 2015] considers constrained ERM problems that
involve Lipschitz loss functions in a more general setting. The
result on the error rate requires the random features be
generated from the optimized density q(v ), which minimizes
dmax (q, λ) = sup
v ∈V
1
hϕ(v , ·), (Σ + λI )−1 ϕ(v , ·)iL2 (dρ) .
q(v )
However, the optimized q(v ) depends on the distribution of
data, which is in general not available. Our work corresponds
to q(v ) = 1. Would need to compute dmax (1, λ).
• [Rudi, 2016] considers ridge regression under RFF, not
classification.
Our work: cast of characters
Feature
Gram matrix
Regularization parameters
Norm constraint
Primal solution
Dual solution
Accurate model
φ(x) ∈ H
k, Km
C
Λ
ŵ
α̂
Approximate model
φ̃(x) ∈ H̃ = RN
k̃, K̃m
C̃
Λ̃
w̃
α̃
Table: Summary of notations for accurate models and approximate models
Our work: summary
Reference
Theorem 1
Corollary 2
Theorem 3
Theorem 4
O
Excess risk
O √1
N
1/8 !
kKm k7
kKm k 1/2
+
5
Nm
Nm
1
O
1/4
N
O √1
N
Performance indicator
Regularization parameters
regularized expected risk
C̃ = C
expected risk
C̃ = C
expected risk
Λ̃ = Λ
expected risk
Λ̃ = η1 Λ or C̃ = η2 C
Summary of generalization performance of random features
method under different regularization contraints
Our work: Lemma (regularized empirical risk)
Lemma
Assume |φ(x; ω)| ≤ κ, kφ̃ (x) k2 ≤ κ̃, and C̃ = C . Let 0 < δ < 1.
Then, with probability at least 1 − δ,
RCm
hw̃ , φ̃(x)iH̃ −
RCm
(hŵ , φ(x)iH ) ≤ C
2m
κ2 kKm k
log
Nm
δ
1/2
.
Proof.
Each term on the left hand side is the minimum of the regularized
empirical risk. Use strong duality and subtract dual forms of LHS
Rm (w̃ ) − Rm (ŵ )
≤
sup
0≤αi ≤ 1
m
≤
≤
C |
Km − K̃m (α ◦ y)
(α ◦ y)
2
C Km − K̃m 2m
C
κ2 kKm k
Nm
log
2m
δ
!1/2
,
with probability at least 1 − δ. The last inequality comes from the
result of the matrix Bernstein inequality.
Our work: Theorem 1 (regularized expected risk)
Theorem (1)
Same assumptions as Lemma. Then, with probability at least 1 − δ,
RCD
hw̃ , φ̃(x)iH̃ −RCD
(hŵ , φ(x)iH ) ≤ C1
1
m
1/2
+C2
kKm k
Nm
1/2
where C1 and C2 are polynomials of κ, κ̃, L, C , log m and log δ.
Explicit constants are shown in the proof.
Proof.
Use Lemma + classical trick in estimating excess risk (expected empirical).
,
Our work: risk gap for norm constraints: Λ = Λ̃
Theorem (3)
ŵ and w̃ are solutions to exact and approximate models (Eqns (2)
and (3)). Assume that Λ = Λ̃. Then
Λ
R (w̃ ) − R (ŵ ) ≤ O
, with high probability.
N 1/4
Our work: Choose parameters carefully
Theorem (4)
Assume ŵ and w̃ are solutions to exact and approximate models
(Eqns (2) and (3)) and
Λ̃
≥η=
Λ
(α̂ ◦ y )| K̃m (α̂ ◦ y )
(α̂ ◦ y )| Km (α̂ ◦ y )
!1/2
,
where α̂ is the dual solution. Then
C
ηΛ
C
C
R (w̃ ) − R (ŵ ) ≤ O √ + √
≤O √ +√
,
m
m
N
N
with high probability.
Experiments: regularization versus random features
Figure: Distribution of training samples. 50 out of 1000 points are shown
in the graph. Blue crosses represent the points generated by the uniform
distribution over the annulus 0.65 ≤ R ≤ 1.15. Red circles represent the
points generated by the uniform distribution over the annulus
0.85 ≤ R ≤ 1.35. The unit circle is the best classifier for these two
distributions. Obviously, the best classifier can achieve at most 70%
classification accuracy due to the overlap between two distributions.
1.5
1
0.5
0
-0.5
-1
r= 1
-1.5
-1.5
-1
-0.5
0
0.5
1
1.5
Experiments: regularization versus random features
Figure: Performance of the LSVM with random Fourier features. Each
curve plots the classification accuracy on the test dataset of the
hypothesis learned by the model with different choice of C . Larger values
of log(C ) indicate weaker regularization. The error bars on the
approximate model with random features represent the mean and
standard deviation of 50 runs.
0.75
Classification Accuracy
0.7
0.65
0.6
0.55
accurate model
10 features
100 features
0.5
0.45
-3
-2
-1
0
1
2
3
4
Further Questions
• In practice, the choice of regularization parameter C is√always
greater than the sample size m. In such a case, O(C / N) is
meaningless.
• Theorem (4) holds only when the random features model
allows smaller margin controlled by η. What is the scale of η?
• When Λ̃/Λ = η, C̃ /C is not 1. This may explain why [Cortes,
2010] failed to obtain an informative upper bound for
kw̃ − ŵ k.
• Based on the results of [Cortes, 2010] and Theorem (3), it
seems additional requirements like C = C̃ or Λ = Λ̃ only
enlarge the gap between approximate and accurate models.
Then, what are the general criteria that we should follow
when comparing two different learning models?