Lecture 12 v1 v2 v4

SYS 6016/4582: Machine Learning
Spring 2017
Lecture 12
Date: Feb 28th , 2017
Instructor: Quanquan Gu
Scriber: Pan Xu
In our last lecture, we introduced the concepts of cover set and covering number. Let’s
restate them as follows.
Definition 1 (Cover Set) We say V ⊆ Rn is an `p cover set of function class F on
X1 , . . . , Xn at scale > 0, if for any f ∈ F, there exists vf ∈ V such that
X
n
1/p
1
f (Xi ) − vf [i]p
≤ ,
n i=1
where vf [i] denotes the i-th entry of vector vf .
Definition 2 (Empirical Covering Number) The empirical covering number of F on
X1 , . . . , Xn is defined as
Np (F, ; X1:n ) = min |V | : V is an `p cover set of F on X1 , . . . , Xn at scale .
V
Based on these definitions, we can define the covering number as
Definition 3 (Covering Number) The covering number of F at scale with respect to `p
norm is defined as
sup Np (F, ; X1:n ).
Np (F, , n) =
X1 ,...,Xn
Given a function class F and sample X1 , . . . , Xn , for any f ∈ F, let the projected vector be
f = (f (X1 ), . . . , f (Xn ))> ∈ Rn . There exists vf ∈ V such that the `p distance between f
and vf is less than . We illustrate this in Figure 1.
v4
v1
f
v5
vf
v2
v3
✏
Figure 1: Illustration of covering set of F on X1 , . . . , Xn at scale .
1
Excercise: Given a Function class F and > 0, if p ≥ q > 0, then which one is larger:
Np (F, , n) or Nq (F, , n)?
Hint: Given any cover set V of F with |V | = Nq (F, , n), for any X1 , X2 , . . . , Xn and any
f ∈ F and the projected vector f = [f (X1 ), f (X2 ), . . . , f (Xn )]> , we can find a vf ∈ V such
that n−1/q kvf − f kq ≤ . For any vector x ∈ Rn , we have kxkq ≤ n1/q−1/p kxkp . Therefore,
the `p norm distance of vf and f may not be smaller than . In that case, V is not sufficient
to be a cover set of F at scale in `p distance, and thus Np (F, , n) ≥ Nq (F, , n).
For a function class F whose output is bounded, the following theorem bounds its empirical Rademacher complexity by its `1 covering number.
Theorem
1
(Pollard’s
Bound)
Let
F
be
a
function
class
such
that
F
=
f : X →
[−1, 1] and X1 , . . . , Xn be n examples. The empirical Rademacher complexity can be
bounded by
r
2 log N1 (F, β; X1:n )
b
.
Rn (F) ≤ inf β +
β≥0
n
Proof: For all β ≥ 0, we want to show
r
b n (F) ≤ β +
R
2 log N1 (F, β; X1:n )
.
n
Let V be an `1 cover of F on X1 , . . . , Xn at scale βPand |V | = N1 (F,β; X1:n ). Therefore,
for any f ∈ F, there exists a vf ∈ V such that n−1 ni=1 f (Xi ) − vf [i] ≤ β. Recalling the
definition of empirical Rademacher complexity, we have
n
1X
sup
σi f (Xi )X1:n
f ∈F n i=1
n
1X
σi f (Xi ) − vf [i] + vf [i] X1:n ,
sup
f ∈F n i=1
b n (F) = Eσ
R
= Eσ
Thus we have
n
n
X
X
1
1
b n (F) ≤ Eσ sup
R
σi f (Xi ) − vf [i] X1:n + Eσ sup
σi vf [i]X1:n
f ∈F n i=1
f ∈F n i=1
n
1X
≤ β + Eσ sup
σi vf [i]X1:n
vf ∈V n
i=1
r
2 log N1 (F, β; X1:n )
≤β+
,
n
where the first inequality comes from sup(a + b) ≤ sup a + sup b, the second inequality is due
to the definition of cover set V , and the last inequality is due to Massart’s Finite Lemma.
2