PAC-Bayes bounds using Gaussian posterior distributions

PAC-Bayes bounds using Gaussian posterior
distributions
Olivier Catoni
CNRS, INRIA (projet CLASSIC)
Département de Mathématiques et Applications,
ENS, 45 rue d’Ulm, 75 230 Paris Cedex 05,
[email protected]
Computer Vision and Machine Learning Group,
Institute for Science and Technology Austria,
Klosterneuburg, July 26, 2013
Supervised classification
We observe independent copies (Xi , Yi ), 1 ≤ i ≤ n of a couple of
random variables (X , Y ), where X ∈ Rd and Y ∈ {−1, +1}. We
would like to relate the empirical error rate of these labeled data
Z
1 hx , θiy ≤ 0 dP(x , y)
n
1X
δ
is the empirical measure, with its
n i=1 (Xi ,Yi )
expectation
Z
where P =
1 hx , θiy ≤ 0 dP(x , y),
where P is the joint distribution of (X , Y ).
PAC-Bayes margin error bounds
after Langford, Shawe-Taylor, and McAllester
Let us introduce the Gaussian perturbation of the parameter θ
dµθ 0
(θ ) =
dθ0
β
2π
!d/2
!
βkθ0 − θk2
exp −
,
2
θ ∈ Rd .
It is such that
K (µθ , µ0 ) =
Z
log(dµθ /dµ0 )dµθ =
β
kθk2 .
2
The perturbation of the loss function can be computed
explicitly :
√
Z βhx , θiy
0
0
,
1 hx , θ iy ≤ 0 dµθ (θ ) = ϕ
kx k
where
Z +∞
ϕ(a) =
a
(2π)−1/2 exp −t 2 /2 dt.
More interestingly, the error function itself is upper-bounded by
the perturbation of the error function with margin
Z
1 kx k−1 hx , θ0 iy ≤ 1 dµθ (θ0 )
hp =ϕ
β kx k−1 hx , θiy − 1
i
p
≥ ϕ(− β)1 hx , θiy ≤ 0
h
p
i = 1 − ϕ( β) 1 hx , θiy ≤ 0 ,
so that
h
p
i−1
P hX , θiY ≤ 0 ≤ 1 − ϕ( β)
×E
Z
1 kx k−1 hX , θ0 iY ≤ 1
|
{z
error with margin
!
dµθ (θ0 )
} | {z }
perturbation
.
This opens the possibility to use the following PAC-Bayes
inequality :
Z
exp
0
0
h(θ )µθ (θ ) − K (µθ , µ0 ) ≤
Z
exp h(θ0 ) dµ0 (θ0 ),
according to it
(
Z Φλ P kX k
E exp sup nλ
θ∈Rd
−
Z
− K (µθ , µ0 )
−
Z
≤E
Z
−1
1 kx k
)!
−1
1 kx k
−1
0
hX , θ iY ≤ 1
hx , θ iy ≤ 1 dP(x , y) dµθ (θ0 )
exp Φλ P kX k−1 hX , θ0 iY ≤ 1
0
!
0
hx , θ iy ≤ 1 dP(x , y) dµ0 (θ ) ≤ 1,
where Φλ (p) = −λ−1 log 1 − p + p exp(−λ) .
0
With probability at least 1 − , for any θ ∈ Rd ,
h
P 1 hX , θiY ≤ 0
× inf Φ−1
λ
i
p
(Z
p
ϕ
λ∈Λ
−1
≤ 1 − ϕ( β)
β kx k−1 hx , θiy − 1 dP(x , y)
βkθk2 /2 + log |Λ|/
+
nλ
p
≤ 1−ϕ( β)
−1
Z
B
s
ϕ
)
β kx k−1 hx , θiy −1 dP(x , y),
p
βkθk2
, ,
2
2q(1 − q) e + log log(n)2 /
cosh log(n)−1
where B (q, e, ) =
n
2
2
2 e + log log(n) /
+
cosh log(n)−1 , for a suitable choice of Λ.
n
As a consequence, for any radius R > 0, with probability at
least 1 − , for any θ ∈ Rd ,
P 1 hX , θiY ≤ 0
× inf Φ−1
λ∈Λ λ
(Z
p
−1
≤ 1 − ϕ( β)
2 − kx k−1
R hθ, x iy
)
+
dP(x , y) +
βkθk2
2nλ
p
log |Λ|/
+
+ ϕ( β) , where kx kR = max R, kx k .
nλ
Assume that P kX − ck ≤ R = 1, for some c ∈ Rd and R ∈ R+ .
With probability at least 1 − , for any θ ∈ Rd , any γ ∈ R such
that mini=1,...,n hθ, Xi i ≤ γ ≤ maxi=1,...,n hθ, Xi i,
p
−1
P hθ, X i − γ Y ≤ 0 ≤ inf 1 − ϕ( β)
β∈Ξ
inf Φ−1
λ
λ∈Λ
Z
1 − y hθ, x i − γ
4βR 2 kθk2
nλ
p
log |Ξ||Λ|/
+
+ ϕ( β) .
nλ
dP(x , y) +
+
Gram matrix estimate
Let Xi ∈ Rd , 1 ≤ i ≤ n be vector valued i.i.d. random variables
with common distribution P ∈ M+1 (Rd ).
Question : Estimate the quadratic form
Z
N (θ) =
hx , θi2 dP(x )
for all θ ∈ Rd , or equivalently all θ ∈ Sd , the sphere of Rd .
This is a way to estimate the expected Gram matrix
def
G =
Z
x x > dP(x ),
since N (θ) = θ>Gθ. We will assume that
Z
kx k2 dP(x ) = Tr(G) < +∞.
Let us introduce the influence function
ψ(z ) =



log(2),
− log 1 − z


−ψ(−z ),
+ z 2 /2
z ≥ 1,
, 0 ≤ z ≤ 1,
z ≤ 0.
2
1
0
−1
−2
−2
−1
0
1
2
z 7→ ψ(z ), compared
with z 7→ z
z 7→ log 1 + z + z 2 /2 , and z 7→ − log 1 − z + z 2 /2
It is symmetric, non decreasing, bounded and satisfies for any
z ∈ R,
− log 1 − z + z 2 /2 ≤ ψ(z ) ≤ log 1 + z + z 2 /2 ,
− log(2) ≤ ψ(z ) ≤ log(2).
Dimension dependent bounds
Let P =
n
1X
δX and
n i=1 i
rλ (θ) = λ−1
Z
h
ψ λ hθ, x i2 − 1
i
dP(x ),
where λ > 0 will be chosen later. Let us consider the estimator
b (θ) = α(θ)
b −2 ,
N
where
This is justified by the fact that
Z
lim rλ (θ) dP⊗n = N (θ) − 1.
λ→0
b
α(θ)
= sup α ∈ R+ : rλ (αθ) ≤ 0 .
Proposition
Z
hθ, x i4 dP(x ),
Let us assume that κ = sup Z
θ6=0
2 < ∞ and
2
hθ, x i dP(x )
s
let us put λ =
h
i
2
log(−1 ) + 1.11 d .
(κ − 1)n
For any > 0, any n such that
!2
√
5κ − 4 q
n > 27 κd + p
log(−1 ) + 1.11 d ,
2(κ − 1)
with probability at least 1 − 2, for any θ ∈ Rd ,
N (θ)
≤ µ ,
−
1
b
1 − 2µ
N (θ)
s
where µ =
i
2(κ − 1) h
log(−1 ) + 1.11 d +
n
s
2κ × 89 d
.
n
Obtaining a true quadratic form
Let us assume that with probability 1 − 2,
θ ∈ Rd .
B− (θ) ≤ N (θ) ≤ B+ (θ),
Let Θ ⊂ Rd be any finite set (e.g. a δ-net of Sd ). Let
b=
G
X
ξ(θ)θθ> , where ξ ∈ arg min
θ∈Θ
−
X
θ∈Θ
ξ(θ)
1
2
X
ξ(θ1 )ξ(θ2 )hθ1 , θ2 i2
(θ1 ,θ2 )∈Θ2
B− (θ) + B+ (θ)
B+ (θ) − B− (θ)
+ |ξ(θ)|
.
2
2
Then with probability at least 1 − 2,
def
b 2 = Tr(G
bG
b > ) ≤ Tr(GG > ) ≤ Tr(G)2 ,
kGk
2
b 2 − θ > Gθ
b 1 ≤ 2 Tr(G)kθ2 − θ1 k, θ1 , θ2 ∈ Rd
so that θ2> Gθ
1
b ≤ B+ (θ),
and B− (θ) ≤ θ> Gθ
θ ∈ Θ.
Proof
b minimizes
G
1 b 2
kGk2
ξ+ ≥0,ξ− ≥0 2
sup
+
X
b + ξ− (θ) θ > Gθ
b − B+ (θ)
ξ+ (θ) B− (θ) − θ> Gθ
θ∈Θ
b first and
and there is no duality gap, so you can minimize in G
b
b
then as ξ+ (θ)ξ− (θ) = 0, you can assume that ξ− (θ)ξ+ (θ) = 0 and
introduce ξ(θ) = ξ+ (θ) − ξ− (θ), so that |ξ(θ)| = ξ+ (θ) + ξ− (θ).
Least square regression estimator
Let (X , Y ) be a random vector, where X ∈ Rd and Y ∈ R. Let
(Xi , Yi ), 1 ≤ i ≤ n be independent copies of (X , Y ) and
h
R(θ) = E hθ, X i − Y
2 i
.
h
Introduce the quadratic form N (θ, γ) = E hθ, X i − γY
2 i
.
b (θ, γ) is a quadratic estimator of N (θ, γ) and
Assume that N
η > 0 a small real parameter
such that with probability at least
N (θ, γ)
− 1 ≤ η. Let θ∗ ∈ arg min R and
1 − 2, b (θ, γ)
N
b (θ, 1). With probability at least 1 − 2,
θb ∈ arg min N
θ∈Rd
b − R(θ∗ )
b (θb − θ∗ , 0) ≤ N (θb − θ∗ , 0) = R(θ)
(1 − η)N
≤
η2 b b
η2 b
η2
N (θ, 1) ≤
N (θ∗ , 1) ≤
R(θ∗ ).
1−η
1−η
(1 − η)2
b
b (θ, 1).
Proof : Let us put R(θ)
=N
n1
0 = sup
4
n1
≥ sup
4
b θb + θ 0 ) − R(
b θb − θ 0 ) ,
R(
R(θb + θ0 ) − R(θb − θ0 ) −
b
b (θ 0 , 0) ≤ R(
b θ)
θ0 : N
o
ηb b 0
b θb − θ 0 ) ,
R(θ + θ ) + R(
4
o
b
b (θ 0 , 0) ≤ R(
b θ)
θ0 : N
n b X i − Y hθ 0 , X i −
= sup E hθ,
ηb b
b (θ 0 , 0) ,
R(θ) + N
2
b
b (θ 0 , 0) ≤ R(
b θ)
θ0 : N
n o
b
b θ),
≥ sup E hθb − θ∗ , X ihθ0 , X i − η R(
o
b ,
b θ)
θ0 : N (θ0 , 0) ≤ (1 − η)R(
b − R(θ∗ ), we
Taking θ0 = c(θb − θ∗ ), and using N (θb − θ∗ , 0) = R(θ)
get
q
0≥
b R(θ)
b − R(θ∗ ) − η R(
b
b θ)
b θ).
(1 − η)R(
Dimension free bounds (Joint work with Ilaria Giulini)
b
With probability 1 − 2, for any θ ∈ Rd , and some estimator N
to be described in the proof,
b
N
(θ)
µ
1 4µ < 1 − 1 ≤
,
N (θ)
1 − 4µ
where, for n ≤ 1020 ,
s
µ=
2.07(κ − 1)
1.6 × kθk2 Tr(G)
log(−1 ) + 4.3 +
n
N (θ)
s
+
Z
Let us recall that Tr(G) =
kx k2 dP(x ) =
d
X
i=1
orthogonal basis (θi , 1 ≤ i ≤ n).
2κ 92 Tr(G)
×
.
n
N (θ)
N (θi ), for any
Proof
Z
Let rλ (θ) =
ψ hθ, x i2 − λ dP(x ). Consider the Gaussian
parameter perturbations πθ = N (θ, β −1 Id , where Id is the
identity matrix of size d × d . Let
b
α(θ)
= sup α ∈ R+ : rλ (αθ) ≤ 0 .
Proposition
Let c =
15
log(4)
2
≤ 11.
ψ hθ, x i − λ ≤
Z
log 1 + hθ0 , x i2 − λ −
kx k2
β
1
kx k2 2
hθ0 , x i2 − λ −
+
2
β
2
2ckx k
5kx k2
0
2
+
4hθ , x i +
dπθ (θ0 ).
β
β
Indeed,
Z
ψ
h dρ ≤
Z
ψ(h) dρ + min log(4), Var(hdρ) ,
R
2
because y 7→ ψ(y) + y − h dρ
As moreover
2
hθ, x i − λ =
Z
is convex, since ψ 00 (y) ≥ −2.
hθ0 , x i2 dπθ (θ0 ) − λ −
kx k2
,
β
we get
ψ hθ, x i2 − λ ≤
Z
kx k2
− λ dπθ (θ0 )
β
4kx k2 hθ, x i2 2kx k4
+ min log(4),
+
β
β2
ψ hθ0 , x i2 −
Lemma
If W ∼ N (0, σ 2 ),
min a, bm 2 + c ≤ E min 2a, 2b(m + W )2 + 2bσ 2 + c
,
a, b, c ∈ R+ , m ∈ R.
The proof of this lemma is based on the inequalities
m 2 ≤ 2(m + W )2 + 2W 2 and
min{a, y + z ≤ min{a, y} + min{a, z }
≤ min{2a, y + z },
a, y, z ∈ R+ .
Accordingly
kx k2
− λ dπθ (θ0 )
β
Z
8kx k2 hθ0 , x i2 10kx k4
+ min 4 log(2),
+
dπθ (θ0 ).
β
β2
ψ hθ, x i2 − λ ≤
Z
ψ hθ0 , x i2 −
We will now use
Lemma
For any a, b, y, ∈ R+ and c =
a
b
exp(b) − 1 ,
log(a) + min{b, y} ≤ log(a + cy).
Applying this lemma to a ≤ 2, b = 4 log(2), and the
15
corresponding c = log(4)
ends the proof of the previous
proposition.
PAC-Bayes bound
Proposition
For any measure ν ∈ M+1 (Θ), real number a > −1 and any
measurable function f : X × Θ → [a, +∞[, with probability at
least 1 − , for any ρ ∈ M+1 (Θ) such that K (ρ, ν) < ∞,
Z
log 1 + f (x , θ) dρ(θ) dP(x ) ≤
Z
f (x , θ) dρ(θ) dP(x )
+
K (ρ, ν) − log()
.
n
The proof is based on the fact that
Z
hdρ − K (ρ, ν) ≤ log
Z
exp(h)dν ,
for any upper-bounded measurable function h : Θ → R.
We get with probability 1 − , for any θ ∈ Rd ,
Z
2
ψ hθ, x i − λ dP(x ) ≤
Z "
hθ, x i2 − λ
2 4
1 2
2
2
2
4
+
hθ, x i − λ + hθ, x i kx k + 2 kx k
2
β
β
2ckx k2 9kx k2
+ 4hθ, x i2
+
β
β
#
dP(x )
+
βkθk2 log(−1 )
+
.
2n
n
Using the Cauchy-Schwartz inequality will make the following
quantities appear
Z
4
kx k dP(x )
s4 =
Z
κ = sup
ξ=
κλ
,
2
1/4
,
hθ, x i4 dP(x ), θ ∈ Rd s. t.
Z
hθ, x i2 dP(x ) = 1 ,
√
2
γ = λ(κ − 1) + (1 + 4c)s42 κ,
β
λ
2
(1 + 18c)s44 log ν(λ, β)
2√
η = (κ − 1) + (1 + 4c)s4 κ +
−
,
2
β
β2λ
nλ
β
δ=
.
2nλ
Proposition
With probability at least 1 − 2, for any θ ∈ Rd ,
2
rλ (αθ)
N (θ) 2
≤ξ
α −1
λ
λ
2
rλ (αθ)
α2 N (θ)
≥ −ξ
−1
λ
λ
λkθk2 δ
−η−
.
N (θ)
N (θ) 2
α − 1 + η + δkθk2 α2 ,
+ (1 + γ)
λ
λkθk2 δ
+ 1−γ −
N (θ)
N (θ) 2
α −1
λ
Let
Φ− (z ) = z 1 −
Φ+ (z ) = z 1 +
η + δλ
z
1 + γ − η − δλ
z
1 ξ −γ +η+
−1 η + δλ
z
δλ
<1 ,
z
1 ξ +γ +η+
1 − γ − η − 2δλ
z
2δλ
<1 .
z
b 2 in the first inequality of the
Replacing N (θ) with Φ− λ/α
b 2 with Φ+ N (θ) in the second
previous proposition and λ/α
inequality, we can prove that
Φ−
λ
b
α2
≤ N (θ),
Φ+ N (θ) ≤
λ
.
b2
α
Proposition
With probability at least 1 − 2, for any θ ∈ Rd ,
B− = sup Φ−
λ,β
λ
b2
α
≤ N (θ) ≤ inf Φ−1
+
λ,β
λ
b2
α
= B+ .
b (θ) = B− + B+ , we get with probability at least
Let us put N
2
1 − 2, for any λ, β ∈ R+ ,
c2 N (θ) − Φ− ◦ Φ+ N (θ)
N − B− N − Φ− λ/α
b
=
≤
,
N (θ) − N (θ) ≤
2
2
2
−1
−1
−1
b2
b (θ) − N (θ) ≤ Φ+ λ/α − N (θ) ≤ Φ+ ◦ Φ− N (θ) − N (θ) .
N
2
2
Putting r (z ) =
η + δλ
z
1−γ −η−
that
2δλ
z
, and c(z ) = η +
−1 δλ
, we can prove
z
Φ− (z ) ≥ z 1 − r (z ) 1 4c(z ) < 1 ,
Φ+ (z ) ≥ z 1 + r (z )
Φ−1
− (z )1 4c(z ) < 1 ≤ z 1 − r (z )
1 4c(z ) < 1
−1
Φ−1
+ (z )1 4c(z ) < 1 ≤ z 1 + r (z ) .
,
From this we can deduce
Proposition
With probablity at least 1 − 2, for any θ ∈ Rd ,
b (θ) ≤ N (θ)
1 4c(N (θ)) < 1 N (θ) − N
c N (θ)
.
1 − 4c N (θ)
The announced
result is then obtained by optimizing the value
of c N (θ) by appropriate choices of λ and β.