PAC-Bayes bounds using Gaussian posterior distributions Olivier Catoni CNRS, INRIA (projet CLASSIC) Département de Mathématiques et Applications, ENS, 45 rue d’Ulm, 75 230 Paris Cedex 05, [email protected] Computer Vision and Machine Learning Group, Institute for Science and Technology Austria, Klosterneuburg, July 26, 2013 Supervised classification We observe independent copies (Xi , Yi ), 1 ≤ i ≤ n of a couple of random variables (X , Y ), where X ∈ Rd and Y ∈ {−1, +1}. We would like to relate the empirical error rate of these labeled data Z 1 hx , θiy ≤ 0 dP(x , y) n 1X δ is the empirical measure, with its n i=1 (Xi ,Yi ) expectation Z where P = 1 hx , θiy ≤ 0 dP(x , y), where P is the joint distribution of (X , Y ). PAC-Bayes margin error bounds after Langford, Shawe-Taylor, and McAllester Let us introduce the Gaussian perturbation of the parameter θ dµθ 0 (θ ) = dθ0 β 2π !d/2 ! βkθ0 − θk2 exp − , 2 θ ∈ Rd . It is such that K (µθ , µ0 ) = Z log(dµθ /dµ0 )dµθ = β kθk2 . 2 The perturbation of the loss function can be computed explicitly : √ Z βhx , θiy 0 0 , 1 hx , θ iy ≤ 0 dµθ (θ ) = ϕ kx k where Z +∞ ϕ(a) = a (2π)−1/2 exp −t 2 /2 dt. More interestingly, the error function itself is upper-bounded by the perturbation of the error function with margin Z 1 kx k−1 hx , θ0 iy ≤ 1 dµθ (θ0 ) hp =ϕ β kx k−1 hx , θiy − 1 i p ≥ ϕ(− β)1 hx , θiy ≤ 0 h p i = 1 − ϕ( β) 1 hx , θiy ≤ 0 , so that h p i−1 P hX , θiY ≤ 0 ≤ 1 − ϕ( β) ×E Z 1 kx k−1 hX , θ0 iY ≤ 1 | {z error with margin ! dµθ (θ0 ) } | {z } perturbation . This opens the possibility to use the following PAC-Bayes inequality : Z exp 0 0 h(θ )µθ (θ ) − K (µθ , µ0 ) ≤ Z exp h(θ0 ) dµ0 (θ0 ), according to it ( Z Φλ P kX k E exp sup nλ θ∈Rd − Z − K (µθ , µ0 ) − Z ≤E Z −1 1 kx k )! −1 1 kx k −1 0 hX , θ iY ≤ 1 hx , θ iy ≤ 1 dP(x , y) dµθ (θ0 ) exp Φλ P kX k−1 hX , θ0 iY ≤ 1 0 ! 0 hx , θ iy ≤ 1 dP(x , y) dµ0 (θ ) ≤ 1, where Φλ (p) = −λ−1 log 1 − p + p exp(−λ) . 0 With probability at least 1 − , for any θ ∈ Rd , h P 1 hX , θiY ≤ 0 × inf Φ−1 λ i p (Z p ϕ λ∈Λ −1 ≤ 1 − ϕ( β) β kx k−1 hx , θiy − 1 dP(x , y) βkθk2 /2 + log |Λ|/ + nλ p ≤ 1−ϕ( β) −1 Z B s ϕ ) β kx k−1 hx , θiy −1 dP(x , y), p βkθk2 , , 2 2q(1 − q) e + log log(n)2 / cosh log(n)−1 where B (q, e, ) = n 2 2 2 e + log log(n) / + cosh log(n)−1 , for a suitable choice of Λ. n As a consequence, for any radius R > 0, with probability at least 1 − , for any θ ∈ Rd , P 1 hX , θiY ≤ 0 × inf Φ−1 λ∈Λ λ (Z p −1 ≤ 1 − ϕ( β) 2 − kx k−1 R hθ, x iy ) + dP(x , y) + βkθk2 2nλ p log |Λ|/ + + ϕ( β) , where kx kR = max R, kx k . nλ Assume that P kX − ck ≤ R = 1, for some c ∈ Rd and R ∈ R+ . With probability at least 1 − , for any θ ∈ Rd , any γ ∈ R such that mini=1,...,n hθ, Xi i ≤ γ ≤ maxi=1,...,n hθ, Xi i, p −1 P hθ, X i − γ Y ≤ 0 ≤ inf 1 − ϕ( β) β∈Ξ inf Φ−1 λ λ∈Λ Z 1 − y hθ, x i − γ 4βR 2 kθk2 nλ p log |Ξ||Λ|/ + + ϕ( β) . nλ dP(x , y) + + Gram matrix estimate Let Xi ∈ Rd , 1 ≤ i ≤ n be vector valued i.i.d. random variables with common distribution P ∈ M+1 (Rd ). Question : Estimate the quadratic form Z N (θ) = hx , θi2 dP(x ) for all θ ∈ Rd , or equivalently all θ ∈ Sd , the sphere of Rd . This is a way to estimate the expected Gram matrix def G = Z x x > dP(x ), since N (θ) = θ>Gθ. We will assume that Z kx k2 dP(x ) = Tr(G) < +∞. Let us introduce the influence function ψ(z ) = log(2), − log 1 − z −ψ(−z ), + z 2 /2 z ≥ 1, , 0 ≤ z ≤ 1, z ≤ 0. 2 1 0 −1 −2 −2 −1 0 1 2 z 7→ ψ(z ), compared with z 7→ z z 7→ log 1 + z + z 2 /2 , and z 7→ − log 1 − z + z 2 /2 It is symmetric, non decreasing, bounded and satisfies for any z ∈ R, − log 1 − z + z 2 /2 ≤ ψ(z ) ≤ log 1 + z + z 2 /2 , − log(2) ≤ ψ(z ) ≤ log(2). Dimension dependent bounds Let P = n 1X δX and n i=1 i rλ (θ) = λ−1 Z h ψ λ hθ, x i2 − 1 i dP(x ), where λ > 0 will be chosen later. Let us consider the estimator b (θ) = α(θ) b −2 , N where This is justified by the fact that Z lim rλ (θ) dP⊗n = N (θ) − 1. λ→0 b α(θ) = sup α ∈ R+ : rλ (αθ) ≤ 0 . Proposition Z hθ, x i4 dP(x ), Let us assume that κ = sup Z θ6=0 2 < ∞ and 2 hθ, x i dP(x ) s let us put λ = h i 2 log(−1 ) + 1.11 d . (κ − 1)n For any > 0, any n such that !2 √ 5κ − 4 q n > 27 κd + p log(−1 ) + 1.11 d , 2(κ − 1) with probability at least 1 − 2, for any θ ∈ Rd , N (θ) ≤ µ , − 1 b 1 − 2µ N (θ) s where µ = i 2(κ − 1) h log(−1 ) + 1.11 d + n s 2κ × 89 d . n Obtaining a true quadratic form Let us assume that with probability 1 − 2, θ ∈ Rd . B− (θ) ≤ N (θ) ≤ B+ (θ), Let Θ ⊂ Rd be any finite set (e.g. a δ-net of Sd ). Let b= G X ξ(θ)θθ> , where ξ ∈ arg min θ∈Θ − X θ∈Θ ξ(θ) 1 2 X ξ(θ1 )ξ(θ2 )hθ1 , θ2 i2 (θ1 ,θ2 )∈Θ2 B− (θ) + B+ (θ) B+ (θ) − B− (θ) + |ξ(θ)| . 2 2 Then with probability at least 1 − 2, def b 2 = Tr(G bG b > ) ≤ Tr(GG > ) ≤ Tr(G)2 , kGk 2 b 2 − θ > Gθ b 1 ≤ 2 Tr(G)kθ2 − θ1 k, θ1 , θ2 ∈ Rd so that θ2> Gθ 1 b ≤ B+ (θ), and B− (θ) ≤ θ> Gθ θ ∈ Θ. Proof b minimizes G 1 b 2 kGk2 ξ+ ≥0,ξ− ≥0 2 sup + X b + ξ− (θ) θ > Gθ b − B+ (θ) ξ+ (θ) B− (θ) − θ> Gθ θ∈Θ b first and and there is no duality gap, so you can minimize in G b b then as ξ+ (θ)ξ− (θ) = 0, you can assume that ξ− (θ)ξ+ (θ) = 0 and introduce ξ(θ) = ξ+ (θ) − ξ− (θ), so that |ξ(θ)| = ξ+ (θ) + ξ− (θ). Least square regression estimator Let (X , Y ) be a random vector, where X ∈ Rd and Y ∈ R. Let (Xi , Yi ), 1 ≤ i ≤ n be independent copies of (X , Y ) and h R(θ) = E hθ, X i − Y 2 i . h Introduce the quadratic form N (θ, γ) = E hθ, X i − γY 2 i . b (θ, γ) is a quadratic estimator of N (θ, γ) and Assume that N η > 0 a small real parameter such that with probability at least N (θ, γ) − 1 ≤ η. Let θ∗ ∈ arg min R and 1 − 2, b (θ, γ) N b (θ, 1). With probability at least 1 − 2, θb ∈ arg min N θ∈Rd b − R(θ∗ ) b (θb − θ∗ , 0) ≤ N (θb − θ∗ , 0) = R(θ) (1 − η)N ≤ η2 b b η2 b η2 N (θ, 1) ≤ N (θ∗ , 1) ≤ R(θ∗ ). 1−η 1−η (1 − η)2 b b (θ, 1). Proof : Let us put R(θ) =N n1 0 = sup 4 n1 ≥ sup 4 b θb + θ 0 ) − R( b θb − θ 0 ) , R( R(θb + θ0 ) − R(θb − θ0 ) − b b (θ 0 , 0) ≤ R( b θ) θ0 : N o ηb b 0 b θb − θ 0 ) , R(θ + θ ) + R( 4 o b b (θ 0 , 0) ≤ R( b θ) θ0 : N n b X i − Y hθ 0 , X i − = sup E hθ, ηb b b (θ 0 , 0) , R(θ) + N 2 b b (θ 0 , 0) ≤ R( b θ) θ0 : N n o b b θ), ≥ sup E hθb − θ∗ , X ihθ0 , X i − η R( o b , b θ) θ0 : N (θ0 , 0) ≤ (1 − η)R( b − R(θ∗ ), we Taking θ0 = c(θb − θ∗ ), and using N (θb − θ∗ , 0) = R(θ) get q 0≥ b R(θ) b − R(θ∗ ) − η R( b b θ) b θ). (1 − η)R( Dimension free bounds (Joint work with Ilaria Giulini) b With probability 1 − 2, for any θ ∈ Rd , and some estimator N to be described in the proof, b N (θ) µ 1 4µ < 1 − 1 ≤ , N (θ) 1 − 4µ where, for n ≤ 1020 , s µ= 2.07(κ − 1) 1.6 × kθk2 Tr(G) log(−1 ) + 4.3 + n N (θ) s + Z Let us recall that Tr(G) = kx k2 dP(x ) = d X i=1 orthogonal basis (θi , 1 ≤ i ≤ n). 2κ 92 Tr(G) × . n N (θ) N (θi ), for any Proof Z Let rλ (θ) = ψ hθ, x i2 − λ dP(x ). Consider the Gaussian parameter perturbations πθ = N (θ, β −1 Id , where Id is the identity matrix of size d × d . Let b α(θ) = sup α ∈ R+ : rλ (αθ) ≤ 0 . Proposition Let c = 15 log(4) 2 ≤ 11. ψ hθ, x i − λ ≤ Z log 1 + hθ0 , x i2 − λ − kx k2 β 1 kx k2 2 hθ0 , x i2 − λ − + 2 β 2 2ckx k 5kx k2 0 2 + 4hθ , x i + dπθ (θ0 ). β β Indeed, Z ψ h dρ ≤ Z ψ(h) dρ + min log(4), Var(hdρ) , R 2 because y 7→ ψ(y) + y − h dρ As moreover 2 hθ, x i − λ = Z is convex, since ψ 00 (y) ≥ −2. hθ0 , x i2 dπθ (θ0 ) − λ − kx k2 , β we get ψ hθ, x i2 − λ ≤ Z kx k2 − λ dπθ (θ0 ) β 4kx k2 hθ, x i2 2kx k4 + min log(4), + β β2 ψ hθ0 , x i2 − Lemma If W ∼ N (0, σ 2 ), min a, bm 2 + c ≤ E min 2a, 2b(m + W )2 + 2bσ 2 + c , a, b, c ∈ R+ , m ∈ R. The proof of this lemma is based on the inequalities m 2 ≤ 2(m + W )2 + 2W 2 and min{a, y + z ≤ min{a, y} + min{a, z } ≤ min{2a, y + z }, a, y, z ∈ R+ . Accordingly kx k2 − λ dπθ (θ0 ) β Z 8kx k2 hθ0 , x i2 10kx k4 + min 4 log(2), + dπθ (θ0 ). β β2 ψ hθ, x i2 − λ ≤ Z ψ hθ0 , x i2 − We will now use Lemma For any a, b, y, ∈ R+ and c = a b exp(b) − 1 , log(a) + min{b, y} ≤ log(a + cy). Applying this lemma to a ≤ 2, b = 4 log(2), and the 15 corresponding c = log(4) ends the proof of the previous proposition. PAC-Bayes bound Proposition For any measure ν ∈ M+1 (Θ), real number a > −1 and any measurable function f : X × Θ → [a, +∞[, with probability at least 1 − , for any ρ ∈ M+1 (Θ) such that K (ρ, ν) < ∞, Z log 1 + f (x , θ) dρ(θ) dP(x ) ≤ Z f (x , θ) dρ(θ) dP(x ) + K (ρ, ν) − log() . n The proof is based on the fact that Z hdρ − K (ρ, ν) ≤ log Z exp(h)dν , for any upper-bounded measurable function h : Θ → R. We get with probability 1 − , for any θ ∈ Rd , Z 2 ψ hθ, x i − λ dP(x ) ≤ Z " hθ, x i2 − λ 2 4 1 2 2 2 2 4 + hθ, x i − λ + hθ, x i kx k + 2 kx k 2 β β 2ckx k2 9kx k2 + 4hθ, x i2 + β β # dP(x ) + βkθk2 log(−1 ) + . 2n n Using the Cauchy-Schwartz inequality will make the following quantities appear Z 4 kx k dP(x ) s4 = Z κ = sup ξ= κλ , 2 1/4 , hθ, x i4 dP(x ), θ ∈ Rd s. t. Z hθ, x i2 dP(x ) = 1 , √ 2 γ = λ(κ − 1) + (1 + 4c)s42 κ, β λ 2 (1 + 18c)s44 log ν(λ, β) 2√ η = (κ − 1) + (1 + 4c)s4 κ + − , 2 β β2λ nλ β δ= . 2nλ Proposition With probability at least 1 − 2, for any θ ∈ Rd , 2 rλ (αθ) N (θ) 2 ≤ξ α −1 λ λ 2 rλ (αθ) α2 N (θ) ≥ −ξ −1 λ λ λkθk2 δ −η− . N (θ) N (θ) 2 α − 1 + η + δkθk2 α2 , + (1 + γ) λ λkθk2 δ + 1−γ − N (θ) N (θ) 2 α −1 λ Let Φ− (z ) = z 1 − Φ+ (z ) = z 1 + η + δλ z 1 + γ − η − δλ z 1 ξ −γ +η+ −1 η + δλ z δλ <1 , z 1 ξ +γ +η+ 1 − γ − η − 2δλ z 2δλ <1 . z b 2 in the first inequality of the Replacing N (θ) with Φ− λ/α b 2 with Φ+ N (θ) in the second previous proposition and λ/α inequality, we can prove that Φ− λ b α2 ≤ N (θ), Φ+ N (θ) ≤ λ . b2 α Proposition With probability at least 1 − 2, for any θ ∈ Rd , B− = sup Φ− λ,β λ b2 α ≤ N (θ) ≤ inf Φ−1 + λ,β λ b2 α = B+ . b (θ) = B− + B+ , we get with probability at least Let us put N 2 1 − 2, for any λ, β ∈ R+ , c2 N (θ) − Φ− ◦ Φ+ N (θ) N − B− N − Φ− λ/α b = ≤ , N (θ) − N (θ) ≤ 2 2 2 −1 −1 −1 b2 b (θ) − N (θ) ≤ Φ+ λ/α − N (θ) ≤ Φ+ ◦ Φ− N (θ) − N (θ) . N 2 2 Putting r (z ) = η + δλ z 1−γ −η− that 2δλ z , and c(z ) = η + −1 δλ , we can prove z Φ− (z ) ≥ z 1 − r (z ) 1 4c(z ) < 1 , Φ+ (z ) ≥ z 1 + r (z ) Φ−1 − (z )1 4c(z ) < 1 ≤ z 1 − r (z ) 1 4c(z ) < 1 −1 Φ−1 + (z )1 4c(z ) < 1 ≤ z 1 + r (z ) . , From this we can deduce Proposition With probablity at least 1 − 2, for any θ ∈ Rd , b (θ) ≤ N (θ) 1 4c(N (θ)) < 1 N (θ) − N c N (θ) . 1 − 4c N (θ) The announced result is then obtained by optimizing the value of c N (θ) by appropriate choices of λ and β.
© Copyright 2026 Paperzz