Mixability in Statistical Learning
Tim van Erven
Joint work with: Peter Grünwald, Mark Reid, Bob Williamson
SMILE Seminar, 24 September 2012
Summary
• Stochastic mixability
fast rates of convergence in
different settings:
• statistical learning (margin condition)
• sequential prediction (mixability)
Outline
• Part 1: Statistical learning
• Stochastic mixability (definition)
• Equivalence to margin condition
• Part 2: Sequential prediction
• Part 3: Convexity interpretation for stochastic mixability
• Part 4: Grünwald’s idea for adaptation to the margin
Notation
Notation
• Data:
(X1 , Y1 ), . . . , (Xn , Yn )
• Predict Y from X:
F = {f : X ! A}
• Loss:
` : Y ⇥ A ! [0, 1]
Notation
• Data:
(X1 , Y1 ), . . . , (Xn , Yn )
• Predict Y from X:
F = {f : X ! A}
• Loss:
` : Y ⇥ A ! [0, 1]
Classification
Y = {0, 1}, A = {0, 1}
(
0 if y = a
`(y, a) =
1 if y 6= a
Notation
• Data:
(X1 , Y1 ), . . . , (Xn , Yn )
• Predict Y from X:
F = {f : X ! A}
• Loss:
` : Y ⇥ A ! [0, 1]
Classification
Density estimation
Y = {0, 1}, A = {0, 1}
(
0 if y = a
`(y, a) =
1 if y 6= a
A = density functions on Y
`(y, p) =
log p(y)
Notation
• Data:
(X1 , Y1 ), . . . , (Xn , Yn )
• Predict Y from X:
F = {f : X ! A}
• Loss:
` : Y ⇥ A ! [0, 1]
Classification
Density estimation
Y = {0, 1}, A = {0, 1}
(
0 if y = a
`(y, a) =
1 if y 6= a
A = density functions on Y
`(y, p) =
log p(y)
Without X : F ⇢ A
Statistical Learning
Statistical Learning
iid
(X1 , Y1 ), . . . , (Xn , Yn ) ⇠ P
⇤
⇤
f = arg min E[`(Y, f (X))]
f 2F
⇤
ˆ
d(f , f ) = E[`(Y, fˆ(X))
⇤
`(Y, f (X))]
Statistical Learning
iid
(X1 , Y1 ), . . . , (Xn , Yn ) ⇠ P
⇤
⇤
f = arg min E[`(Y, f (X))]
f 2F
⇤
ˆ
d(f , f ) = E[`(Y, fˆ(X))
⇤
`(Y, f (X))]
Statistical Learning
iid
(X1 , Y1 ), . . . , (Xn , Yn ) ⇠ P
⇤
⇤
f = arg min E[`(Y, f (X))]
f 2F
⇤
ˆ
d(f , f ) = E[`(Y, fˆ(X))
⇤
`(Y, f (X))] = O(n
?
)
Statistical Learning
iid
(X1 , Y1 ), . . . , (Xn , Yn ) ⇠ P
⇤
⇤
f = arg min E[`(Y, f (X))]
f 2F
⇤
ˆ
d(f , f ) = E[`(Y, fˆ(X))
⇤
`(Y, f (X))] = O(n
?
)
• Two factors that determine rate of convergence:
1. complexity of F
2. the margin condition
Definition of Stochastic Mixability
⇤
• Let ⌘
0 . Then (`, F, P ) is ⌘ -stochastically mixable if
⇤
there exists an f 2 F such that
e
E
e
⌘`(Y,f (X))
⌘`(Y,f ⇤ (X))
1
for all f 2 F.
• Stochastically mixable: this holds for some ⌘ > 0
Immediate Consequences
e
E
e
⌘`(Y,f (X))
⌘`(Y,f ⇤ (X))
1
for all f 2 F
• f ⇤ minimizes risk over F : f ⇤ = arg min E[`(Y, f (X))]
f 2F
• The larger ⌘ , the stronger the property of being ⌘ -
stochastically mixable
Density estimation example 1
• Log-loss: `(y, p) =
log p(y) , F = {p✓ | ✓ 2 ⇥}
• Suppose p✓⇤ 2 F is the true density
• Then for ⌘ = 1 and any p✓ 2 F :
e
E
e
⌘`(Y,p✓ )
⌘`(Y,p✓⇤ )
=
Z
p✓ (y) ⇤
P (dy) = 1
p✓⇤ (y)
Density estimation example 2
Density estimation example 2
• Normal location family with fixed variance
F = {N (µ,
2
) | µ 2 R}
e
E
e
⌘`(Y,pµ )
⌘`(Y,pµ⇤ )
=
=
p 1
2⇡⌧ 2
p 1
2⇡⌧ 2
Z
Z
⌘
2
/⌧ 2 :
⌘
2
⇤ 2
(y
µ)
+
(y
µ
)
2
2 2
e
2
e
1
2⌧ 2
:
P ⇤ = N (µ⇤ , ⌧ 2 )
• ⌘ -stochastically mixable for ⌘ =
2
(y µ)2
dy = 1
1
2⌧ 2
(y µ⇤ )2
dy
Density estimation example 2
• Normal location family with fixed variance
F = {N (µ,
2
) | µ 2 R}
e
E
e
⌘`(Y,pµ )
⌘`(Y,pµ⇤ )
=
=
p 1
2⇡⌧ 2
p 1
2⇡⌧ 2
Z
Z
⌘
2
/⌧ 2 :
⌘
2
⇤ 2
(y
µ)
+
(y
µ
)
2
2 2
e
2
e
1
2⌧ 2
:
P ⇤ = N (µ⇤ , ⌧ 2 )
• ⌘ -stochastically mixable for ⌘ =
2
(y µ)2
1
2⌧ 2
(y µ⇤ )2
dy = 1
⌧2
1
⌘
• If fˆ is empirical mean: E[d(fˆ, f ⇤ )] =
=
2 2n
2n
dy
Outline
• Part 1: Statistical learning
• Stochastic mixability (definition)
• Equivalence to margin condition
• Part 2: Sequential prediction
• Part 3: Convexity interpretation for stochastic mixability
• Part 4: Grünwald’s idea for adaptation to the margin
Margin condition
⇤
⇤
c0 V (f, f ) d(f, f )
for all f 2 F
• where d(f, f ⇤ ) = E[`(Y, f (X)) `(Y, f ⇤ (X))]
⇤
⇤
V (f, f ) = E `(Y, f (X)) `(Y, f (X))
1, c0 > 0
2
• For 0/1-loss implies rate of convergence O(n /(2 1) )
• So smaller is better
[Tsybakov, 2004]
Stochastic mixability
⇤
⇤
c0 V (f, f ) d(f, f )
margin
for all f 2 F
• Thm [ = 1]: Suppose ` takes values in [0, V ]. Then (`, F, P ⇤ ) is
stochastically mixable if and only if there exists c0 > 0 such
that the margin condition is satisfied with = 1.
Margin condition with > 1
⇤
⇤
F✏ = {f } [ {f 2 F | d(f, f )
• Thm [all
✏}
1]: Suppose ` takes values in [0, V ]. Then the
margin condition is satisfied if and only if there exists a
⇤
constant C > 0 such that, for all ✏ > 0 , (`, F✏ , P ) is ⌘ ( 1)/
stochastically mixable for ⌘ = C✏
.
Outline
• Part 1: Statistical learning
• Part 2: Sequential prediction
• Part 3: Convexity interpretation for stochastic mixability
• Part 4: Grünwald’s idea for adaptation to the margin
Sequential Prediction with Expert Advice
• For rounds t
= 1, . . . , n :
• K experts predict fˆ1 , . . . , fˆK
t
t
• Predict (xt , yt ) by choosing fˆt
(xt , yt )
n
X
1
Regret =
`(yt , fˆt (xt ))
n t=1
• Observe
•
n
X
1
min
`(yt , fˆtk (xt ))
k n
t=1
• Game-theoretic (minimax) analysis: want to guarantee small regret
against adversarial data
Sequential Prediction with Expert Advice
• For rounds t
= 1, . . . , n :
• K experts predict fˆt1 , . . . , fˆtK
• Predict (xt , yt ) by choosing fˆt
• Observe (xt , yt )
•
n
X
1
Regret =
`(yt , fˆt (xt ))
n t=1
n
X
1
k
ˆ
min
`(yt , ft (xt ))
k n
t=1
• Worst-case regret = O(1/n) iff the loss is mixable!
[Vovk, 1995]
Mixability
• A loss ` : Y ⇥ A ! [0, 1] is ⌘-mixable if for any
distribution ⇡ on A there exists an action a⇡ 2 A such that
EA⇠⇡
e
e
⌘`(y,A)
⌘`(y,a⇡ )
1
for all y.
• Vovk: fast O(1/n) rates if and only if loss is mixable
(Stochastic) Mixability
• A loss ` : Y ⇥ A ! [0, 1] is ⌘-mixable if for any
distribution ⇡ on A there exists an action a⇡ 2 A such that
EA⇠⇡
e
e
⌘`(y,A)
⌘`(y,a⇡ )
1
for all y.
• (`, F, P ⇤ ) is ⌘ -stochastically mixable if
EX,Y ⇠P ⇤
e
e
⌘`(Y,f (X))
⌘`(Y,f ⇤ (X))
1
for all f 2 F.
(Stochastic) Mixability
• A loss ` : Y ⇥ A ! [0, 1] is ⌘-mixable if for any
distribution ⇡ on A there exists an action a⇡ 2 A such that
`(y, a⇡ )
1
ln
⌘
Z
e
⌘`(y,a)
⇡(da)
for all y.
(Stochastic) Mixability
• A loss ` : Y ⇥ A ! [0, 1] is ⌘-mixable if for any
distribution ⇡ on A there exists an action a⇡ 2 A such that
`(y, a⇡ )
1
ln
⌘
Z
e
⌘`(y,a)
⇡(da)
for all y.
• Thm: (`, F, P ⇤ ) is ⌘ -stochastically mixable iff for any
⇤
⇡
distribution on F there exists f 2 F such that
Z
1
⇤
E[`(Y, f (X))] E[
ln e ⌘`(Y,f (X)) ⇡(df )]
⌘
Equivalence of Stochastic
Mixability and Ordinary Mixability
Equivalence of Stochastic
Mixability and Ordinary Mixability
Ffull = {all functions from X to A}
• Thm: Suppose ` is a proper loss and X is discrete. Then `
⇤
is ⌘ -mixable if and only if (`, Ffull , P ) is ⌘ -stochastically
⇤
mixable for all P.
Equivalence of Stochastic
Mixability and Ordinary Mixability
Ffull = {all functions from X to A}
• Thm: Suppose ` is a proper loss and X is discrete. Then `
⇤
is ⌘ -mixable if and only if (`, Ffull , P ) is ⌘ -stochastically
⇤
mixable for all P.
• Proper losses are e.g. 0/1-loss, log-loss, squared loss
• Thm generalizes to other losses that satisfy two technical
conditions
Summary
• Stochastic mixability
fast rates of convergence in
different settings:
• statistical learning (margin condition)
• sequential prediction (mixability)
Outline
• Part 1: Statistical learning
• Part 2: Sequential prediction
• Part 3: Convexity interpretation for stochastic mixability
• Part 4: Grünwald’s idea for adaptation to the margin
Density estimation example 1
• Log-loss: `(y, p) =
log p(y) , F = {p✓ | ✓ 2 ⇥}
• Suppose p✓⇤ 2 F is the true density
• Then for ⌘ = 1 and any p✓ 2 F :
e
E
e
⌘`(Y,p✓ )
⌘`(Y,p✓⇤ )
=
Z
p✓ (y) ⇤
P (dy) = 1
p✓⇤ (y)
Log-loss example 3 (convex F )
• Log-loss: `(y, p) =
log p(y) , F = {p✓ | ✓ 2 ⇥}
• Suppose model misspecified: p✓⇤ = arg min E[
is not the true density
p✓ 2F
log p✓ (Y )]
• Thm [Li, 1999]: Suppose F is convex. Then
Z
p✓ (y) ⇤
P (dy) 1
p✓⇤ (y)
for all p✓ 2 F
• Convexity is common condition for convergence of
minimum description length and Bayesian methods
Log-loss and convexity for ⌘ = 1
Log-loss and convexity for ⌘ = 1
• Thm: (`, F, P ⇤ ) is ⌘ -stochastically mixable iff for any
distribution ⇡ on F there exists f ⇤ 2 F such that
1
E[`(Y, f (X))] E[
ln
⌘
⇤
Z
e
⌘`(Y,f (X))
⇡(df )]
Log-loss and convexity for ⌘ = 1
• Thm: (`, F, P ⇤ ) is ⌘ -stochastically mixable iff for any
distribution ⇡ on F there exists f ⇤ 2 F such that
1
E[`(Y, f (X))] E[
ln
⌘
⇤
Z
e
⌘`(Y,f (X))
⇡(df )]
• Corollary: For log-loss, 1-stochastic mixability means
min E[ ln p(Y )] = min E[ ln p(Y )],
p2F
p2co(F )
where co(F) denotes the convex hull of F .
Log-loss and convexity for ⌘ = 1
p
p⇤
⇤
p✓ ⇤
co(F )
p✓ ⇤
F
F
co(F )
Not stochastically mixable
Stochastically mixable
• Corollary: For log-loss, 1-stochastic mixability means
min E[ ln p(Y )] = min E[ ln p(Y )],
p2F
p2co(F )
where co(F) denotes the convex hull of F .
Convexity interpretation with
pseudo-likelihoods
• Pseudo-likelihoods: pf,⌘ (Y |X) = e ⌘`(Y,f (X))
PF (⌘) = {pf,⌘ (Y |X) | f 2 F}
• Corollary: (`, F, P ⇤ ) is ⌘ -stochastically mixable iff
min E[
p2PF (⌘)
1
⌘
ln p(Y |X)] =
min
p2co(PF (⌘))
E[
1
⌘
ln p(Y |X)]
Outline
• Part 1: Statistical learning
• Part 2: Sequential prediction
• Part 3: Convexity interpretation for stochastic mixability
• Part 4: Grünwald’s idea for adaptation to the margin
Adapting to the margin / ⌘
• Penalized empirical risk minimization:
fˆ = arg min
f 2F
• Optimal
n
n1 X
n
`(Yi , f (Xi )) +
i=1
· pen(f )
o
/ 1/⌘ depends on ⌘ / the margin
• Single model: take pen(f ) = const.
• Model selection: F =
[
m
no need to know
Fm , pen(f ) = pen(m) 6= const.
Convexity testing
Convexity testing
• Corollary: (`, F, P ⇤ ) is ⌘ -stochastically mixable iff
min E[
p2PF (⌘)
1
⌘
ln p(Y |X)] =
min
p2co(PF (⌘))
E[
1
⌘
ln p(Y |X)]
Convexity testing
• Corollary: (`, F, P ⇤ ) is ⌘ -stochastically mixable iff
min E[
p2PF (⌘)
1
⌘
ln p(Y |X)] =
min
p2co(PF (⌘))
E[
1
⌘
ln p(Y |X)]
• [Grünwald, 2011]: pick the largest ⌘ such that
1
p2PF (⌘) n
min
n
X
i=1
1
⌘
ln p(Yi |Xi )
1
p2co(PF (⌘)) n
min
n
X
i=1
1
⌘
ln p(Yi |Xi )
something
where “something” depends on concentration inequalities and
penalty function.
Summary
• Stochastic mixability
fast rates of convergence in
different settings:
• statistical learning (margin condition)
• sequential prediction (mixability)
• Convexity interpretation
• Idea for adaptation to the margin
References
Slides and NIPS 2012 paper: www.timvanerven.nl
•
P.D. Grünwald. Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical
convexity. Proceedings 24th Conference on Learning Theory (COLT 2011), pp. 551-573, 2011.
•
J.-Y. Audibert, Fast learning rates in statistical inference through aggregation, Annals of Statistics, 2009
•
B.J.K. Kleijn, A.W. van der Vaart, Misspecification in Infinite-Dimensional Bayesian Statistics, The Annals of
Statistics, 2006
•
A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statis- tics, 32(1):135–166,
2004.
•
Y. Kalnishkan and M. V. Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and
System Sciences, 74:1228–1244, 2008.
•
J. Li, Estimation of Mixture Models (PhD thesis), Yale University, 1999
•
V. Vovk. A game of prediction with expert advice. In Proceedings of the Eighth Annual Conference on
Computational Learning Theory, pages 51–60. ACM, 1995.
© Copyright 2026 Paperzz