Introduction à la probabilité et la statistique

Expectation-Propagation performs
smooth gradient descent
Advances in Approximate Bayesian Inference 2016
1
GUILLAUME DEHAENE
Computational troubles in Bayesia
2
If we want to approximate 𝑝 𝜽 :
- Gaussian approximations:
1. Laplace approximation + Gradient Descent
2. Variational Bayes (and a variant)
3. Expectation Propagation
Laplace + Gradient Descent
3
Laplace = Gaussian approximation at the mode
Computed using Gradient Descent on 𝜓 = − log 𝑝
Probability
𝜃
Laplace + Gradient Descent
4
Laplace = Gaussian approximation at the mode
Computed using Gradient Descent on 𝜓 = − log 𝑝
The mathematically conservative choice:
- Gradient Descent is well-understood
- Laplace is exact in the large-data limit
Physical intuitions
5
Gradient Descent ≈ dynamics of a sliding object
- Log probability
- Log probability
Linking GD, VB and EP
6
VB and EP iterate Gaussian approximations
We can define an algorithm that:
- Iterates Gaussian
- Computes the Laplace
- Does Gradient Descent
Algorithm 1: disguised gradient descent
7
- Initialize with any Gaussian 𝑞0
- Loop:
𝜇𝑛 = 𝐸𝑞𝑛 𝜃
𝑟 = 𝜓 ′ 𝜇𝑛
𝛽 = 𝜓 ′′ 𝜇𝑛
𝑞𝑛+1 𝜃 ∝ exp −𝑟 𝜃 − 𝜇𝑛
𝛽
− 𝜃 − 𝜇𝑛
2
𝝍′ 𝝁𝒏
𝝁𝒏+𝟏 = 𝝁𝒏 − ′′
𝝍 𝝁𝒏
This is Newton’s method !
2
Algorithm 1: disguised gradient descent
8
Newton’s method
𝜓 ≈ 𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐
DGD
𝑝 ≈ exp −𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐
𝑝 ≈ 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛
Variational Bayes Gaussian approximation
9
The Variational Bayes approach:
- Minimize
KL 𝑞, 𝑝 = 𝐸𝑞
𝑞
log
𝑝
for 𝑞 a Gaussian
Local minima respect (Opper, Archambeau, 2007):
𝐸𝑞 ∗ 𝜓
′
=0
𝐸𝑞 ∗ 𝜓
′′
= 𝑣𝑎𝑟𝑞∗
−1
Algorithm 2: smoothed gradient descent
10
- Initialize with any Gaussian 𝑞0
- Loop:
𝜇𝑛 = 𝐸𝑞𝑛 𝜃
𝑟 = 𝐸𝑞𝑛 𝜓 ′ 𝜃
𝛽 = 𝐸𝑞𝑛 𝜓 ′′ 𝜃
≈ 𝝍′ (𝝁𝒏 )
𝑞𝑛+1 𝜃 ∝ exp −𝑟 𝜃 − 𝜇𝑛
≈ 𝝍′′ (𝝁𝒏 )
𝛽
− 𝜃 − 𝜇𝑛
2
2
Algorithm 2: smoothed gradient descent
11
𝛼-Divergence minimization
12
If instead of KL, we minimize:
𝐷𝛼 𝑝, 𝑞 = ∫ 𝑝1−𝛼 𝑞𝛼
Then, local minima 𝑞∗ are such that:
ℎ∗ ∝ 𝑝1−𝛼 𝑞 ∗ 𝛼
𝐸ℎ ∗ 𝜓 ′ = 0
𝐸ℎ ∗ 𝜃 − 𝜇ℎ 𝜓 ′ = 1
Algorithm 3: hybrid smoothing GD
13
- Initialize with any Gaussian 𝑞0
- Loop:
ℎ𝑛 ∝ 𝑝
1−𝛼
𝑞𝑛
𝛼
𝜇𝑛 = 𝐸ℎ𝑛 𝜃
′
𝑟 = 𝐸ℎ𝑛 𝜓
𝜃
′
≈ 𝝍 (𝝁𝒏−1
)
𝛽 = 𝑣𝑎𝑟ℎ𝑛
≈ 𝝍′′ (𝝁𝒏 )
𝐸ℎ𝑛 𝜃 − 𝜇ℎ 𝜓 ′ 𝜃
𝑞𝑛+1 𝜃 ∝ exp −𝑟 𝜃 − 𝜇𝑛
𝛽
− 𝜃 − 𝜇𝑛
2
2
Interpreting algorithm 3
14
The only difference (not obvious for 𝛽-term):
Replacing 𝑞𝑛 , a poor approximation to 𝑝
By a superior hybrid approximation:
ℎ𝑛 ∝ 𝑝
1−𝛼
𝑞𝑛
𝛼
≈𝑝
Expectation Propagation
15
Assume that the target can be factorized:
𝑝 𝜃 ∝
𝑓𝑖 𝜃
𝑖
Then EP seeks a Gaussian approximation for each 𝑓𝑖 :
𝑔𝑖 𝜃 ≈ 𝑓𝑖 𝜃
They are improved iteratively
Algorithm 4: classic Expectation Propagation
16
- Loop:
𝑡ℎ
- Compute the 𝑖
hybrid:
ℎ𝑖 ∝ 𝑓𝑖 𝜃
𝑗≠𝑖 𝑔𝑗
𝜃 ≈𝒑
and its mean and variance:
𝜇𝑖 = 𝐸ℎ𝑖 𝜃
-
𝑣𝑖 = 𝑣𝑎𝑟ℎ𝑖
New 𝑖 𝑡ℎ approximation:
𝑔𝑖 𝜃 =
𝜃 − 𝜇𝑖 2
exp −
2𝑣𝑎𝑟ℎ𝑖
𝑗≠𝑖 𝑔𝑗
𝜃
≈ 𝒇𝒊 𝒈𝒋
Algorithm 5: smooth EP
17
Factorizing 𝑝 has split the energy landscape:
𝜓 𝜃 =
i 𝜙𝑖
𝜃
For each component 𝜙𝑖 𝜃 , use a different smoothing:
ℎ𝑖 ∝ 𝑓𝑖
𝑔𝑗 ≈ 𝑝
𝑗≠𝑖
Then, update 𝑔𝑖 ≈ 𝑓𝑖 = exp(−𝜙𝑖 )
Algorithm 5: smooth EP
18
- Initialize with any Gaussians 𝑔1 , 𝑔2 … 𝑔𝑛
- Loop:
ℎ𝑖 ∝ 𝑓𝑖
𝜇𝑖 =
𝛽=
𝑗≠𝑖 𝑔𝑗
′
𝐸ℎ𝑖 𝜃
𝑟 = 𝐸ℎ𝑖 𝜙𝑖
−1
′
𝑣𝑎𝑟ℎ𝑖
𝐸ℎ𝑖 𝜃 − 𝜇𝑖 𝜙𝑖
𝛽
𝑔𝑖 𝜃 ∝ exp −𝑟 𝜃 − 𝜇𝑖 − 𝜃 − 𝜇𝑖
2
𝜃
𝜃
2
Classic vs Smooth EP
19
Algorithm 4:
- Computationally efficient
- Completely unintuitive
Algorithm 5:
- Intuitive: linked to Newton’s method
- Tractable to analysis
Which should we choose?
Conclusion
20
Algorithm 1: iterating on Gaussians to perform GD
Algorithm 2: smoothed GD computes VB approx.
Algorithm 3: hybrid smoothing compute 𝐷𝛼 approx
Algorithm 5: complicated hybrid smoothing which
computes EP approximation
We can re-use our understanding of Newton’s when
we think of EP
Possible path towards improved EP algorithms?
Conclusion
21
This might prove a path towards theoretical results on
EP:
- Intuitively proves the link between EP and VB:
- The only difference between 2 and 5: 𝑞𝑛 or ℎ𝑖 smoothing
-
In the limit where all ℎ𝑖 ≈ 𝑞𝑛 , EP ≈ VB
Corresponds to a large-number of weak factors

Download Report

Introduction à la probabilité et la statistique

Paperzz.com

Your Paperzz