Introduction Γ  la probabilitΓ© et la statistique

Expectation-Propagation performs
smooth gradient descent
Advances in Approximate Bayesian Inference 2016
1
GUILLAUME DEHAENE
Computational troubles in Bayesia
2
If we want to approximate 𝑝 𝜽 :
- Gaussian approximations:
1. Laplace approximation + Gradient Descent
2. Variational Bayes (and a variant)
3. Expectation Propagation
Laplace + Gradient Descent
3
Laplace = Gaussian approximation at the mode
Computed using Gradient Descent on πœ“ = βˆ’ log 𝑝
Probability
πœƒ
Laplace + Gradient Descent
4
Laplace = Gaussian approximation at the mode
Computed using Gradient Descent on πœ“ = βˆ’ log 𝑝
The mathematically conservative choice:
- Gradient Descent is well-understood
- Laplace is exact in the large-data limit
Physical intuitions
5
Gradient Descent β‰ˆ dynamics of a sliding object
- Log probability
- Log probability
Linking GD, VB and EP
6
VB and EP iterate Gaussian approximations
We can define an algorithm that:
- Iterates Gaussian
- Computes the Laplace
- Does Gradient Descent
Algorithm 1: disguised gradient descent
7
- Initialize with any Gaussian π‘ž0
- Loop:
πœ‡π‘› = πΈπ‘žπ‘› πœƒ
π‘Ÿ = πœ“ β€² πœ‡π‘›
𝛽 = πœ“ β€²β€² πœ‡π‘›
π‘žπ‘›+1 πœƒ ∝ exp βˆ’π‘Ÿ πœƒ βˆ’ πœ‡π‘›
𝛽
βˆ’ πœƒ βˆ’ πœ‡π‘›
2
𝝍′ 𝝁𝒏
𝝁𝒏+𝟏 = 𝝁𝒏 βˆ’ β€²β€²
𝝍 𝝁𝒏
This is Newton’s method !
2
Algorithm 1: disguised gradient descent
8
Newton’s method
πœ“ β‰ˆ π‘žπ‘’π‘Žπ‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘
DGD
𝑝 β‰ˆ exp βˆ’π‘žπ‘’π‘Žπ‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘
𝑝 β‰ˆ πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘›
Variational Bayes Gaussian approximation
9
The Variational Bayes approach:
- Minimize
KL π‘ž, 𝑝 = πΈπ‘ž
π‘ž
log
𝑝
for π‘ž a Gaussian
Local minima respect (Opper, Archambeau, 2007):
πΈπ‘ž βˆ— πœ“
β€²
=0
πΈπ‘ž βˆ— πœ“
β€²β€²
= π‘£π‘Žπ‘Ÿπ‘žβˆ—
βˆ’1
Algorithm 2: smoothed gradient descent
10
- Initialize with any Gaussian π‘ž0
- Loop:
πœ‡π‘› = πΈπ‘žπ‘› πœƒ
π‘Ÿ = πΈπ‘žπ‘› πœ“ β€² πœƒ
𝛽 = πΈπ‘žπ‘› πœ“ β€²β€² πœƒ
β‰ˆ 𝝍′ (𝝁𝒏 )
π‘žπ‘›+1 πœƒ ∝ exp βˆ’π‘Ÿ πœƒ βˆ’ πœ‡π‘›
β‰ˆ 𝝍′′ (𝝁𝒏 )
𝛽
βˆ’ πœƒ βˆ’ πœ‡π‘›
2
2
Algorithm 2: smoothed gradient descent
11
𝛼-Divergence minimization
12
If instead of KL, we minimize:
𝐷𝛼 𝑝, π‘ž = ∫ 𝑝1βˆ’π›Ό π‘žπ›Ό
Then, local minima π‘žβˆ— are such that:
β„Žβˆ— ∝ 𝑝1βˆ’π›Ό π‘ž βˆ— 𝛼
πΈβ„Ž βˆ— πœ“ β€² = 0
πΈβ„Ž βˆ— πœƒ βˆ’ πœ‡β„Ž πœ“ β€² = 1
Algorithm 3: hybrid smoothing GD
13
- Initialize with any Gaussian π‘ž0
- Loop:
β„Žπ‘› ∝ 𝑝
1βˆ’π›Ό
π‘žπ‘›
𝛼
πœ‡π‘› = πΈβ„Žπ‘› πœƒ
β€²
π‘Ÿ = πΈβ„Žπ‘› πœ“
πœƒ
β€²
β‰ˆ 𝝍 (ππ’βˆ’1
)
𝛽 = π‘£π‘Žπ‘Ÿβ„Žπ‘›
β‰ˆ 𝝍′′ (𝝁𝒏 )
πΈβ„Žπ‘› πœƒ βˆ’ πœ‡β„Ž πœ“ β€² πœƒ
π‘žπ‘›+1 πœƒ ∝ exp βˆ’π‘Ÿ πœƒ βˆ’ πœ‡π‘›
𝛽
βˆ’ πœƒ βˆ’ πœ‡π‘›
2
2
Interpreting algorithm 3
14
The only difference (not obvious for 𝛽-term):
Replacing π‘žπ‘› , a poor approximation to 𝑝
By a superior hybrid approximation:
β„Žπ‘› ∝ 𝑝
1βˆ’π›Ό
π‘žπ‘›
𝛼
β‰ˆπ‘
Expectation Propagation
15
Assume that the target can be factorized:
𝑝 πœƒ ∝
𝑓𝑖 πœƒ
𝑖
Then EP seeks a Gaussian approximation for each 𝑓𝑖 :
𝑔𝑖 πœƒ β‰ˆ 𝑓𝑖 πœƒ
They are improved iteratively
Algorithm 4: classic Expectation Propagation
16
- Loop:
π‘‘β„Ž
- Compute the 𝑖
hybrid:
β„Žπ‘– ∝ 𝑓𝑖 πœƒ
𝑗≠𝑖 𝑔𝑗
πœƒ β‰ˆπ’‘
and its mean and variance:
πœ‡π‘– = πΈβ„Žπ‘– πœƒ
-
𝑣𝑖 = π‘£π‘Žπ‘Ÿβ„Žπ‘–
New 𝑖 π‘‘β„Ž approximation:
𝑔𝑖 πœƒ =
πœƒ βˆ’ πœ‡π‘– 2
exp βˆ’
2π‘£π‘Žπ‘Ÿβ„Žπ‘–
𝑗≠𝑖 𝑔𝑗
πœƒ
β‰ˆ π’‡π’Š π’ˆπ’‹
Algorithm 5: smooth EP
17
Factorizing 𝑝 has split the energy landscape:
πœ“ πœƒ =
i πœ™π‘–
πœƒ
For each component πœ™π‘– πœƒ , use a different smoothing:
β„Žπ‘– ∝ 𝑓𝑖
𝑔𝑗 β‰ˆ 𝑝
𝑗≠𝑖
Then, update 𝑔𝑖 β‰ˆ 𝑓𝑖 = exp(βˆ’πœ™π‘– )
Algorithm 5: smooth EP
18
- Initialize with any Gaussians 𝑔1 , 𝑔2 … 𝑔𝑛
- Loop:
β„Žπ‘– ∝ 𝑓𝑖
πœ‡π‘– =
𝛽=
𝑗≠𝑖 𝑔𝑗
β€²
πΈβ„Žπ‘– πœƒ
π‘Ÿ = πΈβ„Žπ‘– πœ™π‘–
βˆ’1
β€²
π‘£π‘Žπ‘Ÿβ„Žπ‘–
πΈβ„Žπ‘– πœƒ βˆ’ πœ‡π‘– πœ™π‘–
𝛽
𝑔𝑖 πœƒ ∝ exp βˆ’π‘Ÿ πœƒ βˆ’ πœ‡π‘– βˆ’ πœƒ βˆ’ πœ‡π‘–
2
πœƒ
πœƒ
2
Classic vs Smooth EP
19
Algorithm 4:
- Computationally efficient
- Completely unintuitive
Algorithm 5:
- Intuitive: linked to Newton’s method
- Tractable to analysis
Which should we choose?
Conclusion
20
Algorithm 1: iterating on Gaussians to perform GD
Algorithm 2: smoothed GD computes VB approx.
Algorithm 3: hybrid smoothing compute 𝐷𝛼 approx
Algorithm 5: complicated hybrid smoothing which
computes EP approximation
We can re-use our understanding of Newton’s when
we think of EP
Possible path towards improved EP algorithms?
Conclusion
21
This might prove a path towards theoretical results on
EP:
- Intuitively proves the link between EP and VB:
- The only difference between 2 and 5: π‘žπ‘› or β„Žπ‘– smoothing
-
In the limit where all β„Žπ‘– β‰ˆ π‘žπ‘› , EP β‰ˆ VB
Corresponds to a large-number of weak factors