Approximate Decentralized Bayesian Inference

Approximate Decentralized Bayesian Inference
Trevor Campbell, Jonathan P. How • Laboratory for Information and Decision Systems, MIT
I
N
{Yi }i=1
data subsets: Y =
N
N
Y
Y
p (Θ|Yi )
p (Θ|Y ) ∝ p (Θ) p (Y |Θ) = p (Θ)
p (Yi |Θ) ∝ p (Θ)
p (Θ)
i=1
I
(1)
i=1
Compute local variational posteriors qi (Θ) ' p(Θ|Yi ) and combine via (1):
Y
Y
qi (Θ) =
qλik (θk ) ∀i = 1, . . . , N, q0(Θ) =
qλ0k (θk )
∴ q(Θ) =
k
Y
k
k
X
qλk (θk ) where λk = (1 − N)λ0k +
λik
1.0
1.0
1.0
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
10
8
8
(2)
I
2
2
0
−2
−1
0
1
2
3
(a) Batch
−1
0
1
2
3
I
Fails for more complex models (N = 10 agents, 30 observations)
1.0
0.8
0.8
I
0.2
0.4
π1
0.6
0.8
0.0
0.0
1.0
(a) Batch
0.2
0.4
π1
0.6
0.8
(b) Naı̈ve Decentralized
Why Naı̈ve Fails: Approximate Inference Breaks Symmetry
qi (Θ) missing symmetry in unsupervised models required to combine via (1)
1.0
1.0
1.0
1.0
1.0
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
1.0
I
Apply (1) as before with q̃i instead of qi :
Y
q̃i (·) =
XY
qλ0k (θk )1−N
Y
{Pi }i k
i

I
qPi λik (θk )
I
(5)
I
i
I
N
This distribution is intractable to use directly; contains (K !) terms
−1.0
N
{Pi }i=1:

Solve a combinatorial optimization over permutation matrices

X
X
? N
{Pi }i=1 ← arg max
Ak (1 − N)λ0k +
Pi λik 
k
I
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
Figure: Batch variational Bayes approximate posterior samples from 5 random restarts. Comparison to Figure 2a shows that the approximate
inference algorithm tends to converge to a random component of the permutation symmetry in the true posterior.
I
i
(6)
s.t. Pi ∈ S ∀i
where Ak (·) is the log-partition function for parameter θk
Solving (6) is equivalent to finding the maximum weight component of (5)
Step 4: Output the Decentralized Posterior
X
Y
?
?
?
q (Θ) =
Pi λik
qλ?k (θk ) where λk = (1 − N)λ0k +
i
AMPS provides ad-hoc, decentralized Bayesian learning
Easy to derive and implement for exp family approximations
Surprisingly, can improve model vs. batch/distributed methods
← Paper!
Code! →
Paper: http://arxiv.org/abs/1403.7471
Code: http://github.com/trevorcampbell/amps/
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.4
π1
0.6
0.8
0.0
0.0
1.0
0.2
0.4
π1
0.6
0.8
0.0
0.0
1.0
0.2
0.4
(b)
π1
0.6
0.8
1.0
(c)
Real Data: Latent Dirichlet Allocation
20 Newsgroups dataset – model symmetry in ordering of topics
Solve (6) by iterating max-weight bipartite matching problems on rows of Pi
AMPS improves test log-likelihood over individual agents & batch inference
Solves smaller inference problems, less susceptible to local optima
−1.0
Batch
Individual
AMPS
−1.5
−2.0
−2.5
−3.0 0
10
101
102
Time (s)
103
−1.5
Batch
Individual
AMPS
−2.5
−3.0 0
10
104
−1.0
−2.0
(a) 5 Learning Agents
101
102
Time (s)
103
−1.5
−2.0
−2.5
−3.0 0
10
104
(b) 10 Learning Agents
Batch
Individual
AMPS
101
102
Time (s)
103
104
(c) 50 Learning Agents
Figure: Plots of log likelihood on the test data for 5 (6a), 10 (6b), and 50 (6c) learning agents. The boxes bound the 25th and 75th percentiles of
the data with medians shown in the interior, and whiskers show the maximum and minimum values.
(7)
I
Hierarchical/Streaming AMPS
I
I
I
I
I
0.6
0.8
1.0
Figure: (2a): Samples from the true posterior over µ, π. Particle position on the simplex (with π3 = 1 − π1 − π2) represents the weights,
while RGB color coordinates represent the three means. (2b): Samples from the naı̈vely constructed decentralized approximate posterior.
0.4
0.6
Conclusions
0.2
0.0
0.0
0.2
0.4
0.4
0.2
I
0.2
π2
0.4
I
1.0
k
0.6
π2
0.6
0.8
{Pi }N
i=1
(b) Naı̈ve Decentralized
1.0
0.6
1.0
Figure: (5a): Samples from a typical individual posterior distribution qi (Θ) from variational Bayesian inference. (5b): Samples from the
decentralized posterior output by AMPS. Comparison to Figure 5a shows that the AMPS posterior merging procedure improves the posterior
possessed by each agent significantly. (5c): Samples from the symmetrized decentralized posterior, for comparison to Figure 2a.
Step 3: Optimize Out Unnecessary Structure
4
Example 2: Decentralized 3-Component GMM Inference
I
0.4
q(·) ∝ q0(·)
Figure: (1a): Batch posterior of µ in black, with histogram of observed data. (1b): Decentralized posterior of µ in black, individual
posteriors in color and correspondingly colored histogram of observed data.
I
0.2
1−N
I
0
−2
4
1.0
i
I
4
0.8
1.0
(a)

6
4
0.6
1.0
0.0
0.0
Step 2: Combine Using Bayes’ Rule
I
I
6
0.4
(4)
Figure: Samples from the symmetrized batch variational Bayes approximate posterior from 5 random restarts. Comparison to Figure 2a shows
that symmetrization reintroduces the structure of the true posterior to the approximate posteriors.
Works well for this simple model (N = 10 agents, 100 observations)
10
This reintroduces original model structure, i.e. for any permutation P,
q̃i (Pθ1, . . . , PθK ) = q̃i (θ1, . . . , θK )
1.0
Example 1: Decentralized Gaussian Inference
I
k
1.0
0.2
(3)
π2
P
I
Test Log Likelihood
Latent parameters: Θ =
Symmetrize approximate posteriors by summing over permutations P:
XY
qPλik (θk )
q̃i (Θ) ∝
π2
I
I
AMPS naturally extends to streaming data (20 Newsgroups)
Improves test log-likelihood over batch, SDA-Bayes (Broderick, NIPS ’13)
AMPS10x10 is effectively 100 agents with hierarchical optimization
Hierarchical optimization reduces CPU time without sacrificing quality
−1.0
Test Log Likelihood
I
I
Step 1: Artificial Reintroduction of Symmetry
Solve (6) by swapping rows of Pi until no swap improves objective
AMPS improves model significantly w.r.t. individual agent models
Works well despite high individual agent uncertainty
Test Log Likelihood
I
Bayes’ Approximate Posterior Combination
K
{θk }k=1,
Synthetic: Mixture Model, Revisited
I
The Naı̈ve Approach
I
I
π2
I
Problem: Unsupervised learning in conditionally independent
Bayesian models over a decentralized, ad-hoc network
Challenge: Approximate inference breaks model structure
required to combine posteriors
Test Log Likelihood
I
Experiments
Approximate Merging of Posteriors
with Symmetry (AMPS)
Introduction
−1.5
−2.0
Batch
AMPS1x10
AMPS10x1
AMPS10x10
SDA1x10
SDA10x1
SDA10x10
−2.5
−3.0 0
10
101
102
Time (s)
103
104
Figure: Comparison with SDA-Bayes. The A × B in the legend names refer to using A learning agents, where each splits their individual batches
of data into B streaming subbatches.
ONR MURI N000141110688 E-Mail: [email protected]