Approximate Decentralized Bayesian Inference Trevor Campbell, Jonathan P. How • Laboratory for Information and Decision Systems, MIT I N {Yi }i=1 data subsets: Y = N N Y Y p (Θ|Yi ) p (Θ|Y ) ∝ p (Θ) p (Y |Θ) = p (Θ) p (Yi |Θ) ∝ p (Θ) p (Θ) i=1 I (1) i=1 Compute local variational posteriors qi (Θ) ' p(Θ|Yi ) and combine via (1): Y Y qi (Θ) = qλik (θk ) ∀i = 1, . . . , N, q0(Θ) = qλ0k (θk ) ∴ q(Θ) = k Y k k X qλk (θk ) where λk = (1 − N)λ0k + λik 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10 8 8 (2) I 2 2 0 −2 −1 0 1 2 3 (a) Batch −1 0 1 2 3 I Fails for more complex models (N = 10 agents, 30 observations) 1.0 0.8 0.8 I 0.2 0.4 π1 0.6 0.8 0.0 0.0 1.0 (a) Batch 0.2 0.4 π1 0.6 0.8 (b) Naı̈ve Decentralized Why Naı̈ve Fails: Approximate Inference Breaks Symmetry qi (Θ) missing symmetry in unsupervised models required to combine via (1) 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1.0 I Apply (1) as before with q̃i instead of qi : Y q̃i (·) = XY qλ0k (θk )1−N Y {Pi }i k i I qPi λik (θk ) I (5) I i I N This distribution is intractable to use directly; contains (K !) terms −1.0 N {Pi }i=1: Solve a combinatorial optimization over permutation matrices X X ? N {Pi }i=1 ← arg max Ak (1 − N)λ0k + Pi λik k I 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Figure: Batch variational Bayes approximate posterior samples from 5 random restarts. Comparison to Figure 2a shows that the approximate inference algorithm tends to converge to a random component of the permutation symmetry in the true posterior. I i (6) s.t. Pi ∈ S ∀i where Ak (·) is the log-partition function for parameter θk Solving (6) is equivalent to finding the maximum weight component of (5) Step 4: Output the Decentralized Posterior X Y ? ? ? q (Θ) = Pi λik qλ?k (θk ) where λk = (1 − N)λ0k + i AMPS provides ad-hoc, decentralized Bayesian learning Easy to derive and implement for exp family approximations Surprisingly, can improve model vs. batch/distributed methods ← Paper! Code! → Paper: http://arxiv.org/abs/1403.7471 Code: http://github.com/trevorcampbell/amps/ 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.4 π1 0.6 0.8 0.0 0.0 1.0 0.2 0.4 π1 0.6 0.8 0.0 0.0 1.0 0.2 0.4 (b) π1 0.6 0.8 1.0 (c) Real Data: Latent Dirichlet Allocation 20 Newsgroups dataset – model symmetry in ordering of topics Solve (6) by iterating max-weight bipartite matching problems on rows of Pi AMPS improves test log-likelihood over individual agents & batch inference Solves smaller inference problems, less susceptible to local optima −1.0 Batch Individual AMPS −1.5 −2.0 −2.5 −3.0 0 10 101 102 Time (s) 103 −1.5 Batch Individual AMPS −2.5 −3.0 0 10 104 −1.0 −2.0 (a) 5 Learning Agents 101 102 Time (s) 103 −1.5 −2.0 −2.5 −3.0 0 10 104 (b) 10 Learning Agents Batch Individual AMPS 101 102 Time (s) 103 104 (c) 50 Learning Agents Figure: Plots of log likelihood on the test data for 5 (6a), 10 (6b), and 50 (6c) learning agents. The boxes bound the 25th and 75th percentiles of the data with medians shown in the interior, and whiskers show the maximum and minimum values. (7) I Hierarchical/Streaming AMPS I I I I I 0.6 0.8 1.0 Figure: (2a): Samples from the true posterior over µ, π. Particle position on the simplex (with π3 = 1 − π1 − π2) represents the weights, while RGB color coordinates represent the three means. (2b): Samples from the naı̈vely constructed decentralized approximate posterior. 0.4 0.6 Conclusions 0.2 0.0 0.0 0.2 0.4 0.4 0.2 I 0.2 π2 0.4 I 1.0 k 0.6 π2 0.6 0.8 {Pi }N i=1 (b) Naı̈ve Decentralized 1.0 0.6 1.0 Figure: (5a): Samples from a typical individual posterior distribution qi (Θ) from variational Bayesian inference. (5b): Samples from the decentralized posterior output by AMPS. Comparison to Figure 5a shows that the AMPS posterior merging procedure improves the posterior possessed by each agent significantly. (5c): Samples from the symmetrized decentralized posterior, for comparison to Figure 2a. Step 3: Optimize Out Unnecessary Structure 4 Example 2: Decentralized 3-Component GMM Inference I 0.4 q(·) ∝ q0(·) Figure: (1a): Batch posterior of µ in black, with histogram of observed data. (1b): Decentralized posterior of µ in black, individual posteriors in color and correspondingly colored histogram of observed data. I 0.2 1−N I 0 −2 4 1.0 i I 4 0.8 1.0 (a) 6 4 0.6 1.0 0.0 0.0 Step 2: Combine Using Bayes’ Rule I I 6 0.4 (4) Figure: Samples from the symmetrized batch variational Bayes approximate posterior from 5 random restarts. Comparison to Figure 2a shows that symmetrization reintroduces the structure of the true posterior to the approximate posteriors. Works well for this simple model (N = 10 agents, 100 observations) 10 This reintroduces original model structure, i.e. for any permutation P, q̃i (Pθ1, . . . , PθK ) = q̃i (θ1, . . . , θK ) 1.0 Example 1: Decentralized Gaussian Inference I k 1.0 0.2 (3) π2 P I Test Log Likelihood Latent parameters: Θ = Symmetrize approximate posteriors by summing over permutations P: XY qPλik (θk ) q̃i (Θ) ∝ π2 I I AMPS naturally extends to streaming data (20 Newsgroups) Improves test log-likelihood over batch, SDA-Bayes (Broderick, NIPS ’13) AMPS10x10 is effectively 100 agents with hierarchical optimization Hierarchical optimization reduces CPU time without sacrificing quality −1.0 Test Log Likelihood I I Step 1: Artificial Reintroduction of Symmetry Solve (6) by swapping rows of Pi until no swap improves objective AMPS improves model significantly w.r.t. individual agent models Works well despite high individual agent uncertainty Test Log Likelihood I Bayes’ Approximate Posterior Combination K {θk }k=1, Synthetic: Mixture Model, Revisited I The Naı̈ve Approach I I π2 I Problem: Unsupervised learning in conditionally independent Bayesian models over a decentralized, ad-hoc network Challenge: Approximate inference breaks model structure required to combine posteriors Test Log Likelihood I Experiments Approximate Merging of Posteriors with Symmetry (AMPS) Introduction −1.5 −2.0 Batch AMPS1x10 AMPS10x1 AMPS10x10 SDA1x10 SDA10x1 SDA10x10 −2.5 −3.0 0 10 101 102 Time (s) 103 104 Figure: Comparison with SDA-Bayes. The A × B in the legend names refer to using A learning agents, where each splits their individual batches of data into B streaming subbatches. ONR MURI N000141110688 E-Mail: [email protected]
© Copyright 2025 Paperzz