Training sparse natural image models with a fast Gibbs sampler of an extended state space Lucas Theis1, Jascha Sohl-Dickstein2, Matthias Bethge1,3,4 1 Werner 2 Redwood Center for Theoretical Neuroscience, Berkeley Reichardt Centre for Integrative Neuroscience, Tübingen 3 Max Planck Institute for Biological Cybernetics, Tübingen 4 Bernstein Center for Computational Neuroscience, Tübingen Blocked Gibbs sampling Performance of the blocked Gibbs sampler We present an efficient inference and optimization algorithm for maxi- We introduce two sets of auxiliary variables. One set, z , represents the We compare the performance of the Gibbs sampler to the performance of mum likelihood learning in the overcomplete linear model (OLM). Using a missing visible variables [5]. Hamiltonian Monte Carlo (HMC) sampling. overcomplete representations significantly improves the performance of the linear model when applied to natural images while most previous studies were unable to detect an advantage [1, 2, 3, 4]. x z The other set, = A B + s s=A x+B z , represents the precisions of the GSM source distribu- tions. Our blocked Gibbs sampler alternately samples z and . A B x = As Avg. posterior energy p(z | x, ) = N (z; µz|x , ⌃z|x ) 7.5 Y 1 7.3 p( | x, z) / N (si ; 0, i )gi ( i ) i s p(s) = B x Y fi (si ) z 0 25 6.9 50 0 Persistent EM i 25 7.1 5 10 15 Time [s] For maximum likelihood learning, we use a Monte Carlo variant of expectation maximization (EM) where in each E-step the data is completed by sampling from the posterior. To further speed up learning, we initialize the M ⇥N A 2 R Here, and M N . Note that we don t assume any additive We use Gaussian scale mixtures (GSMs) to represent the source 1. initialize distributions. This family contains the Laplace, Student-t, Cauchy and 2. repeat exponential power distribution as a special case. 3. fi (si ) = 1 4. gi ( i )N (si ; 0, 0 1 i )d 20 40 60 7.3 0 7.1 25 6.9 50 0 5 10 15 Time [s] MALA HMC Gibbs 1 0.5 0 Image model 20 40 60 80 Time [s] 1 Toy model 1 MALA HMC Gibbs 0.5 1.5 1.3 11.1 1.41 1.36 1.38 1.39 1.41 5 10 1.46 1.49 1.47 1.5 1.1 1.03 0.96 0.9 - - 1x 2x 3x 4x 2x 3x - 4x - 1x Overcompleteness 0 0 0 5 10 Time [s] 15 20 40 60 2x 3x 4x 2x 3x 4x Overcompleteness 0 1.58 1.3 0.9 0.5 1.48 1.48 1.47 1.44 1.55 1.33 Image 1.25 model 80 Time the [s] performance of the OLM with a product of experts (PoE) We compared overcomplete representations. 0.5 0 0 16 ⇥ 16 image patches 8 ⇥ 8 image patches model, which represents another generalization of the linear model to 0 Time [s] 15 s = W x, 0 20 Time [s] 40 60 80 Time [s] p(x) / Y fi (si ) i Here, we use the product of Student-t distributions. The upper plots show trace plots of posterior density, the lower plots show autocorrelation functions averaged over several data points. ⇥ E (zt µ)> (zt+⌧ µ) | x R(⌧ ) = E [(zt µ)> (zt µ) | x] z z ⇠ T (·; z, x) maximize log p(x, z | ✓) 25 80 Markov chain with the samples of the previous iteration. noise on the visible units. Z 0 7.5 Toy model B Image model Toy model B Image model Toy model Autocorrelation Overcomplete linear model A A + Autocorrelation Avg. posterior energy persistent variant of expectation maximization, we find that using Model comparison Log-likelihood ± SEM [bit/pixel] Introduction ⇤ Overcompleteness and sparsity Conclusions and remarks • Efficient maximum likelihood learning in overcomplete linear models is possible and enables us to jointly optimize filters and sources. • A linear model for natural images can benefit from overcomplete representations if the source distributions are highly leptokurtotic. • The presented algorithm can easily be extended to more powerful models of natural images such as subspace or bilinear models. i Resources This algorithm works and will converge because each iteration can only Code for training and evaluating overcomplete linear models: improve a lower bound on the log-likelihood. Inference The posterior distribution over latent source variables is constrained to a F [T q, ✓] linear subspace and can have multiple modes with heavy tails. A s2 p(s|x) p(z|x) F [q, ✓] DKL [q(z | x) || p(z | x, ✓)] Likelihood estimation B We use a form of importance sampling (annealed importance sampling) to estimate the likelihood of the model. This yields an unbiased estimator A s1 s A x of the likelihood and a conservative estimator of the log-likelihood. B = p(x) z Z p(x, z) 1 X p(x, zn ) q(z | x) dz ⇡ q(z | x) N n q(zn | x) When applied to natural image patches, the model learns highly sparse http://bethgelab.org/code/theis2012d/ source distributions and three to four times overcomplete representations. Basis vector norm, ||ai || F [q, ✓] = log p(x | ✓) Laplace, 2x GSM, 2x GSM, 3x GSM, 4x 1 0.75 References [1] M. Lewicki and B. A. Olshausen, JOSA A, 1999 [2] P. Berkes, R. Turner, and M. Sahani, NIPS 21, 2007 [3] M. Seeger, JMLR, 2008 [4] D. Zoran and Y. Weiss, NIPS 25, 2012 [5] R.-B. Chen and Y. N. Wu, Computational Statistics & Data Analysis, 2007 0.5 Acknowledgements 0.25 This study was financially supported through the Bernstein award (BMBF; FKZ: 01GQ0601), the 0 64 128 192 Basis coefficient, i 256 German Research Foundation (DFG; priority program 1527, Sachbeihilfe BE 3848/2-1), and a DFGNSF collaboration grant (TO 409/8-1). G G L O P
© Copyright 2026 Paperzz