x = As p(s) = Y fi(si)

Training sparse natural image models with a fast
Gibbs sampler of an extended state space
Lucas Theis1, Jascha Sohl-Dickstein2, Matthias Bethge1,3,4
1 Werner
2 Redwood Center for Theoretical Neuroscience, Berkeley
Reichardt Centre for Integrative Neuroscience, Tübingen
3 Max Planck Institute for Biological Cybernetics, Tübingen
4 Bernstein Center for Computational Neuroscience, Tübingen
Blocked Gibbs sampling
Performance of the blocked Gibbs sampler
We present an efficient inference and optimization algorithm for maxi-
We introduce two sets of auxiliary variables. One set, z , represents the
We compare the performance of the Gibbs sampler to the performance of
mum likelihood learning in the overcomplete linear model (OLM). Using a
missing visible variables [5].
Hamiltonian Monte Carlo (HMC) sampling.
overcomplete representations significantly improves the performance of
the linear model when applied to natural images while most previous
studies were unable to detect an advantage [1, 2, 3, 4].
x
z
The other set,
=

A
B
+
s
s=A x+B z
, represents the precisions of the GSM source distribu-
tions. Our blocked Gibbs sampler alternately samples z and .
A
B
x = As
Avg. posterior energy
p(z | x, ) = N (z; µz|x , ⌃z|x )
7.5
Y
1
7.3
p( | x, z) /
N (si ; 0, i )gi ( i )
i
s
p(s) =
B
x
Y
fi (si )
z
0
25
6.9
50
0
Persistent EM
i
25
7.1
5
10
15
Time [s]
For maximum likelihood learning, we use a Monte Carlo variant of expectation maximization (EM) where in each E-step the data is completed by
sampling from the posterior. To further speed up learning, we initialize the
M ⇥N
A
2
R
Here,
and M  N . Note that we don t assume any additive
We use Gaussian scale mixtures (GSMs) to represent the source
1. initialize
distributions. This family contains the Laplace, Student-t, Cauchy and
2. repeat
exponential power distribution as a special case.
3.
fi (si ) =
1
4.
gi ( i )N (si ; 0,
0
1
i
)d
20
40
60
7.3
0
7.1
25
6.9
50
0
5
10
15
Time [s]
MALA
HMC
Gibbs
1
0.5
0
Image model
20
40
60
80
Time [s]
1
Toy model
1
MALA
HMC
Gibbs
0.5
1.5
1.3
11.1
1.41
1.36 1.38 1.39
1.41
5
10
1.46 1.49
1.47
1.5
1.1 1.03
0.96
0.9
-
-
1x
2x
3x
4x
2x
3x
-
4x
-
1x
Overcompleteness
0
0
0
5
10
Time [s]
15
20
40
60
2x
3x
4x
2x
3x
4x
Overcompleteness


0
1.58
1.3
0.9
0.5
1.48 1.48
1.47
1.44
1.55
1.33
Image 1.25
model



80
Time the
[s] performance of the OLM with a product of experts (PoE)
We compared
overcomplete representations.
0.5
0
0
16 ⇥ 16 image patches
8 ⇥ 8 image patches
model, which represents another generalization of the linear model to
0
Time [s]
15
s = W x,
0
20
Time [s]
40
60
80
Time [s]
p(x) /
Y
fi (si )
i
Here, we use the product of Student-t distributions.
The upper plots show trace plots of posterior density, the lower plots show
autocorrelation functions averaged over several data points.
⇥
E (zt µ)> (zt+⌧ µ) | x
R(⌧ ) =
E [(zt µ)> (zt µ) | x]
z
z ⇠ T (·; z, x)
maximize log p(x, z | ✓)
25
80
Markov chain with the samples of the previous iteration.
noise on the visible units.
Z
0
7.5
Toy model
B
Image model
Toy model
B
Image model
Toy model
Autocorrelation
Overcomplete linear model
A
A
+
Autocorrelation

Avg. posterior energy
persistent variant of expectation maximization, we find that using
Model comparison
Log-likelihood ± SEM [bit/pixel]
Introduction
⇤
Overcompleteness and sparsity
Conclusions and remarks
•
Efficient maximum likelihood learning in overcomplete linear models is
possible and enables us to jointly optimize filters and sources.
•
A linear model for natural images can benefit from overcomplete representations if the source distributions are highly leptokurtotic.
•
The presented algorithm can easily be extended to more powerful
models of natural images such as subspace or bilinear models.
i
Resources
This algorithm works and will converge because each iteration can only
Code for training and evaluating overcomplete linear models:
improve a lower bound on the log-likelihood.
Inference
The posterior distribution over latent source variables is constrained to a
F [T q, ✓]
linear subspace and can have multiple modes with heavy tails.
A
s2
p(s|x)
p(z|x)
F [q, ✓]
DKL [q(z | x) || p(z | x, ✓)]
Likelihood estimation
B
We use a form of importance sampling (annealed importance sampling)
to estimate the likelihood of the model. This yields an unbiased estimator
A
s1
s
A
x
of the likelihood and a conservative estimator of the log-likelihood.
B =
p(x)
z
Z
p(x, z)
1 X p(x, zn )
q(z | x)
dz ⇡
q(z | x)
N n q(zn | x)
When applied to natural image patches, the model learns highly sparse
http://bethgelab.org/code/theis2012d/
source distributions and three to four times overcomplete representations.
Basis vector norm, ||ai ||
F [q, ✓] = log p(x | ✓)
Laplace, 2x
GSM, 2x
GSM, 3x
GSM, 4x
1
0.75
References
[1] M. Lewicki and B. A. Olshausen, JOSA A, 1999
[2] P. Berkes, R. Turner, and M. Sahani, NIPS 21, 2007
[3] M. Seeger, JMLR, 2008
[4] D. Zoran and Y. Weiss, NIPS 25, 2012
[5] R.-B. Chen and Y. N. Wu, Computational Statistics & Data Analysis, 2007
0.5
Acknowledgements
0.25
This study was financially supported through the Bernstein award (BMBF; FKZ: 01GQ0601), the
0
64
128
192
Basis coefficient, i
256
German Research Foundation (DFG; priority program 1527, Sachbeihilfe BE 3848/2-1), and a DFGNSF collaboration grant (TO 409/8-1).
G
G
L
O
P