Efficient Variational Inference in Large-Scale
Bayesian Compressed Sensing
George Papandreou and Alan Yuille
Department of Statistics
University of California, Los Angeles
ICCV Workshop on Information Theory in Computer Vision
November 13, 2011, Barcelona, Spain
Inverse Image Problems
Denoising
Deblurring
Inpainting
2 / 22
The Sparse Linear Model
A hidden vector x ∈ RN and noisy measurements y ∈ RM .
Sparse linear model
P(x; θ) ∝
K
Y
g1
t(gTk x)
k =1
P(y|x; θ) = N (y; Hx, σ 2 I)
x1
x2
h1
g2
g3
x3
h2
x4
h3
◮
Sparsity directions: s = Gx, with G = [gT1 ; . . . ; gTK ]
◮
Measurement directions: H = [hT1 ; . . . ; hTM ]
◮
Sparse potential: t(s), e.g., Laplacian t(s) = e−τk |sk |
◮
Model parameters: θ = (G, H, σ 2 )
gK
xN
hM
3 / 22
Deterministic or Probabilistic Modeling?
Deterministic modeling: Standard Compressive Sensing
◮
Find minimum energy configuration
◮
Same as finding the posterior MAP
Probabilistic modeling: Bayesian Compressive Sensing
◮
Try to capture the full posterior distribution
◮
Suitable for learning parameters by maximum likelihood
(ML)
◮
Harder than just point estimate
4 / 22
Deterministic Modeling
MAP estimate as an optimization problem
Estimate is x̂MAP = argmin φMAP (x), where
φMAP (x) = σ −2 ky − Hxk2 − 2
K
X
log t(sk ) , sk = gTk x .
k =1
Properties
◮
Modern optimization techniques allow us find x̂MAP
efficiently for large-scale problems.
5 / 22
Deterministic Modeling
MAP estimate as an optimization problem
Estimate is x̂MAP = argmin φMAP (x), where
φMAP (x) = σ −2 ky − Hxk2 − 2
K
X
log t(sk ) , sk = gTk x .
k =1
Properties
◮
Modern optimization techniques allow us find x̂MAP
efficiently for large-scale problems.
◮
How much do we trust the solution? What about error
bars?
◮
Is the MAP best in terms of PSNR performance?
5 / 22
Probabilistic Modeling
Work with the full posterior distribution
K
Y
t(gTk x) .
k =1
Posterior
Prior/Measure
P(x|y) ∝ N (y; Hx, σ 2 I)
(Figure from Seeger & Wipf, ’10)
6 / 22
Probabilistic Modeling
Markov Chain Monte-Carlo vs. Variational Bayes
Markov Chain Monte-Carlo
◮
Draw samples from the posterior
◮
Typically model prior with Gaussian mixtures and perform
block Gibbs sampling.
◮
Very general, but can be slow and difficult to monitor
convergence
◮ [Schmidt, Rao & Roth ’10], [Papandreou & Yuille, ’10], ...
Variational Bayes
◮
Approximate the posterior distribution with a tractable
parametric form
◮
Systematic error but often guaranteed convergence
◮ [Attias, ’99], [Girolami, ’01], [Lewicki & Sejnowski, ’00], [Palmer et al., ’05], [Levin
et al., ’11], [Seeger & Nickisch, ’11], ...
7 / 22
Variational Bounding
◮
Approximate the posterior distribution with a Gaussian
1 T −1
Γ s
Q(x|y) ∝ N (y; Hx, σ 2 I)e− 2 s
with x̂Q = A−1 b ,
A = σ −2 HT H + GT Γ−1 G ,
and b = σ −2 HT y .
Γ = diag(γ) ,
◮
= N (x; x̂Q , A−1 ) ,
Suitable for super-Gaussian priors
2
t(sk ) = sup e−sk /(2γk )−hk (γk )/2
γk >0
◮
Optimization problem: Find the variational parameters γ
that give the tightest fit.
8 / 22
Variational Bounding: Double-Loop Algorithm
Outer Loop: Variance Computation
Compute z = diag(GA−1 GT ), i.e. the vector of variances
zk = VarQ (sk |y) along the sparsity directions sk = gTk x.
Inner Loop: Smoothed Estimation
◮
Obtain the variational mean x̂Q = argminx φQ (x; z), where
φQ (x; z) = σ
◮
−2
2
ky − Hxk − 2
K
X
k =1
log t (sk2 + zk )1/2
Update the variational parameters
√
d log t( v ) −1
γk = −2
2
dv
v =ŝk +zk
Convex if standard MAP is convex. See [Seeger & Nickisch, ’11].
9 / 22
Variance Computation
Goal: Estimate elements of Σ = A−1 , where
A = σ −2 HT H + GT Γ−1 G
◮
◮
◮
Direct inversion is hopeless (N ≈ 106 ).
Accurate and fast techniques for problems of special
structure [Malioutov et al., ’08].
Lanczos iteration (only MVM required) [Schneider & Willsky, ’01],
[Seeger & Nickisch, ’11].
10 / 22
Variance Computation
Goal: Estimate elements of Σ = A−1 , where
A = σ −2 HT H + GT Γ−1 G
◮
◮
◮
Direct inversion is hopeless (N ≈ 106 ).
Accurate and fast techniques for problems of special
structure [Malioutov et al., ’08].
Lanczos iteration (only MVM required) [Schneider & Willsky, ’01],
[Seeger & Nickisch, ’11].
◮
This work: Monte-Carlo variance estimation.
10 / 22
Gaussian Sampling by Local Perturbations
g1
x1
x2
h1
g2
x3
h2
g3
x4
h3
gK
xN
hM
g1
x1
x2
h1
g2
x3
h2
g3
x4
h3
gK
xN
hM
Gaussian MRF sampling by local noise injection
1. Local Perturbations : ỹ ∼ N (0, σ 2 I), and β̃ ∼ N (0, Γ−1 )
2. Gaussian Mode : Ax̃ = σ −2 HT ỹ + GT β̃
Then x̃ ∼ N (0, A−1 ), where A = σ −2 HT H + GT Γ−1 G.
[Papandreou & Yuille, ’10]
11 / 22
Monte-Carlo Variance Estimation
Let x̃i ∼ N (0, A−1 ), with i = 1, . . . , Ns .
General purpose Monte-Carlo variance estimator
Σ̂ =
Ns
1 X
x̃i x̃Ti ,
Ns
ẑk =
i=1
where s̃k ,i =
Ns
1 X
s̃k2,i ,
Ns
i=1
gTk x̃i .
Properties
◮
◮
◮
Marginal distribution of estimates ẑk /zk ∼
Unbiased E {ẑk } = zk .
Relative error is r = ∆(ẑk )/zk =
1 2
Ns χ (Ns ).
p
2/Ns .
12 / 22
Monte-Carlo vs. Lanczos Variance Estimates
−3
8
x 10
ẑk
6
SAMPLE
LANCZOS
EXACT
4
2
0
0
2
4
zk
6
8
−3
x 10
13 / 22
Application: Image Deconvolution
≈
◮
◮
◮
∗
Measurement equation: y ≈ k ∗ x = Hx.
Non-blind deconvolution (known blur kernel k).
Blind deconvolution (unknown blur kernel k).
14 / 22
Blind Image Deconvolution
Blur kernel recovery by Maximum Likelihood
◮
ML objective: k̂ = argmaxk P(y; k) = argmaxk
◮
Variational ML: k̂ = argmaxk Q(y; k)
◮
Contrast with argmaxk (maxx P(x, y; k)).
R
P(y, x; k)dx.
◮ [Fergus et al., ’06], [Levin et al., ’09].
15 / 22
Variational EM for Maximum Likelihood
Find k by maximizing Q(y; k) [Girolami, ’01], [Levin et al., ’11].
E-Step
Given current kernel estimate kt , do variational Bayesian
inference, i.e., fit Q(x|y; kt ).
M-Step
Maximize w.r.t. k the expected complete log-likelihood
EQ(x|y;kt ) {log Q(x, y; k)}. Equivalently, minimize w.r.t. k
EQ(x|y;kt )
1
ky − Hxk2
2
1 T
tr (H H)(A−1 + x̂x̂T ) − yT Hx̂ + (const)
2
1
= kT Rxx k − rTxy k + (const)
2
=
Expected moments Rxx estimated by Gaussian sampling.
16 / 22
Summary of Computational Primitives
Smoothed estimation
Obtain the variational mean x̂Q = argminx φQ (x; z), where
φQ (x; z) = σ −2 ky − Hxk2 − 2
◮
K
X
k =1
log t (sk2 + zk )1/2
Inner loop of variational inference.
Sparse linear system
Ax = b,
◮
where A = σ −2 HT H + GT Γ−1 G .
Estimate variances in outer loop of variational inference
and moments Rxx in blind image deconvolution.
17 / 22
Summary of Computational Primitives
Smoothed estimation
Obtain the variational mean x̂Q = argminx φQ (x; z), where
φQ (x; z) = σ −2 ky − Hxk2 − 2
◮
K
X
k =1
log t (sk2 + zk )1/2
Inner loop of variational inference.
Sparse linear system
Ax = b,
where A = σ −2 HT H + GT Γ−1 G .
◮
Estimate variances in outer loop of variational inference
and moments Rxx in blind image deconvolution.
◮
Solve with preconditioned conjugate gradients.
17 / 22
Efficient Circulant Preconditioning
Approximate
A = σ −2 HT H + GT Γ−1 G with P = σ −2 HT H + γ̄ −1 GT G ,
P
with γ̄ −1 , (1/K ) Kk=1 γk−1 [Lefkimmiatis et al., ’12].
Properties
◮
Thanks to stationarity of P, DFT techniques apply.
◮
Optimality: P = argminX∈C kX − Ak
18 / 22
Effect of Preconditioner
10
10
CG
PCG
5
10
0
10
−5
10
−10
10
−15
10
0
20
40
60
80
100
120
19 / 22
Non-Blind Image Deblurring Example
ground truth
our result (PSNR=31.93dB)
blurred (PSNR=22.57dB)
VB stdev
20 / 22
Blind Image Deblurring Example
ground truth
our result (PSNR=27.54dB)
blurred (PSNR=22.57dB)
kernel
21 / 22
Summary
Main Points
◮
Variational Bayesian inference using standard optimization
primitives.
◮
Scalable to large-scale problems.
◮
Open question: Monte-Carlo or Variational?
Summary
Main Points
◮
Variational Bayesian inference using standard optimization
primitives.
◮
Scalable to large-scale problems.
◮
Open question: Monte-Carlo or Variational?
Our software integrated in the glm-ie open source toolbox.
THANK YOU!
© Copyright 2026 Paperzz