An EM-algorithm approach for the design of orthonormal

An EM-algorithm approach for the design of orthonormal bases
adapted to sparse representations
Angélique Drémeau and Cédric Herzet
INRIA Centre Rennes - Bretagne Atlantique, Campus universitaire de Beaulieu, 35000 Rennes, France
Context
Learning algorithms
Estimation of the noise variance
Noise variance included as a new unknown variable in the MAP problem
Sparse representation problem
Sezer’s algorithm [3]
where x0 denotes the l0-norm, i.e., the number of nonzero coefficients in x and L is a given constant.
Or in its Lagrangian version:
min y − Dx22 + λx0,
x
where λ is a Lagrangian multiplier.
Design of dictionaries adapted to sparse representations
which leads to the best distortion-sparsity comGiven a training set {yj }K
,
find
the
dictionary
D
j=1
promise, i.e.,
⎧
⎫
⎨
⎬
min yj − Dxj 22 + λxj 0 .
D = arg min
xj
⎭
D ⎩
j
Low bit rate compression and sparsity in context of orthonormal basis [1]
0. Initialization
(0)
(0)
(0)
(0)
Set D = [D1 , . . . , Di , . . . , DP ], and
(0)
∀i ∈ {1, . . . , P }, ∀j ∈ {1, . . . , K}, xji = arg min
xji
(0)
yj − Di xji22 + λxji0
Addition of the estimation of the noise variance in the M-step
.
where ϕ(L) and γ depend on the basis.
At low bit rates and in context of orthonormal basis, the rate-distortion performance depends on the
ability of the basis to provide a good approximation of the signal with few coefficients.
Di
Performance analysis
(k)
xji = arg min
xji
(k)
xji
j∈Si
(k)
yj − Di xji22 + λxji0
.
where M = GS(I8 + N (a)), GS represents the Gram-Schmidt orthogonalization process and N (a)
represents a 16×16-matrix whose elements are i.i.d. realizations of a uniform law on [−a, a].
With assumptions p(cj ) = P1 , ∀ cj , ∀ j , and
λ = 2λσ 2, Sezer’s algorithm is equivalent to
the MAP estimation problem:
We consider instead the marginalized MAP estimation problem:
Contributions of this paper
A probabilistic framework
c(k) = arg max
c
K
(D, X) = arg max
log p(yj , xj , D, cj ).
(k−1)
log p(yj , xj
, D(k−1), cj ),
K
(D,X) j=1
where p(yj , xj , D) =
We can indeed recognize the two steps:
(0)
Alternative approach
(D,X,c) j=1
• Synthetic signals: 500 16-dimensional signals are generated as the noisy combination of 4 atoms of one
single basis taken from a set of 6 bases. The amplitude of the nonzero coefficients are drawn from
a zero-mean Gaussian distribution with variance σa2 =16. Finally, the dictionary is initialized from the
original dictionary as:
∀i ∈ {1, . . . , P } Di = DiMT
...Revisited
K
• Performance measurements: We evaluate and compare the performance of three algorithms:
– “Sezer”: learning algorithm proposed in [3],
– “EM”: proposed algorithm where the noise variance estimation is also implemented,
– “EM thresholded”: similar to “EM” where the E-step is approximated by the thresholded decision:
(k)
cj
log p(yj , xj , D),
P
cj =1 p(yj , xj , D, cj ).
• E-step computes a lower bound on the log likelihood with respect to the current estimates of
the model parameters,
with c=[c1, . . . , cK ]T ,
K
(k)
(k)
(k)
log p(yj , xj , D, cj ).
(D , X ) = arg max
• M-step estimates the model parameters which
maximize the function evaluated in the E-step.
(D,X) j=1
(k−1)
= arg max p(cj = i|yj , xj
cj
100
100
90
90
80
80
70
60
50
40
30
EM
EM thresholded
Sezer
10
EM-based algorithm
0
DTi Di = IN ,
where IN is the N -dimensional identity matrix. Let finally xj be the vector made up of the xji’s which
correspond to the sparse representations of yj in bases Di’s, i.e.,
xTj [xTj1, . . . , xTji, . . . , xTjP ]T .
We consider the following model for yj :
p(yj |D) =
with
P
RM c =1
j
p(yj |xj , D, cj = i) = N (Dixji, σ 2IN ),
p(xj |cj = i) ∝ exp{−λ xji0},
50
40
10
0
5
EM
EM thresholded
Sezer
10
15
20
25
0
5
10
15
20
25
SNR /dB
Figure 1: Comparison between Sezer’s, EM and EM-thresholded algorithms for different
dictionary initializations (left: a=0.3, right: a=0.4)
xji
1. E-step
∀i ∈ {1, . . . , P }, ∀j ∈ {1, . . . , K}, compute
1
(k)
(k−1) (k−1) 2
(k−1)
xji 2 − λxji 0) p(cj ).
wji ∝ exp(− 2 yj − Di
2σ
p(yj |xj , D, cj ) p(xj |cj ) p(cj ) dxj ,
where N (μ, Γ) denotes a Gaussian distribution with mean μ and covariance Γ, and
where λ>0.
0. Initialization
(0)
(0)
(0)
(0)
Set D = [D1 , . . . , Di , . . . , DP ], λ 2λσ 2, and
(0)
(0)
∀i ∈ {1, . . . , P }, ∀j ∈ {1, . . . , K}, xji = arg min yj − Di xji22 + λxji0 .
60
20
SNR /dB
D [D1, . . . , Di, . . . , DP ],
70
30
20
Let {yj }K
j=1 be a set of training signals for the optimization of an overcomplete dictionary D. We
suppose that D is made up of P orthonormal bases, i.e.,
, D(k−1)).
Performance is evaluated via the missed-detection rate versus the signal-to-noise ratio (SNR).
This problem is solved by the EM algorithm:
j=1
P
j=1 i=1
2. Basis update
∀i ∈ {1, . . . , P }, ∀j ∈ {1, . . . , K}, update Di and xji as follows:
(k)
Di = arg min
min{yj − Dixji22 + λxji0} subject to DTi Di = IN ,
(D, X, c) = arg max
In this paper, we focus on the learning of dictionary made up of the union of orthonormal bases. This
topic was the object of some recent contributions ([2],[3]). We propose here a probabilistic interpretation
of one of them and suggest a novel optimization procedure based on the expectation-maximization (EM)
algorithm ([4]).
K
1. Classification
(k)
(k)
∀ i ∈ {1, . . . , P }, compute Si = j ∈ {1, . . . , K} cj = i ,
(k)
(k−1) (k−1) 2
(k−1)
xji 2 + λxji 0 .
where cj = arg mini∈{1,...,P } yj − Di
Dependency of distortion and rate on the number of nonzero quantized transform coefficients, say L
D = ϕ(L),
R = γ L,
1 (k)
(k) (k)
2
(k)
wji yj − Di xji 22.
(σ ) =
NK
Missed detection rate /%
x
log p(yj , xj , D, σ 2).
(D,X,σ 2) j=1
Missed detection rate /%
Let D ∈ RN×M be a dictionary with N ≤ M and y ∈ RN an observed signal. Find the vector x ∈ RM
such that:
min y − Dx22 subject to x0 ≤ L,
(D, X, (σ 2)) = arg max
K
2. M-step
∀i ∈ {1, . . . , P }, ∀j ∈ {1, . . . , K}, update Di and xji as follows:
(k)
(k)
Di = arg min
wji min {yj − Dixji22 + λxji0} subject to DTi Di = IN ,
xji
Di
j=1
(k)
(k)
xji = arg min yj − Di xji22 + λxji0 .
K
xji
References
[1] S. Mallat and F. Falzon, “Analysis of low bit rate image transform coding,” IEEE Trans. On Signal
Processing, vol. 46, no. 4, pp. 1027–1042, April 1998.
[2] S. Lesage, R. Gribonval, F. Bimbot, and L. Benaroya, “Learning unions of orthonormal bases with
thresholded singular value decomposition,” in Proc. IEEE Int’l Conference on Acoustics, Speech and
Signal Processing (ICASSP), 18-23 March 2005, vol. 5, pp. v293–v296.
[3] O. G. Sezer, O. Harmanci, and O. G. Guleryuz, “Sparse orthonormal transforms for image compression,”
in Proc. IEEE Int’l Conference on Image Processing (ICIP), San Diego, CA., October 2008.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the
em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, pp. 1–38,
1977.