Shrinkage Fields for Effective Image Restoration
– Supplemental Material –
Uwe Schmidt
Stefan Roth
Department of Computer Science, TU Darmstadt
1. Derivations
While the following derivations may look complex on a first glance, they require only standard (multivariate) calculus. We
use the numerator layout notation1 for all derivations.
Please also note that all derived formulae can be computed efficiently and implemented compactly (Matlab code for learning
and inference is available on the authors’ webpages).
We derive the iterative update equation for the image variables x in Eq. (6) of the main paper as:
gβ (z) = arg min Eβ (x|z, y) = arg max exp −Eβ (x|z, y)
x
x
!
N X
X
2
λ
β T
2
= arg max exp − ky − Kxk −
f x(c) − zic + ρi (zic )
2
2 i
x
i=1 c∈C
!
N
βX
λ
T
T
= arg max exp − (y − Kx) (y − Kx) −
(Fi x − zi ) (Fi x − zi )
2
2 i=1
x
!
!!
N
N
X
X
1 T
T
λKT K + β
= arg max exp − x
FT
λKT y + β
FT
i Fi x + x
i zi
2
x
i=1
i=1
"
#−1 "
# "
#−1
N
N
N
X
X
X
T
FT
λKT y + β
FT
FT
= arg max N λKT K + β
i Fi
i zi , λK K + β
i Fi
x
i=1
"
T
= λK K + β
N
X
i=1
i=1
#−1 "
FT
i Fi
T
λK y + β
N
X
(1)
(2)
(3)
(4)
(5)
i=1
#
FT
i zi
(6)
i=1
#−1 "
#
N
N
X
X
λ T
λ T
T
T
K K+
Fi Fi
K y+
Fi z i
=
β
β
i=1
i=1
PN
F βλ KT y + i=1 FT
z
i i
.
= F −1 λ
PN
*
*
Ǩ
◦
Ǩ
+
F̌
◦
F̌
i
i=1 i
β
"
(7)
(8)
As in the main paper, convolution is denoted by Fx = [f T x(C1 ) , . . . , f T x(C|C| ) ]T ≡ f ⊗ x ≡ F −1 (F̌ ◦ F(x)). The optical
transfer function F̌ ≡ F(f ) is derived from filter (point spread function) f , where F denotes the discrete Fourier transform
(DFT). Note that division and multiplication (◦) are applied element-wise in Eq. (8); F̌* denotes the complex conjugate of F̌.
The last step is possible since K and the Fi are block-circulant matrices with circulant blocks as they correspond to convolutions with circular boundary conditions. Consequently, they are diagonalized by DFTs, allowing element-wise operations.
1 cf .
http://en.wikipedia.org/w/index.php?title=Matrix_calculus&oldid=576691987#Numerator-layout_notation.
1
1.1. Boundary handling
To reduce boundary artifacts, which may arise due to using convolutions with circular boundary conditions, we use padding
of the input image. Specifically, we first take the input image x ∈ RD (of width w and height h with D = h · w) and replicate
b pixels of its boundary on all sides by multiplication with the sparse “padding” matrix Pb 2 . In image denoising, b can be
2
freely chosen (we use b = 10); for deconvolution b = (r − 1)/2 with square blur kernel k ∈ Rr of size r×r pixels3 .
For deconvolution we additionally apply “edge-tapering” to the padded input image Pb x so that it better matches the assumptions of applying convolution with circular boundary handling. In particular, we applied u iterations of Matlab’s edgetaper
function (where αk is a weighting vector based on k), which can be formalized as
edgetaper(x, k) = αk ◦ x + (1 − αk ) ◦ (k ⊗ x)
(9)
= diag{αk }x + diag{1 − αk }(Kx)
= diag{αk } + diag{1 − αk }K x
(10)
= Ek x.
(12)
(11)
Performing u iterations of edge-tapering can thus be expressed by multiplication with the matrix Ek = diag{αk }+diag{1−
αk }K raised to the uth power. (For denoising, Euk = I is set to the identity matrix.)
We denote the boundary-padded and possibly edge-tapered input image as xP = Euk Pb x, and as x̂B the output image where
the padded boundary region has not been removed yet. Consequently, we use the following minor variation of gΘ (x) (Eqs.
(10) and (11) of the main paper) in practice:
PN
P
F λKT y + i=1 FT
f
(F
x
)
π
i
i
i
(13)
gΘ (x) = Tb F −1
PN
*
*
λǨ ◦ Ǩ + i=1 F̌i ◦ F̌i
"
#−1 "
#
N
N
X
X
T
T
T
T
P
λK y +
Fi fπi (Fi x )
(14)
= Tb λK K +
Fi Fi
i=1
i=1
|
{z
}|
Ω−1
{z
η
= Tb x̂B .
}
(15)
The uncropped output image is given as x̂B = Ω−1 η. To obtain an output image x̂ = gΘ (x) with the same size as the input
image x, we remove the padded boundary region by multiplication with the sparse (“cropping”) matrix Tb .
It is important to note that padding, edge-tapering, and cropping can all be expressed as linear transformations, which allows
us to easily compute gradients of the transformed images. Furthermore, please note that for inference and learning (Sec. 1.2),
the convolution matrices K and Fi are never explicitly constructed (also applies to KT and FT
i ), since all matrix-vector
products can be efficiently computed through convolutions.
1.2. Learning
In the following we use gΘ (x) with boundary handling as defined in Eq. (15).
(s)
Recall from the main paper that given training data {xgt , y(s) , k(s) }Ss=1 , we learn all model parameters Θt = {λt , π ti , fti }N
i=1
greedily stage-by-stage for t = 1, . . . , T via
J(Θt ) =
S
X
(s)
(s)
`(x̂t ; xgt ),
(16)
s=1
or jointly for all T stages via
J(Θ1,...,T ) =
S
X
(s)
(s)
`(x̂T ; xgt ).
s=1
2 In
Matlab, Pb x ≡ reshape(padarray(reshape(x,h,w),[b,b],'replicate','both'),D,1).
kernel only assumed for simplicity here.
3 Square
2
(17)
To that end, at each stage of greedy learning J(Θt ) we need to compute Eq. (16) and its gradient
h
i
∂J(Θ1,...,T )
∂J(Θ1,...,T )
J(Θ1,...,T ) we need to compute Eq. (17) and its gradient
,
.
.
.
,
.
∂Θ1
∂ΘT
1.2.1
∂J(Θt )
∂Θt ;
for joint training
Gradient of loss function
We can address one training example at a time since the gradients of Eqs. (16) and (17) decompose into sums over training
examples. The gradient w.r.t. the model parameters of the last/current stage can be computed as follows:
∂`(x̂t ; xgt )
∂`(x̂t ; xgt ) ∂Tb Ω−1
t ηt
=
·
∂Θt
∂ x̂t
∂Θt
∂`(x̂t ; xgt )
∂Ωt −1
∂η t
−1
Tb Ωt −
Ω η +
=
∂ x̂t
∂Θt t t ∂Θt
h
iT
∂η t
∂Ωt B
T ∂`(x̂t ; xgt )
= ĉT
−
x̂t
with ĉt = Ω−1
.
t
t Tb
∂Θt
∂Θt
∂ x̂t
(18)
(19)
(20)
The gradient of the chosen loss function (negative PSNR)
`(x̂; xgt ) = −20 log10
!
√
R D
,
kx̂ − xgt k
(21)
where D denotes the number of pixels of x̂ and R the maximum intensity level of a pixel (i.e., R = 255), is derived as
!
√
∂`(x̂; xgt )
20 ∂
R D
=−
log
(22)
∂ x̂
log 10 ∂ x̂
kx̂ − xgt k
20 ∂ logkx̂ − xgt k
log 10
∂ x̂
∂kx̂ − xgt k
20
1
=
log 10 kx̂ − xgt k
∂ x̂
=
(23)
(24)
=
(x̂ − xgt )T
1
20
log 10 kx̂ − xgt k kx̂ − xgt k
(25)
=
20 (x̂ − xgt )T
.
log 10 kx̂ − xgt k2
(26)
For joint training we additionally require gradients w.r.t. the model parameters of previous stages. The gradient w.r.t. the
second-to-last stage is obtained as (t ≥ 2)
∂`(x̂t ; xgt )
∂η t
= ĉT
(27)
t
∂Θt−1
∂Θt−1
X
N
∂
T
P
F
f
(F
x̂
)
(28)
= ĉT
π
ti
t
t−1
∂Θt−1 i=1 ti ti
= ĉT
t
N
X
0
P
u
FT
ti fπ ti (Fti x̂t−1 )Fti Ek Pb
i=1
X
N
∂ x̂t−1
∂Θt−1
∂Tb Ω−1
t−1 η t−1
=
(Fti ĉt )
Euk Pb
∂Θ
t−1
i=1
N
T
X
∂η t−1
∂Ωt−1 B
−1
T uT
T 0
P
= TT
P
E
F
f
(F
x̂
)F
ĉ
Ω
−
x̂
ti t−1
ti t
b b k
ti π ti
t−1
∂Θt−1
∂Θt−1 t−1
i=1
N
X
∂η t−1
0
∂Ωt−1 B
−1
T T uT
P
= ĉT
−
x̂
with
ĉ
=
Ω
T
P
E
FT
t−1
t−1
t−1
ti fπ ti (Fti x̂t−1 )Fti ĉt ,
t−1 b b k
∂Θt−1
∂Θt−1
i=1
T
0
fπti (Fti x̂Pt−1 )Fti
3
(29)
(30)
(31)
(32)
0
∂f
(F x̂P
)
ti t−1
ti
(cf . Sec. 1.2.2). Note that the form is the same as of Eq. (20), which means we can
where fπti (Fti x̂Pt−1 ) = π∂F
P
ti x̂t−1
apply this recursively to compute gradients for earlier stages (t ≥ 2, s = 1, . . . , t − 1):
∂η t−s
∂`(x̂t ; xgt )
∂Ωt−s B
= ĉT
−
x̂
with
(33)
t−s
∂Θt−s
∂Θt−s
∂Θt−s t−s
T T uT
ĉt−s = Ω−1
t−s Tb Pb Ek
N
X
0
P
FT
t−s+1,i fπ t−s+1,i (Ft−s+1,i x̂t−s )Ft−s+1,i ĉt−s+1 .
(34)
i=1
Note that jointly training T stages only takes approximately T times as long as training a single stage.
1.2.2
Gradients for specific model parameters
Now that we have derived the generic gradients for the given loss function, we need to derive the specific gradients w.r.t. the
actual model parameters Θt = {λt , π ti , fti }N
i=1 at all stages t.
Regularization weight λ. We define λt = exp(λ̃t ) to ensure positive values of λt and learn λ̃t via its partial derivative
∂`(x̂t ; xgt )
∂ λ̃t
∂`(x̂t ; xgt ) ∂ exp(λ̃t )
·
∂λt
∂ λ̃t
T
T
B
= ĉT
K
y
−
K
Kx̂
t
t · λt
=
T
= λt (Kĉt ) (y −
Kx̂Bt ).
(35)
(36)
(37)
Shrinkage function. Recall that the shrinkage function is modeled as a linear combination of Gaussian RBF kernels:
fπ (v) =
M
X
γ
πj exp − (v − µj )2 .
2
j=1
(38)
Its derivatives w.r.t. the input and weights can be computed as
γ
∂fπ (v)
= exp − (v − µj )2
∂πj
2
M
γ
X
∂fπ (v)
πj exp − (v − µj )2 · γ(v − µj ).
=−
∂v
2
j=1
(39)
(40)
Concerning vector-valued inputs v = [v1 , . . . , vL ]T to the univariate shrinkage function, please note that
T
fπ (v) = [fπ (v1 ), . . . , fπ (vL )]
0
∂fπ (v)
∂fπ (v1 )
∂fπ (vL )
= diag
,...,
=·· fπ (v) ∈ RL×L .
∂v
∂v1
∂vL
(41)
(42)
For practical purposes, we not only precompute and store in a lookup-table (LUT) the shrinkage function fπ (v) for all
(sensible) v, but all its derivatives that are required for learning the model, which are quickly retrieved from the LUT via
linear interpolation.
Concretely, the relevant gradient of the shrinkage function weights is computed as
P
∂`(x̂t ; xgt )
T
T ∂fπ ti (Fti x̂t−1 )
= ĉt Fti
∂π ti
∂π ti
∂fπti (Fti x̂Pt−1 )
= (Fti ĉt )T
,
∂π ti
4
(43)
(44)
∂f
(F x̂P
)
where πti∂πtiti t−1 ∈ RD×M is a matrix, which contains in each row the derivatives w.r.t. all M entries of vector π ti for
all D filter responses Fti x̂Pt−1 .
Filter. We define each filter of size m×n w.r.t. a basis B ∈ Rmn×V as f = Bf̃ and learn the entries of f̃ ∈ RV . We follow
Chen et al. [1] and choose DCT filters for B, omitting the DC-component to guarantee zero-mean filters.
mn×D
First, it is useful to denote f ⊗ x ≡ Fx = (f T [x]C )T = [x]T
is a matrix of all cliques C of x ∈ RD
C f , where [x]C ∈ R
that filter f is applied to. Then, we can differentiate the convolved image w.r.t. the filter coefficients f̃ as follows
∂[x]T
∂Fx
D×V
C Bf̃
=
= [x]T
CB ∈ R
∂ f̃
∂ f̃
(45)
The gradient of the loss w.r.t. the filter coefficients is derived as
T
∂Fti fπti (Fti x̂Pt−1 ) ∂FT
∂`(x̂t ; xgt )
ti Fti B
= ĉT
−
x̂
t
t
∂ f̃ti
∂ f̃ti
∂ f̃ti
∂fπti (Fti x̂Pt−1 ) ∂Fti x̂Pt−1
∂Fti ĉt
∂(Fti ĉt )T Fti x̂Bt
= fπti (Fti x̂Pt−1 )T
+ (Fti ĉt )T
·
−
P
∂Fti x̂t−1
∂ f̃ti
∂ f̃ti
∂ f̃ti
0
T
P
P
T
T B T
B T
T
= fπti (Fti x̂Pt−1 )T [ĉt ]T
C B + (Fti ĉt ) fπ ti (Fti x̂t−1 )[x̂t−1 ]C B − (Fti ĉt ) [x̂t ]C B − (Fti x̂t ) [ĉt ]C B
h
iT
0
= [ĉt ]C fπti (Fti x̂Pt−1 ) − Fti x̂Bt + [x̂Pt−1 ]C fπti (Fti x̂Pt−1 )Fti ĉt − [x̂Bt ]C Fti ĉt B.
To avoid duplicating degrees of freedom, in practice we learn filters with fixed unit norm f =
To perform learning for these normalized filters, it is again useful to first derive
Bf̃
kBf̃ k
= Bf̃ · f̃ T (BT B)f̃
−1/2
∂Fx
∂
T
T
=
[x]T
C Bf̃ · f̃ (B B)f̃
∂ f̃
∂ f̃
−1/2
−3/2
1 T T
T
T
T T
f̃
(B
B)
f̃
·
I
−
f̃
·
= [x]T
B
·
f̃
(B
B)
f̃
·
2
f̃
B
B
C
2
T
−1
= kBf̃ k−1 · [x]T
kBf̃ k−1 · f̃ T BT B
C B − [x]C Bf̃ · kBf̃ k
T
T
= kBf̃ k−1 · [x]T
B
−
[x]
f
·
f
B
C
C
B
T
= [x]T
C − Fx · f
kBf̃ k
(46)
(47)
(48)
(49)
−1/2
.
(50)
(51)
(52)
(53)
(54)
to then derive the gradient of the loss w.r.t. the filter coefficients
∂`(x̂t ; xgt )
∂ f̃ti
h
iT
0
= [ĉt ]C fπti (Fti x̂Pt−1 ) − Fti x̂Bt + [x̂Pt−1 ]C fπti (Fti x̂Pt−1 )Fti ĉt − [x̂Bt ]C Fti ĉt
B
kBf̃ti k
(55)
h
i f TB
0
(56)
− (Fti ĉt )T fπti (Fti x̂Pt−1 ) − Fti x̂Bt + (Fti x̂Pt−1 )T fπti (Fti x̂Pt−1 )Fti ĉt − (Fti x̂Bt )T Fti ĉt
kBf̃ti k
h
iT B
0
= [ĉt ]C fπti (Fti x̂Pt−1 ) − Fti x̂Bt + [x̂Pt−1 ]C fπti (Fti x̂Pt−1 )Fti ĉt − [x̂Bt ]C Fti ĉt
(57)
kBf̃ti k
h
i f TB
0
− (Fti ĉt )T fπti (Fti x̂Pt−1 ) − 2Fti x̂Bt + fπti (Fti x̂Pt−1 )Fti x̂Pt−1
,
(58)
kBf̃ti k
where we can re-use most of our previous derivation (Eq. 49).
2. Proof of Lemma 1
Lemma 1. For any function f : R → R and ≥ 0, arg minz f (z) ≤ arg minz (f (z) − z).
5
Proof. If = 0, Lemma 1 is trivially true; hence assume > 0 from now on. Let
g(z) = f (z) − z,
ẑf = arg min f (z),
(59)
(60)
z
ẑg = arg min g(z),
(61)
z
then the inequalities
f (ẑg ) ≥ f (ẑf ) = min f (z)
(62)
g(ẑf ) ≥ g(ẑg ) = min g(z)
(63)
z
z
hold. Combining above two inequalities proves the Lemma:
g(ẑf ) ≥ g(ẑg )
(64)
⇒
f (ẑf ) − ẑf ≥ f (ẑg ) − ẑg
(65)
⇒
f (ẑf ) − ẑf ≥ f (ẑf ) − ẑg
(66)
⇒
−ẑf ≥ −ẑg
(67)
⇒
ẑf ≤ ẑg
(68)
⇒
arg min f (z) ≤ arg min (f (z) − z)
z
(69)
z
Acknowledgements. We thank StackExchange (Mathematics) user JiK for their advice.
References
[1] Y. Chen, T. Pock, R. Ranftl, and H. Bischof. Revisiting loss-specific training of filter-based MRFs for image restoration. In GCPR
2013.
6
© Copyright 2026 Paperzz