Deep Sparse Coding and Dictionary LearningSPECTRUM LAB
Subhadip Mukherjee and Chandra Sekhar Seelamantula
Department of Electrical Engineering
Indian Institute of Science, Bangalore 560012, India
Email: {subhadip, chandra.sekhar}@ee.iisc.ernet.in
(a) SNRinput = 10 dB
1. Sparse Coding
I
I
FISTA iterations unfolded
fLETnet architecture:
I sparsity parameter , corresponding to different measurement/input SNRs.
L AYER 1
1+
x
z
x0
neural net.
⌘
t
xt+1 =
t
I
Wxt + b ,
in a data-driven manner
3. Architecture and Training of LETnet
(u) =
ctk
k(u),
k(u)
k=1
IEEE T-PAMI
L AYER 1
x0
K
X
x̃1
W
1 (x̃1 )
x1
x̃2
L AYER t
x̃t
t (x̃t )
W
⇣
⌘
2
= u exp - (k-1)u
2⌧2
L AYER L
xt
x̃t+1
x̃L
W
x
z
2
x̃
W
2
2
2
x̃2
x
2
L AYER L
1
1+
3
z
3
L 1
x̃L
L
z
1
L
3
b
x̃
W
L
L
L
x̃L
xL
b
L (x̃L )
Direct links from two previous layers (second-order memory)
I Circumvents
issue ofand
vanishing/exploding
gradient
6.2
Training Detailsthe
of LETnetVar
MMSE-ISTA
are used
to parametrize the activation function using cubic
B-splines,
the residual
same set of coefficients
I The resulting architecture fLETnet is essentially
a and
deep
networkare shared across
all
the
layers. The MMSE-ISTA network is trained using GD
For a fixed sensing matrix A, a training dataset consisting
I Equivalent performance as the LETnet with
half
as
many
layers
of =
10 4 , as suggested in [28], and the
of N = 100 examples is generated for learning the optimal with a step-size
nonlinearities of LETnetVar and MMSE-ISTA. A validation algorithm is terminated when the number of training epochs
I IEEE T-PAMI
xii
set containing 20 examples is used during training in order exceeds 1000.
The LET-based activation function in LETnetVar is
to avoid over-fitting. The trained network is tested on a
8
9
10
11
12
13
14
dataset containing Ntest = 100 examples drawn according parametrized with K = 5 coefficients in every layer, leading
to the same statistical model that was used to generate the to a total of 500 parameters, which is comparable to the
training dataset (cf. Section 6.1). The procedure is repeated number of free parameters in MMSE-ISTA. The HFO algoT = 10 times with different datasets, corresponding to rithm for training LETnetVar is terminated after 60 epochs.
10 different sensing matrices, to measure the ensembleaveraged performance of LETnetVar and MMSE-ISTA. The 6.2.1 Initialization and termination criteria for CG
optimal regularization parameter for ISTA is chosen as The CG algorithm required to compute the optimal direction
explained in Section 6.1 and the same value is used in
c in every epoch of HFO is initialized with that obtained
MMSE-ISTA, as suggested in [28]. The parameter opt used in the previous epoch of HFO. Theoretically, it takes KL
in training LETnetVar is chosen by cross-validation over five iterations for the CG algorithm to compute an exact solution
values of placed logarithmically in the interval [0.05, 0.5]. to (17), since the variable c is of dimension KL. Martens et
Both networks contain L = 100 layers.
al. [37] reported that it suffices to run the CG algorithm
v For the MMSE-ISTA architecture, K = 501 coefficients in each HFO iteration only a few times according to a
What regularizers does the fLETnet learn?
Advantage-1: Number of parameters to learn does not grow with n
I Advantage-2: Possible to learn a rich variety of activations and regularizers
t
x̃1
L AYER L
1+
2
Key Features of fLETnet:
I
LETnet architecture: Model
1
1
Fig. 4: Architecture of fLETnetVar. Every layer (except the first layer) is connected to two of its previous layers. The activation
I functions are untied across layers.
promotes sparsity
ISTA update: x = T ⌘ x - ⌘rf xt
where W = I - ⌘A>A and b = ⌘A>y
I Our approach: Learning the activation
I
x̃
W
1
b
Relaxation techniques: Solve sparsity-regularized least-squares
1
x̂ = arg min ky - Axk22 +
R (x)
x
| {z }
|2
{z
}
f (x)
L AYER 2
1+
1
1
2. Iterative Shrinkage-Thresholding Algorithm (ISTA)
t
(f) Training and validation error
Fig. 3: (Color online) The variation of the (ensemble averaged) reconstruction SNR for different algorithms as a function of the
Solution: Seek the minimum `0-(quasi)-norm solution: NP-hard
t+1
! ! Spectrum Lab
1. zt = (1(d)+SNRtinput
)xt-1
- xt-2
= 25 dB t
(e) SNRinput = 30 dB
2. xt = T⌫ x̃t , where x̃t = Wzt + b for t = 1, 2, · · · , L
min kxk0 subject to y = Ax
I
(c) SNRinput = 20 dB
! ! SPECTRUM LAB
5. fLETnet: A Deep Architecture Motivated by FISTA
Problem statement: Given a signal y 2 Rm and an overcomplete
dictionary A 2 Rm⇥n, find an s-sparse vector x 2 Rn, s ⌧ n, such that
X
y ⇡ Ax =) Basis selection: Express y =
xiai, |S| = s
i2S
I
! ! SPECTRUM LAB
(b) SNRinput = 15 dB
xL
6. Comparison of Testing Run-times
Fig. 6: The learnt parametric LET functions (blue) of fLETnet over a typical dataset and their corresponding induced
regularizers (red). The activation functions and the regularizers are compared with ST and the 1 penalty (black),
respectively. Zoomed-in portions of the estimated signals at the layers are also shown for visual comparison.
Algorithm per iteration/layer run-time # layers/
total time
c1 |v| > |v|, for
> v0 . This is reflected
(in milliseconds)in | (v)| ⇡iterations
(in|v| milliseconds)
in the corresponding regularizer, as it slopes downwards
ISTA
0.0331
1000
33.10
and even becomes
negative for inputs of
large magnitudes.
LETnet training:
However, the induced regularizers across all layers offer
FISTA
0.0394
1000
39.40
a positive penalty
for small magnitudes
resulting in noise
1X
rejection. The network
LETnet
0.0895
100 effects two simultaneous
8.95 operations
Training cost J(c) =
kx y , c - x k , where c = c ; c ; · · · ; c
– reduction of small amplitudes (noise), and enhancement
2
of large amplitudes,
signal. The signal
0.1088
50which are due to the5.44
with W and b as defined in Sectionq=1
2.1. The proximal Algorithm 1 The B ACK -P ROPAGATION algorithm for the fLETnet
component that gets eliminated in one layer in the process
7.3 What Regularizers Does the fLETnet Learn?
operator
P g in (7) is replaced
by the activation
applied LETnetVar architecture
I Second-order
Hessian-free
optimization
MMSE-ISTA
0.6184
1000
618.40
of canceling noise
can again be recovered
in a subsequent
t+1
To gain insights into the functioning of fLETnet, it is instruc- layer due to the negative penalty as highlighted in the third
element-wise
. Each iteration>in (13)1 can
(q) on x̃
> be viewed Input: A training pair (y, x), the sensing matrix A
IJ
(ci + c) followed
= J (ci) +by ac gnonlinear
50 the regularizers
588.36
tive CoSaMP
to study the activation functions and 11.7672
the corresponding row of Figure 6. Therefore,
i+2 cH
i c
learnt by fLETnet
i operation
as an affine
activation.
1: Perform a forward pass through the LETnetVar to comregularizers
in
each
layer
at
the
end
of
training.
In
Figure
6,
cover a wide range between hard and soft thresholding, and
I Compute
optimal
direction
using
conjugate-gradient
i
This perspective
establishes
a bridge
between
iterative
alpute xt (CG)
and x̃tat
, t every
= 1 : Lepoch
.
IRLS
5.2784
50 leading to a balance
263.92
we show
an illustration of the typical learnt
activations and even go beyond these,
between noise
b
b
b
b
b
less computation time overall. The MMSE-ISTA algorithm
requires relatively more time than the proposed architecFig. 2: (Color online) A schematic of the proposed LETnetVar with L layers. Nonlinear LET shrinkage operators t are appliedtures, mainly because it uses significantly more parameters
to model the activation function. The run-times for IRLS and
pointwise on x̃t at layer-t. Typical examples of LET activation functions and the corresponding regularizers are shown in blue
N
CoSaMP are considerably higher, as they require to compute
and red, respectively, for each layer.
L
2
1
2
L
matrix inversion and pseudo-inverse, respectively, within
I
q
q 2
q
every iteration.
gorithms for sparse coding and feed-forward
NN. Moti-(q) 2: Initialize t
L and r t J
xL x.
the regularizers over each layer from 8 to 14 of a trained rejection and signal preservation.
⇤
arg min J (c3:i +for ct)=+L down
k ck22to 1 dox
c =
fLETnet. A segment of the corresponding output signals are
vated by the flexibility offered by LET-based
activationc andi
t
t
t
also shown. The activation functions in all 50 layers are
efficient parametric control over the induced regularizers,
t J , where
4:
r ct J =
r
=
(x̃
)
k
x
⇤
i
i,k
⌘
⇣
I Update parameters as ci+1
provided in the supplementary document. Take for instance, 7.4 Support Recovery in fLETnet
c
+
i
c
as elucidated in Section 3, we employ them as nonlinear
(t)
5:
rx̃t J = diag
(xt ) rxt J
layer 8, where the regularizer is closer to the 1 penalty In Figure 7,⌦we ↵
N the layer-wise reconstruction SNR,
plot
activation functions. In the sequel, we refer to a NN with
over the dynamic range of the input. On the other hand, the
I
I
Problem
statement: Given a set of signals y kxtx2x2k22Ronm,test
learn
Two
ingredients
ofUnlike
CG the classical feed- 6: rxt 1 J = W rx̃t J
signals,an
versus the
LET-based
activation
as LETnet.
induced regularizers in layers 9, 13, and 14 are between the defined as SNRtj =
2
j=1 fLETnet.
and
penalties.
Interestingly,
the
regularizers
in
layers
layer
index
t
of
a
trained
The input SNR and the
forwardI Gradient
NN, LETnet
weights and biases Output:
1
2
gi contains
= rJ(c)|fixed
product
Hiu, for
any urct J , for t = 1 : L
The gradient
vectors
c=ci and the Hessian-vector
m⇥n sparsity parameter are taken as 20 dB and N
10, 11, and 12 decrease beyond a certain range of inputs
0.2, respectively.
determined
by
the
sensing
matrix
A
and
the
measurement
overcomplete
dictionary
A
2
R
and
s-sparse
vectors
x
2
Rn ,
I Computing the hessian-vector product
j j=1
and even go negative, thereby encouraging an already large The variation of reconstruction SNR with
respect to the
y at every layer, whereas the LET
coefficients are optimized
h(c+↵u)-h(c)
input
ton,
increase
further
in magnitude.
Such
a behavior
is corresponding number of iterations of FISTA is also shown
I Ru(h(c)) = lim↵!0
=) Hiu = Ru (rcJ(c))|c=ci
s
⌧
such
that
y
⇡
Ax
for
all
j
↵
j
during training, for which we propose efficient algorithms. leading to an overall lower complexity. The reconstructionin stark contrast with the familiar
j
to facilitate comparison. We observe that the recovery SNR
0 / 1 -norm-based reguI gi and
using
back-propagation
iu are computed
We consider
two H
architectures:
LETnetFixed,
where
the LET accuracy is also superior in case of LETnetVar/LETnetFixed.larizers, which never amplify the input. This behavior
PofN for FISTA increases initially and2saturates as the number of
the network seems
to have a high impact
on
support re- iterations
exceeds
40, resulting
improvement
parameters across all the layers are tied; and LETnetVar,
I Proposed
approach:
Â
=
min
A net
yj in no further
y (A)
j=1
covery. To understand further, let us examine the
of SNRt for a trained
ALET-based in estimation. jHowever, the evolution
2
where the LET parameters are allowed to vary across layactivation defined in (9), for |v| > v0 , where v0 is a large fLETnet exhibits an overall increasing trend, in spite of the
I nety (A) is the output of ISTA corresponding to the signal y and dictionary A
ers. The numerical experiments indicate that the LETnetVar
j produced by the final layer is
constantj (for example, v0 = 3 ). As the exponential terms local variations. The estimate
results in better reconstruction compared with LETnetFixed, 4.1 Training LETnetVar: Learning Optimal Shrinkage in
the expression
of decayA
as |v| increases,
we can write
significantly
better
thanmatrix-vector
the FISTA output, byproducts
almost 5.5 dB.
I Gradient
descent:
A - µrJ(A),
rJ(A)
requires
only
I Data generation:
For a given A, the training dataset D consists of N examples (v) ⇡ c1 v for |v| > v0 . As the coefficients are initialized
The SNRt has a steep increase for the first few layers (up
owing to higher flexibility.
N
I
Advantages
over
conventional
dictionary
learning
algorithms:
as that of FISTA. After that,
I n =Kamilov
256, m =et d0.7ne
Recently,
al. [28] proposed to employ a {(yq , xq )}q=1 , where yq = Axq + q . The random noiseto approximate the ST operator, we have c1 < 1 prior to to 20), nearly the same behavior
because
ST output never
exceeds its input in the SNRt seems to oscillate before reaching a value about
I Online
andthe
distributed
implementations
parametric
linear
combination
I Ai,j
⇠ N(0,
1/m) of cubic B-splines instead of vectors q are assumed to be independent and identicallytraining,
magnitude. However, in the process of training, c1 might 5.5 dB higher than FISTA. We conjecture that the network
the soft-threshold,
thex activation
function, distributed. Let ct 2 RK , t = 1 : L, be the coefficientsget
I Certain
desirable
properties
on the
dictionary,
as incoherence,
can
I x = xsuppto parametrize
updated to
a value larger
than one, thereby
resulting
actuallysuch
stabilizes
its support estimate
in be
this promoted
regime. To
xmag, where
supp ⇠ Bernoulli(⇢) and xmag ⇠ N(0, 1)
th
which is optimized during training and kept fixed across of the LET activation at layer t. For the q example in the by adding a penalty and appropriately modifying the gradient
I 0 < ⇢ < 1: smaller the value of ⇢, sparser the vector x
L
th
dataset,
the
prediction
x
of
the
L
layer
is
a
function
of
all layers. The resulting network is referred to as minimum
q
Performance metrics:
I
chosen
optimally
using cross-validation
mean-squared
error ISTA
(MMSE-ISTA).
We shall show that, the corresponding measurement vector yq and the setIof
#{recovered atoms}
100advantages
examplesofused
and
testing LET coefficient vectors c = {ct }Lt=1 . For convenience, weI Atom detection rate:
one of Ithe
usingfor
antraining
LET-based
activation
n
t
n
⇣
⌘
is that itI Performance
takes remarkably
less number
basis
functions assume
X
averaged
overof10
independent
trials that c is vector of size KL, with the c s stacked on
(K ⇡ 5) to cover a fairly rich class of regularizers without top of each other. The optimal set of activation parametersI Distance of  from A⇤: = n1
min 1 - aTj a⇤i
16j6n
I Reconstruction
sacrificing
the representationalSNR
abilityand
of the
network. In
training
error:
c is obtained by minimizing the squared estimation error
i=1
contrast, MMSE-ISTA requires K ⇡ 8000 cubic B-spline
I Experimental validation:
N
35
1
1.2
bases. Therefore,
the proposed NN model is less likely to
L
2
J(c) =
kx (yq , c) xq k2 ,
(15)I n = 50, m = 20, sparsity s = 3, # examples N = 2000, SNRinput = 30 dB
ISTA
MMSE-ISTA
Training
2 q=1 q
over-fit a fixed training set, as it uses
less number
of free
FISTA
CoSaMP
30
I # layers L = 200, # training epochs Nepoch = 80
LETnetVar
IRLS
parameters.
Validation
1.0
The proposed
LETnetVar architecture is illustrated in over all training examples. The optimization requires
25
1
0.25
0
knowledge
of
the
gradient
of
J(c)
with
respect
to
c
,
for
Figure 2. The initial estimate x (typically an all-zero vector)
0.8
DNN
is passed 20
through the layers of the LETnetVar to obtain an which we derive associated back-propagation algorithms
KSVD
L
0.8
0.2
estimate x of the true sparse signal. Once the activation (cf. Algorithm 1 and Section 5.1 for LETnetVar, and supple0.6
15
material for LETnetFixed). The optimization of J(c)
function coefficients
are optimized for a fixed sensing matrix mentary
A, it is straightforward to predict x given a new noisy using vanilla GD tends to diverge, unless a very small step 0.6
0.15
10
0.4 is chosen. The reason for divergence is the exploding
and compressive measurement – it is simply a matter of size
computing5 one forward pass through the network. While gradient [45], primarily caused by the unboundedness of
0.4
0.1
10.2in the LET expansion. We overcome this problem by
the computational complexities of LETnetFixed/LETnetVar
optimization (HFO) technique [37]. In the ith
0
per layer and
ISTA per iteration are identical, the network the Hessian-Free
0
10
20
30
40
50
60
in Algorithm 3), the search 0.2
0.05
has far fewer layers
the number
of0.25
ISTA, r epoch of HFO [37] T(summarized
RAINING E POCHS
0.05 than 0.1
0.15 of iterations
0.2
DNN
direction c is obtained by minimizing a locally quadratic
KSVD
(a) SNRinput = 20 dB
(b) Training and validation error
0
0
7. Deep Dictionary Learning
I
Both training and validation errors decrease; no overfitting
Acknowledgement: This is a joint work with Debabrata Mahapatra.
DISTANCE FROM GROUND-TRUTH
ATOM DETECTION RATE
E RROR
R ECOVERY SNR
4. Validation on Synthetic Signals
10
20
30
40
50
TRAINING EPOCHS
60
70
80
10
20
30
40
50
TRAINING EPOCHS
60
70
80
Deep Sparse Coding and Dictionary Learning
Subhadip Mukherjee and Chandra Sekhar Seelamantula
Joint work with Debabrata Mahapatra
Department of Electrical Engineering
Indian Institute of Science
Bangalore – 560012, India
Email:
{subhadip, chandra.sekhar}@ee.iisc.ernet.in
April 7, 2017
EECS Divisional Symposium, 2017
Outline
What is sparse coding?
Iterative shrinkage-thresholding algorithms (ISTA)
I
Iterative unfolding and connections to deep neural networks (DNNs)
I
Building a learnable model for sparse coding
Our contributions
I
Modeling the non-linearity using a linear expansion of thresholds (LETs)
I
Parametric flexibility for designing regularizers
I
Efficient second-order (Hessian-free) optimization for learning
I
Reducing the number of layers and link with deep residual networks
I
Building a deep architecture for dictionary learning
Simulation results and insights
Summary and future works
What is sparse coding?
Problem statement: Given a signal y 2 Rm and an overcomplete dictionary
A 2 Rm⇥n , find an s-sparse vector x 2 Rn , s ⌧ n, such that y = A x
Basis selection: Express y =
I
X
i 2S
xi a i , |S | = s
Estimating A from given data is the problem of dictionary learning
Solution: Seek the minimum `0 -(quasi)-norm solution: Combinatorially hard
min kx k0 subject to y = A x
x
Iterative shrinkage-thresholding algorithm (ISTA) meets neural network
Relaxation techniques: Solve sparsity-regularized least-squares
x̂ = arg min
x
1
R (x )
ky A x k22 +
|
{z }
2
| {z } promotes
sparsity
f (x )
ISTA update rule (Daubechies et al., 2004)
I
x t +1 = T
⇣
t
⌘ x
⇣ ⌘⌘
⌘rf x t , where T⌫ denotes soft-thresholding with threshold ⌫
Unfolding of ISTA iterations
x t +1
0
1
BBB
CCC
B
t
t
B
=
x + b CCCC , where W = I
BB@W
|
{z }A
|{z}
⌘A > A and b = ⌘A > y
affine
non lin.
Building learnable models
I
I
Learn the linear transformation parameters W and b from the data –
data-intensive
Learn the nonlinear shrinkage function
LETnet: Modeling the activation nonlinearity using LETs
I
LET modeling:
t
(u ) =
K
X
k =1
ckt k (u),
where
k (u )
= u exp
✓
(k 1)u2
2⌧2
◆
IEEE T-PAMI
ix
(a) SNRinput = 10 dB
(b) SNRinput = 15 dB
(c) SNRinput = 20 dB
Unfolding of FISTA (Nesterov, 1980s) and the fLETnet
FISTA iterations unfolded
1
zt = (1 +
t )x
t 1
tx
t 2
t
t
2 (d) SNR
input = 25 dB
x̃ = Wz + b
(e) SNRinput = 30 dB
(f) Training and validation error
⇣ ⌘
t
t
3
= TThe
, for of
t=
, 2, · · · averaged)
,L
⌫ x̃variation
Fig. 3: (Colorxonline)
the 1
(ensemble
reconstruction SNR for different algorithms as a function of the
sparsity parameter , corresponding to different measurement/input SNRs.
L AYER 1
1+
x
0
L AYER 2
1+
1
z1
x̃1
W
1
x̃1
x1
b
L AYER L
1+
2
z2
x̃2
W
2
2
b
x̃2
x2
L AYER L
1
1+
3
z3
L 1
x̃L
3
1
L
zL
x̃L
W
L
L
x̃L
xL
b
Fig. 4: Architecture of fLETnetVar. Every layer (except the first layer) is connected to two of its previous layers. The activation
functions are untied across layers.
fLETnet: Network architecture motivated by FISTA
6.2
Training
Details of LETnetVar and MMSE-ISTA
I
are used to parametrize the activation function using cubic
t
Replace T⌫ with a parametric activation
B-splines, and the same set of coefficients are shared across
the layers. The MMSE-ISTA
network is trained using GD
I sensing
For a fixed
, a training
dataset consisting
Directmatrix
links Afrom
two previous
layers all
(second-order
memory)
4
with
a
step-size
of
=
10
,
as
suggested in [28], and the
of N = 100 examples is generated for learning the optimal
I -1oflink:
is terminated
when the number of training epochs
identity
mapping; thus
no newalgorithm
parameters
to learn
nonlinearities
LETnetVar
and MMSE-ISTA.
A validation
set containing 20 examples is used during training in order exceeds 1000.
I Circumvents the issue of vanishing/exploding gradient
The LET-based activation function in LETnetVar is
to avoid over-fitting.
The trained network is tested on a
parametrized
with K = 5 coefficients in every layer, leading
dataset containing
N
=
100
examples
drawn
according
test
I The fLETnet
architecture is essentially a deep residual network (He et al., 2015)
to the same statistical model that was used to generate the to a total of 500 parameters, which is comparable to the
of free parameters
in as
MMSE-ISTA.
The HFO algoI Results
training dataset
(cf. Section
6.1). The procedure
is repeated asnumber
in equivalent
performance
the LETnet
with half
many layers
T = 10 times with different datasets, corresponding to rithm for training LETnetVar is terminated after 60 epochs.
10 different sensing matrices, to measure the ensembleaveraged performance of LETnetVar and MMSE-ISTA. The 6.2.1 Initialization and termination criteria for CG
optimal regularization parameter for ISTA is chosen as The CG algorithm required to compute the optimal direction
explained in Section 6.1 and the same value is used in
c in every epoch of HFO is initialized with that obtained
MMSE-ISTA, as suggested in [28]. The parameter opt used in the previous epoch of HFO. Theoretically, it takes KL
in training LETnetVar is chosen by cross-validation over five iterations for the CG algorithm to compute an exact solution
values of placed logarithmically in the interval [0.05, 0.5]. to (17), since the variable c is of dimension KL. Martens et
Both networks contain L = 100 layers.
al. [37] reported that it suffices to run the CG algorithm
For the MMSE-ISTA architecture, K = 501 coefficients in each HFO iteration only a few times according to a
Training the LETnet and the fLETnet
Training dataset D consists of N examples {(yq , xq )}Nq=1 , where
yq = Axq + ⇠ q , for a given A
N
⌘
1 X L⇣
Training cost J (c ) =
kxq y q , c
2 q =1
⇣
xq k22 , where c = c 1 ; c 2 ; · · · ; c L
Second-order optimization
I
Quadratic approximation in the i th training epoch
(q )
Ji
I
(c i +
= J (c i ) +
>
c gi
+
1
2
>
c Hi c
Compute optimal direction using conjugate-gradient (CG) at every epoch i
⇤
c
I
c)
(q )
= arg min Ji
c
Update parameters as c i +1
ci +
(c i +
c)
+ k
2
c k2
⇤
c
Two ingredients of CG
I
I
Gradient g i = rJ (c ) c =c
i
Hessian-vector product Hi u for any vector u, where Hi = r2 J (c ) c =c
Ru (h(c)) = lim↵!0
h(c+↵u) h(c)
↵
=) Hi u = Ru (rc J (c)) c=c
i
i
⌘
Experimental details and parameter settings
Data generation
I
I
n = 256, m = d0.7ne
Ai ,j ⇠ N(0, 1/m)
x mag , where x supp ⇠ Bernoulli(⇢) and x mag ⇠ N(0, 1)
I
x = x supp
I
0 < ⇢ < 1: smaller the value of ⇢, sparser the vector x
I
chosen optimally using cross-validation
I
Ntrain = 100 examples used for training
I
Ntest = 100 examples used for testing
I
Performance averaged over 10 independent trials
Performance assessment of LETnetVar
35
ISTA
FISTA
LETnetVar
30
1.2
MMSE-ISTA
CoSaMP
IRLS
Training
Validation
1.0
R ECOVERY SNR
25
E RROR
0.8
20
15
0.6
10
0.4
5
0.2
0
0.05
0.1
0.15
0.2
(a) SNRinput = 20 dB
0.25
r
0
10
20
30
40
T RAINING E POCHS
50
60
(b) Training and validation error
Figure: Comparison of ensemble-averaged reconstruction SNR and its standard deviation.
Observations
I
I
I
LETnetVar with 100 layers performs 3 to 4 dB better than ISTA, with optimally
chosen hyper-parameter ( ).
When measurements are more noisy, the improvement in recovery SNR of
LETnet was found to be higher.
Training and validation errors reduce monotonically with training epochs
What regularizers did the fLETnet learn?
IEEE T-PAMI
8
xii
9
10
11
12
13
14
Fig. 6: The
learnt
parametric
LET functions
(blue)of
of fLETnet
fLETnet over
a typical
dataset and their induced
corresponding induced
Figure:
The
learnt
LET functions
(blue)
and
the corresponding
regularizers (red). The activation functions and the regularizers are compared with ST and the 1 penalty (black),
regularizers
(red). portions of the estimated signals at the layers are also shown for visual comparison.
respectively. Zoomed-in
Key observations
less computation time overall. The MMSE-ISTA algorithm in | (v)|
c1 |v| > |v|, for |v| > v0 . This is reflected
I Canmore
requires relatively
time
than the
proposed
architec- in the corresponding regularizer, as it slopes downwards
learn
a wide
variety
of regularizers
tures, mainly
uses significantly
more parameters
and even
becomes negative
for inputs of large magnitudes.
I because
Signalitcomponent
missed
in a layer can
be recovered
subsequently
to model the
activation
function.
The
run-times
for
IRLS
and
However,
the
induced
regularizers
across all layers offer
I Balance between signal preservation and noise cancellation
CoSaMP are considerably
higher, as they require to compute a positive penalty for small magnitudes resulting in noise
matrix inversion and pseudo-inverse, respectively, within rejection. The network effects two simultaneous operations
every iteration.
– reduction of small amplitudes (noise), and enhancement
of large amplitudes, which are due to the signal. The signal
component that gets eliminated in one layer in the process
7.3 What Regularizers Does the fLETnet Learn?
of canceling noise can again be recovered in a subsequent
To gain insights into the functioning of fLETnet, it is instruc- layer due to the negative penalty as highlighted in the third
tive to study the activation functions and the corresponding row of Figure 6. Therefore, the regularizers learnt by fLETnet
regularizers in each layer at the end of training. In Figure 6, cover a wide range between hard and soft thresholding, and
we show an illustration of the typical learnt activations and even go beyond these, leading to a balance between noise
the regularizers over each layer from 8 to 14 of a trained rejection and signal preservation.
fLETnet. A segment of the corresponding output signals are
also shown. The activation functions in all 50 layers are
provided in the supplementary document. Take for instance, 7.4 Support Recovery in fLETnet
layer 8, where the regularizer is closer to the 1 penalty In Figure 7, we plot the layer-wise reconstruction SNR,
2
over the dynamic range of the input. On the other hand, the
xt x
2
on test signals, versus the
2
induced regularizers in layers 9, 13, and 14 are between the defined as SNRt =
x 2
layer index t of a trained fLETnet. The input SNR and the
1 and 2 penalties. Interestingly, the regularizers in layers
10, 11, and 12 decrease beyond a certain range of inputs sparsity parameter are taken as 20 dB and 0.2, respectively.
and even go negative, thereby encouraging an already large The variation of reconstruction SNR with respect to the
input to increase further in magnitude. Such a behavior is corresponding number of iterations of FISTA is also shown
A comparison of testing run-times
Algorithm
per iteration/layer run-time
number of layers/
total time
(in milliseconds)
iterations
(in milliseconds)
ISTA
0.0331
1000
33.10
FISTA
0.0394
1000
39.40
LETnet
0.0895
100
8.95
fLETnet
0.1088
50
5.44
MMSE-ISTA
0.6184
1000
618.40
CoSaMP
11.7672
50
588.36
IRLS
5.2784
50
263.92
Deep dictionary learning
n oN
2 Rm , learn an
n oN
and s-sparse vectors x j
2 Rn , s ⌧ n,
Problem statement: Given a set of signals y j
overcomplete dictionary A 2 Rm⇥n
such that y j ⇡ A x j
j =1
j =1
Proposed approach:
 = min
A
I
I
I
N
X
j =1
A nety j (A )
yj
2
2
nety j (A ) is the output of ISTA corresponding to the signal y j and dictionary A
Gradient descent: A
A µrJ (A )
Computing rJ (A ) requires only matrix-vector products
Advantages over conventional dictionary learning algorithms:
I
I
I
Online implementation
Distributed implementation for a large training dataset
Certain desirable properties on the dictionary, such as incoherence, can be
promoted by adding a penalty and appropriately modifying the gradient
Deep dictionary learning: Performance validation
Data generation
I
I
I
I
I
n = 50, m = 20
Number of examples N = 2000
Sparsity s = 3
Number of layers L = 200
Add noise to the training dataset such that SNRinput = 30 dB
DISTANCE FROM GROUND-TRUTH
ATOM DETECTION RATE
1
0.8
0.6
0.4
0.2
DNN
KSVD
0
10
20
30
40
50
TRAINING EPOCHS
60
70
80
0.25
DNN
KSVD
0.2
0.15
0.1
0.05
0
10
20
30
40
50
TRAINING EPOCHS
Figure: Comparison of DNN-based dictionary learning with K-SVD
60
70
80
Summary and future works
Summary
Sparse coding as a function approximation problem
Link between ISTA and a DNN
Learning data-driven parameterized nonlinearities
Efficient Hessian-free second-order optimization for training the network
Link between FISTA and deep residual network
The induced regularizers are nonstandard! Go beyond `1 and `0
Extension to dictionary learning
Future works
Restricted to a fixed dictionary. How about adaptive/slowly varying
dictionaries?
A generic DNN solution to any iterative algorithm for inverse problems?
© Copyright 2026 Paperzz