Sparse coding - inference (ISTA algorithm)

Neural networks
Sparse coding - inference (ISTA algorithm)
Sparse coding
Sparse coding
hugo.larochelle@
2
November 1, 2012
Sparse
coding
November
Abstract
AbS
Hugo Larochelle
Département
d’informatique
Hugo Larochelle
Université d’informatique
de Sherbrooke
Département
Hugo
Larochelle
[email protected]
Abstract
MathUniversité
for my slides
“Sparse
coding”.
Math for my slides “Sparse coding”.
Hugo
Larochelle
de
Sherbrooke
Topics: sparse coding
Département
d’informatique
Sparse
coding
Département
d’informatique
[email protected]
Abstr
Math
for
my
slides
“Sparse
coding”.
November
1,
2012
(t)
(t)
(t)
(t)
• For each
• x find
h a latent
D representation
• x
hde such
D that:
Université
Sherbrooke
Dépa
Université
de
Sherbrooke
T “Sparse coding”.
T
Hugo
Larochelle
X
X
Math
for
my
slides
November
1,
2012
1
1
1
(t)
(t)
[email protected]
[email protected]
(t)
Uni
‣ it is sparse: the
zeros d’informatique
• vector
x
h has
Dmany
Département
arg min
argTmin ||xarg min
D h(t) ||22arg
+ min
||h(
Abstract
T t=1 1 X
2
T t=1hugo.la
(t)
(t)
Université
de Sherbrooke
D
D
(t)
(t)
h
h
1
(t)1, 2012 (t) 2
‣ we can reconstruct the original input
ashmuch D
as
possible
•
x
November
arg
min
arg min ||x T D h ||2 +
[email protected]
November
1,
2012
y slides “Sparse coding”.
X
Abstract
T
2
(t)
D
1
1
h
•
•
t=1
• More formally:
reconstruction errorNovember
sparsity
penalty
arg
min
arg
min
1,
2012
(t)
1
1
lides
b “Sparse coding”.
x
(t)
(t) D
(t)
(t) 2 h(t) (t)
T
2
h(x ) = arg min ||x
h(x
D h )t=1
=
||2arg
+ min
||h |
T
Abstract
• 1X
1 (t)
2
2
(t)
(t) 2
h
1 (t)
min
min ||x
D hAbstract
||2Abstract
+ ||h(t) ||1 (t) h(t)
(t) 2
Math
for
my
slides
“Sparse
coding”.
(t)
h(x
)
=
arg
min
||x
D
h
||
+
D TT
2
h
•
2
t=1
X
2
1
(t)
Math for my slides “Sparse coding”. 1
1
h
(t)
(t)
2
(t)
(t)
reconstruction
vs.
ath for my slides
“Sparse
coding”.
arg min
arg min ||x
+ ||h ||1
(t)
(t)D h b||(t)
2
h(x
)
=
arg
min
||x
•
x
h
D
x
reconstruction
T
2
(t)
sparsity
control
h
2
(t)
t=1
T
• x(t) h(t) D
D
1 (t)
h
X
(t)
(t)
2
(t)
T
Math
for
my
slides
“Sparse
coding”.
1
1
X
h(x
)
=
arg
min
||x
D
h
||
+
||h
||
(t)
(t) 2
(t) ‣
1
1
1
2
(t)
(t)
2
(t)
arg
min
arg
min
||x
D
h
||
+
output
weight
matrix
h
D is equivalent to the autoencoder
2
2
arg
min
arg
min
||x
D
h
||
+
||h
||
(t)
1
2
h
T t=1 h(t) 2
T
2
D
T
D
h
1 1
X
(t)
(t)t=1
(t) 2
(t)
(t)
(t)
1
arg
min
||x
D
h
||
+
||h
||
‣ however, h(x ) =
is now
a min
complicated
function
•
x
D
(t) 1h
(t) 2
(t)
2 of
X
arg
arg
min
||x
D
h
||
+
||h
||
•
2
1
(t)
2
T
•
h
(t)
(t)
(t)
1
X
T
2
bminimization
= D h(x
D(t)·,k h(x
(t)
1
Dh(x)(t)=
1 (t)
- encoder is thex
) = t=1
arg min h||x
D h(t) ||)22k+ ||h(t) ||1 (t)
(t) 2
2
h(x ) = arg min ||x argDmin
h ||2 +
h
SPARSE CODING
(t)
(t)
• x
(t)
h
(t)
slides “Sparse coding”.
(t)
Abstract1
(t)
h(x
)
=
arg
min
||x
D
2
(t)
D h(t) ||22 + ||h(t) ||•1
SPARSE CODING D h
h
T
X
1
1 (t)
arg min X arg
min ||x
Abstract
T t=1 h(t) (t)
2
(t) D
3
(t) 2
||2
+ ||h(t) ||1
b(t)
x
@
(t)
b
x
=
D
h(x
)
=
D
h(x
)
T
·,k
k
X
(t
1
1 (t)
Math
for
my
slides
“Sparse
coding”.
(t)
2
(t)
@hk
kh
s.t. || + ||h ||1
Topics:
inference
of
sparse
codes
arg
min
arg
min
||x
D
2
(t)
•
h(x
)k 6=0
T
2
(t)
D
h
t=1
1
(t)
(t)
(t) 2
(t)
(t)
(t)
r
•
h(x
)
=
arg
min
||x
D
h
||
+
||h
||
x Given
h
D , how do we compute
1
2
2
T
h(t)
1
X
(t)
(t) 1
(t) 2
(t)
1
(t)(t) 2
1
(t)
l(x
)
=
||x
D
h
||
+
||h
‣ we want toh(x
(t)
(t)
(t) 2 2
(t) ||(t)
•
h
optimize
w.r.t.
1
) = arg minarg||x
||2 +min
||h ||x
||1
min D h arg
D h ||2 + ||h ||1
2
2D
T t=1 h(t) 2
h(t)
(t)
•
h(x
)k = 0
@
(t)
(t)
>
(t)
(t)
l(x ) = (D·,k ) (D h
x ) + sign(hk )
(t)
X
(t)
(t)
>
@h
(t) k
(t)
(t)
•
h(x
)
(=
h(x
)
↵
DD
(D
k
k
b = D h(x ) =
x
D·,k h(x1 )k
(t) (t)
(t) 2
(t)
(t) h(x>(t)
(t)
(t)
= argxmin
||x
D) h ||2 + ||h ||1
k )s.t.
rh(t) l(x ) = D (D
h
)
+
sign(h
(t)
h(x
@
‣
(t)
hk
l(x(t) ) = ||x(t)
(t)
>
)k 6=0
h(t)
2
D h(t) ||22 + ||h(t) ||1
(t)
(t)
l(x ) = (D·,k ) (D h
x
(t)
we could@h
use
k a gradient descent method:
(t)
(t)
↵ DD> (Drh(t)
x
) = D> (D h(t) x(t) )
l(x
)
(t)
h
)+
+
(t)
sign(hk )
sign(h(t) )
•
•
•
•
h
D
T
t=1
2
h
(t)
(t)
(t)
b 1= D
x
h(x (t)) =
(t)
2
h(x •) = arg min ||x
2
(t)
X
•
•
(t)
D h ||2 + ||h ||1
SPARSE CODING
b(t) = D h(x(t) ) =
x
h(t)
2(t) 2
D·,k h(x )k
h(t)
k s.t.
(t)6=0
(t)
(t)
(t)
(t)
(t) 2
(t) h(x(t)
)
b
x
=
D
h(x
) (t)
=)
k
(t)
h(x ) = arg
||x )kD h ||2 + ||h ||1
Dmin
·,k h(x
b = D h(x
x
1h
2
h(t)
k s.t.
(t)
(t)
b
x
=
D
h(x
)=
(t)
h(x )k 6=0
X(t)
(t)
(t)
b = D•h(x
x
Topics: inference of sparse codes
•
)=
X
D·,k h(x(t) )k
(t)
(t) 2
(t)
(t)
k s.t.
l(x h(x
)D=
||x
D
h
||
+
||h
||
1
·,k h(x )k
(t)
2
) 6=0
k
4
XX
D·,kDh
=
k s.t.
s.t.
h(x(t) )kk6=
(t)0
h(x )k 6=0
(t)
(t)
(t) 2
l(x
)
=
||x
D
h
||2(t)+ 2 ||h
(t)
(t)
@
l(x
D
(t)h ||2 +
(t)
(t)
(t) 2
(t)
(t)
>
(t)
(t) ) = ||x
•
l(x
)
=
||x
D
h
||
+
||h
||
• For a single hidden unit:
l(x
) (D(t)h
x ) + sign(hk )
1(t) ) = (D
2
@
(t) (t)
(t) ·,k
2
(t)
>
(t)
(t)
•
l(x @h
) = ||x
D h ||2 + ||h ||1 @
l(x
)
=
(D
)
(D
h
x
)+
·,k
(t)
>
(t)
(t)
1
k
(t)
l(x
)
=
(D
)
(D
h
x
(t)
(t)
(t)
2
(t)
@
·,k
@hk (t)
l(x@ ) =
D h ||(t)
+ ||h ||1
(t)
>
(t)
(t) ||x
2
(t)
l(x ) = (D·,k ) (D h
xl(x(t)
sign(h
)(t) >
@hk (t)
> (t)
(t)
2) )+= (D
(t)
(t)
k
(t)
)
(D
h
x
)
+
sign(h
)
·,k
k
r
l(x
)
=
D
(D
h
x
)
+
sign(h
(t)
(t)
(t)
>
(t) ) (t)
h
@hk
r
l(x
)
=
D
(D
h
@ @h(t)
(t)
h(t)
>
(t)
(t)
(t)
>
(t)x ) +
(t)
k
l(x
)
=
(D
)
(D
h
x
)
+
sign(h
)
r
l(x
)
=
D
(D
h
x
)+
(t)
·,k
k
h
(t)
@h(t)
(t)
>
(t)
> (t) (t)
(t)
(t)
(t)
(t))sign(h
l(x
=
D
(D
h
x
)
+
sign(h
)
(t)
rh•(t) l(x
) = D (D h k r
x(t)
)
+
)
h
h • h
(t) (t)
(t)
>
(t)
(t)
•
h
‣ issue: L1 norm not differentiable
at
0
rh(t) l(x ) = D
(D
h
x
)
+
sign(h
)
(t)
(t) (t)
• h•
• hk (t)
=0
h
=
0
k
- •very
• hk = 0 (even if it’s the solution)
for gradient descent to ‘‘land’’ on
h(t)unlikely
(t)
(t)
•(t) hk =(t)
0
• hk (t)
• hk =•if0 hk changes sign because of L1• norm
‣ solution:
gradient, clamp to 0
hk
(t)
(t)
(t)
>
(t)
(t)
•(t) hk
•
h
(=
h
↵
DD
(D
h
x
>hkhidden
(t) unit
(t)
(t) (t)
(t)
> ) (t)
(t)
k
k
(t)
(t)
>
(t)
•
‣
each
update
would
be
performed
as
follows:
•
h(x
)
(=
h(x
)
↵
DD
(D
h
x
)
↵
DD
(D
h
x
)
• h(x )k (= h(x )k ↵ DD (D h k
x ) k
k
(t)
(t)
>
(t)
(t)
(t)
(t)
(t)
•
h
(=
h
↵
DD
(D
h
x
)
(t)
(t)
> > (t) (t)(t)
(t)
(t)
k
k
•
sign(h
)
=
6
sign(h
↵
sign(h
(t)
(=hhk ↵(D
↵ D·,k(D
h
x )x )
- •
k
k ))
hkhk(=
)
(D
h
k
(t)
(t)
(t) (t) ) 6= sign(h
(t)
(t)
•
sign(h
↵
sign(h
)) h(t) (= 0
- •if sign(h ) 6= sign(h
k
k •
↵
sign(h )) then:
k
(t)
k
(t)
(t)
- •else:
h•k h(=
0 hk
k (=
k
↵
(t)
sign(hk )
k s.t.
h(x(t) )k 6=0
k
(t)
(t)
• hk (= hk
↵
(t)
sign(hk )
•
•
•
•
h
D
T
t=1
2
h
(t)
(t)
(t)
b 1= D
x
h(x (t)) =
(t)
2
h(x •) = arg min ||x
2
(t)
X
•
•
(t)
D h ||2 + ||h ||1
SPARSE CODING
b(t) = D h(x(t) ) =
x
h(t)
2(t) 2
D·,k h(x )k
h(t)
k s.t.
(t)6=0
(t)
(t)
(t)
(t)
(t) 2
(t) h(x(t)
)
b
x
=
D
h(x
) (t)
=)
k
(t)
h(x ) = arg
||x )kD h ||2 + ||h ||1
Dmin
·,k h(x
b = D h(x
x
1h
2
h(t)
k s.t.
(t)
(t)
b
x
=
D
h(x
)=
(t)
h(x )k 6=0
X(t)
(t)
(t)
b = D•h(x
x
Topics: inference of sparse codes
•
)=
X
D·,k h(x(t) )k
(t)
(t) 2
(t)
(t)
k s.t.
l(x h(x
)D=
||x
D
h
||
+
||h
||
1
·,k h(x )k
(t)
2
) 6=0
k
4
XX
D·,kDh
=
k s.t.
s.t.
h(x(t) )kk6=
(t)0
h(x )k 6=0
(t)
(t)
(t) 2
l(x
)
=
||x
D
h
||2(t)+ 2 ||h
(t)
(t)
@
l(x
D
(t)h ||2 +
(t)
(t)
(t) 2
(t)
(t)
>
(t)
(t) ) = ||x
•
l(x
)
=
||x
D
h
||
+
||h
||
• For a single hidden unit:
l(x
) (D(t)h
x ) + sign(hk )
1(t) ) = (D
2
@
(t) (t)
(t) ·,k
2
(t)
>
(t)
(t)
•
l(x @h
) = ||x
D h ||2 + ||h ||1 @
l(x
)
=
(D
)
(D
h
x
)+
·,k
(t)
>
(t)
(t)
1
k
(t)
l(x
)
=
(D
)
(D
h
x
(t)
(t)
(t)
2
(t)
@
·,k
@hk (t)
l(x@ ) =
D h ||(t)
+ ||h ||1
(t)
>
(t)
(t) ||x
2
(t)
l(x ) = (D·,k ) (D h
xl(x(t)
sign(h
)(t) >
@hk (t)
> (t)
(t)
2) )+= (D
(t)
(t)
k
(t)
)
(D
h
x
)
+
sign(h
)
·,k
k
r
l(x
)
=
D
(D
h
x
)
+
sign(h
(t)
(t)
(t)
>
(t) ) (t)
h
@hk
r
l(x
)
=
D
(D
h
@ @h(t)
(t)
h(t)
>
(t)
(t)
(t)
>
(t)x ) +
(t)
k
l(x
)
=
(D
)
(D
h
x
)
+
sign(h
)
r
l(x
)
=
D
(D
h
x
)+
(t)
·,k
k
h
(t)
@h(t)
(t)
>
(t)
> (t) (t)
(t)
(t)
(t)
(t))sign(h
l(x
=
D
(D
h
x
)
+
sign(h
)
(t)
rh•(t) l(x
) = D (D h k r
x(t)
)
+
)
h
h • h
(t) (t)
(t)
>
(t)
(t)
•
h
‣ issue: L1 norm not differentiable
at
0
rh(t) l(x ) = D
(D
h
x
)
+
sign(h
)
(t)
(t) (t)
• h•
• hk (t)
=0
h
=
0
k
- •very
• hk = 0 (even if it’s the solution)
for gradient descent to ‘‘land’’ on
h(t)unlikely
(t)
(t)
•(t) hk =(t)
0
• hk (t)
• hk =•if0 hk changes sign because of L1• norm
‣ solution:
gradient, clamp to 0
hk
(t)
(t)
(t)
>
(t)
(t)
•(t) hk
•
h
(=
h
↵
DD
(D
h
x
>hkhidden
(t) unit
(t)
(t) (t)
(t)
> ) (t)
(t)
k
k
(t)
(t)
>
(t)
•
‣
each
update
would
be
performed
as
follows:
•
h(x
)
(=
h(x
)
↵
DD
(D
h
x
)
↵
DD
(D
h
x
)
• h(x )k (= h(x )k ↵ DD (D h k
x ) k
k
(t)
(t)
>
(t)
(t)
(t)
(t)
(t)
•
h
(=
h
↵
DD
(D
h
x
)
(t)
(t)
> > (t) (t)(t)
(t)
(t)
k
k
•
sign(h
)
=
6
sign(h
↵
sign(h
(t)
(=hhk ↵(D
↵ D·,k(D
h
x )x )
- •
updatekfrom reconstruction
k ))
hkhk(=
)
(D
h
k
(t)
(t)
(t) (t) ) 6= sign(h
(t)
(t)
•
sign(h
↵
sign(h
)) h(t) (= 0
- •if sign(h ) 6= sign(h
k
k •
↵
sign(h )) then:
k
(t)
k
(t)
(t)
- •else:
h•k h(=
0 hk
k (=
k
↵
(t)
sign(hk )
k s.t.
h(x(t) )k 6=0
k
(t)
(t)
• hk (= hk
↵
(t)
sign(hk )
•
•
•
•
h
D
T
t=1
2
h
(t)
(t)
(t)
b 1= D
x
h(x (t)) =
(t)
2
h(x •) = arg min ||x
2
(t)
X
•
•
(t)
D h ||2 + ||h ||1
SPARSE CODING
b(t) = D h(x(t) ) =
x
h(t)
2(t) 2
D·,k h(x )k
h(t)
k s.t.
(t)6=0
(t)
(t)
(t)
(t)
(t) 2
(t) h(x(t)
)
b
x
=
D
h(x
) (t)
=)
k
(t)
h(x ) = arg
||x )kD h ||2 + ||h ||1
Dmin
·,k h(x
b = D h(x
x
1h
2
h(t)
k s.t.
(t)
(t)
b
x
=
D
h(x
)=
(t)
h(x )k 6=0
X(t)
(t)
(t)
b = D•h(x
x
Topics: inference of sparse codes
•
)=
X
D·,k h(x(t) )k
(t)
(t) 2
(t)
(t)
k s.t.
l(x h(x
)D=
||x
D
h
||
+
||h
||
1
·,k h(x )k
(t)
2
) 6=0
k
4
XX
D·,kDh
=
k s.t.
s.t.
h(x(t) )kk6=
(t)0
h(x )k 6=0
(t)
(t)
(t) 2
l(x
)
=
||x
D
h
||2(t)+ 2 ||h
(t)
(t)
@
l(x
D
(t)h ||2 +
(t)
(t)
(t) 2
(t)
(t)
>
(t)
(t) ) = ||x
•
l(x
)
=
||x
D
h
||
+
||h
||
• For a single hidden unit:
l(x
) (D(t)h
x ) + sign(hk )
1(t) ) = (D
2
@
(t) (t)
(t) ·,k
2
(t)
>
(t)
(t)
•
l(x @h
) = ||x
D h ||2 + ||h ||1 @
l(x
)
=
(D
)
(D
h
x
)+
·,k
(t)
>
(t)
(t)
1
k
(t)
l(x
)
=
(D
)
(D
h
x
(t)
(t)
(t)
2
(t)
@
·,k
@hk (t)
l(x@ ) =
D h ||(t)
+ ||h ||1
(t)
>
(t)
(t) ||x
2
(t)
l(x ) = (D·,k ) (D h
xl(x(t)
sign(h
)(t) >
@hk (t)
> (t)
(t)
2) )+= (D
(t)
(t)
k
(t)
)
(D
h
x
)
+
sign(h
)
·,k
k
r
l(x
)
=
D
(D
h
x
)
+
sign(h
(t)
(t)
(t)
>
(t) ) (t)
h
@hk
r
l(x
)
=
D
(D
h
@ @h(t)
(t)
h(t)
>
(t)
(t)
(t)
>
(t)x ) +
(t)
k
l(x
)
=
(D
)
(D
h
x
)
+
sign(h
)
r
l(x
)
=
D
(D
h
x
)+
(t)
·,k
k
h
(t)
@h(t)
(t)
>
(t)
> (t) (t)
(t)
(t)
(t)
(t))sign(h
l(x
=
D
(D
h
x
)
+
sign(h
)
(t)
rh•(t) l(x
) = D (D h k r
x(t)
)
+
)
h
h • h
(t) (t)
(t)
>
(t)
(t)
•
h
‣ issue: L1 norm not differentiable
at
0
rh(t) l(x ) = D
(D
h
x
)
+
sign(h
)
(t)
(t) (t)
• h•
• hk (t)
=0
h
=
0
k
- •very
• hk = 0 (even if it’s the solution)
for gradient descent to ‘‘land’’ on
h(t)unlikely
(t)
(t)
•(t) hk =(t)
0
• hk (t)
• hk =•if0 hk changes sign because of L1• norm
‣ solution:
gradient, clamp to 0
hk
(t)
(t)
(t)
>
(t)
(t)
•(t) hk
•
h
(=
h
↵
DD
(D
h
x
>hkhidden
(t) unit
(t)
(t) (t)
(t)
> ) (t)
(t)
k
k
(t)
(t)
>
(t)
•
‣
each
update
would
be
performed
as
follows:
•
h(x
)
(=
h(x
)
↵
DD
(D
h
x
)
↵
DD
(D
h
x
)
• h(x )k (= h(x )k ↵ DD (D h k
x ) k
k
(t)
(t)
>
(t)
(t)
(t)
(t)
(t)
•
h
(=
h
↵
DD
(D
h
x
)
(t)
(t)
> > (t) (t)(t)
(t)
(t)
k
k
•
sign(h
)
=
6
sign(h
↵
sign(h
(t)
(=hhk ↵(D
↵ D·,k(D
h
x )x )
- •
updatekfrom reconstruction
k ))
hkhk(=
)
(D
h
k
(t)
(t)
(t) (t) ) 6= sign(h
(t)
(t)
•
sign(h
↵
sign(h
)) h(t) (= 0
- •if sign(h ) 6= sign(h
k
k •
↵
sign(h )) then:
k
(t) (t)
•
hk h(=
0
k (=
- •else:
k
(t)
hk
k
↵
(t)
sign(hk )
k s.t.
h(x(t) )k 6=0
k
(t)
(t)
• hk (= hk
update from sparsity
↵
(t)
sign(hk )
(D h
h(x ) = arg min ||x
2
h(t)
x )
SPARSExb CODING
X
= D h(x ) =
(t)
(t) (t)
Dh
x
↵ sign(h) ))
t)
↵
D h ||2 + ||h ||1
•k
x(t) ) (t)
sign(hk ))
(t)
(t)
sign(h
))
(t) k
n(hk )
(t)•
hk )
(t)
k s.t.
h(x(t) )k 6=0
Topics: ISTA (Iterative Shrinkage and Thresholding Algorithm)
This process
corresponds to the ISTA algorithm:
•
‣
‣
(= instance
h(t) ↵ to
D>0)
(D h(t)
initialize h(t) (for
while h
-
‣
(t)
(t)
>
(=not
h converged
↵ D (D h
has
(t)
• ↵ D1> (D h(t)
h(t) (= h(t)
h(t) (= shrink(h1(t) , ↵ )
x(t) )
(t)@
x )(t) l(x
nk(a, b)
•
max(|ai |
(t)
) = (D·,k ) (D h
(t)
sign(h ))
bi , 0), . . . ]
(t)
• Here shrink(a,
b) = [. . . , sign(ai ) max(|ai |
• hk
1
>
• Will converge
if ↵
bigger
the largest
•
D
(t)
(t) isD
>than (t)
• hk (= hk
↵ DD (D h
x(t) )
(t)
>
(t)
•
•
bi , 0), . . . ]
D·,k h(x(t) )k
D h(t) ||22 + ||h(t) ||1
@hk
x(t) )
rh(t) l(x(t) ) = D> (D h(t)
• h(t) h(t) 1(= shrink(h(t) , ↵
return
= •[. .h.(t)
, sign(a
i)
=
0
k
l(x(t) ) = ||x(t)
5
(t)
(t)
x )+
(t)
x(t)
)
h+
(t)
sign(hk )
(t)
(t)
sign(h
)
(= shrink(h ,
shrink(a, b) = [. . . , sign(a
shri
1
>
•
D
D
eigenvalue of
↵
(D h
h(x ) = arg min ||x
2
h(t)
x )
SPARSExb CODING
X
= D h(x ) =
(t)
(t) (t)
Dh
x
↵ sign(h) ))
t)
↵
D h ||2 + ||h ||1
•k
x(t) ) (t)
sign(hk ))
•
(t)
sign(h
))
(t) k
n(hk )
(t)•
hk )
(t)
(t)
(t)
(t)
k s.t.
h
(=
shrink(h
,↵
(t)
h(x )k 6=0
Topics: ISTA (Iterative Shrinkage and Thresholding Algorithm)
•
This process
corresponds to the ISTA algorithm:
•
‣
‣
(= instance
h(t) ↵ to
D>0)
(D h(t)
initialize h(t) (for
while •h
-
‣
(t)
>
(t)
(=not
h converged
↵ D (D h
has
(t)
nk(a, b)
l(x(t) ) = ||x(t)
x•(t)shrink(a
)
i , bi )
(t)
(t)@ 1
x )• l(x
D> D)
>
D h(t) ||22 + ||h(t) ||1
(t)
• ↵ D1> (D h(t) x )
h(t) (= h(t)
(t)
>
(t)
•
r
l(x
)
=
D
(D
h
(t)
•
h(t) (= shrink(h1(t)
, ↵ ) shrink(a, b) = [.h. . , sign(ai ) max(|a(t)i | •bi , 0), . . . ]
•
= •[. .h.(t)
, sign(a
i)
=
0
k
•
max(|ai |
1
↵
(t)
sign(h ))
bi , 0), . . . ]
x )+
(t)
x(t)
)
h+
(= shrink(h(t) , ↵
•
D D
(t)
• Here shrink(a,
b)
=
[.
.
.
, sign(ai ) max(|ai |
• hk
1
>
• Will converge
if ↵
bigger
the largest
•
D
(t)
(t) isD
>than (t)
• hk (= hk
↵ DD (D h
x(t) )
(t)
•
bi , 0), . . . ]
(t)
sign(hk )
(t)
(t)
sign(h
)
(= shrink(h ,
sign(h(t) ))
shrink(a, b) = [. . . , sign(ai ) max(|ai |
>
(t)
(t)
sign(h(t) ))
h
sign(h(t
shrink(a, b) = [. . . , sign(ai ) max(|ai |
= (D·,k ) (D h
(t)↵
(t)@h
h (=kshrink(h(t) , ↵
(t)
(t)
(t)
D> D
•• h↵1 (t)
return
h 1(= shrink(h , ↵
D·,k h(x(t) )k
5
bi , 0), . . . ]
shrink(a, b) = [. . . , sign(a
shri
1
>
•
D
D
eigenvalue of
↵
(D h
x )
Sparse
(t)
(t) (t)
Dh
x
↵ sign(h) ))
•k
t)
x(t) ) (t)
sign(hk ))
↵
(t)
sign(h
))
(t) k
n(hk )
(t)•
hk )
Topics: ISTA
h(x ) = arg min ||x
2
h(t)
coding
SPARSExb CODING
X
= D h(x ) =
•
‣
while •h
>
(t)
(=not
h converged
↵ D (D
has
Math for my slides “Sparse coding”.
• x(t)
‣
•
- h(t) (=
b(t)
h(t) D• x
- h(t) (=
(t)
(t)Abstract
(t)@ 1
h
x )• l(x
D> D)
>
shrink(a, b) = [. . . , sign(ai ) max(|ai |
D h(t) ||22 + ||h(t) ||1
= (D·,k ) (D h
(t)↵
(t)@h
h (=kshrink(h(t) , ↵
(t)
(t)
T
•
t=1
h(t)
h
2
2
(t)
sign(h ))
•
x )+
sign(h(t) ))
• ↵ D1> (D h(t) x )
h(t)
(t)
>
(t)
T
r
l(x
)
=
D
(D
h
(t)
(t)
• 1X
1 b)(t)= [.h. . , sign(a
shrink(h
,
↵
)
shrink(a,
bi , 0), . . . ]
1
arg min
arg min ||x
D h(t) ||2 i+) max(|a
||h(t) (t)
||i1| •
D
(t)
(t)
x(t)
)
h+
(= shrink(h(t) , ↵
1
↵
>
•
D D
(t)
X
b)
=
[.
.
.
,
sign(a
)
max(|a
b , 0), . . . ]
Here shrink(a,
i
i|
(t)
(t)
(t)i
• h
b = D h(x ) =
x
D h(x )
k
·,k
1
k s.t.
>
converge
if ↵
bigger
the
•
D
h(x )largest
6=(t)
0
(t)
(t) isD
>than (t)
• hk (= hk
↵ DD (D h
x )
(t)
l(x(t)
) = ||x(t)
(t)
k
sign(h(t) ))
bi , 0), . . . ]
shrink(a, b) = [. . . , sign(a
shri
k
1
>
•
D
D
eigenvalue of
↵
2
(t)
D h(t) ||(t)
2 + ||h ||1
(t)
sign(hk )
(t)
(t)
sign(h
)
(= shrink(h ,
1 (t)
(t)
(t) 2 shrink(a,
(t)
h(x
)
=
arg
min
||x
D
h
||
+
||h
||1 b) = [. . . , sign(ai ) max(|ai |
= •[. .h.(t)
, sign(a
)
max(|a
|
b
,
0),
.
.
.
]
this
is
i
i
i 2
2
=
0
•
k
h(t)
• Will
•
x•(t)shrink(a
)
i , bi )
(t)
(t)
D> D
•• h↵1 (t)
return
h 1(= shrink(h , ↵
nk(a, b)
••
l(x(t) ) = ||x(t)
(= instance
h(t) ↵ to
D>0)
(D h(t)
initialize h(t) (for
(t)
5
Hugo Larochelle (t)
(t)
(t)
D
h(x
)k
·,k
•
Département d’informatique
(t)
(t)
(t
k s.t.
h
(=
shrink(h
,
↵
sign(h
Université
de
Sherbrooke
(Iterative Shrinkage and Thresholding Algorithm)
h(x(t) )k 6=0
[email protected]
This process
corresponds November
to the ISTA
algorithm:
•
1, 2012
‣
D h ||2 + ||h ||1
6
SPARSE CODING
•
Topics: coordinate descent for sparse coding inference
• ISTA
‣
updates all hidden units simultaneously
this is wasteful if many hidden units have already converged
• Idea: update
‣
‣
only the ‘‘most promising’’ hidden unit
see coordinate descent algorithm in
-
•
Learning Fast Approximations of Sparse Coding.
Gregor and Lecun, 2010.
• shrink(ai , bi )
1
•
this algorithm has the advantage of not requiring a learning rate ↵
•
>
D D