Neural networks Sparse coding - inference (ISTA algorithm) Sparse coding Sparse coding hugo.larochelle@ 2 November 1, 2012 Sparse coding November Abstract AbS Hugo Larochelle Département d’informatique Hugo Larochelle Université d’informatique de Sherbrooke Département Hugo Larochelle [email protected] Abstract MathUniversité for my slides “Sparse coding”. Math for my slides “Sparse coding”. Hugo Larochelle de Sherbrooke Topics: sparse coding Département d’informatique Sparse coding Département d’informatique [email protected] Abstr Math for my slides “Sparse coding”. November 1, 2012 (t) (t) (t) (t) • For each • x find h a latent D representation • x hde such D that: Université Sherbrooke Dépa Université de Sherbrooke T “Sparse coding”. T Hugo Larochelle X X Math for my slides November 1, 2012 1 1 1 (t) (t) [email protected] [email protected] (t) Uni ‣ it is sparse: the zeros d’informatique • vector x h has Dmany Département arg min argTmin ||xarg min D h(t) ||22arg + min ||h( Abstract T t=1 1 X 2 T t=1hugo.la (t) (t) Université de Sherbrooke D D (t) (t) h h 1 (t)1, 2012 (t) 2 ‣ we can reconstruct the original input ashmuch D as possible • x November arg min arg min ||x T D h ||2 + [email protected] November 1, 2012 y slides “Sparse coding”. X Abstract T 2 (t) D 1 1 h • • t=1 • More formally: reconstruction errorNovember sparsity penalty arg min arg min 1, 2012 (t) 1 1 lides b “Sparse coding”. x (t) (t) D (t) (t) 2 h(t) (t) T 2 h(x ) = arg min ||x h(x D h )t=1 = ||2arg + min ||h | T Abstract • 1X 1 (t) 2 2 (t) (t) 2 h 1 (t) min min ||x D hAbstract ||2Abstract + ||h(t) ||1 (t) h(t) (t) 2 Math for my slides “Sparse coding”. (t) h(x ) = arg min ||x D h || + D TT 2 h • 2 t=1 X 2 1 (t) Math for my slides “Sparse coding”. 1 1 h (t) (t) 2 (t) (t) reconstruction vs. ath for my slides “Sparse coding”. arg min arg min ||x + ||h ||1 (t) (t)D h b||(t) 2 h(x ) = arg min ||x • x h D x reconstruction T 2 (t) sparsity control h 2 (t) t=1 T • x(t) h(t) D D 1 (t) h X (t) (t) 2 (t) T Math for my slides “Sparse coding”. 1 1 X h(x ) = arg min ||x D h || + ||h || (t) (t) 2 (t) ‣ 1 1 1 2 (t) (t) 2 (t) arg min arg min ||x D h || + output weight matrix h D is equivalent to the autoencoder 2 2 arg min arg min ||x D h || + ||h || (t) 1 2 h T t=1 h(t) 2 T 2 D T D h 1 1 X (t) (t)t=1 (t) 2 (t) (t) (t) 1 arg min ||x D h || + ||h || ‣ however, h(x ) = is now a min complicated function • x D (t) 1h (t) 2 (t) 2 of X arg arg min ||x D h || + ||h || • 2 1 (t) 2 T • h (t) (t) (t) 1 X T 2 bminimization = D h(x D(t)·,k h(x (t) 1 Dh(x)(t)= 1 (t) - encoder is thex ) = t=1 arg min h||x D h(t) ||)22k+ ||h(t) ||1 (t) (t) 2 2 h(x ) = arg min ||x argDmin h ||2 + h SPARSE CODING (t) (t) • x (t) h (t) slides “Sparse coding”. (t) Abstract1 (t) h(x ) = arg min ||x D 2 (t) D h(t) ||22 + ||h(t) ||•1 SPARSE CODING D h h T X 1 1 (t) arg min X arg min ||x Abstract T t=1 h(t) (t) 2 (t) D 3 (t) 2 ||2 + ||h(t) ||1 b(t) x @ (t) b x = D h(x ) = D h(x ) T ·,k k X (t 1 1 (t) Math for my slides “Sparse coding”. (t) 2 (t) @hk kh s.t. || + ||h ||1 Topics: inference of sparse codes arg min arg min ||x D 2 (t) • h(x )k 6=0 T 2 (t) D h t=1 1 (t) (t) (t) 2 (t) (t) (t) r • h(x ) = arg min ||x D h || + ||h || x Given h D , how do we compute 1 2 2 T h(t) 1 X (t) (t) 1 (t) 2 (t) 1 (t)(t) 2 1 (t) l(x ) = ||x D h || + ||h ‣ we want toh(x (t) (t) (t) 2 2 (t) ||(t) • h optimize w.r.t. 1 ) = arg minarg||x ||2 +min ||h ||x ||1 min D h arg D h ||2 + ||h ||1 2 2D T t=1 h(t) 2 h(t) (t) • h(x )k = 0 @ (t) (t) > (t) (t) l(x ) = (D·,k ) (D h x ) + sign(hk ) (t) X (t) (t) > @h (t) k (t) (t) • h(x ) (= h(x ) ↵ DD (D k k b = D h(x ) = x D·,k h(x1 )k (t) (t) (t) 2 (t) (t) h(x>(t) (t) (t) = argxmin ||x D) h ||2 + ||h ||1 k )s.t. rh(t) l(x ) = D (D h ) + sign(h (t) h(x @ ‣ (t) hk l(x(t) ) = ||x(t) (t) > )k 6=0 h(t) 2 D h(t) ||22 + ||h(t) ||1 (t) (t) l(x ) = (D·,k ) (D h x (t) we could@h use k a gradient descent method: (t) (t) ↵ DD> (Drh(t) x ) = D> (D h(t) x(t) ) l(x ) (t) h )+ + (t) sign(hk ) sign(h(t) ) • • • • h D T t=1 2 h (t) (t) (t) b 1= D x h(x (t)) = (t) 2 h(x •) = arg min ||x 2 (t) X • • (t) D h ||2 + ||h ||1 SPARSE CODING b(t) = D h(x(t) ) = x h(t) 2(t) 2 D·,k h(x )k h(t) k s.t. (t)6=0 (t) (t) (t) (t) (t) 2 (t) h(x(t) ) b x = D h(x ) (t) =) k (t) h(x ) = arg ||x )kD h ||2 + ||h ||1 Dmin ·,k h(x b = D h(x x 1h 2 h(t) k s.t. (t) (t) b x = D h(x )= (t) h(x )k 6=0 X(t) (t) (t) b = D•h(x x Topics: inference of sparse codes • )= X D·,k h(x(t) )k (t) (t) 2 (t) (t) k s.t. l(x h(x )D= ||x D h || + ||h || 1 ·,k h(x )k (t) 2 ) 6=0 k 4 XX D·,kDh = k s.t. s.t. h(x(t) )kk6= (t)0 h(x )k 6=0 (t) (t) (t) 2 l(x ) = ||x D h ||2(t)+ 2 ||h (t) (t) @ l(x D (t)h ||2 + (t) (t) (t) 2 (t) (t) > (t) (t) ) = ||x • l(x ) = ||x D h || + ||h || • For a single hidden unit: l(x ) (D(t)h x ) + sign(hk ) 1(t) ) = (D 2 @ (t) (t) (t) ·,k 2 (t) > (t) (t) • l(x @h ) = ||x D h ||2 + ||h ||1 @ l(x ) = (D ) (D h x )+ ·,k (t) > (t) (t) 1 k (t) l(x ) = (D ) (D h x (t) (t) (t) 2 (t) @ ·,k @hk (t) l(x@ ) = D h ||(t) + ||h ||1 (t) > (t) (t) ||x 2 (t) l(x ) = (D·,k ) (D h xl(x(t) sign(h )(t) > @hk (t) > (t) (t) 2) )+= (D (t) (t) k (t) ) (D h x ) + sign(h ) ·,k k r l(x ) = D (D h x ) + sign(h (t) (t) (t) > (t) ) (t) h @hk r l(x ) = D (D h @ @h(t) (t) h(t) > (t) (t) (t) > (t)x ) + (t) k l(x ) = (D ) (D h x ) + sign(h ) r l(x ) = D (D h x )+ (t) ·,k k h (t) @h(t) (t) > (t) > (t) (t) (t) (t) (t) (t))sign(h l(x = D (D h x ) + sign(h ) (t) rh•(t) l(x ) = D (D h k r x(t) ) + ) h h • h (t) (t) (t) > (t) (t) • h ‣ issue: L1 norm not differentiable at 0 rh(t) l(x ) = D (D h x ) + sign(h ) (t) (t) (t) • h• • hk (t) =0 h = 0 k - •very • hk = 0 (even if it’s the solution) for gradient descent to ‘‘land’’ on h(t)unlikely (t) (t) •(t) hk =(t) 0 • hk (t) • hk =•if0 hk changes sign because of L1• norm ‣ solution: gradient, clamp to 0 hk (t) (t) (t) > (t) (t) •(t) hk • h (= h ↵ DD (D h x >hkhidden (t) unit (t) (t) (t) (t) > ) (t) (t) k k (t) (t) > (t) • ‣ each update would be performed as follows: • h(x ) (= h(x ) ↵ DD (D h x ) ↵ DD (D h x ) • h(x )k (= h(x )k ↵ DD (D h k x ) k k (t) (t) > (t) (t) (t) (t) (t) • h (= h ↵ DD (D h x ) (t) (t) > > (t) (t)(t) (t) (t) k k • sign(h ) = 6 sign(h ↵ sign(h (t) (=hhk ↵(D ↵ D·,k(D h x )x ) - • k k )) hkhk(= ) (D h k (t) (t) (t) (t) ) 6= sign(h (t) (t) • sign(h ↵ sign(h )) h(t) (= 0 - •if sign(h ) 6= sign(h k k • ↵ sign(h )) then: k (t) k (t) (t) - •else: h•k h(= 0 hk k (= k ↵ (t) sign(hk ) k s.t. h(x(t) )k 6=0 k (t) (t) • hk (= hk ↵ (t) sign(hk ) • • • • h D T t=1 2 h (t) (t) (t) b 1= D x h(x (t)) = (t) 2 h(x •) = arg min ||x 2 (t) X • • (t) D h ||2 + ||h ||1 SPARSE CODING b(t) = D h(x(t) ) = x h(t) 2(t) 2 D·,k h(x )k h(t) k s.t. (t)6=0 (t) (t) (t) (t) (t) 2 (t) h(x(t) ) b x = D h(x ) (t) =) k (t) h(x ) = arg ||x )kD h ||2 + ||h ||1 Dmin ·,k h(x b = D h(x x 1h 2 h(t) k s.t. (t) (t) b x = D h(x )= (t) h(x )k 6=0 X(t) (t) (t) b = D•h(x x Topics: inference of sparse codes • )= X D·,k h(x(t) )k (t) (t) 2 (t) (t) k s.t. l(x h(x )D= ||x D h || + ||h || 1 ·,k h(x )k (t) 2 ) 6=0 k 4 XX D·,kDh = k s.t. s.t. h(x(t) )kk6= (t)0 h(x )k 6=0 (t) (t) (t) 2 l(x ) = ||x D h ||2(t)+ 2 ||h (t) (t) @ l(x D (t)h ||2 + (t) (t) (t) 2 (t) (t) > (t) (t) ) = ||x • l(x ) = ||x D h || + ||h || • For a single hidden unit: l(x ) (D(t)h x ) + sign(hk ) 1(t) ) = (D 2 @ (t) (t) (t) ·,k 2 (t) > (t) (t) • l(x @h ) = ||x D h ||2 + ||h ||1 @ l(x ) = (D ) (D h x )+ ·,k (t) > (t) (t) 1 k (t) l(x ) = (D ) (D h x (t) (t) (t) 2 (t) @ ·,k @hk (t) l(x@ ) = D h ||(t) + ||h ||1 (t) > (t) (t) ||x 2 (t) l(x ) = (D·,k ) (D h xl(x(t) sign(h )(t) > @hk (t) > (t) (t) 2) )+= (D (t) (t) k (t) ) (D h x ) + sign(h ) ·,k k r l(x ) = D (D h x ) + sign(h (t) (t) (t) > (t) ) (t) h @hk r l(x ) = D (D h @ @h(t) (t) h(t) > (t) (t) (t) > (t)x ) + (t) k l(x ) = (D ) (D h x ) + sign(h ) r l(x ) = D (D h x )+ (t) ·,k k h (t) @h(t) (t) > (t) > (t) (t) (t) (t) (t) (t))sign(h l(x = D (D h x ) + sign(h ) (t) rh•(t) l(x ) = D (D h k r x(t) ) + ) h h • h (t) (t) (t) > (t) (t) • h ‣ issue: L1 norm not differentiable at 0 rh(t) l(x ) = D (D h x ) + sign(h ) (t) (t) (t) • h• • hk (t) =0 h = 0 k - •very • hk = 0 (even if it’s the solution) for gradient descent to ‘‘land’’ on h(t)unlikely (t) (t) •(t) hk =(t) 0 • hk (t) • hk =•if0 hk changes sign because of L1• norm ‣ solution: gradient, clamp to 0 hk (t) (t) (t) > (t) (t) •(t) hk • h (= h ↵ DD (D h x >hkhidden (t) unit (t) (t) (t) (t) > ) (t) (t) k k (t) (t) > (t) • ‣ each update would be performed as follows: • h(x ) (= h(x ) ↵ DD (D h x ) ↵ DD (D h x ) • h(x )k (= h(x )k ↵ DD (D h k x ) k k (t) (t) > (t) (t) (t) (t) (t) • h (= h ↵ DD (D h x ) (t) (t) > > (t) (t)(t) (t) (t) k k • sign(h ) = 6 sign(h ↵ sign(h (t) (=hhk ↵(D ↵ D·,k(D h x )x ) - • updatekfrom reconstruction k )) hkhk(= ) (D h k (t) (t) (t) (t) ) 6= sign(h (t) (t) • sign(h ↵ sign(h )) h(t) (= 0 - •if sign(h ) 6= sign(h k k • ↵ sign(h )) then: k (t) k (t) (t) - •else: h•k h(= 0 hk k (= k ↵ (t) sign(hk ) k s.t. h(x(t) )k 6=0 k (t) (t) • hk (= hk ↵ (t) sign(hk ) • • • • h D T t=1 2 h (t) (t) (t) b 1= D x h(x (t)) = (t) 2 h(x •) = arg min ||x 2 (t) X • • (t) D h ||2 + ||h ||1 SPARSE CODING b(t) = D h(x(t) ) = x h(t) 2(t) 2 D·,k h(x )k h(t) k s.t. (t)6=0 (t) (t) (t) (t) (t) 2 (t) h(x(t) ) b x = D h(x ) (t) =) k (t) h(x ) = arg ||x )kD h ||2 + ||h ||1 Dmin ·,k h(x b = D h(x x 1h 2 h(t) k s.t. (t) (t) b x = D h(x )= (t) h(x )k 6=0 X(t) (t) (t) b = D•h(x x Topics: inference of sparse codes • )= X D·,k h(x(t) )k (t) (t) 2 (t) (t) k s.t. l(x h(x )D= ||x D h || + ||h || 1 ·,k h(x )k (t) 2 ) 6=0 k 4 XX D·,kDh = k s.t. s.t. h(x(t) )kk6= (t)0 h(x )k 6=0 (t) (t) (t) 2 l(x ) = ||x D h ||2(t)+ 2 ||h (t) (t) @ l(x D (t)h ||2 + (t) (t) (t) 2 (t) (t) > (t) (t) ) = ||x • l(x ) = ||x D h || + ||h || • For a single hidden unit: l(x ) (D(t)h x ) + sign(hk ) 1(t) ) = (D 2 @ (t) (t) (t) ·,k 2 (t) > (t) (t) • l(x @h ) = ||x D h ||2 + ||h ||1 @ l(x ) = (D ) (D h x )+ ·,k (t) > (t) (t) 1 k (t) l(x ) = (D ) (D h x (t) (t) (t) 2 (t) @ ·,k @hk (t) l(x@ ) = D h ||(t) + ||h ||1 (t) > (t) (t) ||x 2 (t) l(x ) = (D·,k ) (D h xl(x(t) sign(h )(t) > @hk (t) > (t) (t) 2) )+= (D (t) (t) k (t) ) (D h x ) + sign(h ) ·,k k r l(x ) = D (D h x ) + sign(h (t) (t) (t) > (t) ) (t) h @hk r l(x ) = D (D h @ @h(t) (t) h(t) > (t) (t) (t) > (t)x ) + (t) k l(x ) = (D ) (D h x ) + sign(h ) r l(x ) = D (D h x )+ (t) ·,k k h (t) @h(t) (t) > (t) > (t) (t) (t) (t) (t) (t))sign(h l(x = D (D h x ) + sign(h ) (t) rh•(t) l(x ) = D (D h k r x(t) ) + ) h h • h (t) (t) (t) > (t) (t) • h ‣ issue: L1 norm not differentiable at 0 rh(t) l(x ) = D (D h x ) + sign(h ) (t) (t) (t) • h• • hk (t) =0 h = 0 k - •very • hk = 0 (even if it’s the solution) for gradient descent to ‘‘land’’ on h(t)unlikely (t) (t) •(t) hk =(t) 0 • hk (t) • hk =•if0 hk changes sign because of L1• norm ‣ solution: gradient, clamp to 0 hk (t) (t) (t) > (t) (t) •(t) hk • h (= h ↵ DD (D h x >hkhidden (t) unit (t) (t) (t) (t) > ) (t) (t) k k (t) (t) > (t) • ‣ each update would be performed as follows: • h(x ) (= h(x ) ↵ DD (D h x ) ↵ DD (D h x ) • h(x )k (= h(x )k ↵ DD (D h k x ) k k (t) (t) > (t) (t) (t) (t) (t) • h (= h ↵ DD (D h x ) (t) (t) > > (t) (t)(t) (t) (t) k k • sign(h ) = 6 sign(h ↵ sign(h (t) (=hhk ↵(D ↵ D·,k(D h x )x ) - • updatekfrom reconstruction k )) hkhk(= ) (D h k (t) (t) (t) (t) ) 6= sign(h (t) (t) • sign(h ↵ sign(h )) h(t) (= 0 - •if sign(h ) 6= sign(h k k • ↵ sign(h )) then: k (t) (t) • hk h(= 0 k (= - •else: k (t) hk k ↵ (t) sign(hk ) k s.t. h(x(t) )k 6=0 k (t) (t) • hk (= hk update from sparsity ↵ (t) sign(hk ) (D h h(x ) = arg min ||x 2 h(t) x ) SPARSExb CODING X = D h(x ) = (t) (t) (t) Dh x ↵ sign(h) )) t) ↵ D h ||2 + ||h ||1 •k x(t) ) (t) sign(hk )) (t) (t) sign(h )) (t) k n(hk ) (t)• hk ) (t) k s.t. h(x(t) )k 6=0 Topics: ISTA (Iterative Shrinkage and Thresholding Algorithm) This process corresponds to the ISTA algorithm: • ‣ ‣ (= instance h(t) ↵ to D>0) (D h(t) initialize h(t) (for while h - ‣ (t) (t) > (=not h converged ↵ D (D h has (t) • ↵ D1> (D h(t) h(t) (= h(t) h(t) (= shrink(h1(t) , ↵ ) x(t) ) (t)@ x )(t) l(x nk(a, b) • max(|ai | (t) ) = (D·,k ) (D h (t) sign(h )) bi , 0), . . . ] (t) • Here shrink(a, b) = [. . . , sign(ai ) max(|ai | • hk 1 > • Will converge if ↵ bigger the largest • D (t) (t) isD >than (t) • hk (= hk ↵ DD (D h x(t) ) (t) > (t) • • bi , 0), . . . ] D·,k h(x(t) )k D h(t) ||22 + ||h(t) ||1 @hk x(t) ) rh(t) l(x(t) ) = D> (D h(t) • h(t) h(t) 1(= shrink(h(t) , ↵ return = •[. .h.(t) , sign(a i) = 0 k l(x(t) ) = ||x(t) 5 (t) (t) x )+ (t) x(t) ) h+ (t) sign(hk ) (t) (t) sign(h ) (= shrink(h , shrink(a, b) = [. . . , sign(a shri 1 > • D D eigenvalue of ↵ (D h h(x ) = arg min ||x 2 h(t) x ) SPARSExb CODING X = D h(x ) = (t) (t) (t) Dh x ↵ sign(h) )) t) ↵ D h ||2 + ||h ||1 •k x(t) ) (t) sign(hk )) • (t) sign(h )) (t) k n(hk ) (t)• hk ) (t) (t) (t) (t) k s.t. h (= shrink(h ,↵ (t) h(x )k 6=0 Topics: ISTA (Iterative Shrinkage and Thresholding Algorithm) • This process corresponds to the ISTA algorithm: • ‣ ‣ (= instance h(t) ↵ to D>0) (D h(t) initialize h(t) (for while •h - ‣ (t) > (t) (=not h converged ↵ D (D h has (t) nk(a, b) l(x(t) ) = ||x(t) x•(t)shrink(a ) i , bi ) (t) (t)@ 1 x )• l(x D> D) > D h(t) ||22 + ||h(t) ||1 (t) • ↵ D1> (D h(t) x ) h(t) (= h(t) (t) > (t) • r l(x ) = D (D h (t) • h(t) (= shrink(h1(t) , ↵ ) shrink(a, b) = [.h. . , sign(ai ) max(|a(t)i | •bi , 0), . . . ] • = •[. .h.(t) , sign(a i) = 0 k • max(|ai | 1 ↵ (t) sign(h )) bi , 0), . . . ] x )+ (t) x(t) ) h+ (= shrink(h(t) , ↵ • D D (t) • Here shrink(a, b) = [. . . , sign(ai ) max(|ai | • hk 1 > • Will converge if ↵ bigger the largest • D (t) (t) isD >than (t) • hk (= hk ↵ DD (D h x(t) ) (t) • bi , 0), . . . ] (t) sign(hk ) (t) (t) sign(h ) (= shrink(h , sign(h(t) )) shrink(a, b) = [. . . , sign(ai ) max(|ai | > (t) (t) sign(h(t) )) h sign(h(t shrink(a, b) = [. . . , sign(ai ) max(|ai | = (D·,k ) (D h (t)↵ (t)@h h (=kshrink(h(t) , ↵ (t) (t) (t) D> D •• h↵1 (t) return h 1(= shrink(h , ↵ D·,k h(x(t) )k 5 bi , 0), . . . ] shrink(a, b) = [. . . , sign(a shri 1 > • D D eigenvalue of ↵ (D h x ) Sparse (t) (t) (t) Dh x ↵ sign(h) )) •k t) x(t) ) (t) sign(hk )) ↵ (t) sign(h )) (t) k n(hk ) (t)• hk ) Topics: ISTA h(x ) = arg min ||x 2 h(t) coding SPARSExb CODING X = D h(x ) = • ‣ while •h > (t) (=not h converged ↵ D (D has Math for my slides “Sparse coding”. • x(t) ‣ • - h(t) (= b(t) h(t) D• x - h(t) (= (t) (t)Abstract (t)@ 1 h x )• l(x D> D) > shrink(a, b) = [. . . , sign(ai ) max(|ai | D h(t) ||22 + ||h(t) ||1 = (D·,k ) (D h (t)↵ (t)@h h (=kshrink(h(t) , ↵ (t) (t) T • t=1 h(t) h 2 2 (t) sign(h )) • x )+ sign(h(t) )) • ↵ D1> (D h(t) x ) h(t) (t) > (t) T r l(x ) = D (D h (t) (t) • 1X 1 b)(t)= [.h. . , sign(a shrink(h , ↵ ) shrink(a, bi , 0), . . . ] 1 arg min arg min ||x D h(t) ||2 i+) max(|a ||h(t) (t) ||i1| • D (t) (t) x(t) ) h+ (= shrink(h(t) , ↵ 1 ↵ > • D D (t) X b) = [. . . , sign(a ) max(|a b , 0), . . . ] Here shrink(a, i i| (t) (t) (t)i • h b = D h(x ) = x D h(x ) k ·,k 1 k s.t. > converge if ↵ bigger the • D h(x )largest 6=(t) 0 (t) (t) isD >than (t) • hk (= hk ↵ DD (D h x ) (t) l(x(t) ) = ||x(t) (t) k sign(h(t) )) bi , 0), . . . ] shrink(a, b) = [. . . , sign(a shri k 1 > • D D eigenvalue of ↵ 2 (t) D h(t) ||(t) 2 + ||h ||1 (t) sign(hk ) (t) (t) sign(h ) (= shrink(h , 1 (t) (t) (t) 2 shrink(a, (t) h(x ) = arg min ||x D h || + ||h ||1 b) = [. . . , sign(ai ) max(|ai | = •[. .h.(t) , sign(a ) max(|a | b , 0), . . . ] this is i i i 2 2 = 0 • k h(t) • Will • x•(t)shrink(a ) i , bi ) (t) (t) D> D •• h↵1 (t) return h 1(= shrink(h , ↵ nk(a, b) •• l(x(t) ) = ||x(t) (= instance h(t) ↵ to D>0) (D h(t) initialize h(t) (for (t) 5 Hugo Larochelle (t) (t) (t) D h(x )k ·,k • Département d’informatique (t) (t) (t k s.t. h (= shrink(h , ↵ sign(h Université de Sherbrooke (Iterative Shrinkage and Thresholding Algorithm) h(x(t) )k 6=0 [email protected] This process corresponds November to the ISTA algorithm: • 1, 2012 ‣ D h ||2 + ||h ||1 6 SPARSE CODING • Topics: coordinate descent for sparse coding inference • ISTA ‣ updates all hidden units simultaneously this is wasteful if many hidden units have already converged • Idea: update ‣ ‣ only the ‘‘most promising’’ hidden unit see coordinate descent algorithm in - • Learning Fast Approximations of Sparse Coding. Gregor and Lecun, 2010. • shrink(ai , bi ) 1 • this algorithm has the advantage of not requiring a learning rate ↵ • > D D
© Copyright 2026 Paperzz