On the Expressive Power of Deep Learning: A Tensor Analysis Nadav Cohen Or Sharir Amnon Shashua The Hebrew University of Jerusalem Conference on Learning Theory (COLT) 2016 Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 1 / 11 The Expressive Power of Deep Learning Expressive power of depth – the driving force behind Deep Learning Depth efficiency: when a polynomially sized deep network realizes a function that requires shallow networks to have super-polynomial size Prior works on depth efficiency: Show its existence, without discussing how frequent it is Do not apply to convolutional networks (locality+sharing+pooling), the most successful deep learning architecture to date Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 2 / 11 Convolutional Arithmetic Circuits input X hidden layer 0 representation 1x1 conv hidden layer L-1 pooling 1x1 conv pooling xi rep i, d fd x i M r0 pool0 j, conv0 j, a0, j , , rep j,: r0 conv0 j ', j ' window j poolL1 rL1 dense (output) rL1 convL1 j ', j ' covers space Y out y a L,1, y , poolL1 : Convolutional networks: locality sharing (optional) product pooling Computation in log-space leads to SimNets – new deep learning architecture showing promising empirical performance 1 1 Deep SimNets, CVPR’16 Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 3 / 11 Convolutional Arithmetic Circuits (cont’) input X representation hidden layer 0 1x1 conv hidden layer L-1 pooling 1x1 conv pooling xi rep i, d fd x i M r0 pool0 j, conv0 j, a0, j , , rep j,: r0 conv0 j ', j ' window j poolL1 rL1 dense (output) rL1 convL1 j ', j ' covers space Y out y a L,1, y , poolL1 : Function realized by output y : hy (x1 , . . . , xN ) = M X Ayd1 ,...,dN d1 ...dN =1 N Y fθdi (xi ) i=1 x1 . . .xN – input patches fθ1 . . .fθM – representation layer functions Ay – coefficient tensor (M N entries, polynomials in weights al,j,γ ) Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 4 / 11 Shallow Network ↔ CP Decomposition Shallow network (single hidden layer, global pooling): hidden layer input X representation 1x1 conv global pooling xi r0 M rep i, d fd x i conv j, a0, j , , rep j,: pool conv j, j covers space dense (output) r0 Y out y a1,1, y , pool : Coefficient tensor Ay given by classic CP decomposition: Ay = r0 X γ=1 aγ1,1,y · a0,1,γ ⊗ a0,2,γ ⊗ · · · ⊗ a0,N,γ | {z rank-1 tensor } (rank(Ay )≤r0 ) Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 5 / 11 Deep Network ↔ Hierarchical Tucker Decomposition Deep network (L = log2 N hidden layers, size-2 pooling windows): input X representation hidden layer 0 1x1 conv hidden layer L-1 (L=log2N) pooling 1x1 conv r0 M rep i, d fd x i r0 conv0 j, a0, j , , rep j,: pool0 j, rL1 poolL1 j '2 j 1,2 j conv0 j ', j '1,2 dense (output) pooling xi rL1 Y convL1 j ', out y a L,1, y , poolL1 : Coefficient tensor Ay given by Hierarchical Tucker decomposition: φ1,j,γ = Xr0 α=1 aα1,j,γ · a0,2j−1,α ⊗ a0,2j,α ··· φ l,j,γ = Xrl−1 α=1 aαl,j,γ · φl−1,2j−1,α ⊗ φl−1,2j,α ··· Ay Cohen, Sharir, Shashua (HUJI) = XrL−1 α=1 aαL,1,y · φL−1,1,α ⊗ φL−1,2,α On the Expressive Power of Deep Learning COLT 2016 6 / 11 Theorem of Network Capacity Theorem The rank of tensor Ay given by Hierarchical Tucker decomposition is at least min{r0 , M}N/2 almost everywhere w.r.t. decomposition parameters. Since rank of Ay generated by CP decomposition is no more than the number of terms (# of hidden channels in shallow network): Corollary Randomizing linear weights of deep network by a continuous distribution gives functions that with probability one, cannot be approximated by shallow network with less than min{r0 , M}N/2 hidden channels. Depth efficiency holds almost always! Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 7 / 11 Theorem of Network Capacity – Proof Sketch JAK – arrangement of tensor A as matrix (matricization) – Kronecker product for matrices. Holds: rank(AB) = rank(A)·rank(B) Relation between tensor and Kronecker products: JA ⊗ BK = JAK JBK Implies: A = PZ z=1 (z) (z) λz v1 ⊗ · · · ⊗ v2L =⇒ rankJAK≤Z By induction over l = 1. . .L, almost everywhere w.r.t. {al,j,γ }l,j,γ : ∀j ∈ [N/2l ], γ ∈ [rl ] : rankJφl,j,γ K≥ (min{r0 , M}) 2l/2 Base: “SVD has maximal rank almost everywhere” Step: rankJA ⊗ BK = rank(JAK JBK) = rankJAK·rankJBK, and “linear combination preserves rank almost everywhere” Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 8 / 11 Generalization Comparison between arbitrary depths shows penalty in resources grows double exponentially w.r.t. number of layers cut off. L−l f (l) = O r2 # of parameters O rN/2 optimal 1 L # of layers Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 9 / 11 Conclusion Through tensor decompositions, we showed that depth efficiency holds almost always with convolutional arithmetic circuits Equivalence between convolutional networks and tensor decompositions has many other applications, for example: Expressiveness of convolutional ReLU networks: 1 Average pooling leads to loss of universality Depth efficiency exists but does not hold almost always Inductive bias of convolutional arithmetic circuits: 2 Deep networks can model strong correlation between input elements, shallow networks can’t Pooling geometry of a deep network selects supported correlations 1 2 Convolutional Rectifier Networks as Generalized Tensor Decompositions, ICML’16 Inductive Bias of Deep Convolutional Networks through Pooling Geometry, arXiv Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 10 / 11 Thank You Cohen, Sharir, Shashua (HUJI) On the Expressive Power of Deep Learning COLT 2016 11 / 11
© Copyright 2024 Paperzz