On the Expressive Power of Deep Learning: A Tensor Analysis

On the Expressive Power of Deep Learning:
A Tensor Analysis
Nadav Cohen
Or Sharir
Amnon Shashua
The Hebrew University of Jerusalem
Conference on Learning Theory (COLT) 2016
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
1 / 11
The Expressive Power of Deep Learning
Expressive power of depth – the driving force behind Deep Learning
Depth efficiency: when a polynomially sized deep network realizes a
function that requires shallow networks to have super-polynomial size
Prior works on depth efficiency:
Show its existence, without discussing how frequent it is
Do not apply to convolutional networks (locality+sharing+pooling),
the most successful deep learning architecture to date
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
2 / 11
Convolutional Arithmetic Circuits
input X
hidden layer 0
representation
1x1 conv
hidden layer L-1
pooling
1x1 conv
pooling
xi
rep  i, d   fd  x i 
M
r0
pool0  j,   
conv0  j,    a0, j , , rep  j,:
r0

conv0  j ',  
j ' window j
poolL1   

rL1
dense
(output)
rL1
convL1  j ',  
j ' covers space
Y
out  y   a L,1, y , poolL1 :
Convolutional networks:
locality
sharing (optional)
product pooling
Computation in log-space leads to SimNets – new deep learning
architecture showing promising empirical performance 1
1
Deep SimNets, CVPR’16
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
3 / 11
Convolutional Arithmetic Circuits (cont’)
input X
representation
hidden layer 0
1x1 conv
hidden layer L-1
pooling
1x1 conv
pooling
xi
rep  i, d   fd  x i 
M
r0
pool0  j,   
conv0  j,    a0, j , , rep  j,:
r0

conv0  j ',  
j ' window j
poolL1   

rL1
dense
(output)
rL1
convL1  j ',  
j ' covers space
Y
out  y   a L,1, y , poolL1 :
Function realized by output y :
hy (x1 , . . . , xN ) =
M
X
Ayd1 ,...,dN
d1 ...dN =1
N
Y
fθdi (xi )
i=1
x1 . . .xN – input patches
fθ1 . . .fθM – representation layer functions
Ay – coefficient tensor (M N entries, polynomials in weights al,j,γ )
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
4 / 11
Shallow Network ↔ CP Decomposition
Shallow network (single hidden layer, global pooling):
hidden layer
input X
representation 1x1 conv
global
pooling
xi
r0
M
rep  i, d   fd  x i 
conv  j,    a0, j , , rep  j,:
pool   

conv  j,  
j covers space
dense
(output)
r0
Y
out  y  
a1,1, y , pool :
Coefficient tensor Ay given by classic CP decomposition:
Ay =
r0
X
γ=1
aγ1,1,y · a0,1,γ ⊗ a0,2,γ ⊗ · · · ⊗ a0,N,γ
|
{z
rank-1 tensor
}
(rank(Ay )≤r0 )
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
5 / 11
Deep Network ↔ Hierarchical Tucker Decomposition
Deep network (L = log2 N hidden layers, size-2 pooling windows):
input X
representation
hidden layer 0
1x1 conv
hidden layer L-1
(L=log2N)
pooling
1x1 conv
r0
M
rep  i, d   fd  x i 
r0
conv0  j,    a0, j , , rep  j,:
pool0  j,   
rL1
poolL1   

j '2 j 1,2 j
conv0  j ',  

j '1,2
dense
(output)
pooling
xi
rL1
Y
convL1  j ',  
out  y   a L,1, y , poolL1 :
Coefficient tensor Ay given by Hierarchical Tucker decomposition:
φ1,j,γ
=
Xr0
α=1
aα1,j,γ · a0,2j−1,α ⊗ a0,2j,α
···
φ
l,j,γ
=
Xrl−1
α=1
aαl,j,γ · φl−1,2j−1,α ⊗ φl−1,2j,α
···
Ay
Cohen, Sharir, Shashua (HUJI)
=
XrL−1
α=1
aαL,1,y · φL−1,1,α ⊗ φL−1,2,α
On the Expressive Power of Deep Learning
COLT 2016
6 / 11
Theorem of Network Capacity
Theorem
The rank of tensor Ay given by Hierarchical Tucker decomposition is at
least min{r0 , M}N/2 almost everywhere w.r.t. decomposition parameters.
Since rank of Ay generated by CP decomposition is no more than the
number of terms (# of hidden channels in shallow network):
Corollary
Randomizing linear weights of deep network by a continuous distribution
gives functions that with probability one, cannot be approximated by
shallow network with less than min{r0 , M}N/2 hidden channels.
Depth efficiency holds almost always!
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
7 / 11
Theorem of Network Capacity – Proof Sketch
JAK – arrangement of tensor A as matrix (matricization)
– Kronecker product for matrices. Holds: rank(AB) = rank(A)·rank(B)
Relation between tensor and Kronecker products: JA ⊗ BK = JAK JBK
Implies: A =
PZ
z=1
(z)
(z)
λz v1 ⊗ · · · ⊗ v2L =⇒ rankJAK≤Z
By induction over l = 1. . .L, almost everywhere w.r.t. {al,j,γ }l,j,γ :
∀j ∈ [N/2l ], γ ∈ [rl ] : rankJφl,j,γ K≥ (min{r0 , M})
2l/2
Base: “SVD has maximal rank almost everywhere”
Step: rankJA ⊗ BK = rank(JAK JBK) = rankJAK·rankJBK, and
“linear combination preserves rank almost everywhere”
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
8 / 11
Generalization
Comparison between arbitrary depths shows penalty in resources grows
double exponentially w.r.t. number of layers cut off.
L−l f (l) = O r2
# of parameters
O rN/2
optimal
1
L
# of layers
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
9 / 11
Conclusion
Through tensor decompositions, we showed that depth efficiency holds
almost always with convolutional arithmetic circuits
Equivalence between convolutional networks and tensor decompositions
has many other applications, for example:
Expressiveness of convolutional ReLU networks:
1
Average pooling leads to loss of universality
Depth efficiency exists but does not hold almost always
Inductive bias of convolutional arithmetic circuits:
2
Deep networks can model strong correlation between input elements,
shallow networks can’t
Pooling geometry of a deep network selects supported correlations
1
2
Convolutional Rectifier Networks as Generalized Tensor Decompositions, ICML’16
Inductive Bias of Deep Convolutional Networks through Pooling Geometry, arXiv
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
10 / 11
Thank You
Cohen, Sharir, Shashua (HUJI)
On the Expressive Power of Deep Learning
COLT 2016
11 / 11