presentation

Generalization Error of
Linear Neural Networks in
an Empirical Bayes Approach
Shinichi Nakajima
Sumio Watanabe
Tokyo Institute of Technology
Nikon Corporation
1
Contents

Backgrounds





Setting



Model
Subspace Bayes (SB) Approach
Analysis




Regular models
Unidentifiable models
Superiority of Bayes to ML
What’s the purpose?
(James-Stein estimator)
Solution
Generalization error
Discussion & Conclusions
2
Regular Models
K : dimensionality of
parameter space
Conventional Learning Theory
Regular models
n : # of samples
Everywhere
det (Fisher Information) > 0
- Mean estimation
- Linear regression
1. Asymptotic normalities of
distribution of ML estimator
and Bayes posterior
Model selection methods
(AIC, BIC, MDL)
2. Asymptotic generalization error
l(ML) = l(Bayes)
x : input
y : output
K
y x    ak x k 1
k 1
GE:
FE:
 
Gn  ln 1  o n 1
F n  l log n  olog n
2l  2l  K
AIC  2Gˆ (n)  2 K
3
(Asymptotically) normal
likelihood
ˆ
BIC for
MDL
2 F (parameter
n)  K log n
ANYtrue
Unidentifiable models
y  RN
H : # of components
Unidentifiable models
Exist singularities, where
det (Fisher Information) = 0
- Neural networks
- Bayesian networks
- Mixture models
- Hidden Markov models
1. Asymptotic normalities NOT hold.
No (penalized likelihood type)
information criterion.
H
 
ah  R M
bh  R N
y x    bh ah x
h 1
t
x  RM
Unidentifiable set : ah  0 bh  0
ah
bh
NON-normal likelihood
when true is on singularities.
4
Superiority of Bayes to ML
bh
Unidentifiable models
Exist singularities, where
det (Fisher Information) = 0
- Neural networks
- Bayesian networks
- Mixture models
- Hidden Markov models
1. Asymptotic normalities NOT hold.
No (penalized likelihood type)
information criterion.
2. Bayes has advantage
G(Bayes) < G(ML)
How singularities work
in learning ?
ah
When true is on singularities,
Increase of neighborhood of true
accelerates overfitting.
Increase of population denoting true
suppresses overfitting. (only in Bayes)
In ML,
In Bayes,
2l  K
2l  K
5
What’s the purpose ?

Bayes provides good generalization.

Expensive. (Needs Markov chain Monte Carlo)
Is there any approximation
with good generalization and tractability?

Variational Bayes (VB)
[Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00]
Analyzed in another paper. [Nakajima&Watanabe05]

Subspace Bayes (SB)
6
Contents

Backgrounds





Setting



Model
Subspace Bayes (SB) Approach
Analysis




Regular models
Unidentifiable models
Superiority of Bayes to ML
What’s the purpose?
(James-Stein estimator)
Solution
Generalization error
Discussion & Conclusions
7
Linear Neural Networks
(LNNs)
LNN with M input, N output, and H hidden units:
 y  BAx 

p y | x, A, B  
exp  
N /2


2
2 


A : input parameter (H x M ) matrix
B : output parameter (N x H ) matrix
2
1
Essential parameter dimensionality:
K  H M  N   H 2
 a1t 


A  
 t 
 aH 
H
BA   bh ah
ah  R M
bh  R N
t
h 1
Trivial redundancy
True map: B*A* with rank H* (  H ).
B  b1  bH 
BA  BTT 1 A
N
H*<H
H*=H
ML
[Fukumizu99]
l>K/2
l=K/2
Bayes
[Aoyagi&Watanabe03]
l<K/2
l=K/2
H
H*
M
learner
true
8
Maximum Likelihood estimator
[Baldi&Hornik95]
ML estimator is given by
H
 
b a 
t
ˆ
Bˆ A

b
a
 h h
MLE
h 1
where
Here

n
R X ,Y
n

h
MLE
t
h MLE
 
 O n 1
 bh bh RQ 1
t
1 n
t
  yi xi
n i 1
 
1 n
t
  xi xi
n i 1
γh
: h-th largest singular value of RQ -1/2.
ωah
: right singular vector.
ωbh
γhbh  RQ 1/ 2ah
: left singular vector.
γhah  bh RQ 1/ 2
QX
n
t
t
9
Bayes estimation
y : output
x : input
True
w : parameter
Learner
Prior
p y | x, w
 w
n training samples
q y | x
X
n

, Y n  xi , yi ; i  1,, n

Marginal likelihood : Z Y | X
Posterior :
Predictive :

n
p w | X ,Y

    w p y | x , w dw
n
n
n
i
n

i
i 1
n
1

 w p yi | xi , w
n
n
ZY |X
i 1





p y | x, X n , Y n   p y | x, w p w | X n , Y n dw
In ML (or MAP) : Predict with one model
In Bayes : Predict with ensemble of models
10
Empirical Bayes (EB) approach
[Effron&Morris73]
Hyperparameter :
True

n training samples
q y | x
X
n

, Y n  xi , yi ; i  1,, n
Learner
Prior
y | x, w|| τ 
p y|x,w
 ww
||  
 n

Marginal likelihood : Z Y |X || τ    p yi | xi , w || τ φw || τ dw
  i1


n

n
Hyperparameter is estimated by maximizing marginal likelihood.



τˆ X n , Y n  arg max Z X n , Y n || τ

n
n

τ
1

Z Yn | Xn



 n

  p yi | xi , w || τˆ φw || τˆ 
 i 1

Posterior :
p w|X , Y
Predictive :
p y | x, X n , Y n   p y | x, w p w | X n , Y n dw




11
Subspace Bayes (SB) approach
SB is an EB where part of parameters are regarded as hyperparameters.
a) MIP (Marginalizing in Input Parameter space) version
A : parameter
B : hyperparameter
 y  BAx 2 
1

Learner : p y|x,A || B  
exp  
N/ 2


2
2π 


t


1
tr A A


exp

Prior :   A 
M/ 2

2
2π 




b) MOP (Marginalizing in Output Parameter space) version
A : hyperparameter
B : parameter
Marginalization can be done analytically in LNNs.
12
Intuitive explanation
Bayes posterior
SB posterior
For redundant comp.
ah
bh
Optimize
13
Contents

Backgrounds





Setting



Model
Subspace Bayes (SB) Approach
Analysis




Regular models
Unidentifiable models
Superiority of Bayes to ML
What’s the purpose?
(James-Stein estimator)
Solution
Generalization error
Discussion & Conclusions
14
Free energy (a.k.a. evidence,
stochastic complexity)
Free energy :

F n    log Z Y n | X n

Important variable used for model selection. [Akaike80;Mackay92]
We minimize the free energy,
optimizing hyperparameter.
15
Generalization error
Generalization Error :
where
G n   DTrue || Predictive
 q X
n
,Y n

Dq || p : Kullbuck-Leibler divergence between q & p
V
q
: Expectation of V over q
Asymptotic expansion : G n  
In regular,
l
1
 o 
n
n
l : generalization coefficient
2l  K
In unidentifiable,
2l  K
16
James-Stein (JS) estimator
Domination of a over b :
Gα  Gβ
Gα  Gβ
for any true
for a certain true
K-dimensional mean estimation (Regular model)
z1 ,, zn 
A certain relation
between EB and JS
was discussed in
[Efron&Morris73]
: samples
1 n
z   zi : ML estimator (arithmetic mean)
n i 1
ML is efficient (never dominated by any unbiased estimator),
but is inadmissible (dominated by biased estimator) when K  3 [Stein56].
James-Stein estimator
[James&Stein61]
̂ js

K  2 

 1
z
2

n z 

2l
K
ML
JS (K=3)
17
true mean
Positive-part JS estimator
Positive-part JS type (PJS) estimator
ˆ pjs

   z 

where


 1
n 
nz

2

z


0 (if event is false)
 event   
 1 (if event is true)
Thresholding
Model selection
PJS is a model selecting, shrinkage estimator.
ˆ pjs  (1   '
1
)z

where    max  , n X
2

18
Hyperparameter optimization
Assume orthonormality :

t
q(x)xx
dx  I M


H

I d : d x d identity matrix
F Y | X || B   Fh Y n | X n || bh
n
n

h 1
Fh
Optimum hyperparameter value :


bˆh  


nγ
0

M
ωhb
nM
2
h
if  h 
M /n
if  h 
M /n
Analytically solved
in LNNs!
h  M n
bh
Fh
2
h  M n
bh
2
19
SB solution
(Theorem1, Lemma1)
L : dimensionality of marginalized subspace (per component),
i.e., L = M in MIP, or L = N in MOP.
Theorem 1: The SB estimator is given by
H


 
L  max L, n 
1
t
ˆ

Bˆ A
1

L
L
b
a

h
h h
h 1
where

MLE
 O n 1
2
h
h
Lemma 1: Posterior is localized so that we can substitute the model
at the SB estimator for predictive.

ˆ equivalent
 (1   ' to) PJS
z
SB is asymptotically
estimation.
1
pjs
where    max  , n X
2
20

Generalization error
(Theorem 2)
Theorem 2: SB generalization coefficient is given by
2l  H M  N   H 
*
*2
 
H H *
h 1

2

L  2
  h  L 1  2   h
  h 
 
q  h 2
 h 2 : h-th largest eigenvalue of matrix
subject to WN-H* (M-H*, IN-H* ).
Expectation over
Wishart distribution.
21
Large scale approximation
(Theorem 3)
Theorem 3: In the large scale limit when M , N , H , H *   ,
the generalization coefficient converges to
2l  H M  N   H
*
where
*2
M  H N  H  J s ;1  2J s ;0   J s

2a
N  H*
a
M  H*

*
*
2
M
H  H*
b
N  H*
J s;1  2a  s 1  s 2  cos 1 s


M

;1
L
M  H*
J s;0  2 a 1  s 2  1  a  cos 1 s  1  a  cos 1
a 1  a s  2a
2as  a 1  a 

1 s2
1 a
a 1  a s  2a
 cos 1 s 
cos 1
2 a

1a
2 a s 1 a
2as  a 1  a 
J s;1  
1 s

2
 cos 1 s

1 s

   1  a  1

sM  max 
, J 2ab;0
 2 a

M
0  a  1
a  1
22
Results 1
(true rank dependence)
2l
K
N = 30
ML
H*
M = 50
Bayes
true
SB(MIP)
SB(MOP)
N = 30
H*
SB provides good generalization.
H  20
M = 50
learner
Note : This does NOT mean domination of SB over Bayes.
Discussion of domination needs consideration of delicate situation. (See paper)
23
Results 2
(redundant rank dependence)
2l
K
N = 30
H*  0
ML
M = 50
Bayes
true
SB(MOP)
SB(MIP)
N = 30
H
H
M = 50
depends on H similarly to ML.
has also a property similar to ML.
learner
24
Contents

Backgrounds





Setting



Model
Subspace Bayes (SB) Approach
Analysis




Regular models
Unidentifiable models
Superiority of Bayes to ML
What’s the purpose?
(James-Stein estimator)
Solution
Generalization error
Discussion & Conclusions
25
Feature of SB

provides good generalization.



In LNNs, asymptotically equivalent to PJS.
requires smaller computational costs.

Reduction of marginalized space.

In some models, marginalization can be done analytically.
related to variational Bayes (VB) approach.
26
Variational Bayes (VB) Solution
[Nakajima&Watanabe05]


VB results in same solution as MIP.
VB automatically selects larger dimension to marginalize.
For
M N
*
and H  h  H
O1
 
O n 1
ah
Bayes posterior
bh
VB posterior
Similar to SB posterior
27
Conclusions


We have introduced a subspace Bayes (SB) approach.
We have proved that, in LNNs, SB is asymptotically
equivalent to a shrinkage (PJS) estimation.



Even in asymptotics, SB for redundant components converges
not to ML but to smaller value, which means suppression of
overfitting.
Interestingly, MIP of SB is asymptotically equivalent to VB.
We have clarified the SB generalization error.

SB has Bayes-like and ML-like properties, i.e., shrinkage and
acceleration of overfitting by basis selection.
28
Future work


Analysis of other models. (neural networks, Bayesian
networks, mixture models, etc).
Analysis of variational Bayes (VB) in other models.
29
Thank you!
30