Generalization Error of Linear Neural Networks in an Empirical Bayes Approach Shinichi Nakajima Sumio Watanabe Tokyo Institute of Technology Nikon Corporation 1 Contents Backgrounds Setting Model Subspace Bayes (SB) Approach Analysis Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? (James-Stein estimator) Solution Generalization error Discussion & Conclusions 2 Regular Models K : dimensionality of parameter space Conventional Learning Theory Regular models n : # of samples Everywhere det (Fisher Information) > 0 - Mean estimation - Linear regression 1. Asymptotic normalities of distribution of ML estimator and Bayes posterior Model selection methods (AIC, BIC, MDL) 2. Asymptotic generalization error l(ML) = l(Bayes) x : input y : output K y x ak x k 1 k 1 GE: FE: Gn ln 1 o n 1 F n l log n olog n 2l 2l K AIC 2Gˆ (n) 2 K 3 (Asymptotically) normal likelihood ˆ BIC for MDL 2 F (parameter n) K log n ANYtrue Unidentifiable models y RN H : # of components Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models 1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion. H ah R M bh R N y x bh ah x h 1 t x RM Unidentifiable set : ah 0 bh 0 ah bh NON-normal likelihood when true is on singularities. 4 Superiority of Bayes to ML bh Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models 1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion. 2. Bayes has advantage G(Bayes) < G(ML) How singularities work in learning ? ah When true is on singularities, Increase of neighborhood of true accelerates overfitting. Increase of population denoting true suppresses overfitting. (only in Bayes) In ML, In Bayes, 2l K 2l K 5 What’s the purpose ? Bayes provides good generalization. Expensive. (Needs Markov chain Monte Carlo) Is there any approximation with good generalization and tractability? Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00] Analyzed in another paper. [Nakajima&Watanabe05] Subspace Bayes (SB) 6 Contents Backgrounds Setting Model Subspace Bayes (SB) Approach Analysis Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? (James-Stein estimator) Solution Generalization error Discussion & Conclusions 7 Linear Neural Networks (LNNs) LNN with M input, N output, and H hidden units: y BAx p y | x, A, B exp N /2 2 2 A : input parameter (H x M ) matrix B : output parameter (N x H ) matrix 2 1 Essential parameter dimensionality: K H M N H 2 a1t A t aH H BA bh ah ah R M bh R N t h 1 Trivial redundancy True map: B*A* with rank H* ( H ). B b1 bH BA BTT 1 A N H*<H H*=H ML [Fukumizu99] l>K/2 l=K/2 Bayes [Aoyagi&Watanabe03] l<K/2 l=K/2 H H* M learner true 8 Maximum Likelihood estimator [Baldi&Hornik95] ML estimator is given by H b a t ˆ Bˆ A b a h h MLE h 1 where Here n R X ,Y n h MLE t h MLE O n 1 bh bh RQ 1 t 1 n t yi xi n i 1 1 n t xi xi n i 1 γh : h-th largest singular value of RQ -1/2. ωah : right singular vector. ωbh γhbh RQ 1/ 2ah : left singular vector. γhah bh RQ 1/ 2 QX n t t 9 Bayes estimation y : output x : input True w : parameter Learner Prior p y | x, w w n training samples q y | x X n , Y n xi , yi ; i 1,, n Marginal likelihood : Z Y | X Posterior : Predictive : n p w | X ,Y w p y | x , w dw n n n i n i i 1 n 1 w p yi | xi , w n n ZY |X i 1 p y | x, X n , Y n p y | x, w p w | X n , Y n dw In ML (or MAP) : Predict with one model In Bayes : Predict with ensemble of models 10 Empirical Bayes (EB) approach [Effron&Morris73] Hyperparameter : True n training samples q y | x X n , Y n xi , yi ; i 1,, n Learner Prior y | x, w|| τ p y|x,w ww || n Marginal likelihood : Z Y |X || τ p yi | xi , w || τ φw || τ dw i1 n n Hyperparameter is estimated by maximizing marginal likelihood. τˆ X n , Y n arg max Z X n , Y n || τ n n τ 1 Z Yn | Xn n p yi | xi , w || τˆ φw || τˆ i 1 Posterior : p w|X , Y Predictive : p y | x, X n , Y n p y | x, w p w | X n , Y n dw 11 Subspace Bayes (SB) approach SB is an EB where part of parameters are regarded as hyperparameters. a) MIP (Marginalizing in Input Parameter space) version A : parameter B : hyperparameter y BAx 2 1 Learner : p y|x,A || B exp N/ 2 2 2π t 1 tr A A exp Prior : A M/ 2 2 2π b) MOP (Marginalizing in Output Parameter space) version A : hyperparameter B : parameter Marginalization can be done analytically in LNNs. 12 Intuitive explanation Bayes posterior SB posterior For redundant comp. ah bh Optimize 13 Contents Backgrounds Setting Model Subspace Bayes (SB) Approach Analysis Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? (James-Stein estimator) Solution Generalization error Discussion & Conclusions 14 Free energy (a.k.a. evidence, stochastic complexity) Free energy : F n log Z Y n | X n Important variable used for model selection. [Akaike80;Mackay92] We minimize the free energy, optimizing hyperparameter. 15 Generalization error Generalization Error : where G n DTrue || Predictive q X n ,Y n Dq || p : Kullbuck-Leibler divergence between q & p V q : Expectation of V over q Asymptotic expansion : G n In regular, l 1 o n n l : generalization coefficient 2l K In unidentifiable, 2l K 16 James-Stein (JS) estimator Domination of a over b : Gα Gβ Gα Gβ for any true for a certain true K-dimensional mean estimation (Regular model) z1 ,, zn A certain relation between EB and JS was discussed in [Efron&Morris73] : samples 1 n z zi : ML estimator (arithmetic mean) n i 1 ML is efficient (never dominated by any unbiased estimator), but is inadmissible (dominated by biased estimator) when K 3 [Stein56]. James-Stein estimator [James&Stein61] ̂ js K 2 1 z 2 n z 2l K ML JS (K=3) 17 true mean Positive-part JS estimator Positive-part JS type (PJS) estimator ˆ pjs z where 1 n nz 2 z 0 (if event is false) event 1 (if event is true) Thresholding Model selection PJS is a model selecting, shrinkage estimator. ˆ pjs (1 ' 1 )z where max , n X 2 18 Hyperparameter optimization Assume orthonormality : t q(x)xx dx I M H I d : d x d identity matrix F Y | X || B Fh Y n | X n || bh n n h 1 Fh Optimum hyperparameter value : bˆh nγ 0 M ωhb nM 2 h if h M /n if h M /n Analytically solved in LNNs! h M n bh Fh 2 h M n bh 2 19 SB solution (Theorem1, Lemma1) L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP. Theorem 1: The SB estimator is given by H L max L, n 1 t ˆ Bˆ A 1 L L b a h h h h 1 where MLE O n 1 2 h h Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive. ˆ equivalent (1 ' to) PJS z SB is asymptotically estimation. 1 pjs where max , n X 2 20 Generalization error (Theorem 2) Theorem 2: SB generalization coefficient is given by 2l H M N H * *2 H H * h 1 2 L 2 h L 1 2 h h q h 2 h 2 : h-th largest eigenvalue of matrix subject to WN-H* (M-H*, IN-H* ). Expectation over Wishart distribution. 21 Large scale approximation (Theorem 3) Theorem 3: In the large scale limit when M , N , H , H * , the generalization coefficient converges to 2l H M N H * where *2 M H N H J s ;1 2J s ;0 J s 2a N H* a M H* * * 2 M H H* b N H* J s;1 2a s 1 s 2 cos 1 s M ;1 L M H* J s;0 2 a 1 s 2 1 a cos 1 s 1 a cos 1 a 1 a s 2a 2as a 1 a 1 s2 1 a a 1 a s 2a cos 1 s cos 1 2 a 1a 2 a s 1 a 2as a 1 a J s;1 1 s 2 cos 1 s 1 s 1 a 1 sM max , J 2ab;0 2 a M 0 a 1 a 1 22 Results 1 (true rank dependence) 2l K N = 30 ML H* M = 50 Bayes true SB(MIP) SB(MOP) N = 30 H* SB provides good generalization. H 20 M = 50 learner Note : This does NOT mean domination of SB over Bayes. Discussion of domination needs consideration of delicate situation. (See paper) 23 Results 2 (redundant rank dependence) 2l K N = 30 H* 0 ML M = 50 Bayes true SB(MOP) SB(MIP) N = 30 H H M = 50 depends on H similarly to ML. has also a property similar to ML. learner 24 Contents Backgrounds Setting Model Subspace Bayes (SB) Approach Analysis Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? (James-Stein estimator) Solution Generalization error Discussion & Conclusions 25 Feature of SB provides good generalization. In LNNs, asymptotically equivalent to PJS. requires smaller computational costs. Reduction of marginalized space. In some models, marginalization can be done analytically. related to variational Bayes (VB) approach. 26 Variational Bayes (VB) Solution [Nakajima&Watanabe05] VB results in same solution as MIP. VB automatically selects larger dimension to marginalize. For M N * and H h H O1 O n 1 ah Bayes posterior bh VB posterior Similar to SB posterior 27 Conclusions We have introduced a subspace Bayes (SB) approach. We have proved that, in LNNs, SB is asymptotically equivalent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges not to ML but to smaller value, which means suppression of overfitting. Interestingly, MIP of SB is asymptotically equivalent to VB. We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and acceleration of overfitting by basis selection. 28 Future work Analysis of other models. (neural networks, Bayesian networks, mixture models, etc). Analysis of variational Bayes (VB) in other models. 29 Thank you! 30
© Copyright 2025 Paperzz