Local Loss Optimization in Operator Models: A New Insight into Spectral Learning 11 Traducció de la marca a altres idiomes La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble. Borja Balle, Ariadna Quattoni, Xavier Carreras ICML 2012 June 2012, Edinburgh This work is partially supported by the PASCAL2 Network and a Google Research Award A Simple Spectral Method [HKZ09] Discrete Homogeneous Hidden Markov Model Y1 Y2 Y3 Y4 X1 X2 X3 X4 ⋯ Forward-backward equations with Aσ P Rnn : ~ PrX1:t ws αJ 1 A w1 A wt 1 Probabilities arranged into matrices H, Hσ1 , . . . , Hσk Hpi, jq PrX1 σi , X2 σj s Hσ pi, jq PrX1 P t1, . . . , nu k symbols – Xt P tσ1 , . . . , σk u for now assume n ¤ k n states – Yt P Rkk σi, X2 σ, X3 σjs Spectral learning algorithm for Bσ QAσQ1: 1. Compute SVD H UDV J and take top n right singular vectors Vn 2. Bσ pHVn q Hσ Vn (For simplicity, in this talk we ignore learning of initial and final vectors) A Local Approach to Learning? Maximum likelihood uses the whole of the sample S tw1 , . . . , wN u and is always consistent in the realizable case Ņ 1 ~ max logpαJ 1 Awi1 Awit 1q i α1 ,tAσ u N i 1 The spectral method only uses local information from the sample in p H p a, H p b and its consistency depends on properties of H H, S tabbabba, aabaa, baaabbbabab, bbaaba, bababbabbaaaba, abbb, . . .u Questions Is the spectral method minimizing a “local” loss function? When does this minimization yield a consistent algorithm? Outline Spectral Learning as Local Loss Optimization A Convex Relaxation of the Local Loss Choosing a Consistent Local Loss Loss Function of the Spectral Method Both ingredients in the spectral method have optimization interpretations SVD — J H}F minVnJ Vn I }HVn Vn Pseudo-inverse — minBσ }HVn Bσ Hσ Vn }F Can formulate a joint optimization for the spectral method min ¸ tBσ u,VnJ Vn I σPΣ }HVnBσ HσVn}2F Properties of the Spectral Optimization min ¸ tBσ u,VnJ Vn I σPΣ }HVnBσ HσVn}2F Theorem The optimization is consistent under the same conditions of the spectral method J Vn I The loss is non-convex due to Vn Bσ and constraint Vn Spectral method equivalent to 1. Choosing Vn using SVD 2. Optimizing tBσ u with fixed Vn Intuition about the Loss Function Minimize the `2 norm of the unexplained (finite set of) futures when a symbol σ is generated and the transition is explained using Bσ (over a finite set of pasts) Strongly based on the markovianity of the process – which generic ML does not exploit A Convex Relaxation of the Local Loss For algorithmic purposes a convex local loss function is more desirable A relaxation can be obtained by replacing the projection Vn with a regularization term ° P }HVn Bσ Hσ Vn }F 1. fix n |S| and take Vn I 2. BΣ rBσ | |Bσ s and HΣ rHσ | |Hσ s mintBσ u,VnJ Vn I 2 σ Σ 1 k 1 k 3. regularize via nuclear norm to emulate Vn minBΣ }HBΣ HΣ }2F τ}BΣ } This optimization is convex and has some interesting theoretical (see paper) and empirical properties Experimental Results with the Convex Local Loss Performing experiments with synthetic targets the following facts are observed Tuning the regularization parameter τ a better trade-off between generalization and model complexity can be achieved The largest gains when using the convex relaxation are attained for targets suposedly hard to the spectral method 0.09 0.1 SVD n=1 SVD n=2 SVD n=3 SVD n=4 SVD n=5 CO 0.08 0.07 0.09 SVD CO difference 0.08 0.07 L1 error L1 error 0.06 0.06 0.05 0.05 0.04 0.03 0.04 0.02 0.01 0.03 0 0.02 0 500 1000 1500 tau 2000 2500 3000 -0.01 1e-05 0.0001 0.001 0.01 0.1 minimum singular value of target model 1 The Hankel Matrix For any function f : Σ Hf pp, sq fpp sq Σ λ a b aa ab .. . λ Σ λ λ Σ a λ b a aa b ab aa . Ñ R its Hankel matrix Hf P RΣΣ is defined as a b aa ab . ... 1 0.3 0.7 0.05 0.25 . . . 0.3 0.05 0.25 0.02 0.03 . . . 0.7 0.6 0.1 0.03 0.2 . . . 0.05 0.02 0.03 0.017 0.003 . . . 0.25 0.23 0.02 0.11 0.12 . . . .. .. .. .. .. .. . . . . . . . ab H Ha H Ha .. . Blocks defined by sets of rows (prefixes P) and columns (suffixes S) Can parametrize the spectral method by P and S taking H P RPS Each pair pP, Sq defines a different local loss function 1 λ 0.3 1 0.7 0.3 0.05 0.7 0.25 0.05 .. . 0.25 .. . Consistency of the Local Loss Theorem (Schützenberger ’61) rankpHf q n iff f can be computed with operators Aσ P Rnn Consequences The spectral method is consistent iff rankpHq rankpHf q n There always exist |P| |S| n with rankpHq n Trade-off Larger P and S more likely to have rankpHq n, but also require p larger samples for good estimation H Question Given a sample, how to choose good P and S? Answer Random sampling succeeds w.h.p. with |P| and |S| depending polynomially on the complexity of the target Visit us at poster 53 Local Loss Optimization in Operator Models: A New Insight into Spectral Learning 11 Traducció de la marca a altres idiomes La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble. Borja Balle, Ariadna Quattoni, Xavier Carreras ICML 2012 June 2012, Edinburgh This work is partially supported by the PASCAL2 Network and a Google Research Award
© Copyright 2026 Paperzz