Local Loss Optimization in Operator Models: A New

Local Loss Optimization in Operator Models:
A New Insight into Spectral Learning
11
Traducció de la marca a altres idiomes
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat,
que no és traduïble.
Borja Balle, Ariadna Quattoni, Xavier Carreras
ICML 2012
June 2012, Edinburgh
This work is partially supported by the PASCAL2 Network and a Google Research Award
A Simple Spectral Method [HKZ09]
Ÿ
Discrete Homogeneous
Hidden Markov Model
Ÿ
Y1
Y2
Y3
Y4
X1
X2
X3
X4
Ÿ
⋯
Ÿ
Forward-backward equations with
Aσ P Rnn :
~
PrX1:t ws αJ
1 A w1 A wt 1
Probabilities arranged into matrices H, Hσ1 , . . . , Hσk
Hpi, jq PrX1 σi , X2 σj s
Hσ pi, jq PrX1
Ÿ
Ÿ
P t1, . . . , nu
k symbols – Xt P tσ1 , . . . , σk u
for now assume n ¤ k
n states – Yt
P Rkk
σi, X2 σ, X3 σjs
Spectral learning algorithm for Bσ
QAσQ1:
1. Compute SVD H UDV J and take top n right singular vectors Vn
2. Bσ pHVn q Hσ Vn
(For simplicity, in this talk we ignore learning of initial and final vectors)
A Local Approach to Learning?
Ÿ
Maximum likelihood uses the whole of the sample S tw1 , . . . , wN u
and is always consistent in the realizable case
Ņ
1
~
max
logpαJ
1 Awi1 Awit 1q
i
α1 ,tAσ u N
i 1
Ÿ
The spectral method only uses local information from the sample in
p H
p a, H
p b and its consistency depends on properties of H
H,
S tabbabba, aabaa, baaabbbabab, bbaaba,
bababbabbaaaba, abbb, . . .u
Questions
Ÿ
Is the spectral method minimizing a “local” loss function?
Ÿ
When does this minimization yield a consistent algorithm?
Outline
Spectral Learning as Local Loss Optimization
A Convex Relaxation of the Local Loss
Choosing a Consistent Local Loss
Loss Function of the Spectral Method
Ÿ
Ÿ
Both ingredients in the spectral method have optimization
interpretations
SVD
—
J H}F
minVnJ Vn I }HVn Vn
Pseudo-inverse
—
minBσ }HVn Bσ Hσ Vn }F
Can formulate a joint optimization for the spectral method
min
¸
tBσ u,VnJ Vn I σPΣ
}HVnBσ HσVn}2F
Properties of the Spectral Optimization
min
¸
tBσ u,VnJ Vn I σPΣ
Ÿ
Ÿ
Ÿ
}HVnBσ HσVn}2F
Theorem The optimization is consistent under the same conditions of
the spectral method
J Vn I
The loss is non-convex due to Vn Bσ and constraint Vn
Spectral method equivalent to
1. Choosing Vn using SVD
2. Optimizing tBσ u with fixed Vn
Intuition about the Loss Function
Ÿ
Ÿ
Minimize the `2 norm of the unexplained (finite set of) futures when a
symbol σ is generated and the transition is explained using Bσ (over a
finite set of pasts)
Strongly based on the markovianity of the process – which generic ML
does not exploit
A Convex Relaxation of the Local Loss
Ÿ
For algorithmic purposes a convex local loss function is more desirable
Ÿ
A relaxation can be obtained by replacing the projection Vn with a
regularization term
Ÿ
Ÿ
ž
°
P }HVn Bσ Hσ Vn }F
1. fix n |S| and take Vn I
2. BΣ rBσ | |Bσ s and HΣ rHσ | |Hσ s
mintBσ u,VnJ Vn I
2
σ Σ
1
k
1
k
3. regularize via nuclear norm to emulate Vn
minBΣ }HBΣ HΣ }2F
Ÿ
τ}BΣ }
This optimization is convex and has some interesting theoretical (see
paper) and empirical properties
Experimental Results with the Convex Local Loss
Performing experiments with synthetic targets the following facts are
observed
Ÿ
Tuning the regularization parameter τ a better trade-off between
generalization and model complexity can be achieved
Ÿ
The largest gains when using the convex relaxation are attained for
targets suposedly hard to the spectral method
0.09
0.1
SVD n=1
SVD n=2
SVD n=3
SVD n=4
SVD n=5
CO
0.08
0.07
0.09
SVD
CO
difference
0.08
0.07
L1 error
L1 error
0.06
0.06
0.05
0.05
0.04
0.03
0.04
0.02
0.01
0.03
0
0.02
0
500
1000
1500
tau
2000
2500
3000
-0.01
1e-05
0.0001
0.001
0.01
0.1
minimum singular value of target model
1
The Hankel Matrix
For any function f : Σ
Hf pp, sq fpp sq
Σ
λ
a
b
aa
ab
..
.
Ÿ
Ÿ
Ÿ
λ
Σ
λ
λ
Σ
a
λ
b
a
aa
b
ab
aa
.
Ñ R its Hankel matrix Hf P RΣΣ is defined as
a
b
aa
ab
.
...
1
0.3 0.7 0.05 0.25 . . .
0.3 0.05 0.25 0.02 0.03 . . .
0.7 0.6 0.1 0.03
0.2 . . .
0.05 0.02 0.03 0.017 0.003 . . .
0.25 0.23 0.02 0.11 0.12 . . .
..
..
..
..
..
..
.
.
.
.
.
.
.
ab
H
Ha
H
Ha
..
.
Blocks defined by sets of rows (prefixes P) and columns (suffixes S)
Can parametrize the spectral method by P and S taking H P RPS
Each pair pP, Sq defines a different local loss function
1
λ
0.3
1
0.7
0.3
0.05
0.7
0.25
0.05
..
.
0.25
..
.
Consistency of the Local Loss
Theorem (Schützenberger ’61) rankpHf q n iff f can be computed with
operators Aσ P Rnn
Consequences
Ÿ
Ÿ
The spectral method is consistent iff rankpHq rankpHf q n
There always exist |P| |S| n with rankpHq n
Trade-off
Ÿ
Larger P and S more likely to have rankpHq n, but also require
p
larger samples for good estimation H
Question
Ÿ
Given a sample, how to choose good P and S?
Answer
Ÿ
Random sampling succeeds w.h.p. with |P| and |S| depending
polynomially on the complexity of the target
Visit us at poster 53
Local Loss Optimization in Operator Models:
A New Insight into Spectral Learning
11
Traducció de la marca a altres idiomes
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat,
que no és traduïble.
Borja Balle, Ariadna Quattoni, Xavier Carreras
ICML 2012
June 2012, Edinburgh
This work is partially supported by the PASCAL2 Network and a Google Research Award