Spectral Methods for Learning Finite State Machines

Spectral Methods for Learning
Finite State Machines
11
Traducció de la marca a altres idiomes
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat,
que no és traduïble.
Borja Balle
(based on joint work with X. Carreras, M. Mohri, and A. Quattoni)
McGill University
Montreal, September 2012
This work is partially supported by the PASCAL2 Network and a Google Research Award
Outline
Weighted Automata and Functions over Strings
A Spectral Method for Learning Weighted Automata
Survey of Recent Applications to Learning Problems
Conclusion
Outline
Weighted Automata and Functions over Strings
A Spectral Method for Learning Weighted Automata
Survey of Recent Applications to Learning Problems
Conclusion
Notation
Ÿ
Finite alphabet Σ tσ1 , σ2 , . . . , σr u
Free monoid Σ
Ÿ
t, a, b, aa, ab, ba, bb, aaa, . . .u
Functions over strings f : Σ Ñ R
Ÿ
Examples:
Ÿ
fpxq Prxs
fpxq PrxΣ s
fpxq Irx P Ls
fpxq |x|a
fpxq Er|w|x s
(probability of a string)
(probability of a prefix)
(characteristic function of language L)
(number of a’s in x)
(expected number of substrings equal to x)
Weighted Automata
Ÿ
Class of WA parametrized by alphabet Σ and number of states n
A hα1 , α8 , tAσ uσPΣ i
P Rn
α8 P Rn
Aσ P Rnn
α1
Ÿ
(initial weights)
(terminal weights)
(transition weights)
Computes a function fA : Σ
ÑR
J
fA pxq fA px1 xt q αJ
1 Ax Ax α8 α1 Ax α8
1
t
Examples – Probabilistic Finite Automata
Ÿ
Compute / generate distributions over strings Prxs
b, 0.4
b, 0.2
αJ
1
r0.3
αJ
8 r0.2
0 0.7s
0 0.2s
a, 0.2
0.3
0.3
0
0
0.2
Aa 0 0.75 0 0 0.25 0
0.2
0.7
b, 0.25
b, 0.3
0
a, 0.75
a, 0.25 ∣ b, 0.15
Examples – Hidden Markov Models
Ÿ
Generates infinite strings, computes probabilities of prefixes PrxΣ s
Ÿ
Emission and transition are conditionally independent given state
αJ
1
r0.3 0.3
αJ
8 r1 1 1s
Aa Oa T
0.4s
a, 0.3
b, 0.7
0.3
0 0.7 0.3
T 0 0.75 0.25 0 0.4 0.6
a, 0.5
b, 0.5
0.6
0.25
0.7
0.4
0.3 0
0
0 0.9 0 Oa 0
0 0.5
0.75
a, 0.9
b, 0.1
Examples – Probabilistic Finite State Transducers
Ÿ
Ÿ
y
Compute conditional probabilities Pry|xs αJ
1 Ax α8 for pairs
px, yq P pΣ ∆q, must have |x| |y|
Can also assume models factorized like in HMM
A/b, 0.85
αJ
1
r0.3
0 0.7s
αJ
r1 1 1s
8
Ab
B
B/b, 0.2
0.2 0.4 0
0
0
1 0 0.75 0
0.3
A/a, 0.1 ∣ A/b, 0.9
A/b, 0.25 ∣ B/b, 1
B/b, 0.4 B/a, 0.4
0.7
B/a, 0.25 ∣ B/b, 0.75 ∣ A/b, 0.15
A/a, 0.75
Examples – Deterministic Finite Automata
Ÿ
Compute membership in a regular language
a
αJ
1 r1 0 0s
J
α8 r0 0 1s
a
1
1 0 0
Aa 0 1 0 1 0 0
3
b
b
2
a
Facts About Weighted Automata (I)
Invariance Under Change of Basis
Ÿ
Ÿ
Ÿ
n be invertible
Let QAQ1 QJ α1 , Qα8 , tQAσ Q1 u
Let Q P Rn
Then fA
fQAQ
1
since
pαJ1 Q1qpQAx Q1q pQAx Q1qpQα8q αJ1 Ax Ax α8
1
1
t
t
Example
Aa
0.5 0.1
0.2 0.3
Q
0 1
1 0
QAa Q1 0.3
0.1
0.2
0.5
Consequences
Ÿ
For learning WA it is not necessary to recover original parametrization
Ÿ
PFA is only one way to parametrize probability distributions
Ÿ
Unfortunately, given A it is undecidable whether @x fA pxq ¥ 0
Facts About Weighted Automata (II)
Forward–Backward Factorization
Ÿ A defines forward and backward maps fF , fB : Σ Ñ Rn
A A
Ÿ Such that for any splitting x y z one has fA pxq fF pyq fB pzq
A
A
fFA pyq αJ
1 Ay
and
Example
Ÿ For a PFA A and i P rns, one has
Ÿ rfF pyqsi rαJ Ay si Pry , h|y| 1
1
A
Ÿ rfB pzqsi rAz α8 si Prz | h is
A
fB
A pzq Az α8
is
Consequences
Ÿ String structure has direct relation to computation structure
Ÿ In particular, strings sharing prefixes or suffixes share computations
Ÿ Information on Aa can be recovered from fA pyazq, fF pyq, and fB pzq:
A
A
fA pyazq fFA pyqAa fB
A pz q
Outline
Weighted Automata and Functions over Strings
A Spectral Method for Learning Weighted Automata
Survey of Recent Applications to Learning Problems
Conclusion
Learning Weighted Automata
Goal
Ÿ
Given some kind of (partial) information about f : Σ
weighted automata A such that f fA
Ñ R, find a
Types of Target
Ÿ
Realizable case, f fB – exact learning, PAC learning
Ÿ
Angostic setting, arbitrary f – agnostic learning, generalization bounds
Information on the Target
Ÿ
Total knowledge (e.g. via queries) – algorithmic/compression problem
Ÿ
Approximate global knowledge – noise filtering problem
Ÿ
Exact local knowledge (e.g. random sampling) – interpolation problem
Learning Weighted Automata
Goal
Ÿ
Given some kind of (partial) information about f : Σ
weighted automata A such that f fA
Ñ R, find a
Types of Target
Ÿ
Realizable case, f fB – exact learning, PAC learning
Ÿ
Angostic setting, arbitrary f – agnostic learning, generalization bounds
Information on the Target
Ÿ
Total knowledge (e.g. via queries) – algorithmic/compression problem
Ÿ
Approximate global knowledge – noise filtering problem
Ÿ
Exact local knowledge (e.g. random sampling) – interpolation problem
Precedents and Alternative Approaches
Related Work
Ÿ
Subspace methods for identification of linear dynamical systems
[Overschee–Moor ’94]
Ÿ
Results on identifiability and learning of HMM and phylogenetic trees
[Chang ’96, Mossel–Roch ’06]
Ÿ
Query learning algorithms for DFA and Multiplicity Automata
[Angluin ’87, Bergadano–Varrichio ’94]
Other Spectral Methods
This presentation does not cover recent spectral learning methods for:
Ÿ
Mixture models [Anandkumar et al. ’12]
Ÿ
Latent tree graphical models [Parikh et al. ’11, Anandkumar et al. ’11]
Ÿ
Tree automata [Bailly et al. ’10]
Ÿ
Probabilistic context-free grammars [Cohen et al. ’12]
Ÿ
Models with continuous observables or feature maps [Song et al. ’10]
The Hankel Matrix
Ÿ
Ÿ
Ÿ
Ÿ
Ñ R is Hf P RΣΣ
For y, z P Σ , entries are defined by Hf py, zq fpy zq
Given P, S „ Σ will consider sub-blocks Hf pP, Sq P RPS
Very redundant representation for f – fpxq appears |x| 1 times
The Hankel matrix of f : Σ
a
b
aa
ab
..
.
a
b
..
.
..
.
..
.
fpaabq
aa
ab
..
.
fpaabq
Schützenberger’s Theorem
Theorem: rankpHf q ¤ n if and only if f fA with |A| n
In particular, rankpHf q is size of smallest WA for f
Proof (ð)
Ÿ
Ÿ
Ÿ
Write F fFA pΣ q P RΣ
FB
Then, rankpHf q ¤ n
n
n Σ
and B fB
A pΣ q P R
Note Hf
Proof (ñ)
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Assume rankpHf q n
F B with F P RΣn and B P RnΣ
J
Let αJ
f)
1 Fp, rnsq and α8 Bprns, q (note α1 α8
Let Aσ Bprns, σ Σ q B P Rnn (note Aσ B n , x
B n ,σ x )
J
By induction on |x| we get α1 Ax α8 fpxq
Take rank factorization Hf
pr
s
p q
q pr
s
q
Schützenberger’s Theorem
Theorem: rankpHf q ¤ n if and only if f fA with |A| n
In particular, rankpHf q is size of smallest WA for f
Proof (ð)
Ÿ
Ÿ
Ÿ
Write F fFA pΣ q P RΣ
FB
Then, rankpHf q ¤ n
n
n Σ
and B fB
A pΣ q P R
Note Hf
Proof (ñ)
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Assume rankpHf q n
F B with F P RΣn and B P RnΣ
J
Let αJ
f)
1 Fp, rnsq and α8 Bprns, q (note α1 α8
Let Aσ Bprns, σ Σ q B P Rnn (note Aσ B n , x
B n ,σ x )
J
By induction on |x| we get α1 Ax α8 fpxq
Take rank factorization Hf
pr
s
p q
q pr
s
q
Schützenberger’s Theorem
Theorem: rankpHf q ¤ n if and only if f fA with |A| n
In particular, rankpHf q is size of smallest WA for f
Proof (ð)
Ÿ
Ÿ
Ÿ
Write F fFA pΣ q P RΣ
FB
Then, rankpHf q ¤ n
n
n Σ
and B fB
A pΣ q P R
Note Hf
Proof (ñ)
Ÿ
Ÿ
Ÿ
Ÿ
Ÿ
Assume rankpHf q n
F B with F P RΣn and B P RnΣ
J
Let αJ
f)
1 Fp, rnsq and α8 Bprns, q (note α1 α8
Let Aσ Bprns, σ Σ q B P Rnn (note Aσ B n , x
B n ,σ x )
J
By induction on |x| we get α1 Ax α8 fpxq
Take rank factorization Hf
pr
s
p q
q pr
s
q
Towards the Spectral Method
pñq proof
A finite sub-block H Hf pP, Sq such that rankpHq rankpHf q is
sufficient – pP, Sq is called a basis when also P P X S
A compatible factorization of H and Hσ Hf pP, σ Sq is needed –
Aσ Bprns, σ Sq Bprns, Sq
(in fact, for all σ)
Remarks about the
Ÿ
Ÿ
Another expression for Aσ
Ÿ
Instead of factorizing H and Hσ , do . . .
Ÿ
Factorize only H B F and note Hσ
Ÿ
Solving yields Aσ
Ÿ
B Hσ F
Also, αJ
1 Hp, Sq F and α8 B
B Aσ F
HpP, q
The Spectral Method
Idea: Use SVD decomposition to obtain a factorization of H
Ÿ
Ÿ
Given H and Hσ over basis pP, Sq
Compute compact SVD as H USV J with
U P RPn
Ÿ
Let Aσ pHV q
H pHV qV J
S P R n n
V
P RSn
pHσV q – corresponds to rank factorization
Properties
Ÿ
Ÿ
Ÿ
Easy to implement: just linear algebra
Fast to compute: Opmaxt|P|, |S|u3 q
Noise tolerant: Ĥ H and Ĥσ
ñ learning!
Hσ implies Âσ Aσ
Outline
Weighted Automata and Functions over Strings
A Spectral Method for Learning Weighted Automata
Survey of Recent Applications to Learning Problems
Conclusion
Overview
(of a biased selection)
Direct Applications
Ÿ
Ÿ
Learning stochastic rational languages – any probability distribution
computed by WA
Learning probabilistic finite state transducers – learn Pry|xs from
examples pairs px, yq
Composition with Other Methods
Ÿ
Combination with matrix completion for learning non-stochastic
functions – when f : Σ Ñ R is not related to a probability distribution
Algorithmic and Miscellaneous Problems
Ÿ
Ÿ
Interpretation as an optimization problem – from linear algebra to
convex optimization
Finding a basis via random sampling – knowing pP, Sq is a prerequisite
for learning
Learning Stochastic Rational Languages
[HKZ’09, BDR’09, etc.]
Idea: Given sample from probability distribution fA over Σ find a WA Â
Algorithms
Ÿ
Given a basis, use the sample to compute Ĥ and Ĥσ
Ÿ
Apply the spectral method to obtain Âσ
Properties
Ÿ
Can PAC learn any distribution computed by a WA (w.r.t. L1
distance)
Ÿ
May not output a probability distribution
Ÿ
Sample bound polyp1{ε, logp1{δq, n, |Σ|, |P|, |S|, 1{sn pHq, 1{sn pBqq
Open problems / Future Work
Ÿ
Learn models guaranteed to be probability distributions [Bailly ’11]
Ÿ
Study inference problems in such models
Ÿ
Provide smoothing procedures, PAC learn w.r.t. KL
Ÿ
How do “infrequent” states in the target affect learning?
Learning Probabilistic Finite State Transducers
Idea: Learn a function f : pΣ ∆q
[BQC’11]
Ñ R computing Pry|xs
Learning Model
Ÿ
Input is sample of aligned sequences pxi , yi q, |xi | |yi |
Ÿ
Drawn i.i.d. from distribution Prx, ys Pry|xs Dpxq
Ÿ
Want to assume as little as possible on D
Ÿ
Performance measured against x generated from D
Properties
Oδ Tσ, sample bound scales mildly
Ÿ
Assuming independece Aδσ
with input alphabet |Σ|
Ÿ
For applications, need to align sequences prior to learning – or use
iterative procedures
Open problems / Future Work
Ÿ
Deal with alignments inside the model
Ÿ
Smoothing and inference questions (again!)
Matrix Completion and Spectral Learning
Idea
Ÿ
Ÿ
Ÿ
Ÿ
[BM’12]
In stochastic learning tasks (e.g. Prxs, PrxΣ s, Er|w|x s) a sample S
yields global approximate knowledge f̂S
Supervised learning setup is given pairs px, fpxqq, where x D
But spectral method needs (approximate) information on sub-blocks
Matrix completion finds missing entries under contraints (e.g. low
rank Hankel matrix), then apply spectral method
xi , fpxi qq
p
(
−Ñ
2
1
0
0
1
4
matrix completion
−Ñ
2
1
0
1.1
2.3
1
0
1.1
4
Result: Generalization bounds for some MC + SM combinations
Open problems / Future Work
Ÿ Design specific convex optimization algorithm for completion problem
Ÿ Analyze combination with other completion algorithms
An Optimization Point of View
[BQC’12]
Idea: Replace linear algebra with optimization primitives – make it possible
to use the “ML optimization toolkit”
Algorithms
Ÿ
Ÿ
Spectral optimization: mintAσ u,VnJ Vn I
°
Convex relaxation: minAΣ }HAΣ HΣ }2F
}HVnAσ HσVn}2F
τ}AΣ }
P
σ Σ
Properties
Ÿ
Equivalent in some situations and choice of parameters
Ÿ
Experiments show convex relaxation can be better in cases known to
be difficult for the spectral method
Open problems / Future Work
Ÿ
Design problem-specific optimization algorithms
Ÿ
Constrain learned models imposing further regularizations, e.g.
sparsity
Finding a Basis
[BQC ’12]
Idea: Choose a basis in a data-driven manner – as oposed to using a fixed
set of prefixes and suffixes
Algorithm
Input: strings px1 , . . . , xN q
Initialize P Ð ∅, S Ð ∅
for i 1 to N do
Choose 0 ¤ t ¤ |xi | u.a.r.
Split xi ui vi with |ui | t
and |vi | |xi | t
Add ui to P and vi to S
end for
Result
Ÿ
Ÿ
Ÿ
xi i.i.d. from distribution D
over Σ with full support
f fA with }Aσ } ¤ 1
If N ¥ Cηpf, Dq logp1{δq then
pP, Sq is basis w.h.p.
Open problems / Future Work
Ÿ
Do something more smart and practical
Ÿ
Find smaller basis containing shorter strings
Outline
Weighted Automata and Functions over Strings
A Spectral Method for Learning Weighted Automata
Survey of Recent Applications to Learning Problems
Conclusion
Take-home Message
Ÿ
Efficient, easy to implement learning method
Ÿ
Alternative to EM not suffering from local minima
Ÿ
Can be extended to many probabilistic (and some non-probabilistic)
models
Ÿ
Comes with theoretical analysis, quantifies hardness of models,
provides intuitions
Ÿ
Lots of interesting open problems, theoretical and practical
Spectral Methods for Learning
Finite State Machines
11
Traducció de la marca a altres idiomes
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat,
que no és traduïble.
Borja Balle
(based on joint work with X. Carreras, M. Mohri, and A. Quattoni)
McGill University
Montreal, September 2012
This work is partially supported by the PASCAL2 Network and a Google Research Award