A Spectral Learning Algorithm for Finite State

A Spectral Learning Algorithm for Finite State
Transducers
11
Traducció de la marca a altres idiomes
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat,
que no és traduïble.
Borja Balle, Ariadna Quattoni, Xavier Carreras
ECML PKDD — September 7, 2011
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
1 / 15
Overview
Probabilistic Transducers
I
Model input-output relations with hidden states
I
As conditional distribution Pr[ y | x ] over strings
I
With certain independence assumptions
Input
X1
X2
X3
X4
..
Hidden
H1
H2
H3
H4
Output
Y1
Y2
Y3
Y4
.
···
I
Used in many applications: NLP, biology, . . .
I
Hard to learn in general — usually EM algorithm is used
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
2 / 15
Overview
Spectral Learning Probabilistic Transducers
Our contribution:
I
Fast learning algorithm for probabilistic FST
I
With PAC-style theoretical guarantees
I
Based on Observable Operator Model for FST
I
Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09,
Siddiqi et al. ’10)
I
Performing better than EM in experiments with real data
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
3 / 15
Outline
Observable Operators for FST
Learning Observable Operator Models
Experimental Evaluation
Conclusion
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
4 / 15
Observable Operators for FST
Deriving Observable Operator Models
Given (x, y) ∈ (X × Y)t aligned sequences, model computes
conditional probability (i.e. |x| = |y|)
Pr[ y | x ] =
=
=
=
=
P
Ph∈H
t
Pr[ y, h | x ]
ht+1 ∈H Pr[ y, ht+1 | x
1> αt+1
y
1> Axtt αt
y
y
1> Axtt · · · Ax11 α
(marginalize states)
]
(independence assumptions)
(vector form, αt+1 ∈ Rm )
(forward-backward equations)
(induction on t)
The choice of an operator Aba depends only on observable symbols
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
5 / 15
Observable Operators for FST
Observable Operator Model Parameters
Given X = {a1 , . . . , ak }, Y = {b1 , . . . , bl }, H = {c1 , . . . , cm }, then
y
y
Pr[ y | x ] = 1> Axtt · · · Ax11 α with parameters:
Aba = Ta Db ∈ Rm×m
(factorized operator)
Ta (i, j) = Pr[Hs = ci |Xs−1 = a, Hs−1 = cj ] ∈ Rm×m
Db (i, j) = δi,j Pr[Ys = b|Hs = cj ] ∈ Rm×m
O(i, j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m
α(i) = Pr[H1 = ci ] ∈ Rm
(state transition)
(observation emission)
(collected emissions)
(initial probabilites)
The choice of an operator Aba depends only on observable symbols . . .
. . . but operator parameters are conditioned by hidden states
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
6 / 15
Observable Operators for FST
A Learnable Set of Observable Operators
Note that for any invertible Q ∈ Rm×m
y
y
Pr[ y | x ] = 1> Q −1 (Q Axtt Q −1 ) · · · (Q Ax11 Q −1 ) Q α
Idea
(subspace identification methods for linear systems, ’80s)
Find a basis for the state space such that operators in the new basis
are related to observable quantities
Following multiplicity automata and spectral HMM learning . . .
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
7 / 15
Observable Operators for FST
A Learnable Set of Observable Operators
Find a basis Q where operators can be expressed in terms of unigram,
bigram and trigram probabilities
ρ(i) = Pr[Y1 = bi ] ∈ Rl
P(i, j) = Pr[Y1 = bj , Y2 = bi ] ∈ Rl×l
Pab (i, j) = Pr[Y1 = bj , Y2 = b, Y3 = bi |X2 = a] ∈ Rl×l
Theorem (ρ, P and Pab are sufficient statistics)
Let P = UΣV ∗ be a thin SVD decomposition, then Q = U > O yields
(under certain assumptions)
Q α = U >ρ
1> Q −1 = ρ> (U > P)+
Q Aba Q −1 = (U > Pab )(U > P)+
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
8 / 15
Learning Observable Operator Models
Spectral Learning Algorithm
Given
I
I
I
Do
I
I
I
Input X and output Y alphabet
Number of hidden states m
Training sample S = {(x 1 , y 1 ), . . . , (x n , y n )}
b and trigram P
b b relative frequencies
Compute unigram ρb, bigram P
a
in S
b and take U
b with top m left singular vectors
Perform SVD on P
b P
b b and U
b
Return operators computed using ρb, P,
a
In Time
I
O(n) to compute relative frequencies
I
O(|Y|3 ) to compute SVD
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
9 / 15
Learning Observable Operator Models
PAC-Style Result
I
I
I
Input distribution DX over X ∗ with λ = E[|X |], µ = mina Pr[X2 = a]
Conditional distributions DY |x on Y ∗ given x ∈ X ∗ modeled by an
FST with m states (satisfying certain rank assumptions)
Sampling i.i.d. from joint distribution DX ⊗ DY |X
Theorem
For any 0 < ε, δ < 1, if the algorithm receives a sample of size
!
λ2 m|Y|
|X |
(σO and σP are mth singular
,
n≥O
log
values of O and P in target)
2 σ4
δ
ε4 µσO
P
b Y |x satisfies
then with probability at least 1 − δ the hypothesis D


X b Y |X (y ) ≤ ε .
EX 
DY |X (y) − D
y∈Y ∗
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
(L1 distance between
joint distributions
DX ⊗ DY |X and
b
D ⊗D
)
X
Y |X
ECML PKDD 2011
10 / 15
Experimental Evaluation
Synthetic Experiments
Goal: Compare against baselines when learning hypothesis hold
Target:
Randomly generated with |X | = 3, |Y| = 3, |H| = 2
0.7
HMM
k−HMM
FST
0.6
L1 distance
0.5
I
HMM: model input-output
jointly
I
k-HMM: one model for each
input symbol
0.4
0.3
0.2
I Results averaged over 5 runs
0.1
0
32
128
512
2048
8192
32768
# training samples (in thousands)
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
11 / 15
Experimental Evaluation
Transliteration Experiments
Goal: Compare against EM in a real task (where modeling assumptions fail)
Task: English to Russian transliteration (brooklyn → бруклин)
normalized edit distance
80
Training times
Spectral
26 s
EM (iteration)
37 s
EM (best)
1133 s
Spectral, m=2
Spectral, m=3
EM, m=2
EM, m=3
70
60
50
I Sequence alignment done in
40
preprocessing
30
20
75
I Standard techniques used for
inference
150
350
750
1500
# training sequences
B. Balle, A. Quattoni, X. Carreras
3000
6000
I Test size: 943, |X | = 82, |Y| = 34
Spectral Learning FST
ECML PKDD 2011
12 / 15
Conclusion
Summary of Contributions
I
Fast spectral method for learning input-output OOM
I
Strong theoretical guarantees with few assumptions on input
distribution
I
Outperforming previous spectral algorithms on FST
I
Faster and better than EM in some real tasks
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
13 / 15
A Spectral Learning Algorithm for Finite State
Transducers
11
Traducció de la marca a altres idiomes
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat,
que no és traduïble.
Borja Balle, Ariadna Quattoni, Xavier Carreras
ECML PKDD — September 7, 2011
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
14 / 15
Technical Assumptions
X = {a1 , . . . , ak }, Y = {b1 , . . . , bl }, H = {c1 , . . . , cm }
Parameters
Ta (i, j) = Pr[Hs = ci |Xs−1 = a, Hs−1 = cj ] ∈ Rm×m
(state transition)
P
m×m
T = a Ta Pr[X1 = a] ∈ R
(“mean” transition matrix)
O(i, j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m
α(i) = Pr[H1 = ci ] ∈ R
m
(collected emissions)
(initial probabilites)
Assumptions
1. l ≥ m
2. α > 0
3. rank(T ) = rank(O) = m
4. mina Pr[X2 = a] > 0
B. Balle, A. Quattoni, X. Carreras
Spectral Learning FST
ECML PKDD 2011
15 / 15