Learning probabilistic finite
automata
Colin de la Higuera
University of Nantes
Darmstadt, September 2014
1
Links
http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/
http://videolectures.net/colin_de_la_higuera/
A bibliography linked with this presentation can
be found here:
http://cdlh7.free.fr/WATA2014/wata_ cdlh.pdf
Chapters 5 and 16
2
Cdlh 2014
Darmstadt, September 2014
Outline
About grammatical
inference and
distributions
1.
1.
2.
About probabilistic finite
state automata
2.
1.
2.
3.
4.
3.
A bit of history
Some motivations
4.
5.
6.
7.
Learning models
Learning DPFA: State
Merging
Extending the model
Further challenges
Representation issues
Parsing issues
Distances
Most probable string
Estimation issues
3
Cdlh 2014
Darmstadt, September 2014
Grammatical inference
Applications
The « need » for probabilities
1. INTRODUCTION
4
Cdlh 2014
Darmstadt, September 2014
Grammatical inference
is about learning a grammar given information
about a language
Information is strings, trees or graphs
Information can be (typically)
Text: only positive information
Informant: labelled data
Actively sought (query learning, teaching)
Above lists are not limitative
5
Cdlh 2014
Darmstadt, September 2014
The grammars
Languages and grammars from the Chomsky
hierarchy
Probabilistic
automata
and
context-free
grammars
Hidden Markov Models
Patterns
Transducers
…any device allowing to generate, recognise,
define language
6
Cdlh 2014
Darmstadt, September 2014
Grammatical inference: some goals
Invent algorithms which can take data and identify
in the limit a given target
Understand when this is not possible
Put some polynomial bounds on the identification
process
Build empirical learning systems
7
Cdlh 2014
Darmstadt, September 2014
A typical situation:
unsupervised learning
Only positive data
Perhaps some noise
Not necessarily an identification situation
Looks like clustering, but isn’t
Explanatory clustering ?
8
Cdlh 2014
Darmstadt, September 2014
The grammar induction
problem, revisited
The data consists of positive strings,
«generated»
following
an
unknown
distribution
The goal is now to find (identify, learn) this
distribution
or the grammar/automaton that is used to
generate the strings
9
Cdlh 2014
Darmstadt, September 2014
a: 1/4
a: 2/9
b: .1/4
Example
1/2
2/3
b: .1/9
S={λ(490), a(128), b(170), aa(31), ab(42), ba(38), bb(14), aaa(8), aab(10),
aba(10), abb(4), baa(9), bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3),
aabb(2), abaa(2), abab(2), abba(2), abbb(1), baaa(2), baab(2), baba(1), babb(1),
bbaa(1), bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1), aabaa(1), aabab(1),
aabba(1), abbaa(1), abbab(1)}
a: .252
a: .215
b: .250
.497
.676
b: .110
Cdlh 2014
10
Darmstadt, September 2014
2.1 Definitions
2.2 Equivalence between models
2.3 Parsing issues
2.4 Distances
2.5 Most probable string
2. DEFINITIONS
11
Cdlh 2014
Darmstadt, September 2014
2.1 Distributions, definitions
A distribution D over Σ*
0≤PrD(w)≤1
∑w∈Σ* Pr (w)=1
D
Can be seen as a vector in infinite dimension
where each dimension is a w∈Σ*
12
Cdlh 2014
Darmstadt, September 2014
A Probabilistic Finite (state) Automaton is a
<Q, Σ, IP, FinP, δP>
Q set of states
IP : Q→[0;1]
FinP : Q→[0;1]
δP : Q×Σ×Q →[0;1]
13
Cdlh 2014
Darmstadt, September 2014
What does a PFA do?
It defines the probability of each string w as the sum (over all
paths reading w) of the products of the probabilities
PrA(w)=∑πi∈paths(w)Pr(πi)
πi=qi0ai1qi1ai2…ainqin
Pr(πi)=IP(qi0) · ∏aij δP (qij-1,aij,qij) · FinP(qin)
Note that if λ-transitions are allowed the sum may be infinite
14
Cdlh 2014
Darmstadt, September 2014
1
2
a
1
2
1
2
a
1
3
1
4
a
b
1
2
b
3
4
b
2
3
DPFA: Deterministic Probabilistic
Finite Automaton
Cdlh 2014
15
Darmstadt, September 2014
1
2
a
1
2
1
2
a
b
3
4
b
2
3
1 1 1 2 3 1
PrA(abab)= × × × × =
2 2 3 3 4 24
Cdlh 2014
1
3
1
4
a
1
2
b
16
Darmstadt, September 2014
1
2
a
1
2
b
1
2
1
3
1
4
a
a
1
2
b
3
4
b
2
3
PFA: Probabilistic Finite
(state) Automaton
Cdlh 2014
17
Darmstadt, September 2014
1
2
λ
1
2
λ
1
2
1
3
1
4
a
λ
1
2
b
3
4
b
2
3
λ-PFA: Probabilistic Finite (state)
Automaton with λ-transitions
18
Cdlh 2014
Darmstadt, September 2014
Fuzzy automaton are different objets!
a
2
3
1
2
2
3
b
3
4
a
1
4
a
b
3
4
b
2
3
2 3 1 2 3 1
PrA(abab∈L)= × × × × =
3 4 2 3 4 8
Cdlh 2014
1
2
19
Darmstadt, September 2014
How useful are these automata?
They can define a distribution over Σ*
They do not tell us if a string belongs to a
language
They are good candidates for grammar induction
20
Cdlh 2014
Darmstadt, September 2014
2.3 Parsing issues
Computation of the probability of a string or of a
set of strings
Deterministic case
Simple: apply definitions
Technically, rather sum up logs: this is easier, safer and
cheaper
21
Cdlh 2014
Darmstadt, September 2014
Non-deterministic case
b
0.4
0.1
a
a
b 0.4
0.7
0.2
a
0.1
a
0.3
0.35
b
1
0.45
Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2
= 0.028+0.0252=0.0532
Cdlh 2014
22
Darmstadt, September 2014
In the literature
The computation of the probability of a string is by
dynamic programming : O(n2m)
2 algorithms: Backward and Forward
If we want the most probable derivation to define
the probability of a string, then we can use the Viterbi
algorithm
23
Cdlh 2014
Darmstadt, September 2014
2.4 Distances
What for?
Estimate the quality of a language model
Have an indicator of the convergence of
learning algorithms
Construct kernels
24
Darmstadt, September 2014
Main idea
Suppose each distribution (or PFA) is a vector in
an infinite dimension space
Compute distances between these vectors
2 issues
Definitions
Having algorithms
25
Cdlh 2014
Darmstadt, September 2014
Distance d2
d2(D, D’)=
∑w∈Σ (PrD(w)-PrD’(w))2
*
Can be computed in polynomial time if D and
are given by PFA
This also means that equivalence of PFA is in P
Note that this is not the case for d1
D’
26
Cdlh 2014
Darmstadt, September 2014
A small word before concluding
(Recent work with M-J Nederhof and J. Scicluna)
None of this seems possible for PCFGs:
d1, d2, dKL are all intractable (associated undecidable
problems)
Equivalence of PCFGs is interreducible with multiple
ambiguity
27
Cdlh 2014
Darmstadt, September 2014
2.5 Most probable string
Name: Most probable string (MPS)
Instance: A probabilistic automaton A, a p>0
Question: Is there in Σ* a string x such that
PrA(x) > p?
Name: Consensus string (CS)
Instance: A probabilistic automaton A,
Question: Find in Σ* a string x such that ∀y∈Σ*
PrA(y) ≤PrA(x)
28
Cdlh 2014
Darmstadt, September 2014
a 0.7
1
2
a 0.7
a 0.3
a 0.9
a 0.3
b 0.1
Most probable string is?
PrA(aaaaa)=3*0.9*0.32*0.72=0.119
Cdlh 2014
1
PrA(b)=0.1
29
Darmstadt, September 2014
Result #1 (cdlh&Oncina 2013)
Key lemma: if w has probability p, then it has
length at most |A|2/p
Lemma allows to bound the search space, given
some value p
Concretely, this means that MPS can now be
solved without needing to bound the search
As a corollary MPS is decidable!
30
Cdlh 2014
Darmstadt, September 2014
Result #2
There exists an algorithm which computes the
consensus string in time O(|Σ|⋅|A|2/popt2 )
Note this is no longer the decision problem
Note: popt is the probability of the most probable
string
31
Cdlh 2014
Darmstadt, September 2014
3 ESTIMATION ISSUES
32
Cdlh 2014
Darmstadt, September 2014
Expectation Maximisation (EM)
We can use an approach similar to that of
clustering: loop between
Using the current parameters and determine what
data belongs to what cluster: E-step (expectation step)
Use each cluster to determine the parameters: M-step
(maximization step)
33
Cdlh 2014
Darmstadt, September 2014
And for PFA?
We use a dataset of strings
E-step: parse the strings and obtain virtual counts
M-step: use the counts to update the weights
The algorithm used for PFA is called Baum-Welch
34
Cdlh 2014
Darmstadt, September 2014
E-step uses Forward-Backward
We have to update the count for transition
(qi,ak,qj)
ak
F[qi,k]
qi
qj
B[qj,k+1]
35
Cdlh 2014
Darmstadt, September 2014
4. LEARNING MODELS
36
Cdlh 2014
Darmstadt, September 2014
Is learning an ill posed question?
Can we answer « we have learnt well »?
Can we answer « this learner is better than this
one »?
Can we answer « our learning algorithm is going
to learn OK »?
37
Cdlh 2014
Darmstadt, September 2014
Yes!
Provided we replace (for analysis only) the
learning problem by an identification problem
We can answer:
In the limit our algorithm will identify
With high probability our algorithm will do really well
Advert: learning probabilistic objects can lead to
convergence proofs
38
Cdlh 2014
Darmstadt, September 2014
5 LEARNING WITH STATE
MERGING
39
Cdlh 2014
Darmstadt, September 2014
A learning sample
is a multiset
Strings appear with a frequency (or multiplicity)
S={λ (3), aaa (4), aaba (2), ababa (1), bb (3),
bbaaa (1)}
40
Cdlh 2014
Darmstadt, September 2014
DFFA
A deterministic frequency finite automaton is a
DFA with a frequency function returning a
positive integer for every state and every
transition, and for entering the initial state such
that
the sum of what enters is equal to what exits and
the sum of what halts is equal to what starts
41
Cdlh 2014
Darmstadt, September 2014
Example
a: 2
a: 1
6
3
b : 5
a: 5
2
1
b: 3
b: 4
42
Cdlh 2014
Darmstadt, September 2014
From a DFFA to a DPFA
Frequencies become relative frequencies by
dividing by sum of exiting frequencies
a: 2/6
a: 1/7
6/6
3/13
2/7
1/6
b: 5/13
a: 5/13
b: 3/6
b: 4/7
43
Cdlh 2014
Darmstadt, September 2014
From a DFA and a sample to a
DFFA
S = {λ, aaaa, ab, babb, bbbb, bbbbaa}
a: 2
a: 1
6
3
b: 5
a: 5
2
1
b: 3
b: 4
44
Cdlh 2014
Darmstadt, September 2014
Note
Another sample may lead to the same DFFA
Doing the same with a NFA is a much harder
problem
Typically what algorithm Baum-Welch (EM) has
been invented for…
45
Cdlh 2014
Darmstadt, September 2014
The frequency prefix tree
acceptor
The data is a multi-set
The FTA is the smallest tree-like FFA consistent
with the data
Can be transformed into a PFA if needed
46
Cdlh 2014
Darmstadt, September 2014
From the sample to the frequency
prefix tree acceptor
FTA(S)
a:4
a:6
a:7
14
4
a:2
2
b:2
b:1
a:1
b:1
a:1
3
1
1
b:4
a:1
b:4
a:1
a:1
3
S={λ (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)}
Cdlh 2014
47
Darmstadt, September 2014
Red, Blue and White states
-Red states are confirmed states
-Blue states are the (non Red)
successors of the Red states
-White states are the others
a
a
b
a
b
a
b
a
b
Same as with DFA and what RPNI does
48
Cdlh 2014
Darmstadt, September 2014
State merging algorithm
A=FTA(S); Blue ={δ(qI,a): a∈Σ };
Red ={qI}
While Blue≠∅ do
choose q from Blue such that Freq(q)≥t0
if ∃p∈Red: d(Ap,Aq) is small
then A = merge_and_fold(A,p,q)
else Red = Red ∪ {q}
Blue = {δ(q,a): q∈Red} – {Red}
49
Cdlh 2014
Darmstadt, September 2014
The real question
How do we decide if d(Ap,Aq) is small?
Use a distance…
Be able to compute this distance
If possible update the computation easily
Have properties related to this distance
50
Cdlh 2014
Darmstadt, September 2014
Deciding if two distributions are
similar
If the two distributions are known, equality can
be tested
Distance (L2 norm) between distributions can be
exactly computed
But what if the two distributions are unknown?
51
Cdlh 2014
Darmstadt, September 2014
A run of Alergia
Our learning multisample
S={λ(490), a(128), b(170), aa(31), ab(42), ba(38),
bb(14), aaa(8), aab(10), aba(10), abb(4), baa(9),
bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3),
aabb(2), abaa(2), abab(2), abba(2), abbb(1),
baaa(2), baab(2), baba(1), babb(1), bbaa(1),
bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1),
aabaa(1), aabab(1), aabba(1), abbaa(1), abbab(1)}
52
Cdlh 2014
Darmstadt, September 2014
Parameter α is arbitrarily set to 0.05. We choose
30 as a value for threshold t0.
Note that for the blue state who have a
frequency less than the threshold, a special
merging operation takes place (once all the
informed merges have taken place)
53
Cdlh 2014
Darmstadt, September 2014
1
a
a : 15
10
128
a :257
a : 14 10
b : 65
42
1000
b : 9
4
a : 13
9
490
b : 170
38
a : 57
b : 6
2
b:3
2
a:5
3
8
a : 64 31
b : 18
a:4
b
a
a
2
b
a:2
2
a
b:2
2
a:4
2
b:3
b:1
a:2
1
b:2
a:1
2
b:1
1
4
1
1
1
1
1
a
b
1
1
2
1
170
b : 26
a :5
14
b : 7
Cdlh 2014
a:1
3
6
b:1
a:1
1
1
1
54
Darmstadt, September 2014
Can we merge λ and a?
Compare λ and a, aΣ* and aaΣ*, bΣ* and abΣ*
490/1000 with 128/257 ,
257/1000 with 64/257 ,
253/1000 with 65/257 , . . . .
All tests return true
55
Cdlh 2014
Darmstadt, September 2014
Merge…
a: 64
a: 15
10
128
a: 14 10
b: 65
42
1000
b: 9
4
a: 13
9
490
b: 170
38
a: 57
b: 6
a:4
2
b:3
2
a:5
3
8
31
b: 18
a: 257
1
a
b
a
a
2
b
a:2
2
a
b:2
2
a:4
2
b:3
b:1
a:2
1
b:2
a:1
2
b:1
1
4
1
1
1
1
1
a
b
1
1
2
1
170
b: 26
a: 5
14
b: 7
Cdlh 2014
a:1
3
6
b:1
a:1
1
1
1
56
Darmstadt, September 2014
And fold
a:341
1000
660
a: 2
a: 16
b: 340
12
52
a: 77
b: 9
7
225
b: 38
20
a: 10 6
b: 8
7
2
b: 2
a: 1
2
b: 1
1
a: 1
1
b: 1
a: 1
1
1
1
57
Cdlh 2014
Darmstadt, September 2014
Next merge ?
λ with b ?
a:341
a: 16
a: 2
12
52
a: 77
1000
660
b: 340
b: 9
b: 38
a: 10
20
b: 8
b: 2
a: 1
2
b: 1
1
7
225
a: 1
6
7
2
1
1
b: 1
1
a: 1
1
58
Cdlh 2014
Darmstadt, September 2014
Can we merge λ and b?
Compare λ and b, aΣ* and baΣ*, bΣ* and bbΣ*
660/1341
and
225/340
are
different
(giving γ= 0.162)
On the other hand
1
1 2
1
. ln = 0.111
+
n
2 α
n
2
1
59
Cdlh 2014
Darmstadt, September 2014
Promotion
a:341
a: 16
a:2
12
52
a: 77
1000
660
b: 340
b: 9
7
225
b: 38
20
a:10
b: 8
6
7
2
b:2
a:1
2
b:1
1
a:1
1
b:1
a:1
1
1
1
60
Cdlh 2014
Darmstadt, September 2014
Merge
a: 341
a: 77
a: 16
a:2
12
52
b: 9
1000
660
b: 340
7
225
b: 38
a: 10
20
b: 8
b:2
a:1
2
b:1
1
a:1
6
7
2
b:1
a:1
1
1
1
1
61
Cdlh 2014
Darmstadt, September 2014
And fold
a:341
a: 95
1000
660
b: 340
291
b: 49
a: 11
29
b: 9
a: 2
7
8
b: 2
a: 1
2
2
1
62
Cdlh 2014
Darmstadt, September 2014
Merge
a:341
a: 95
b: 340
1000
660
225
b: 49
a: 11
29
b: 9
a: 2
7
8
b: 2
a: 1
2
2
1
63
Cdlh 2014
Darmstadt, September 2014
And fold
As a PFA
a: 354
a: 96
a: .252
b: 351
a: .215
b: .250
1000
698
302
.497
.676
b: 49
b: .110
64
Cdlh 2014
Darmstadt, September 2014
Conclusion and logic
Alergia builds a DFFA in polynomial time
Alergia can identify DPFA in the limit with
probability 1
Alternative tests exist
65
Cdlh 2014
Darmstadt, September 2014
6.1 Context-free grammars
6.2 Transducers
6 ABOUT EXTENSIONS
66
Cdlh 2014
Darmstadt, September 2014
6.1 About probabilistic contextfree grammars
First problem
Consistency
A AA+a (with each rule having probability ½ )
Parsing is possible with CYK algorithm
67
Cdlh 2014
Darmstadt, September 2014
Estimating probabilities
One option is that of estimating probabilities
Inside Outside allows to deal with that too
(similar to EM / Baum Welch)
Bayesian approach: add priors to help the
algorithm
68
Cdlh 2014
Darmstadt, September 2014
6.2 Learning probabilistic
transducers
Work by Akram & al.
Learns probabilistic subsequential transducers
From queries
From data
69
Cdlh 2014
Darmstadt, September 2014
Current work
Learn semi-deterministic transducers
Deterministic input
Nondeterministic output
70
Cdlh 2014
Darmstadt, September 2014
7 Conclusion and open
questions
71
Darmstadt, September 2014
Grammatical Inference Software
Repository
https://logiciels.lina.univnantes.fr/redmine/projects/gisr/wiki
72
Cdlh 2014
Darmstadt, September 2014
Merge and fold
Suppose we decide to merge
b with state a
λ
a:26
10
b:6
a:6
a:10
100
6
60
b:24
a:4
a:4
b:24
4
11
b:9
9
73
Cdlh 2014
Darmstadt, September 2014
Merge and fold
First disconnect
and reconnect to
b:24
λ
b
a
a:26
10
b:6
a:6
a:10
100
6
60
a:4
a:4
b:24
4
11
b:9
9
74
Cdlh 2014
Darmstadt, September 2014
Merge and fold
Then fold
b:24
a:26
10
b:6
a:6
a:10
100
60
b:24
6
a:4
4
a:4
11
b:9
9
75
Cdlh 2014
Darmstadt, September 2014
Merge and fold
after folding
b:24
a:26
10
b:30
a:10
100
60
a:4
a:10
10
b:9
9
4
11
76
Cdlh 2014
Darmstadt, September 2014
© Copyright 2026 Paperzz