Slides - GRLMC

Characterization of state
merging strategies which
ensure identification in the
limit from complete data
Cristina Bibire
History
Motivation
Preliminaries
RPNI
Further Research
Bibliography
History
In the second half of 60’s it was Gold who first formulated the process of learning
formal languages. Motivated by observing children’s learning process, he proposed
an idea that learning is an infinite process of making guesses of grammars and it
does not terminate in finite steps but only able to converge at a correct grammar in
the limit.
Gold’s algorithm for learning regular languages from both positive and negative
examples finds the correct automaton when a characteristic sample is included in
the data.
The problem of learning the minimum state DFA that is consistent with a given
sample has been actively studied for over two decades. A lot of algorithms have
been developed: RPNI (Regular Inference from Positive and Negative Data),
ALERGIA, MDI (Minimum Divergence Inference), DDSM (Data Driven State
Merging) and many others.
Even if there is no guarantee of identification from the available data, the existence
of the associated characteristic sets means that these algorithms converge towards
the correct solution.
Motivation
Given S  , S  two sets of strings, how can we decide if they contain or not a
characteristic sample for a given algorithm? How do we decide which algorithm to
apply? How many consistent DFA can we find? Which is the best searching
strategy: exhaustive search, beam search, greedy search, etc?
The importance of learning regular languages (or equivalently, identification of the
corresponding DFA) is justified by the fact that algorithms treating the inference
problem for DFA can be nicely adapted for larger classes of grammars, for
instance: even linear grammars (Takada 88 & 94; Sempere & Garcia 94, Makinen
96), subsequential functions (Oncina, Garcia & Vidal 93), tree automata (Knuutila)
or Context-free grammars from skeletons (Sakakibara 90).
The problem of exactly learning the target DFA from an arbitrary set of labeled
examples and the problem of approximating the target DFA from labeled examples
are both known to be hard problems. Thus the question as to whether DFA are
efficiently learnable under some restricted but fairly general and practically useful
classes of distribution is clearly of interest.
Preliminaries
We will assume that the target DFA being learned is a canonical DFA.

Let S and S denote the set of positive and negative examples of A respectively. A
is consistent with a sample S  S   Sif it accepts all positive examples and
rejects all negative examples.
A set is said to be structurally complete with respect to a DFA A if it covers each
transition of A and uses each final state of A.
 
 

Given a set S , let PTA S denote the prefix tree automaton for S . PTA S 
is a DFA that contains a path from the start state to an accepting state for each
string in S  modulo common prefixes.
Ex: S   00,1,010,011,100
 

The states of the PTA S are labeled
based on the standard order of the set
Pr(S+)
0
0
00
0
λ
1
1
1
0
010
01
0
10
1
011
0
100
Preliminaries
Given a DFA A   Q, ,  , po , F ,   B1 ,
, Bn  is a partition of Q iff
1. Each Bi is nonempty,
2.  i, j Bi  B j   ,
3.  Bi  Q .
i
Ex: The DFA is
p
0
 p, q, r,0,1, , p,r
q
1
r
Partitions of Q are:
1   p, q, r
 2   p, q,r
 3   p,q, r
 4   p, r,q
 5   p,q,r
Lattice of partitions is:
π1
π2
πi
πj
π3
π4
π5
iff πi covers πj iff πj ≤ πi
Preliminaries
Given a DFA A and a partition π on the set of states Q of A, we define the quotient
automaton Aπ obtained by merging the states of A that belong to the same block of
the partition π.
Note that a quotient automaton of a DFA might be a NFA and vice-versa.
0
Ex: Given M:
1
S   01. 
A structurally complete set for M is:
A  PTA  S   :
p
0
 4   p, r,q then
A/4 :
1
q
0
p, r
1
r
q
Preliminaries
Search Space comprising π-quotient automata of A:
A / 1
0,1
p,q,r
A/2
A/3
0
p,q
1
r
p
1
0
0
q,r
p,r
q
1
A/5
p
0
q
1
r
A/4
Preliminaries
The set of all derived automata obtained by systematically merging the states of A
represents a lattice of finite state automata.

Given a canonical DFA M and a set S that is structurally complete with respect to

M, the lattice derived from PTA S
is guaranteed to contain M (Pao & Carr,
1978; Parekh & Honavar 1993; Dupont et al, 1994)
 
Pr (α) – prefixes of α
Pr  L     L - the set of prefixes of L
L    L - the set of tails of α
The standard order of strings of the alphabet Σ is denoted by <. The standard
enumeration of strings over   a, b is λ, a, b, aa, ab, ba, bb, …

S p  L     Pr  L   * suchthat L  L and   
- short prefixes of L



N  L      a   S p  L  , a , a  Pr  L  - the kernel of L
Preliminaries
Definition: A sample S  S   S  is said to be characteristic with respect to a
regular language L (with the canonical DFA A) if it satisfies the following two
conditions:
  N  L  , if   Lthen   S  else   * suchthat   S 
  S p  L  ,   N  L  , if L  L then   * suchthat
  S

and   S   or    S  and   S  
Intuitively, condition 1 implies structural completeness with respect to A and
condition 2 implies that for any distinct states of A there is a suffix γ that would
correctly distinguish them.
Notice that:
- if you add more strings to a characteristic sample it still is characteristic;
- there can be many different characteristic samples
RPNI
The regular positive and negative inference (RPNI) algorithm [Oncina & Garcia,
1992] is a polynomial time algorithm for identifying a DFA consistent with a given
sample
It can
S  S . 
S  be shown that given a characteristic sample for the
target DFA the algorithm is guaranteed to return a canonical representation of the
target DFA [Oncina & Garcia, 1992; Dupont, 1996].
A  PTA  S   ;
K  q0 ; Fr    q0 , a  a  ;
While Fr   do
choose q from Fr
if p  K : L  dmerge  A, p, q    X   
then A  dmerge  A, p, q 
else K  K  q
Fr    q, a  q  K   K
RPNI
Ex: Suppose our language L is the set of all words w 0,1 which are congruent
with 2 (mod 3).
*
λ
A canonical automaton for this language is:
0
1
1
0
1
0
1
0
0
2
0
1

It can be easily verified that S  S  S
1
1

0
01
is a characteristic sample, where
10
0
S  0101,10100,1110

S  0,1,1001
The PTA  S
 is :
1
101
1

11
1
010

1
0101
0
10100
111
0
0
1010
1110
RPNI
S   0101,10100,1110
K
λ
S   0,1,1001
0
1
0
Fr
1
1
0
01
1
10
0
11
1
010
1
101
1
0101
0
10100
111
0
0
1010
1110
RPNI
S   0,1,1001
0
λ
λ
0
1
1
0
1
1
1
0
01
0
1
10
0
01
11
1
010
0101
111
0
0
1010
1110
1
10
0
1
101
1
1
11
1
010
1
101
1
0101
0
0
10100
10100
111
0
0
1010
1110
RPNI
S   0,1,1001
0
λ,0
λ
0
1
1
0
1,01
1
1
0
0
1
1
0
01
10
0
10
11
1
010
1
0101
1
1
101
111
0
0
1010
1110
11
010
1
101
1
0101
0
0
10100
10100
111
0
0
1010
1110
RPNI
S   0,1,1001
0
λ,0
λ
0
1
1
0
1,01
1
1
0
01
10
0
0
1
10,
010
11
1
010
0101
1
111
0
0
1010
1110
11
1
1
101
1
1
1
101
0101
0
0
10100
10100
111
0
0
1010
1110
RPNI
S   0,1,1001
0
λ,0
λ
0
1
1
0
0
01
0
0
1
10
0101
11
1
1
101
1
1
10,
010
11
1
010
Fr
1,01
1
1
K
1
101,
0101
111
111
0
0
0
0
1010
1110
1010
1110
0
0
10100
10100
RPNI
S   0,1,1001
0
λ,0
0,1
λ,0,
1,01
1
0
1
0
1,01
0,1
1
10,
010
10,
010
11
1
101,
0101
11
1
1
1
101,
0101
111
111
0
0
0
0
1010
1110
1010
1110
0
0
10100
10100
λ,0,1,01,10,
010,101,0101
1010,10100,
11,111,1110
! 0 L  S
RPNI
S   0,1,1001
0
λ,0
0
λ,10,
010,0
1
1
1
K
0
0
1,01
0
1,0101
01,101
1,01
1
10,
010
1
Fr
11
1
1
11
1
101,
0101
0
λ,010,10,0,
1010,1010
11
1
1
101,
0101
111
1
111
111
0
0
0
0
0
1010
1110
1010
1110
1110
0
0
10100
10100
! 0 L  S
RPNI
S   0,1,1001
0
λ,0
0
λ,0
1
1
Fr
1
10100
1
11,101,
0101
11
1
101,
0101
0
1
1
11
0
1
101,
0101
111
0
1,01,1
0,010
0
1,01,1
0,010
1,01
10,
010
0
1
1
K
0
λ,0
1
1010
111
111
0
0
0
0
0
0
1010
1110
1010
1110
10100
1110
0
10100
!1001 L  S 
RPNI
S   0,1,1001
0
λ,0
λ,0,101,
0101
1
K
1,01
0
0
1
10100
10,
010
1
0
1010
1110
1
1,01
0
10,
010
11
1
11
1
111
0
0
1
1
101,
0101
0
1,01
0
Fr
11
λ,0,101,
0101
1
1
1
10,
010
0
1
111
1010
111
0
0
1110
1110
0
10100
! 0 L  S
RPNI
S   0,1,1001
0
λ,0
0
λ,0
1
1
K
1
1,01,10
1,0101
1,01
0
1
10,
010
0
1
101,
0101
10100
1,01,10
1,0101
1
1
10,
010
Fr
11
1
0
0
λ,0
0
10,010,
1010
11
1
0
111
0
0
1010
1110
11
1
0
111
1010
1
1
111
10100
0
0
1110
1110
0
10100
!1 L  S 
RPNI
S   0,1,1001
0
λ,0
0
λ,0
1
K
1
K
1,01
0
0
Fr
11
1
1010
0
10100
10,010,1
01,0101
0
10100
1,01
0
1
Fr 1
11
1
0
10,010,1
01,0101
1
11
1
1
0
101,
0101
0
1
0
1
1,01
1
10,
010
λ,0,
1010
111
111
1010
0
1110
0
111
0
0
1110
1110
10100
L  S  
RPNI
S   0,1,1001
0
λ,0
0
λ,0
1
K
1
K
1,01
0
10,
010
0
Fr
11
1
1010
0
10100
1
1
10,010,1
01,0101
1,01
0
Fr 1
11
1
0
10,010,1
01,0101
1
11
1
1
0
101,
0101
0
1
1,01
1
0
λ,0,101
0,10100
111
111
1010
0
1110
0
111
0
0
1110
1110
10100

LS 
!0  L  S 
RPNI
S   0,1,1001
0
λ,0
0
λ,0
1
K
1
K
1,01
0
0
Fr
11
1
1010
0
10100
1
1,01,
1010
1
10,010,1
01,0101
1
Fr
11
1
1
111
1
0
10,010,1
01,0101
1
11
1
111
1010
0
1110
0
0
0
101,
0101
0
0
1,01
1
10,
010
λ,0
0
0
1110
10100
L  S  
111
10100
0
1110
RPNI
S   0,1,1001
0
λ,0
0
λ,0
1
K
1
K
1,01
0
0
Fr
11
1
1010
0
10100
1
1,01,
1010
1
10,010,1
01,0101
1
Fr
11
1
1
0
1
0
10,010,1
01,0101,
10100
1
11
1
0
101,
0101
0
0
1,01
1
10,
010
λ,0
111
111
1010
0
1110
0
111
0
0
1110
1110
10100
L  S  
L  S  
RPNI
λ,0
0
S   0,1,1001
K
1
0
λ,0,11
1
111
1,01,
1010
0
1
0
10,010,1
01,0101,
10100
1
1
1,01,
1010
1
Fr
11
1
111
0
1110
0
1
0
10,010,1
01,0101,
10100
0
1110
RPNI
λ,0
0
S   0,1,1001
λ,0,11
K
1
1
1,01,
1010
0
1
0
10,010,1
01,0101,
10100
1
0
11
1
111
1110
0
0
1110
1
0
1
1,01,10
10,111
Fr
0
10,010,1
01,0101,
10100
RPNI
λ,0
0
S   0,1,1001
λ,0,11
K
1
1
1,01,
1010
0
1
0
10,010,1
01,0101,
10100
1
1,01,10
10,111
Fr
0
11
1
1
111
0
1110
1
0
10,010,10
1,0101,10
100,1110
0
RPNI
The convergence of the RPNI algorithm relies on the fact that sooner or later, the
set of labeled examples seen by the learner will include a characteristic set.
If the stream of examples provided to the learner is drawn according to a simple
distribution, the characteristic set would be made available relatively early (during
learning) with a sufficiently high probability and hence the algorithm will
converge quickly to the desired target.
RPNI is an optimistic algorithm: at any step two states are compared and the
question is: can they be merged? No positive evidence can be produced; merging
will take place each time that such a merge does not produce inconsistency.
Obvious an early mistake can have disastrous effects and Lang proved that a
breadth first exploration of the lattice is likely to be better.
Further Research
o The RPNI complexity is not a tight upper bound. Find the correct complexity
o Are DFA’s PAC-identifiable if examples are drawn from the uniform
distribution, or some other known simple distribution?
o The study of some data-independent algorithms (which do not use the state
merging strategy)
o The development of a software which would facilitate the merging of the states
in any given algorithm (any merging strategy)
Bibliography
• Colin de la Higuera, José Oncina, Enrique Vidal. “Identification of DFA:
Data-Dependent versus Data-Independent Algorithms”. Lecture Notes in
Artificial Intelligence 1147, Grammatical Inference: Learning Syntax from
Sentences, 313-325
• Rajesh Parekh, Vasant Honavar. “Learning DFA from Simple Examples”.
Lecture Notes in Artificial Intelligence 1316, Algorithmic Learning Theory,
116-131
• Satoshi Kobayashi, Lecture notes for the 3rd International PhD School on
Formal Languages and Applications, Tarragona, Spain
• Colin de la Higuera, Lecture notes for the 3rd International PhD School on
Formal Languages and Applications, Tarragona, Spain
• Michael J. Kearns, Umesh V. Vazirani “An Introduction to Computational
Theory”