CS 360: Programming Languages Lecture 12: Finite Automata and

CS 360: Programming Languages
Lecture 12: Finite Automata and Regular
Expressions
Geoffrey Mainland
Drexel University
Tuesday, February 17, 2015
Section 1
Administrivia
Administrivia
I
The final exam has been moved! It is now on Wednesday,
March 18 from 8am–10am in MacAlister Hall 4014.
I
Homework 6, due February 23, is posted.
I
For Homework 6, you will implement conversion of regular
expressions to NFAs, conversion of NFAs to graphical form,
NFA matching, and compilation of regular expressions to C.
I
Not as bad as it sounds...most of the code is written for you.
Section 2
Implementing Sets in Haskell
Implementing a Set data type
I
Our investigation of automata and regular expressions will be
intertwined with implementations is Haskell.
I
First, we will implement a simple data structure that will be
useful for our investigation: sets.
I
Question: what options are there for representing sets?
I
We will represent sets as sorted lists with no duplicates.
I
Question: can we write a Haskell type that expresses the
invariant that a list is sorted?
I
Let’s get to work...
Section 3
Finite Automata
Finite Automata
I
A finite automaton is a model of computation (weaker than
Turing Machines) that is often used to design hardware circuits
and can also describe some computer programs. In particular,
finite automata can be used to describe regular expression
matching.
I
A finite automaton has a finite number of states. One of these
states is the start state, and some subset of the automata’s
states is the set of accepting states.
I
An automaton operates on a finite, non-empty alphabet, Σ, of
symbols.
I
An automaton moves between states as defined by a transition
function, δ. Given a state and a symbol from the alphabet Σ,
the transition function gives the new state(s) of the automaton.
I
Finite automata come in two varieties: deterministic finite
automata (DFAs) and non-deterministic finite automata
(NFAs). We will start with DFAs.
Deterministic Finite Automata: Formally
Formally, a deterministic finite automata is a tuple (Q, Σ, δ, q0 , F )
where:
I
Q is a finite set of states.
I
Σ is a finite, non-empty alphabet of symbols.
I
δ : Q × Σ → Q is the transition function.
I
q0 ∈ Q is the start state.
I
F ⊆ Q is the set of accept states.
A language over an alphabet Σ is a set of strings, all of which are
chosen from Σ∗ , i.e., strings containing zero or more symbols from
Σ.
DFA Example
We will build a DFA A = (Q, Σ, δ, q0 , F ) for the language L given by
L = {x 01y | x and y are any strings of 0’s and 1’s }
I
I
I
I
What do we known about an automaton that can accept L?
First, its input alphabet is Σ = {0, 1}.
It has some states Q, including a start state—say, q0 .
It must remember important facts about the input it has seen
so far. To decide whether 01 is a substring of the input, it
must remember:
0. Has it never seen 01, but its last input was either nonexistent or
it last saw a 1? Then it cannot accept until it first sees a 0 and
then immediately sees a 1.
1. Has it never seen 01, but its most recent input is 0? If it sees a
1, then it can accept everything it sees from then on.
2. Has it already seen 01? If so, then accept every sequence of
further input.
I
We can represent the three conditions above as states, which
we will call q0 , q1 , and q2 , in a DFA.
DFA Example cont’d
0. Has it never seen 01, but its last input was either nonexistent or it
last saw a 1? Then it cannot accept until it first sees a 0 and then
immediately sees a 1.
1. Has it never seen 01, but its most recent input is 0? If it sees a 1,
then it can accept everything it sees from then on.
2. Has it already seen 01? If so, then accept every sequence of further
input.
I
I
I
I
Condition (0) is surely our start state, q0 . We still need to see
01. If we see 1, we are no closer to seeing 01, so δ(q0 , 1) = q0 .
If in state q0 we see a 0, we are in condition (1), so
δ(q0 , 0) = q1 .
If we are in q1 and we see a 0, we are no better or worse off
than before, so δ(q1 , 0) = q1 . If we see a 1, we know we saw a
0 followed by a 1, so we can enter q2 , which is the accepting
state. Therefore, δ(q1 , 1) = q2
Finally, in state q2 we have already seen 01, so we can remain
in this state regardless of what happens. Therefore,
δ(q2 , 0) = δ(q2 , 1) = q1 .
DFA Example cont’d
What is our DFA, A = (Q, Σ, δ, q0 , F )?
I
Q = {q0 , q1 , q2 }
I
Σ = {0, 1}
I
F = {q2 }
I
A = ({q0 , q1 , q2 } , {0, 1} , δ, q0 , {q2 }), where δ is the function
we described previously.
I
We can represent the DFA A graphically as follows
1
start
q0
0
0
q1
1
q2
0, 1
Nondeterministic Finite Automata
I
When a DFA is in a given state and reads the next input
symbol, we know what the next state will be—the automaton
is deterministic.
I
In a nondeterministic finite automaton (NFA), the
automaton may be in multiple states at the same time.
I
An NFA may also contain transitions between states that don’t
require an input symbol—these are called transitions.
I
Conceptually, when encountering an transition, an NFA splits
into multiple copies of itself and follows all the possibilities in
parallel.
I
We now have an extra pseudo-symbol . We write Σ ∪ {} as
Σ .
Nondeterministic Finite Automata: Formally
Formally, a nondeterministic finite automata is a tuple
(Q, Σ, δ, q0 , F ) where:
I
I
I
I
I
Q is a finite set of states.
Σ is a finite, non-empty alphabet of symbols, which does not
contain the symbol .
δ : Q × Σ → P(Q) is the transition function.
q0 ∈ Q is the start state.
F ⊆ Q is the set of accept states.
Note that:
I The differences between a DFA and an NFA is the transition
function δ. For a deterministic finite automaton, we had
δ : Q × Σ → Q.
I P(Q), the power set of Q, is the set of all subsets of
Q—including the empty set ∅ and the set Q itself.
I Question: if Q contains n elements, how many elements are in
P(Q)? Hint: use induction.
NFA: Example
I
The NFA below recognizes the language of strings taken from
Σ = {0, 1} that end in 01.
I
Notice the transition. Question: could we have defined an
equivalent NFA without this transition? If so, what would it
look like?
I
Even if we didn’t have the transition, the automaton would
be in multiple states after consuming the symbol 0.
0, 1
start
q0
q1
0
q2
1
q3
NFAs and DFAs
I
I
I
We built and NFA that recognizes the language of strings taken
from Σ = {0, 1} that end in 01. Question: could we construct
a deterministic finite automaton that recognizes the same
language?
For this particular language, we can construct a DFA
recognizer.
Question: in general, given an NFA, can we construct a DFA
that recognizes the same language?
0, 1
start
q0
q1
1
0
1
q2
0
0
start
q0
0
q1
1
1
q2
q3
From NFA to DFA
Goal: Given an NFA N = (Q, Σ, δ, q0 , F ), we want to construct a
DFA M = (Q 0 , Σ0 , δ 0 , q00 , F 0 ) that recognizes the same language.
I
Recall that the transition function for a DFA has the form
δ : Q × Σ → Q, whereas the transition function for an NFA has
the form δ : Q × Σ → P(Q).
I
This should suggest to you an obvious candidate for Q 0 . What
is it?
Transforming (-free) NFAs to DFAs
Goal: Given an NFA N = (Q, Σ, δ, q0 , F ) with no transitions, we
can construct a DFA M = (Q 0 , Σ, δ 0 , q00 , F 0 ) that recognizes the
same language as follows.
1. Q 0 = P(Q). Every state of M is a set of states of N.
2. For R ∈ Q 0 and a ∈ Σ, let
δ 0 (R, a) = {q ∈ Q | q ∈ δ(r , a) for some r ∈ R }
3. q00 = {qo }. M starts in the state corresponding to the
collection of states containing just the start state of N.
4. F 0 = {R ∈ Q 0 | r ∈ R and r ∈ F }. The machine M accepts if
one of the possible states that N could be in at this point is an
accept state.
To handle transitions, we define the function E (R) as follows
E (R) = {q | r ∈ R and δ(r , ) = q }
We write E ∗ for the reflexive transitive closure of E , so E ∗ (R) is
the set of all states in R reachable in zero or more “hops” along
an transition.
Transforming NFAs to DFAs
Goal: Given an NFA N = (Q, Σ, δ, q0 , F ), we can construct a DFA
M = (Q 0 , Σ, δ 0 , q00 , F 0 ) that recognizes the same language as follows.
1. Q 0 = P(Q). Every state of M is a set of states of N.
2. For R ∈ Q 0 and a ∈ Σ, let
δ 0 (R, a) = {q ∈ Q | q ∈ E ∗ ({δ(r , a)}) for some r ∈ R }
3. q00 = E ∗ ({qo }).
4. F 0 = {R ∈ Q 0 | r ∈ R and r ∈ F }.
Simulating an NFA
Instead of converting an NFA to a DFA, we can write a program to
simulate an NFA by keeping track of sets of states. To simulate an
NFA (Q, Σ, δ, q0 , F )
1. Start with the initial set of states S = E ∗ ({q0 }).
2. Given a symbol a ∈ Σ, update S to
S 0 = {q | q ∈ S and q 0 ∈ E ∗ (δ(q, a)) }.
3. If S ∩ F is non-empty, then we are in an accepting NFA state.
Section 4
Implementing Finite Automata in Haskell
A Haskell Implementation of Finite Automata
I
How can we represent (Q, Σ, δ, q0 , F ) in Haskell?
I
We will only worry about transitions on Chars, but we want to
be polymorphic in the type of the state (why?).
I
We won’t explicitly include the alphabet, Σ
data Nfa q = Nfa (Set q) (Set (Move q)) q (Set q)
deriving (Eq, Show, Read)
data Move q = Emove q q
| Move q Char q
deriving (Eq, Ord, Show, Read)
I
We can still find the alphabet
alphabet :: Nfa q -> [Char]
alphabet (Nfa _ moves _ _) =
nub [c | Move _ c _ <- toList moves]
I
Notice that alphabet uses a list comprehension. You will
find list comprehensions very useful in the homework—read
about them in LYAH.
A Haskell Implementation of Deterministic Finite Automata
I
What about DFAs?
I
A DFA is also an NFA...
type Dfa q = Nfa q
I
Dfa is a type alias for NFA. This is a lot like a typedef in C.
I
The type Dfa doesn’t make this explicit, but there are
additional constraints on a value of type Nfa for it to be a valid
deterministic finite automata. Question: what are these
constraints?
-Closure and Fixpoints
I
To calculate the -closure of a set of states Q, we first need to
find the states reachable from Q in one step via an transition.
I
We then need to add these states to Q and repeat the process
until we stop adding states.
I
It is useful to think of this in terms of a fixpoint. A fixpoint of
a function f is an x ∈ dom(f ) such that f (x ) = x .
I
If the function E (Q) finds the states reachable from the set of
states Q via a single transition, then E ∗ (Q) is a fixpoint of E .
I
We can write a generic Haskell function to find a fixpoint of a
function, given an argument, as follows:
fixpoint :: Eq a => (a -> a) -> a -> a
fixpoint f x | x == x’
= x
| otherwise = fixpoint f x’
where
x’ = f x
-Closure in Haskell
Given an NFA and a set of states Q, epsilonAccessible returns
the set of states that are accessible from Q via (single) transitions.
epsilonAccessible :: Ord q => Nfa q -> Set q -> Set q
epsilonAccessible (Nfa _ moves _ _) qs =
fromList accessible
where
accessible = [r | q <- toList qs,
Emove q’ r <- toList moves,
q == q’]
Given an NFA and a set of states Q, epsilonClosure returns the
set of states that are accessible from Q via zero or more transitions.
epsilonClosure :: Ord q => Nfa q -> Set q -> Set q
epsilonClosure nfa qs0 =
fixpoint addEpsilonAccessible qs0
where
addEpsilonAccessible qs = qs ‘union‘ epsilonAccessible nfa qs
NFA Transitions in Haskell
Given an NFA, a set of states Q, and a symbol a, how can we
compute the set of states reachable by a single transition on a?
Recall how we computed states reachable by a single transition.
epsilonAccessible :: Ord q => Nfa q -> Set q -> Set q
epsilonAccessible (Nfa _ moves _ _) qs =
fromList accessible
where
accessible = [s | q <- toList qs,
Emove q’ r <- toList moves,
q == q’]
We have to look for moves of the form Move q c r instead of
Emove q r
onemove :: Ord q => Nfa q -> Set q -> Char -> Set q
onemove (Nfa _ moves _ _) qs c =
fromList [s | q <- toList qs,
Move r c’ s <- toList moves,
r == q,
c == c’]
NFA Transitions in Haskell
To compute the set of states reachable from Q via a transition on a,
we need to take the -closure of onemove.
onetrans :: Ord q => Nfa q -> Set q -> Char -> Set q
onetrans nfa q c = epsilonClosure nfa (onemove nfa q c)
onemove :: Ord q => Nfa q -> Set q -> Char -> Set q
onemove (Nfa _ moves _ _) qs c =
fromList [s | q <- toList qs,
Move r c’ s <- toList moves,
r == q,
c == c’]
Section 5
Regular Languages and Regular Expressions
Regular Languages
I
A regular language is a language that can be recognized by a
finite automata.
I
Here is an example of a non-regular language:
L = {an b n | n is a positive integer }
I
Regular expressions are another, equivalent way to specify
regular languages. They are very much like the “regular
expressions” you are familiar with from tools like grep, awk,
perl, and vi.
I
UNIX “regular expressions” actually include features that allow
them to recognize non-regular languages. We will not consider
these features.
Regular Expressions: Examples
I
The regular expression a matches the string a.
I
The regular expression ab matches the string ab. That is, ab
matches a followed by b.
I
The regular expression a|b matches the string a and the string
b, but not the string ab. That is, a|b matches a or b.
I
The regular expression a* matches the empty string, the string
a, and the string aa, among others. That is, a* matches zero
or more occurrences if a.
Regular Expressions: Formally
A regular expression denotes a set of strings and is built from the
following:
I
The empty regexp, ∅, that matches nothing (the empty set).
I
The regexp that matches the empty string, which has no
characters (this is distinct from the empty set!).
I
The literal regexp that matches a single literal from the
alphabet.
I
The concatenation of two regexps, RS, which denotes the set
of strings obtained by concatenating a string in R with a string
in S.
I
The alternation of two regexps, R|S, which denotes the set of
strings obtained by taking the union of R and S.
I
Kleene star, R∗, which denotes the smallest superset of the set
R that is closed under concatenation. This is the set of all
strings that can be formed by concatenating a finite number of
elements (zero or more) of R.
Some “Extensions” to Regular Expressions
I
I
I
I
You may have seen regular expressions like a?, a+, and (less
commonly) a{m,n}.
a? means match 0 or 1 occurrences of a, a+ means match 1 or
more occurrences of a, and a{m,n}, where m and n are
integers, means match at least m and not more than n
occurrences of a.
We can rewrite all expressions in the above forms into
equivalent regular expressions that use only the operations on
the previous slide.
For a regular expression R, how can we rewrite the following?
R?, R+, and R{m, n}?
R? ≡ |R
R+ ≡ RR ∗
R{m, n} ≡ RR
· · · R} (|R)(|R) · · · (|R)
| {z
m times
|
{z
n − m times
}
Representing Regular Expressions in Haskell
I
We can write a Haskell data type for regular expressions (over
Chars) as given below.
I
Writing a regular expression matcher in Haskell is relatively
straightforward (there are two tricky cases).
I
What is the type of our matching function, which we will call
regExpMatch?
I
We should be able to knock off the empty, , literal, and
alternation cases easily—let’s do that.
data RegExp = Empty
| Epsilon
| Lit Char
| Cat RegExp RegExp
| Alt RegExp RegExp
| Star RegExp
deriving (Eq, Ord, Show, Read)
Matching Regular Expressions: Concatenation
I
Concatenation is more difficult—we know that a string that
matches the regular expression RS can be split into two parts,
one part that matches R and one part that matches S, but
where should we split the string?
I
Our solution: try all possible ways of splitting the string.
regExpMatch (Cat r s) cs =
or [regExpMatch r cs1 && regExpMatch s cs2 | (cs1,cs2) <- splits cs]
splits :: [a] -> [([a], [a])]
splits st = [splitAt n st | n <- [0 .. length st]]
or :: [Bool] -> Bool
or []
= False
or (x:xs) = x || or xs
Matching Regular Expressions: Kleene Star
I
I
I
I
Kleene star is also tricky. R ∗ can match the empty string, or it
can match R followed by zero or more additional matches of R.
But where does the first match of R start and the match of the
other R’s end?
Again, we will try possible ways of splitting the string.
The catch this time is that if we don’t match the empty string,
we need to match something.
Why do we need to make sure the match is non-empty?
Consider matching the pattern (a∗ )∗ against the string b.
regExpMatch (Star r) cs =
regExpMatch Epsilon cs ||
or [regExpMatch r cs1 && regExpMatch (Star r) cs2 |
(cs1,cs2) <- frontSplits cs]
frontSplits :: [a] -> [([a], [a])]
frontSplits st = [splitAt n st | n <- [1.. length st]]
or :: [Bool] -> Bool
or []
= False
or (x:xs) = x || or xs
Efficiency of our Matcher
I
Our matcher is actually quite inefficient. Why?
I
Here is the performance of two regular expression
implementations when matching the regular expression a?n an
against the string an . Note that the X axis is n, and the Y axis
is time. Also note the different Y axis scales!
I
You can read more about this topic here:
http://swtch.com/~rsc/regexp/regexp1.html.
I
Punch line: we can match a regular expression much more
efficiently by constructing a finite automata. But how?
Section 6
Regular Expression Matching with Finite
Automata
From Regular Expression to NFA
Given a regular expression, we can straightforwardly construct an
NFA.
I
Automata construction for ∅
start
I
Automata construction for start
I
Automata construction for a ∈ Σ
start
a
From Regular Expression to NFA cont’d
I
Automata construction for R|S
R
start
I
Automata construction for RS
start
I
S
R
S
Automata construction for R ∗
start
R
Section 7
Matching NFAs
NFA Example
Consider the NFA corresponding to a*b.
start
q0
q1
a
q2
q3
q4
b
q5
I
I
I
I
I
Now that we have an NFA, we need a way to run (or simulate)
it.
Since an NFA can be in multiple states at the same time, when
we simulate the NFA, we will have to keep track of a set of
states.
What is the set of states this NFA can initially be in before
having consumed any input?
Which states are we in after consuming a single input, a?
In general, given a set of states Q, we need a way to find all
the states that are accessible from some state in Q via zero or
more transitions.
Finding -accessible States
start
q0
q1
a
q2
q3
q4
b
q5
E (Q) = {r | q ∈ Q and δ(q, ) = r }
I
The function E tells us which states are accessible from Q via
a single transition. How do we find the set of initial states?
E ({q0 }) = {q1 , q3 }
E ({q0 , q1 , q3 }) = {q1 , q3 , q4 }
E ({q0 , q1 , q3 , q4 }) = {q0 , q1 , q3 , q4 }
I
How do we find the set of states the NFA is in after consuming
a single input, a?
E ({q2 }) = {q1 , q3 }
E ({q1 , q2 , q3 }) = {q1 , q3 , q4 }
E ({q1 , q2 , q3 , q4 }) = {q1 , q2 , q3 , q4 }
Finding -accessible States in Haskell
start
q0
q1
a
q2
q3
q4
b
q5
E ({q0 }) = {q1 , q3 }
E ({q0 , q1 , q3 }) = {q1 , q3 , q4 }
E ({q0 , q1 , q3 , q4 }) = {q0 , q1 , q3 , q4 }
Given an NFA and a set of states qs, epsilonAccessible returns
the set of states that are accessible from qs via (single) transitions. It implements the function E .
epsilonAccessible :: Ord q => Nfa q -> Set q -> Set q
epsilonAccessible (Nfa _ moves _ _) qs =
fromList [r | q <- toList qs,
Emove q’ r <- toList moves,
q == q’]
Finding the -closure
start
q0
q1
a
q2
q3
q4
b
q5
E ({q0 }) = {q1 , q3 }
E ({q0 , q1 , q3 }) = {q1 , q3 , q4 }
E ({q0 , q1 , q3 , q4 }) = {q0 , q1 , q3 , q4 }
E ∗ ({q0 }) = {q0 , q1 , q3 , q4 }
I
I
I
I
The function E ∗ is the reflexive transitive closure of E .
That is, E ∗ (Q) is all the states we get via zero or more
applications of E to Q.
Equivalently, E ∗ (Q) is the set of all states reachable from Q
via zero or more transitions.
We can compute E ∗ (Q) by applying E to the set of states we
have, Q, combining Q with E (Q), and repeating the process.
When the set stops growing, i.e., when E (Q) = Q, we’re done.
Fixpoints
E ({q0 }) = {q1 , q3 }
E ({q0 , q1 , q3 }) = {q1 , q3 , q4 }
E ({q0 , q1 , q3 , q4 }) = {q0 , q1 , q3 , q4 }
E ∗ ({q0 }) = {q0 , q1 , q3 , q4 }
I
I
I
The set {q0 , q1 , q3 , q4 } is a fixpoint of the function E .
Given a function f , x is a fixpoint of f if f (x ) = x .
We can write a generic Haskell function to find a fixpoint of a
function, given an argument, as follows:
fixpoint :: Eq a => (a -> a) -> a -> a
fixpoint f x | x == x’
= x
| otherwise = fixpoint f x’
where
x’ = f x
I
The Haskell function fixpoint repeatedly applies f until the
result stops changing, i.e., it calculates f (f (· · · f (x ))), using as
many copies of f as needed until the result stops changing.
Using fixpoint to Calculate E ∗
E ({q0 }) = {q1 , q3 }
E ({q0 , q1 , q3 }) = {q1 , q3 , q4 }
E ({q0 , q1 , q3 , q4 }) = {q0 , q1 , q3 , q4 }
E ∗ ({q0 }) = {q0 , q1 , q3 , q4 }
I
Note that when we repeatedly applied E , we didn’t calculate
E (E (· · · E (Q))), but instead added E (Q) back to Q before
calling E again.
I
We can write this in Haskell as follows
epsilonClosure :: Ord q => Nfa q -> Set q -> Set q
epsilonClosure nfa qs0 =
fixpoint addEpsilonAccessible qs0
where
addEpsilonAccessible qs = qs ‘union‘ epsilonAccessible nfa qs
Simulating the NFA in Haskell
Last lecture we saw the functions for calculating states reachable via
a transition on a literal. How can we use what we’ve seen to
simulate an NFA?
onetrans :: Ord q => Nfa q -> Set q -> Char -> Set q
onetrans nfa q c = epsilonClosure nfa (onemove nfa q c)
onemove :: Ord q => Nfa q -> Set q -> Char -> Set q
onemove (Nfa _ moves _ _) qs c =
fromList [s | q <- toList qs,
Move r c’ s <- toList moves,
r == q,
c == c’]
I
I
I
For an NFA with start state q0, the initial set of states in our
simulation is epsilonClosure q0.
When we see a character, make a transition using onetrans
(not onemove).
When we have reached the end of our input, we need to test
and see if we are in a final state. How can we do that? Assume
that the set of final states of the NFA given by the Haskell
variable f and that the current set of states is given by qs.
Section 8
Constructing NFAs in Haskell
Implementing NFA Construction
Recall our definition of the data type for NFAs.
data Nfa q = Nfa (Set q) (Set (Move q)) q (Set q)
deriving (Eq, Show, Read)
data Move q = Emove q q
| Move q Char q
deriving (Eq, Ord, Show, Read)
We want to write a function
regExpToNfa :: RegExp -> Nfa Int
How can we handle the empty regular expression?
start
regExpToNfa Empty =
Nfa (fromList [0..1])
empty
0
(singleton 1)
NFA Construction for data Nfa q = Nfa (Set q) (Set (Move q)) q (Set q)
deriving (Eq, Show, Read)
data Move q = Emove q q
| Move q Char q
deriving (Eq, Ord, Show, Read)
start
regExpToNfa Epsilon =
Nfa (fromList [0..1])
(singleton (Emove 0 1))
0
(singleton 1)
Invariants: Implementing NFA Construction
Note a few invariants about the NFAs we constructed.
1. All states are consecutively numbered, starting from 0.
2. State 0 is the start state.
3. There is a single accept state, and it is the highest-numbered
state.
These invariants will make it easier to work with the NFAs we
construct.
NFA Construction for R|S
R
start
S
regExpToNfa (Alt r s) =
Nfa (qs1
‘union‘ qs2
‘union‘ fromList [q0,qf])
(moves1 ‘union‘ moves2 ‘union‘ fromList altMoves)
0
(singleton qf)
where
r_nfa@(Nfa qs1 moves1 start1 _) =
numberNfaFrom 1 $ regExpToNfa r
s_nfa@(Nfa qs2 moves2 start2 _) =
numberNfaFrom (nfaSize r_nfa + 1) $ regExpToNfa s
q0 = 0
qf = nfaSize r_nfa + nfaSize s_nfa + 1
altMoves = [
,
,
,
Emove
Emove
Emove
Emove
q0 start1
q0 start2
(acceptState r_nfa) qf
(acceptState s_nfa) qf]
Section 9
Converting an NFA to a DFA
Converting an NFA to a DFA
start
q0
q1
a
q2
q3
q4
b
q5
r0 = {q0 , q1 , q3 , q4 }
r1 = {q1 , q2 , q3 , q4 }
r2 = {q5 }
I
I
I
I
I
I
I
I
Our method for simulating an NFA suggest a way to convert an
NFA to a DFA.
First, find the set of possible initial states of the NFA.
This set of NFA states is the start state for the DFA. We will
call this state r0 .
Then, for each symbol in the alphabet, find the set of states
accessible via transitions on that symbol.
First, let’s handle the symbol a.
This set of states is r1 .
Next, let’s handle the symbol b.
This set of states is r2 .
Converting an NFA to a DFA cont’d
start
q0
q1
a
q2
q3
q4
b
q5
r0 = {q0 , q1 , q3 , q4 }
r1 = {q1 , q2 , q3 , q4 }
r2 = {q5 }
I
We now know all the transitions from the start state.
I
What about transitions from r1 ?
I
On an a, r1 transitions to r1 .
I
On an b, r1 transitions to r2 .
I
The state r2 doesn’t transition to anything.
Converting an NFA to a DFA cont’d
start
q0
q1
a
q2
q3
b
q4
q5
r0 = {q0 , q1 , q3 , q4 }
r1 = {q1 , q2 , q3 , q4 }
r2 = {q5 }
a
start
r0
a
r1
b
r2
b
I
I
Now we know the states and transitions of our DFA.
Notice that our conversion produced a DFA with unnecessary
states. We can get rid of these by minimizing the DFA. We
won’t worry about this step here.
Converting an NFA to a DFA
I
We have seen intuitively how to convert an NFA to a DFA.
Now we will implement this conversion.
I
Recall our formally-defined conversion from an NFA
N = (Q, Σ, δ, q0 , F ) to a DFA M = (Q 0 , Σ, δ 0 , q00 , F 0 )
1. Q 0 = P(Q). Every state of M is a set of states of N.
2. For R ∈ Q 0 and a ∈ Σ, let
δ 0 (R, a) = {q ∈ Q | q ∈ E ∗ ({δ(r , a)}) for some r ∈ R }
3. q00 = E ∗ ({qo }).
4. F 0 = {R ∈ Q 0 | r ∈ R and r ∈ F }.
I
This formal description hopefully makes more sense now that
we’ve seen an example. However, our implementation will draw
inspiration from intuition rather than rigorously following the
above definition.
I
Note in particular that the DFA we constructed in our example
had 3 states, not 26 = 64 states!
Transforming an NFA to a DFA
First, we need a way to add to the DFA the new state that results
from moving from the set of NFA states qs on a single character c.
addmove :: Ord q => Nfa q -> Set q -> Char -> Dfa (Set q) -> Dfa (Set q)
addmove nfa@(Nfa _ _ _ f0) q c (Nfa qs moves q0 f) =
Nfa qs’ moves’ q0 f’
where
qs’
= qs
‘union‘ singleton new
moves’
= moves ‘union‘
singleton (Move q c new)
f’ | isEmpty (f0 ‘intersection‘ new) = f
| otherwise
= f ‘union‘ singleton new
new
= onetrans nfa q c
Recall that onetrans computes the set of states reachable from the
set of states qs via a transition on a single character c.
onetrans :: Ord q => Nfa q -> Set q -> Char -> Set q
onetrans nfa q c = epsilonClosure nfa (onemove nfa q c)
Transforming an NFA to a DFA cont’d
Next, we need a way to add to the DFA all the new states that
result from moving from the set of NFA states qs via a whole list of
characters. We will use this to find all the moves for any character
in the alphabet.
addmoves :: Ord q => Nfa q -> Set q -> [Char] -> Dfa (Set q) -> Dfa (Set q)
addmoves _
_ []
dfa = dfa
addmoves nfa q (c:cs) dfa = addmoves nfa q cs (addmove nfa q c dfa)
Question: How could we have written addmoves more succinctly?
Transforming an NFA to a DFA cont’d
Now, we need a way to add to the DFA all the new states which
can be reached from some DFA state by a single transition on
some character of the alphabet.
addstep
addstep
add
where
add
add
:: Ord q => Nfa q -> [Char] -> Dfa (Set q) -> Dfa (Set q)
nfa alpha dfa0@(Nfa qs _ _ _) =
dfa0 (toList qs)
dfa []
= dfa
dfa (r:rs) = add (addmoves nfa r alpha dfa) rs
Transforming an NFA to a DFA cont’d
Finally, we need to start with an initial DFA and then iterate this
process until we can no longer add any new states.
deterministic :: Nfa Int -> [Char] -> Dfa (Set Int)
deterministic nfa@(Nfa _ _ q0 f) alpha =
fixpoint (addstep nfa alpha) (Nfa (singleton q0’) empty q0’ f’)
where
q0’ = epsilonClosure nfa (singleton q0)
f’ | isEmpty (f ‘intersection‘ q0’) = empty
| otherwise
= singleton q0’
Executing a DFA
I
Now that we have a DFA, how can we execute it?
I
One catch: the DFA we constructed may not be minimal.
I
We say that string x distinguishes state s from state t if exactly
one of the states reached from s and t by following the path
with label x is an accepting state. State s is distinguishable
from state t if there is some string that distinguishes them.
I
A minimal DFA has no indistinguishable states.
I
We have implemented a DFA minimization algorithm for you.
Section 10
Compiling a DFA to C
Executing a DFA in C
Rather than simulating the DFA in Haskell, we can generate a C
function that simulates the DFA on an input. Here is a matcher
generated from the regular expression a*b.
int match(const char* cs)
{
int state = 0;
int accept = 0;
while (1) {
switch (*(cs++)) {
case ’a’:
switch (state) {
case 0:
state = 0;
accept = 0;
break;
default: return 0;
}
break;
case ’b’:
switch (state) {
case 0:
state = 1;
accept = 1;
break;
default: return 0;
}
break;
case ’\0’: return accept;
default: return 0;
}
}
}