COMP 412
FALL 2010
Lexical Analysis — Part II
From Regular Expression
to Scanner
Comp 412
Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 412 at Rice University have explicit permission to make copies
of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit
educational purposes, provided this copyright notice is preserved.
Quick Review
compile
time
design
time
source code
Scanner
parts of speech & words
tables
or code
specifications
Scanner
Generator
Represent
words as
indices into a
global table
Specifications written as
“regular expressions”
Last class:
— The scanner is the first stage in the front end
— Specifications can be expressed using regular expressions
— Build tables and code from a DFA
Comp 412, Fall 2010
1
Quick Review of Regular Expressions
• All strings of 1s and 0s ending in a 1
( 0 | 1 )* 1
• All strings over lowercase letters where the vowels (a,e,i,o,
& u) occur exactly once, in ascending order
Let Cons be (b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z)
Cons* a Cons* e Cons* i Cons* o Cons* u Cons*
• All strings of 1s and 0s that do not contain three 0s in a row:
( 1* ( |01 | 001 ) 1* )* ( | 0 | 00 )
Comp 412, Fall 2010
2
Example
(from Lab 1)
Consider the problem of recognizing ILOC register names
Register r (0|1|2| … | 9) (0|1|2| … | 9)*
• Allows registers of arbitrary number
• Requires at least one digit
RE corresponds to a recognizer (or DFA)
(0|1|2| … 9)
(0|1|2| … 9)
r
S0
S1
S2
accepting state
Recognizer for Register
Transitions on other inputs go to an error state, se
Comp 412, Fall 2010
3
Example
(continued)
DFA operation
• Start in state S0 & make transitions on each input character
• DFA accepts a word x iff x leaves it in a final state (S2 )
(0|1|2| … 9)
(0|1|2| … 9)
r
S0
S1
S2
accepting state
Recognizer for Register
So,
• r17 takes it through s0, s1, s2 and accepts
• r takes it through s0, s1 and fails
• a takes it straight to se
Comp 412, Fall 2010
4
Example
(continued)
To be useful, the recognizer must be converted into code
Char next character
State s0
while (Char EOF)
State (State,Char)
Char next character
if (State is a final state )
then report success
else report failure
r
0,1,2,3,4,
5,6,7,8,9
s0
s1
se
se
s1
se
s2
se
s2
se
s2
se
se
se
se
se
Skeleton recognizer
O(1) cost per character
per2010
transition)
Comp (or
412, Fall
All
others
Table encoding the RE
(0|1|2| … 9)
(0|1|2| … 9)
r
S0
S1
S2
5
Example
(continued)
We can add “actions” to each transition
Char next character
State s0
while (Char EOF)
Next (State,Char)
Act (State,Char)
perform action Act
State Next
Char next character
if (State is a final state )
then report success
else report failure
Skeleton recognizer
Comp 412, Fall 2010
r
0,1,2,3,4,
All
5,6,7,8,9 others
s0
s1
start
se
error
se
error
s1
se
error
s2
add
se
error
s2
se
error
s2
add
se
error
se
se
error
se
error
se
error
Table encoding RE
Typical action is to capture the lexeme
6
What if we need a tighter specification?
r Digit Digit* allows arbitrary numbers
• Accepts r00000
• Accepts r99999
• What if we want to limit it to r0 through r31 ?
Write a tighter regular expression
— Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
— Register r0|r1|r2| … |r31|r00|r01|r02| … |r09
Produces a more complex DFA
• DFA has more states
• DFA has same cost per transition
• DFA has same basic implementation
Comp 412, Fall 2010
(or per character)
7
Tighter register specification
(continued)
The DFA for
Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
(0|1|2| … 9)
S2
S3
0,1,2
S0
r
S1
3
S5
0,1
S6
4,5,6,7,8,9
S4
• Accepts a more constrained set of register names
• Same set of actions, more states
Comp 412, Fall 2010
8
Tighter register specification
(continued)
r
0,1
2
3
4-9
All
others
s0
s1
se
se
se
se
se
s1
se
s2
s2
s5
s4
se
s2
se
s3
s3
s3
s3
se
s3
se
se
se
se
se
se
s4
se
se
se
se
se
se
s5
se
s6
se
se
se
se
s6
se
se
se
se
se
se
se
se
se
se
se
se
se
This table runs
in the same
skeleton
recognizer
This table uses
the same O(1)
time per
character
The extra
precision costs
us table space,
not time
(0|1|2| … 9)
Table encoding RE for the tighter register specification
S0
Comp 412, Fall 2010
S2
S3
0,1,2
r
S1
3
S5
0,1
S6
9
4,5,6,7,8,9
S4
Tighter register specification
(continued)
State
Action
r
0,1
2
3
4,5,6
7,8,9
other
0
1
start
e
e
e
e
e
1
e
2
add
2
add
5
add
4
add
e
2
e
3
add
3
add
3
add
3
add
e
exit
3,4
e
e
e
e
e
e
exit
5
e
6
add
e
e
e
e
exit
6
e
e
e
e
e
x
exit
e
e
e
e
e
e
e
(0|1|2| … 9)
We care about path lengths (time) and finite
size of set of states (representability), but we
Comp
Fall 2010
don’t412,
worry
(much) about number of states.
S2
S3
0,1,2
S0
r
S1
3
S5
0,1
10
4,5,6,7,8,9
S4
S6
Where are we going?
• We will show how to construct a finite state automaton
to recognize any RE
Introduce NFAs
• Overview:
— Direct construction of a nondeterministic finite automaton
(NFA) to recognize a given RE
Easy to build in an algorithmic way
Requires -transitions to combine regular subexpressions
— Construct a deterministic finite automaton (DFA) to simulate
the NFA
Use a set-of-states construction
— Minimize the number of states in the DFA
Optional, but
worthwhile
Hopcroft state minimization algorithm
— Generate the scanner code
Additional specifications needed for the actions
Comp 412, Fall 2010
11
Non-deterministic Finite Automata
What about an RE such as ( a | b )* abb ?
b
S0
a
a
S1
b
S2
b
S3
a
a
b
Each RE corresponds to a deterministic finite automaton (DFA)
• We know a DFA exists for each RE
• The DFA may be hard to build directly
• Automatic techniques will build it for us …
Comp 412, Fall 2010
Nothing here that would change
the O(1) cost per transition
12
Non-deterministic Finite Automata
Here is a simpler RE for ( a | b )* abb
a|b
S0
S1
a
S2
b
S3
b
S4
This recognizer is more intuitive
• Structure seems to follow the RE’s structure
This recognizer is not a DFA
• S0 has a transition on
• S1 has two transitions on a
This is a non-deterministic finite automaton (NFA)
Comp 412, Fall 2010
This NFA needs one more transition,
at O(1) cost per transition
13
Non-deterministic Finite Automata
An NFA accepts a string x iff a path though the
transition graph from s0 to a final state such that
the edge labels spell x, ignoring ’s
• Transitions on consume no input
• To “run” the NFA, start in s0 and guess the right transition at
each step
— Always guess correctly
— If some sequence of correct guesses accepts x then accept
Why study NFAs?
• They are the key to automating the REDFA construction
• We can paste together NFAs with -transitions
NFA
Comp 412, Fall 2010
NFA
becomes an
NFA
14
Relationship between NFAs and DFAs
DFA is a special case of an NFA
• DFA has no transitions
• DFA’s transition function is single-valued
• Same rules will work
DFA can be simulated with an NFA
— Obviously
NFA can be simulated with a DFA
(less obvious)
• Simulate sets of possible states
• Possible exponential blowup in the state space
• Still, one state per character in the input stream
Comp 412, Fall 2010
Rabin & Scott, 1959
15
Automating Scanner Construction
To convert a specification into code:
1 Write down the RE for the input language
2 Build a big NFA
3 Build the DFA that simulates the NFA
4 Systematically shrink the DFA
5 Turn it into code
Scanner generators
• Lex and Flex work along these lines
• Algorithms are well-known and well-understood
• Key issue is interface to parser
(define all parts of speech)
• You could build one in a weekend!
Comp 412, Fall 2010
16
Where are we? Why are we doing this?
RE NFA (Thompson’s construction)
• Build an NFA for each term
• Combine them with -moves
NFA DFA (Subset construction)
• Build the simulation
The Cycle of Constructions
DFA Minimal DFA
• Hopcroft’s algorithm
DFA RE
RE
NFA
DFA
minimal
DFA
• All pairs, all paths problem
• Union together paths from s0 to a final state
Comp 412, Fall 2010
17
RE NFA using Thompson’s Construction
Key idea
• NFA pattern for each symbol & each operator
• Join them with moves in precedence order
a
S0
a
S0
S1
S1
a
S2
S5
S3
b
S4
b
S4
S0
S3
NFA for ab
NFA for a
S1
S0
S1
a
S3
S4
NFA for a*
NFA for a | b
Ken Thompson, CACM, 1968
Comp 412, Fall 2010
18
Example of Thompson’s Construction
Let’s try a ( b | c )*
1. a, b, & c
2. b | c
a
S0
S1
b
S1
S0
S2
b
S1
S0
c
S1
S0
S5
c
S3
S4
3. ( b | c )*
S0
S2
b
S3
S1
S6
S4
c
S5
S7
Comp 412, Fall 2010
19
Example of Thompson’s Construction
4. a ( b | c )*
S0
a
(con’t)
S1
S2
S4
b
S5
S3
S8
S6
c
S7
S9
Of course, a human would design something simpler ...
b|c
S0
Comp 412, Fall 2010
a
S1
But, we can automate production of
the more complex NFA version ...
20
Where are we? Why are we doing this?
RE NFA (Thompson’s construction)
• Build an NFA for each term
• Combine them with -moves
NFA DFA (subset construction)
• Build the simulation
The Cycle of Constructions
DFA Minimal DFA
• Hopcroft’s algorithm
DFA RE
RE
NFA
DFA
minimal
DFA
• All pairs, all paths problem
• Union together paths from s0 to a final state
Comp 412, Fall 2010
21
NFA DFA with Subset Construction
Need to build a simulation of the NFA
Two key functions
• Move(si , a) is the set of states reachable from si by a
•
-closure(si) is the set of states reachable from si by
The algorithm:
• Start state derived from s0 of the NFA
• Take its -closure S0 = -closure({s0})
• Take the image of S0, Move(S0, ) for each , and take
its -closure
• Iterate until no more states are added
Sounds more complex than it is…
Comp 412, Fall 2010
Rabin & Scott, 1959
22
NFA DFA with Subset Construction
The algorithm:
s0 -closure({n0})
S { s0 }
W { s0 }
while ( W ≠ Ø )
select and remove s from W
for each
t -closure(Move(s,))
T[s,] t
if ( t S ) then
add t to S
add t to W
Let’s think about why this works
s0 is a set of states
Comp
412, Fall 2010
S & W are sets of sets of states
The algorithm halts:
1. S contains no duplicates
(test before adding)
2. 2{NFA states} is finite
3. while loop adds to S, but
does not remove from S
(monotone)
the loop halts
S contains all the reachable
NFA states
It tries each character in each si.
It builds every possible NFA
configuration.
S and T form the DFA
This test is a little tricky
23
NFA DFA with Subset Construction
The algorithm:
s0 -closure({n0})
S { s0 }
W { s0 }
while ( W ≠ Ø )
select and remove s from W
for each
t -closure(Move(s,))
T[s,] t
if ( t S ) then
add t to S
add t to W
Let’s think about why this works
Comp 412, Fall 2010
The algorithm halts:
1. S contains no duplicates
(test before adding)
2. 2{NFA states} is finite
3. while loop adds to S, but
does not remove from S
(monotone)
the loop halts
S contains all the reachable
NFA states
It tries each character in each si.
It builds every possible NFA
configuration.
S and T form the DFA
Any DFA state containing a final state of the NFA
final state becomes a final state of the DFA.
24
NFA DFA with Subset Construction
Example of a fixed-point computation
• Monotone construction of some finite set
• Halts when it stops adding to the set
• Proofs of halting & correctness are similar
• These computations arise in many contexts
Other fixed-point computations
• Canonical construction of sets of LR(1) items
— Quite similar to the subset construction
• Classic data-flow analysis (& Gaussian Elimination)
— Solving sets of simultaneous set equations
We will see many more fixed-point computations
Comp 412, Fall 2010
25
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
26
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
s0
q0
q1, q2, q3,
q4, q6, q9
s1
q1, q2, q3,
q4, q6, q9
s2
s3
Comp 412, Fall 2010
b
c
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
q5, q8, q9,
q3, q4, q6
none
s2
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
27
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
28
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
29
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
30
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
31
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
32
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
33
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
34
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s3
q7, q8, q9,
q3, q4, q6
none
Comp 412, Fall 2010
35
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
q7 is the core
state of s3
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
36
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
q5 is the core
state of s2
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
37
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
q1, q2, q3,
q4, q6, q9
none
none
s1
q1, q2, q3,
q4, q6, q9
none
q5, q8, q9,
q3, q4, q6
q7, q8, q9,
q3, q4, q6
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
Final states because of q9
38
NFA DFA with Subset Construction
a ( b | c )* :
q0
a
q1
q2
q4
q3
q6
b
c
q5
q8
q7
q9
States
-closure(Move(s,*))
DFA
NFA
a
b
c
s0
q0
s1
none
none
s1
q1, q2, q3,
q4, q6, q9
none
s2
s3
s2
q5, q8, q9,
q3, q4, q6
none
s2
s3
s3
q7, q8, q9,
q3, q4, q6
none
s2
s3
Comp 412, Fall 2010
Transition table for the DFA
39
NFA DFA with Subset Construction
The DFA for a ( b | c )*
b
s2
b
s0
a
s1
b
c
c
s3
c
a
b
c
s0
s1
none
none
s1
none
s2
s3
s2
none
s2
s3
s3
none
s2
s3
• Much smaller than the NFA (no -transitions)
• All transitions are deterministic
• Use same code skeleton as before
Comp 412, Fall 2010
But, remember
our goal:
S0
b|c
a
S1
40
Where are we? Why are we doing this?
RE NFA (Thompson’s construction)
• Build an NFA for each term
• Combine them with -moves
NFA DFA (subset construction)
• Build the simulation
The Cycle of Constructions
DFA Minimal DFA
• Hopcroft’s algorithm
DFA RE
RE
NFA
DFA
minimal
DFA
• All pairs, all paths problem
• Union together paths from s0 to a final state
Not enough time to teach Hopcroft’s algorithm today
Comp 412, Fall 2010
41
Alternative Approach to DFA Minimization
The Intuition
• The subset construction merges prefixes in the NFA
s0
s1
s5
s8
a
b
a
d
a
s0
b
s1
s4
Comp 412, Fall 2010
b
c
s2
s6
s9
b
c
d
s3
c
s4
abc | bc | ad
s7
s10
Thompson’s construction would leave
-transitions between each singlecharacter automaton
s3
Subset construction eliminates transitions and merges the paths for a.
It leaves duplicate tails, such as bc.
s6
s2
c
s5
42
Alternative Approach to DFA Minimization
Idea: use the subset construction twice
• For an NFA N
— Let reverse(N) be the NFA constructed by making initial states
final (& vice-versa) and reversing the edges
— Let subset(N) be the DFA that results from applying the
subset construction to N
— Let reachable(N) be N after removing all states that are not
reachable from the initial state
• Then,
reachable(subset(reverse[reachable(subset(reverse(N))]))
is the minimal DFA that implements N
[Brzozowski, 1962]
This result is not intuitive, but it is true.
Neither algorithm dominates the other.
Comp 412, Fall 2010
43
Alternative Approach to DFA Minimization
Step 1
• The subset construction on reverse(NFA) merges suffixes in
original NFA
s0
s1
s5
s8
s1
s8
Comp 412, Fall 2010
a
b
a
a
a
s2
s6
s9
s2
s9
b
c
d
b
s3
c
s4
s7
s11
Reversed NFA
s10
s3
d
c
s11
subset(reverse(NFA))
44
Alternative Approach to DFA Minimization
Step 2
• Reverse it again & use subset to merge prefixes …
s0
a
s1
s2
b
s3
c
Reverse it, again
a
s8
s11
d
s9
d
s2
a
s0
b
b
s3
c
And subset it, again
s11
The Cycle of Constructions
Minimal DFA
Comp 412, Fall 2010
RE
NFA
DFA
Brzozowski
minimal
DFA
45
© Copyright 2026 Paperzz